Intro LOGIT
Intro LOGIT
Intro LOGIT
2
Salford Predictive Modeler® Introduction to Logistic Regression Modeling
Logistic Regression will estimate binary (Cox (1970)) and multinomial (Anderson (1972)) logistic models.
Logistic Regression is designed for analyzing the determinants of a categorical dependent variable.
Typically, the dependent variable is binary and coded as 0 or 1; however, it may be multinomial and is
often coded as an integer ranging from 1 to K but could be, for instance, coded as a series of character
strings, e.g., "Republican", "Democrat", "Independent", and “Non Voter". Studies one can conduct with
Logistic Regression include bioassay, epidemiology of disease (cohort or case-control), clinical trials,
market research, transportation research (mode of travel), psychometric studies, and voter choice
analysis.
This manual contains a brief introduction to logistic regression and a full description of the commands and
features of the module. If you are unfamiliar with logistic regression, the textbook by Hosmer and
Lemeshow (1989) is an excellent place to begin; Breslow and Day (1980) provide an introduction in the
context of case-control studies, Train (1986) and Ben-Akiva and Lerman (1985) introduce the discrete
choice model for econometrics, Wrigley (1985) discusses the model for geographers, and Hoffman and
Duncan (1988) review discrete choice in a demographic-sociological context. Valuable surveys appear in
Amemiya (1981), McFadden (1984, 1982, 1976) and Maddala (1983)). This is just a small sampling from
a rather large literature; other specialty references are cited later in this chapter.
The best way to learn to use Logistic Regression is to read the QUICKSTART section which follows and
try the program out. Later you can selectively read the more detailed documentation or refer to the
appendices containing reference material on each command.
3
Salford Predictive Modeler® Introduction to Logistic Regression Modeling
Select the [Model…] button to open the setup for model parameters and choice of other options. The
binary variable TARGET, which takes on values 0 and 1, will be the dependent variable (or Target). A
value of 1 represents a "good" loan, that is one that did not default (fail to repay). At the Model Setup
dialog, TARGET is indicated as the dependent variable, all other variables will be considered predictors.
TARGET and several of the predictors are treated as categorical. We can set the GENDER and
OCCUP_BLANK variable to categorical. The Analysis Engine to select is Logistic Regression and by
default the Classification/Logistic Binary Target Type will be automatically selected.
4
Salford Predictive Modeler® Introduction to Logistic Regression Modeling
The Logistic Regression algorithm does not use a test sample in the estimation of the model, so we wish
to have all the data included in the learn sample. To make this so, we visit the Testing tab on the Model
Setup dialog and make sure that "No independent testing – exploratory model" is selected.
5
Salford Predictive Modeler® Introduction to Logistic Regression Modeling
Finally, to get the estimation started, we click the [Start] button at lower right. The data will be read from
our dataset GOODBAD.CSV, prepared for analysis, and the logistic regression model will be built:
If you prefer to use commands, the same model setup can be accomplished with just four simple
commands, followed by LOGIT GO which will launch the model estimation. USE identifies the input
dataset, MODEL identifies the target (dependent variable), CATEGORY identifies categorical variables
including the target and, if any, categorical predictors. PARTITION specifies what, if any, test and holdout
sample to use; in this case we use all data for the learn sample:
USE GOODBAD
MODEL TARGET
CATEGORY TARGET, GENDER, OCCUP_BLANK
PARTITION NONE
LOGIT GO
6
Salford Predictive Modeler® Introduction to Logistic Regression Modeling
with coefficients available as a separate tab, together with standard errors, t-ratios and p-values:
Similar to what is done for linear regression, we can identify the strongest predictors by the absolute
value of their t-statistics (t-ratio) and p-value. Those with high absolute values for the t-statistic, and thus
near-zero p-values, are strong predictors. In this case, POSTBIN, N_INQUIRIES, NUMCARDS and
GENDER are strong predictors of whether a loan will default (fail to repay) or not.
Coefficients are presented in the Classic Output as well, which may be useful if you wish to cut/paste
into a separate report or if you are using a non-GUI version of the SPM®:
7
Salford Predictive Modeler® Introduction to Logistic Regression Modeling
=====================
Results of Estimation
=====================
95.0% bounds
Parameter Odds ratio Upper Lower
----------------------------------------------------------------
1 EDUCATION$ = "College 0.52725E+07 . 0.00000
2 EDUCATION$ = "HS" 0.833116E+07 . 0.00000
3 GENDER = -4 2.69807 5.36650 1.35648
4 MARITAL$ = "Married" 0.70829 1.57434 0.31866
5 MARITAL$ = "Other" 6.50126 80.31349 0.52627
6 OCCUP_BLANK = 1 5.91333 26.48617 1.32022
7 OWNRENT$ = "Other" 0.94201 19.63631 0.04519
8 OWNRENT$ = "Own" 0.94919 6.44018 0.13990
9 OWNRENT$ = "Parents" 1.19494 8.52905 0.16741
10 AGE 1.00352 1.05034 0.95879
11 CREDIT_LIMIT 1.00001 1.00004 0.99998
12 HH_SIZE 0.93004 1.14229 0.75722
13 INCOME 0.99979 0.99998 0.99960
14 N_INQUIRIES 1.42522 1.59095 1.27676
15 NUMCARDS 0.74300 0.94916 0.58161
16 TIME_EMPLOYED 1.10568 1.20046 1.01838
17 POSTBIN 3.33787 4.68044 2.38041
----------------------------------------------------------------
Log Likelihood of constants only model = ll(0) = -249.12754
2*[ll(n)-ll(0)] = 220.79564 with 17 DOF, Chi-Sq P-value = 0.00000
Mcfadden's Rho-Squared = 0.44314
-----------------------------------------------------------------------------
You may have noticed that the result of the estimation are based on 368 records, but the full
GOODBAD.CSV dataset contains 664 records. The Logistic Regression method requires non-missing
data for all predictors and the target, and will use listwise deletion to remove any record that contains a
missing value for one or more of those variables. In this case, 278 records were removed. A report
identifying which variables were responsible is available under the record deletions tab in the GUI:
8
Salford Predictive Modeler® Introduction to Logistic Regression Modeling
as well as further information about which variables were missing, and how often, in the Classic Output:
=============================================
Predictors Affecting Listwise Record Deletion
=============================================
NMiss Predictor
---------------------------------------------
11 EDUCATION$
75 GENDER
1 MARITAL$
143 OWNRENT$
84 AGE
120 HH_SIZE
9
Salford Predictive Modeler® Introduction to Logistic Regression Modeling
The standard error for each coefficient is the square root of the corresponding main diagonal element of
the variance-covariance matrix. By default, the variance-covariance matrix is not presented in the classic
output. However, adding the command
PRINT=LONG
will present the covariance matrix in the Classic Output as well as in the GUI:
10
Salford Predictive Modeler® Introduction to Logistic Regression Modeling
In this example, our model shows an integrated ROC statistic of 0.90838 and provides a "lift" of 2.32 in
the "top" 10.87% of the data. The lift drops slightly to 2.17 in the second bin, which is composed of
records that are not so likely to be class 1 as those records in the first bin. The gains chart sorts by the
"richest", or most highly probable, class 1 records, with the most probable in bin 1 and the least probable
in bin 10. Selecting class 0 will change the display to show performance if the criterion of interest is
predicting class 0 instead of class 1. In this example, the top two bins all have a lift of 1.65, containing 80
records (21.74% of the data). The ROC statistic remains the same, since it is symmetric for binary (two-
class) situations.
11
Salford Predictive Modeler® Introduction to Logistic Regression Modeling
Classification Tables
The model's classificatory power -- how well it predicts the two outcomes -- is described with a "prediction
success table" via the Confusion Matrix (formerly known as the Prediction Success) tab. Suppose class
1 is considered a "response". For each record, Logistic Regression produces a predicted response
probability, that is, the probability that that record is a 1. (In this simple binary example, the
complementary probability, that the record is a non-response or 0, is simply 1.0 minus the response.) In
order to classify each record, the record's response probability is compared to some threshold between
0.0 and 1.0. If the probability if greater than or equal to the threshold, the record is classified as a
response (1), otherwise it is classified as a non-response (0). By default, the Logistic Regression GUI
results use a "baseline threshold" based on the observed distribution of 0's and 1's, i.e., from the original
target. The observed target has a distribution of 217 zeroes and 151 ones. The proportion of 1's, which
forms the "baseline classification threshold", is 0.41. This is then the threshold that is used, along with the
response probability for each record, to discriminate predicted responses from predicted non-responses.
Each record will contribute its entire weight to only one of the four cells, based on the observed and
predicted classes for the record. (The case weight, or simply "weight", of each record defaults to 1.0.
However, you may specify variable case weights with the WEIGHT command in order to control the
impact each record has on the model and resulting classification tables, gains charts, etc.)
A cross tabulation, depicting predicted classes (columns) by observed classes (rows) is presented in the
GUI in this way:
Note that the threshold for discriminating a predicted response (1) from a predicted non-response (0)
defaults to the baseline of 0.41, but you can select any threshold you wish between 0.0 and 1.0. Suppose
you wish to consider how the model classifies the original data using a simple midpoint threshold of 0.50:
12
Salford Predictive Modeler® Introduction to Logistic Regression Modeling
13
Salford Predictive Modeler® Introduction to Logistic Regression Modeling
If you would like to see how sensitivity and specificity change as a function of the classification threshold,
select the [Show Table] button and you will be presented with a threshold table that looks like:
The classic output presents this same table, using the midpoint threshold (0.50), with a slightly different
layout:
14
Salford Predictive Modeler® Introduction to Logistic Regression Modeling
====================
Classification Table
====================
The above classification tables accumulate cell counts record-by-record. However, you may instead wish
to have each record contribute to multiple cells in the classification table according to (weighted by) the
predicted probabilities of that record, rather than committing to a single predicted class. In this binary
example, each record will contribute to two cells based on the record's observed class and two predicted
probabilities. This is presented in the classic output "Classification Table Using Predicted Probabilities":
==================================================
Classification Table Using Predicted Probabilities
==================================================
15
Salford Predictive Modeler® Introduction to Logistic Regression Modeling
If your interest is in assessing how records are misclassified by your Logistic Regression model, you can
certainly glean that from the tables described above, either in the GUI or the classic output. However,
Logistic Regression provides a table emphasizing just misclassification (how many records are
misclassified for each target class) on the Misclassification tab. For example, at the baseline threshold
response probability of 0.41, the model misclassifies 18.43% (N=40) of non-response records and
15.23% (N=23) response records:
16
Salford Predictive Modeler® Introduction to Logistic Regression Modeling
The first model considered is the simple regression of LOW on a constant and LWD, a dummy variable
indicating if LWT is less than 110 pounds. (See H&L Table 3.17.) LWD and LWT are similar variable
names (but note that LWD is set as a categorical variable). Be sure to note which is being used in the
models which follow. We estimate this model with:
or with commands:
17
Salford Predictive Modeler® Introduction to Logistic Regression Modeling
USE "HOSLEM_CHAR.CSV"
MODEL LOW$
KEEP LWD$
CATEGORY LOW$, LWD$
LOGIT GO
The classic output presents a summary of the sample split (distribution of target classes) as well as how
the maximum likelihood iterative progress reaches convergence (Low Weight = 1, and OK = 0):
SAMPLE SPLIT
============
WEIGHTED WEIGHTED
CATEGORY COUNT Prop COUNT %
---------------------------------------------------------------
Low Weight | 59 0.31217 | 59.00000 0.31217
OK | 130 0.68783 | 130.00000 0.68783
-------------+----------------------+--------------------------
| 189 | 189.00000
CONVERGENCE ACHIEVED
=====================
Results of Estimation
=====================
18
Salford Predictive Modeler® Introduction to Logistic Regression Modeling
The output begins with a listing of the dependent variable, the sample split of target classes, a brief
maximum likelihood iteration history shows the progress of the procedure to convergence. Finally, the
parameter estimates, standard errors, standardized coefficients, p-values and the log-likelihood are
presented.
We can evaluate these results much like a linear regression. The coefficient on LWD is large relative to its
standard error (Coefficient/S.E.) and so appears to be an important predictor of low birth weight. The
interpretation of the coefficient is quite different from ordinary regression, however. The Logistic
Regression coefficient tells how much the Logistic Regression increases for a unit increase in the
independent variable but the probability of response outcome is a nonlinear function of the Logistic
Regression (see the Appendix for technical details).
19
Salford Predictive Modeler® Introduction to Logistic Regression Modeling
The odds-ratio table provides a more intuitively meaningful quantity for each coefficient. The odds of the
response is given by p/(1- p) where p is the probability of response and the odds ratio is the multiplicative
factor by which the odds change when the independent variable increases by one unit. In the first model,
being a low weight mother increases the odds of a low birth weight baby by a multiplicative factor of 2.87,
with lower and upper confidence bounds of 1.41 and 5.83 respectively. Since the lower bound is greater
than one, the variable appears to represent a genuine risk factor. (See Kleinbaum, Kupper and Chambliss
(1982) for a discussion.)
95.0% bounds
Parameter Odds ratio Upper Lower
----------------------------------------------------------------
1 LWD$ = "Less than 100 lbs"
2.86842 5.82648 1.41215
The model above contains only a constant and a single dummy variable. We now consider the addition of
the continuous variable AGE:
20
Salford Predictive Modeler® Introduction to Logistic Regression Modeling
MODEL LOW$
KEEP LWD$, AGE
CATEGORY LOW$, LWD$
LOGIT GO
21
Salford Predictive Modeler® Introduction to Logistic Regression Modeling
22
Salford Predictive Modeler® Introduction to Logistic Regression Modeling
CONVERGENCE ACHIEVED
=====================
Results of Estimation
=====================
95.0% bounds
Parameter Odds ratio Upper Lower
----------------------------------------------------------------
1 LWD$ = "Less than 100 lbs"
2.74594 5.60727 1.34471
2 AGE 0.95673 1.01911 0.89817
----------------------------------------------------------------
Log Likelihood of constants only model = ll(0) = -117.33600
2*[ll(n)-ll(0)] = 10.38523 with 2 DOF, Chi-sq P-value = 0.00556
Mcfadden's Rho-Squared = 0.04425
-----------------------------------------------------------------------------
Consider the means of the independent variables overall and by target class. In this sample there is a
substantial difference between the mean LWD across birth weight groups, but an apparently small AGE
difference. AGE is clearly not significant by conventional standards if we look at the coefficient/standard-
error ratio. The confidence interval for the odds ratio (.898, 1.019) includes 1.00, indicating no effect in
relative risk, when adjusting for LWD.
23
Salford Predictive Modeler® Introduction to Logistic Regression Modeling
Before concluding that AGE does not belong in the model H&L consider the interaction of AGE and LWD.
To generate this, as well as some other useful output, we create a new interaction variable with a small
bit of BASIC code:
%let age_lwd=0
%if lwd$="Less than 100 lbs" then let age_lwd=age
keep age lwd$ age_lwd
Logit go
24
Salford Predictive Modeler® Introduction to Logistic Regression Modeling
=====================
Results of Estimation
=====================
95.0% bounds
Parameter Odds ratio Upper Lower
----------------------------------------------------------------
1 LWD$ = "Less than 100 lbs"
0.14312 4.20566 0.00487
2 AGE 0.92351 0.99811 0.85449
3 AGE_LWD 1.14133 1.32387 0.98396
----------------------------------------------------------------
Log Likelihood of constants only model = ll(0) = -117.33600
2*[ll(n)-ll(0)] = 13.53207 with 3 DOF, Chi-sq P-value = 0.00362
Mcfadden's Rho-Squared = 0.05766
-----------------------------------------------------------------------------
Now the AGE coefficient becomes more significant, LWD becomes less significant and the interaction is
borderline.
This statistic tests the hypothesis that all coefficients except the constant are zero, much like the F-test
reported below linear regressions. The likelihood ratio statistic (LR for short) of 13.532 is given in the
second line in the display above and is chi-squared with three degrees of freedom and a p-value of
.00362. The degrees of freedom are equal to the number of covariates in the model not including the
constant. McFadden’s Rho-squared is a transformation of the LR statistic intended to mimic an R-
squared. It is always between 0 and 1 and a higher Rho-squared corresponds to more significant results.
Rho-squared tends to be much lower than R-squared though, and a low number does not necessarily
imply a poor fit. Values between .20 and .40 are considered very satisfactory (Hensher and Johnson
(1981)).
Models can also be assessed relative to one another. A likelihood ratio test is formally conducted by
computing twice the difference in log-likelihoods for any pair of nested models; commonly called the G-
statistic it has degrees of freedom equal to the difference in the number of parameters estimated in the
25
Salford Predictive Modeler® Introduction to Logistic Regression Modeling
two models. Comparing the current model with the previous model we have
which has a p-value of .07607. This result corresponds to the bottom row of H&L’s table 3.17. The
conclusion of the test is that the interaction is borderline significant.
==================================================
Classification Table Using Predicted Probabilities
==================================================
The rows of the table show how observations from each level of the dependent variable are allocated to
predicted outcomes. Reading across the first ("Low Weight") row we see that of the 59 cases of low birth
weight, 21.28 are correctly predicted and 37.72 are incorrectly predicted. The second row shows that of
the 130 "OK" cases, 37.72 are incorrectly predicted and 92.28 are correctly predicted.
The table also includes additional analytic results. The "Correct" row is the proportion successfully
predicted, defined as the diagonal table entry divided by the column total, and "Tot. Correct" is the ratio of
the sum of the diagonal elements in the table to the total number of observations. In the "Low Weight"
column 21.28 are correctly predicted out of a column total of 59, giving a correct rate of .36068. Overall,
21.28 + 92.28 out of a total of 189 are correct giving a total correct rate of .60085.
The "Success Index" is the gain this model shows over a purely random model which assigned the same
probability to every observation in the data. The model produces a gain of .04851 over the random model
for "Low Weight" and .02202 for "OK" classes. Based on these results, we would not think too highly of
this model.
In the biostatistical literature another terminology is used for these quantities. The "Correct" quantity is
also known as sensitivity for the RESPONSE group and specificity for the REFERENCE group. In this
example, Logistic Regression is treating "Low Weight" as the reference and "OK" as the response. Thus,
26
Salford Predictive Modeler® Introduction to Logistic Regression Modeling
sensitivity, associated with response class "OK", is .70985, while specificity, associated with reference
class "Low Weight", is 0.36068. The FALSE REFERENCE rate is the fraction of those predicted to
respond ("OK") that actually did not respond ("Low Weight") while the FALSE RESPONSE rate is the
fraction of those predicted to not respond ("Low Weight") that actually responded ("OK"). We prefer the
Prediction Success terminology because it is applicable to the multinomial case as well (see the
Multinomial Logistic Regression section for further discussion).
Before turning to more detailed model diagnostics, we examine H&L’s final model and provide some
interpretation of the results. As a result of experimenting with more variables and a large number of
interactions H&L arrive at the following model (which includes an additional interaction between smoking
status and LWD):
use "hoslem_char.csv"
model low$
category low$, lwd$, race$
%let age_lwd=0
%if lwd$="Less than 100 lbs" then let age_lwd=age
%let smoke_lwd=0
%if lwd$="Less than 100 lbs" then let smoke_lwd=smoke
keep age race$ smoke ht ui ptd lwd$ age_lwd smoke_lwd
Logit go
27
Salford Predictive Modeler® Introduction to Logistic Regression Modeling
CONVERGENCE ACHIEVED
=====================
Results of Estimation
=====================
28
Salford Predictive Modeler® Introduction to Logistic Regression Modeling
29
Salford Predictive Modeler® Introduction to Logistic Regression Modeling
This table is generated by partitioning the sample into ten groups based on the predicted probability of the
observations. The row labeled "Prb Cut Point" in the classic output version of the table gives the end
points of the cells defining a group. Thus, the first group consists of all observations with predicted
probability between zero and 0.33068, the second group covers the interval 0.33068 to 0.44101, and the
last group contains observations with predicted probability greater than 0.92576.
30
Salford Predictive Modeler® Introduction to Logistic Regression Modeling
=====================================
Learn Deciles of Risk - Balanced Bins
=====================================
---------------------------------------------------------------------------------
Response Obs | 3.00 7.00 7.00 9.00 12.00
Exp | 2.79 5.35 8.37 9.64 11.50
Reference Obs | 10.00 7.00 10.00 8.00 6.00
Exp | 10.21 8.65 8.63 7.36 6.50
---------------------------------------------------------------------------------
Prb Cut Point | 0.33068 0.44101 0.55157 0.60069 0.67257
Avg Prob Obs | 0.23077 0.50000 0.41176 0.52941 0.66667
Exp | 0.21446 0.38248 0.49230 0.56706 0.63882
Log Odds Obs | -1.20397 0.00000 -0.35667 0.11778 0.69315
Exp | -1.29822 -0.47902 -0.03079 0.26986 0.57026
---------------------------------------------------------------------------------
HL ChiSq Comp | 0.02052 0.81857 0.44117 0.09814 0.06048
N In Bin | 13.00 14.00 17.00 17.00 18.00
% In Bin | 6.88 7.41 8.99 8.99 9.52
---------------------------------------------------------------------------------
---------------------------------------------------------------------------------
Response Obs | 14.00 15.00 19.00 20.00 24.00
Exp | 14.10 15.41 19.31 19.90 23.63
Reference Obs | 6.00 5.00 4.00 2.00 1.00
Exp | 5.90 4.59 3.69 2.10 1.37
---------------------------------------------------------------------------------
Prb Cut Point | 0.73817 0.80716 0.87674 0.92576 1.00000
Avg Prob Obs | 0.70000 0.75000 0.82609 0.90909 0.96000
Exp | 0.70511 0.77030 0.83947 0.90469 0.94521
Log Odds Obs | 0.84730 1.09861 1.55814 2.30259 3.17805
Exp | 0.87174 1.20999 1.65429 2.25047 2.84782
---------------------------------------------------------------------------------
HL ChiSq Comp | 0.00251 0.04657 0.03057 0.00494 0.10565
N In Bin | 20.00 20.00 23.00 22.00 25.00
% In Bin | 10.58 10.58 12.17 11.64 13.23
---------------------------------------------------------------------------------
Within each cell we are given a breakdown of the observed and expected occurrences of "Low Weight"
(Reference) and "OK" (Response) calculated as in the prediction success table. Expected response are
just the sum of the predicted probabilities of response in the cell. From the table it is apparent that
observed totals are close to expected totals everywhere, indicating a fairly good fit. This conclusion is
borne out by the Hosmer-Lemeshow statistic of 1.62912 which is approximately chi-squared with 8
degrees of freedom. See H&L for a discussion of this statistic. It should be noted that the Hosmer-
Lemeshow statistic will depend on the binning and that not all statistics programs will use the same
binning. Logistic Regression provides an alternative binning in which bins are equally wide (instead of
approximately equally populated as shown above). The table also provides the Pearson Chi-square and
the sum of squared deviance residuals assuming each observation has a unique covariate pattern.
31
Salford Predictive Modeler® Introduction to Logistic Regression Modeling
There are important differences between binary and multinomial models however. Chiefly, the multinomial
output is more complicated than that of the binary model, and care must be taken in the interpretation of
the results. Fortunately, Logistic Regression provides tools which make the task of interpretation much
easier. There is also a difference in dependent variable coding. The binary logistic regression dependent
variable is normally coded 0 or 1 (but can be any two distinct values), whereas the multinomial dependent
is often coded 1,2,...,K (but can be any K distinct values).
We will illustrate multinomial modeling with an example, emphasizing what is new in this context. If you
have not already read the section on binary logistic regression, this is a good time to do so.
Example
The data used below have been extracted from the National Longitudinal Survey of Young Men, 1979.
Information on 200 individuals is supplied on school enrollment status (NOTENR=1 if not enrolled, 0
otherwise), base-10 log of wage (LW), age, highest completed grade (EDUC), mother’s education (MED),
father’s education (FED), an index of reading material available in the home (CULTURE=1 for least, 3 for
most), mean income of persons in father’s occupation in 1960 (FOMY), an IQ measure, a race dummy
(BLACK=0 for white), a region dummy (SOUTH=0 for non-South) and the number of siblings (NSIBS).
The data appear in the SPM installation files as NLS.CSV. We estimate a model to analyze the
CULTURE variable, predicting its value with several demographic characteristics. In this example, we
ignore the fact that the dependent variable is ordinal and treat it as a nominal variable. (See Agresti
(1990) for a discussion of the distinction.)
32
Salford Predictive Modeler® Introduction to Logistic Regression Modeling
With the Model Setup dialog, we indicate the target is CULTURE and that two predictors, MED and
FOMY, should be used:
use "nls.csv"
model culture
category culture
keep med,fomy
logit go
SAMPLE SPLIT
============
WEIGHTED WEIGHTED
CATEGORY COUNT Prop COUNT %
---------------------------------------------------------------
1 | 12 0.06000 | 12.00000 0.06000
2 | 49 0.24500 | 49.00000 0.24500
3 | 139 0.69500 | 139.00000 0.69500
-------------+----------------------+--------------------------
| 200 | 200.00000
33
Salford Predictive Modeler® Introduction to Logistic Regression Modeling
Parameter 1 2 3 Overall
-----------------------------------------------------------------------------
Intercept 1.00000 1.00000 1.00000 1.00000
1 MED 8.75000 10.18367 11.44604 10.97500
2 FOMY 4551.50000 5368.85714 6116.13669 5839.17500
-----------------------------------------------------------------------------
CONVERGENCE ACHIEVED
=====================
Results of Estimation
=====================
The output begins with a report on the number of records read and retained for analysis, and on some
hardware, a report on whether the data in its entirety could fit into RAM for faster processing. This is
followed by a frequency table of the dependent variable; both weighted and unweighted counts would be
provided if the WEIGHT option had been used. Next an abbreviated history of the optimization process
lists the log-likelihood at each iteration, and finally the estimation results are printed.
Note that the regression results consist of two sets of estimates, labeled CHOICE GROUP 1 and
CHOICE GROUP 2. It is this multiplicity of parameter estimates that differentiates multinomial from binary
logistic regression. If there had been 5 target classes, there would have been four sets of estimates and
34
Salford Predictive Modeler® Introduction to Logistic Regression Modeling
This volume of output provides the challenge to understanding the results. The output is a little more
intelligible when you realize that we have really estimated a series of binary logistic regressions
simultaneously. The first sub-model consists of the two dependent variable categories 1 and 3 and the
second consists of categories 2 and 3. These sub-models always include the highest level of the
dependent variable as the reference class and one other level as the response class.
Wald Tests
The coefficient/standard-error ratios ("t-ratios") reported next to each coefficient are a guide to the
significance of an individual parameter. But when there are more than two target classes each variable
corresponds to more than one parameter. The Wald test table automatically conducts the hypothesis test
of dropping all parameters associated with a variable and the degrees of freedom indicates how many
parameters were involved. Since in this example each variable generates two coefficients, the Wald tests
have two degrees of freedom each. Given the high individual t-ratios it is not surprising that every variable
is also significant overall.
WALD CHI-SQ
PARAMETER STATISTIC SIGNIF DOF
----------------------------------------------------------------
Intercept 12.00309 0.00247 2.00000
1 MED 12.14107 0.00231 2.00000
2 FOMY 9.45778 0.00884 2.00000
----------------------------------------------------------------
35
Salford Predictive Modeler® Introduction to Logistic Regression Modeling
Parameter 1 2 3 Overall
-----------------------------------------------------------------------------
Intercept 1.00000 1.00000 1.00000 1.00000
1 MED 8.75000 10.18367 11.44604 10.97500
2 FOMY 4551.50000 5368.85714 6116.13669 5839.17500
-----------------------------------------------------------------------------
The Independent Variable Means table provides means of the independent variables by target class
and overall, and can provide some insight into likely outcomes from the regression. We observe that the
highest educational and income values are associated with the most reading material in the home.
==============================
Model Prediction Success Table
==============================
Each row of the table takes all cases having a specific value of the dependent variable and shows how
the model allocates those cases across the possible outcomes. Thus in row 1, the twelve cases who
actually had CULTURE=1 were distributed by the predictive model as: 1.88 to CULTURE=1, 4.09 to
CULTURE=2, and 6.03 to CULTURE=3. These numbers are obtained by summing the predicted
probability of being in each category across all the cases with CULTURE actually equal to 1. A similar
allocation is provided for every value of the dependent variable.
The prediction success table is also bordered by additional information: row totals are observed sums and
column totals are predicted sums and will be equal for any model containing a constant. The CORRECT
36
Salford Predictive Modeler® Introduction to Logistic Regression Modeling
row gives the ratio of the number correctly predicted in a column to the column total. Thus, among cases
for which CULTURE=1 the fraction correct is 1.8761/12=.1563, and for CULTURE=3 the ratio is
101.4862/139=.7301. The total correct (TOT. CORRECT) gives the fraction correctly predicted overall
and is computed as the sum CORRECT in each column divided by the table total. This is (1.8761 +
13.8862 + 101.4862)/200 =.5862. Finally, the success index measures the gain the model exhibits in
number correctly predicted in each column over a purely random model (a model with just a constant.) A
purely random model would assign the same probabilities of the three outcomes to each case, as
illustrated below:
Random Probability Model Success Index = Predicted Sample Fraction CORRECT - Random
Predicted
Thus, the smaller the success index in each column the poorer the performance of the model, and the
index can even be negative. Normally one prediction success table is produced for each model
estimated. However, if the data has been separated into learning, test and/or holdout samples a separate
prediction success table will be produced for each portion of the data. This can provide a clear picture of
the strengths and weaknesses of the model when applied to fresh data.
Classification Tables
Classification tables are similar to prediction success tables except that predicted choices are added into
the table instead of predicted probabilities. Predicted choice is the choice with the highest probability.
Mathematically, the classification table is a prediction success table with the predicted probabilities
changed, setting the highest probability of each case to one and the other probabilities to zero.
==========================
Model Classification Table
==========================
In the absence of fractional case weighting, each cell of the main table will contain an integer instead of a
real number. All other quantities are computed as they would be for the prediction success table. In our
judgment the classification table is not as good a diagnostic tool as the prediction success table. The
option is included primarily for the binary logistic regression to provide comparability with results reported
in the literature.
37
Salford Predictive Modeler® Introduction to Logistic Regression Modeling
dependent variable. If you wish to save these predictions you may issue a SAVE command prior to the
LOGIT GO command. For the multinomial Logistic Regression, this will produce a dataset with the
variables PROB(1) through PROB(NClasses), while for the binary logistic regression only the probability
of the response PROB (without a subscript) is saved.
For example, the following commands estimate a model and save probabilities:
use "nls.csv"
model culture
category culture
save "predictions.csv"
keep age,iq,educ
idvar fomy
logit go
The dataset predictions.csv will contain the predicted probabilities along with variables specified by the
IDVAR command. IDVAR can be used to attach one or more identifying variables to each record or any
other variable useful for subsequent analysis or merging. The abbreviated output from the above
command set is:
Parameter 1 2 3 Overall
-----------------------------------------------------------------------------
Intercept 1.00000 1.00000 1.00000 1.00000
1 EDUC 12.58333 12.85714 14.43165 13.93500
2 AGE 26.25000 26.34694 25.46763 25.73000
3 IQ 95.66667 100.97959 109.43165 106.53500
-----------------------------------------------------------------------------
CONVERGENCE ACHIEVED
=====================
Results of Estimation
=====================
38
Salford Predictive Modeler® Introduction to Logistic Regression Modeling
95.0% bounds
Parameter Odds ratio Upper Lower
----------------------------------------------------------------
Choice group: 1
1 EDUC 0.88028 1.29754 0.59720
2 AGE 1.12738 1.37623 0.92353
3 IQ 0.94950 1.00547 0.89665
Choice group: 2
1 EDUC 0.79175 0.97967 0.63987
2 AGE 1.12649 1.26057 1.00667
3 IQ 0.98383 1.01774 0.95105
----------------------------------------------------------------
Log Likelihood of constants only model = ll(0) = -153.25352
2*[ll(n)-ll(0)] = 28.55812 with 6 DOF, Chi-sq P-value = 0.00007
Mcfadden's Rho-Squared = 0.09317
-----------------------------------------------------------------------------
39
Salford Predictive Modeler® Introduction to Logistic Regression Modeling
40
Salford Predictive Modeler® Introduction to Logistic Regression Modeling
FOMY is the "id variable", which can be used to link this dataset to the original input dataset.
If you wish to have all of the variables on your input dataset present in the output dataset, simply use the
/COMPLETE option with the SAVE command:
In which case PREDICTIONS.CSV will contain many more variables. You could then use this dataset to
build a second model, perhaps using the predicted probabilities (or predicted class) as part of the model:
use "nls.csv"
model culture
category culture
keep age,iq,educ
logit go
Visiting the Model Translation dialog allows us to specify which programming language we wish to use,
optionally with the name of the file that will contain the translation:
41
Salford Predictive Modeler® Introduction to Logistic Regression Modeling
By default, the translation will be presented on screen, but you may request that it be saved to a text file
by selecting "Save Output To File" on the Model Translation dialog. While saving the translation to a
text file is particularly helpful for lengthy translations such as CART ®, TreeNet® and Random Forests®,
Logistic Regression translations are quite simple and you may prefer to work with them on-screen:
Note that the model coefficients lead to two scores for target classes CULTURE=1 and CULTURE=2,
while the third target class is "left out". This left out class is characteristic of Logistic Regression models.
The expressions are used to produce a score for each of the first two classes, which are then
exponentiated and normalized to produce predicted probabilities.
42
Salford Predictive Modeler® Introduction to Logistic Regression Modeling
sample records will have no influence on the model estimation, but performance measures for the test
sample will be computed and presented alongside comparable measures based on the learning sample.
Let's consider a dataset concerning "spam" email named SPAMBASE.CSV (or SPAM.CSV). The
dependent variable SPAM is 1 if an email was considered spam by the research team that assembled the
dataset, and 0 if not. In other words, an incoming email that receives a predicted class of 1 is an email
with which we would not want to be bothered, while instead we wish to have all the emails that are
predicted to be of class 0 reach our inbox. A variety of predictive variables are available on the dataset,
and most are used in this example but not explained in detail. In addition, the variable TESTVAR is
provided which takes on values 0 and 1. We would like the SPM will treat records with TESTVAR=1 as a
test sample, such that they are processed and available for model performance evaluation but are not
actually used in the model estimation itself. In this was we not only can see how the model performs in
terms of classification performance on the data with which it was built but we can also see how it performs
on data the model has never seen before.
The Model Setup dialog looks like this, where we identify the target and predictors. In particular, note that
the "test separation variable" TESTVAR is not checked as a predictor, which will ensure it is available in
the next step to be used to define the test sample.
43
Salford Predictive Modeler® Introduction to Logistic Regression Modeling
On the Testing tab, we are able to select how the test sample is created. We could have the SPM select
a random 20% sample, which is often a good choice. But here we will illustrate the use of the "separation
variable" TESTVAR:
If you prefer to use commands, this example would consist of the following:
USE "SPAM.CSV"
MODEL SPAM
KEEP MAKE-ALL, OUR-HPL, N650-TELNET, DATA-DIRECT, MEETING-TABLE, SEMICOLO-TOTAL,
RANDOM
CATEGORY SPAM
PARTITION SEPVAR=TESTVAR
LOGIT GO
44
Salford Predictive Modeler® Introduction to Logistic Regression Modeling
Selecting START gets the model estimation going, at which point we can see how the learn and test
samples are composed:
======================
Target Frequency Table
======================
Variable: SPAM
N Classes: 2
Data Value N % Wgt Count %
----------------------------------------------------------
0 L 1433 61.13 1433 61.13
T (1355 60.04) (1355 60.04)
1 L 911 38.87 911 38.87
T (902 39.96) (902 39.96)
----------------------------------------------------------
Totals
0 2788 60.60 2788 60.60
1 1813 39.40 1813 39.40
----------------------------------------------------------
Total 4601 4601
Total Learn 2344 2344
Total Test 2257 2257
The model is built using only the learn sample (only the first few coefficients are presented below for
brevity). We see that many of the coefficients have large t-ratios and p-values near 0.0:
=====================
Results of Estimation
=====================
45
Salford Predictive Modeler® Introduction to Logistic Regression Modeling
After the model estimation details are reported, performance measures for the model are reported. It is at
this point that the test sample becomes important. Performance measures for both the learn and test
samples are presented side-by-side, so the analyst can easily see whether the model holds up (or breaks
down) on an independent testing sample:
====================================
Classification Performance By Sample
====================================
=============================================
Class Table Learn (Test) Classification Table
=============================================
In this example, the classification error is only about one percentage point higher (worse) for the test
sample. And from the classification table we can see that the proportion correct and success indices all
drop just slightly when moving from the learn sample to the test sample, but the differences are not great.
So, if the goal is to commit to a yes/no classification of whether an email is spam or not, this model
appears to work relatively well.
On the other hand, we may instead prefer to consider the predicted probability that an email is spam,
rather than committing to an actual yes/no prediction. In this case, the goal is to rank order (sort) a group
of incoming email by their likelihood of being spam. The ROC and Lift statistics are helpful here. Both of
these agree well between learn and test, meaning the model rank orders email nearly as well on new
data as it did on the data upon which the model was built. So, if our Logistic Regression model is used to
sort incoming email by likelihood of being spam, it appears that actual spam emails would be sorted at or
near the "front of the pack" making them relatively easy for our email recipient to identify and remove.
46