This document provides an overview of multinomial logistic regression. It discusses key topics such as:
- Multinomial logistic regression compares multiple groups through a combination of binary logistic regressions.
- It predicts probabilities of group membership and compares predicted vs. actual groups to determine classification accuracy.
- For a model to be considered useful, its classification accuracy must be at least 25% higher than chance-level accuracy.
- Relationship between individual predictors and the outcome are evaluated through likelihood ratio and Wald tests.
This document provides an overview of multinomial logistic regression. It discusses key topics such as:
- Multinomial logistic regression compares multiple groups through a combination of binary logistic regressions.
- It predicts probabilities of group membership and compares predicted vs. actual groups to determine classification accuracy.
- For a model to be considered useful, its classification accuracy must be at least 25% higher than chance-level accuracy.
- Relationship between individual predictors and the outcome are evaluated through likelihood ratio and Wald tests.
Sample Problems SW388R7 Data Analysis & Computers II
Slide 2 Multinomial logistic regression Multinomial logistic regression is used to analyze relationships between a non-metric dependent variable and metric or dichotomous independent variables.
Multinomial logistic regression compares multiple groups through a combination of binary logistic regressions.
The group comparisons are equivalent to the comparisons for a dummy-coded dependent variable, with the group with the highest numeric score used as the reference group.
For example, if we wanted to study differences in BSW, MSW, and PhD students using multinomial logistic regression, the analysis would compare BSW students to PhD students and MSW students to PhD students. For each independent variable, there would be two comparisons. SW388R7 Data Analysis & Computers II
Slide 3 What multinomial logistic regression predicts Multinomial logistic regression provides a set of coefficients for each of the two comparisons. The coefficients for the reference group are all zeros, similar to the coefficients for the reference group for a dummy-coded variable.
Thus, there are three equations, one for each of the groups defined by the dependent variable.
The three equations can be used to compute the probability that a subject is a member of each of the three groups. A case is predicted to belong to the group associated with the highest probability.
Predicted group membership can be compared to actual group membership to obtain a measure of classification accuracy. SW388R7 Data Analysis & Computers II
Slide 4 Level of measurement requirements Multinomial logistic regression analysis requires that the dependent variable be non-metric. Dichotomous, nominal, and ordinal variables satisfy the level of measurement requirement.
Multinomial logistic regression analysis requires that the independent variables be metric or dichotomous. Since SPSS will automatically dummy-code nominal level variables, they can be included since they will be dichotomized in the analysis.
In SPSS, non-metric independent variables are included as factors. SPSS will dummy-code non-metric IVs.
In SPSS, metric independent variables are included as covariates. If an independent variable is ordinal, we will attach the usual caution. SW388R7 Data Analysis & Computers II
Slide 5 Assumptions and outliers Multinomial logistic regression does not make any assumptions of normality, linearity, and homogeneity of variance for the independent variables.
Because it does not impose these requirements, it is preferred to discriminant analysis when the data does not satisfy these assumptions.
SPSS does not compute any diagnostic statistics for outliers. To evaluate outliers, the advice is to run multiple binary logistic regressions and use those results to test the exclusion of outliers or influential cases. SW388R7 Data Analysis & Computers II
Slide 6 Sample size requirements The minimum number of cases per independent variable is 10, using a guideline provided by Hosmer and Lemeshow, authors of Applied Logistic Regression, one of the main resources for Logistic Regression.
For preferred case-to-variable ratios, we will use 20 to 1. SW388R7 Data Analysis & Computers II
Slide 7 Methods for including variables The only method for selecting independent variables in SPSS is simultaneous or direct entry.
SW388R7 Data Analysis & Computers II
Slide 8 Overall test of relationship - 1 The overall test of relationship among the independent variables and groups defined by the dependent is based on the reduction in the likelihood values for a model which does not contain any independent variables and the model that contains the independent variables.
This difference in likelihood follows a chi-square distribution, and is referred to as the model chi-square.
The significance test for the final model chi-square (after the independent variables have been added) is our statistical evidence of the presence of a relationship between the dependent variable and the combination of the independent variables.
SW388R7 Data Analysis & Computers II
Slide 9 Overall test of relationship - 2 Model Fitting Information 284.429 265.972 18.457 6 .005 Model Intercept Onl y Fi nal -2 Log Li kel i hood Chi-Square df Si g. The presence of a relationship between the dependent variable and combination of independent variables is based on the statistical significance of the final model chi-square in the SPSS table titled "Model Fitting Information".
In this analysis, the probability of the model chi-square (18.457) was 0.005, less than or equal to the level of significance of 0.05. The null hypothesis that there was no difference between the model without independent variables and the model with independent variables was rejected. The existence of a relationship between the independent variables and the dependent variable was supported. SW388R7 Data Analysis & Computers II
Slide 10 Strength of multinomial logistic regression relationship While multinomial logistic regression does compute correlation measures to estimate the strength of the relationship (pseudo R square measures, such as Nagelkerke's R), these correlations measures do not really tell us much about the accuracy or errors associated with the model.
A more useful measure to assess the utility of a multinomial logistic regression model is classification accuracy, which compares predicted group membership based on the logistic model to the actual, known group membership, which is the value for the dependent variable.
SW388R7 Data Analysis & Computers II
Slide 11 Evaluating usefulness for logistic models The benchmark that we will use to characterize a multinomial logistic regression model as useful is a 25% improvement over the rate of accuracy achievable by chance alone.
Even if the independent variables had no relationship to the groups defined by the dependent variable, we would still expect to be correct in our predictions of group membership some percentage of the time. This is referred to as by chance accuracy.
The estimate of by chance accuracy that we will use is the proportional by chance accuracy rate, computed by summing the squared percentage of cases in each group. The only difference between by chance accuracy for binary logistic models and by chance accuracy for multinomial logistic models is the number of groups defined by the dependent variable. SW388R7 Data Analysis & Computers II
Slide 12 Computing by chance accuracy The percentage of cases in each group defined by the dependent variable is found in the Case Processing Summary table. Case Processing Summary 62 37.1% 93 55.7% 12 7.2% 167 100.0% 103 270 153 a 1 2 3 HIGHWAYS AND BRIDGES Vali d Missing Total Subpopul ati on N Margi nal Percentage The dependent vari abl e has onl y one val ue observed i n 146 (95.4%) subpopul ati ons. a. The proportional by chance accuracy rate was computed by calculating the proportion of cases for each group based on the number of cases in each group in the 'Case Processing Summary', and then squaring and summing the proportion of cases in each group (0.371 + 0.557 + 0.072 = 0.453).
The proportional by chance accuracy criteria is 56.6% (1.25 x 45.3% = 56.6%). SW388R7 Data Analysis & Computers II
Slide 13 Comparing accuracy rates To characterize our model as useful, we compare the overall percentage accuracy rate produced by SPSS at the last step in which variables are entered to 25% more than the proportional by chance accuracy. (Note: SPSS does not compute a cross-validated accuracy rate for multinomial logistic regression .) Classification 15 47 0 24.2% 7 86 0 92.5% 5 7 0 .0% 16.2% 83.8% .0% 60.5% Observed 1 2 3 Overal l Percentage 1 2 3 Percent Correct Predi cted The classification accuracy rate was 60.5% which was greater than or equal to the proportional by chance accuracy criteria of 56.6% (1.25 x 45.3% = 56.6%).
The criteria for classification accuracy is satisfied in this example. SW388R7 Data Analysis & Computers II
Slide 14 Numerical problems The maximum likelihood method used to calculate multinomial logistic regression is an iterative fitting process that attempts to cycle through repetitions to find an answer. Sometimes, the method will break down and not be able to converge or find an answer. Sometimes the method will produce wildly improbable results, reporting that a one-unit change in an independent variable increases the odds of the modeled event by hundreds of thousands or millions. These implausible results can be produced by multicollinearity, categories of predictors having no cases or zero cells, and complete separation whereby the two groups are perfectly separated by the scores on one or more independent variables. The clue that we have numerical problems and should not interpret the results are standard errors for some independent variables that are larger than 2.0. SW388R7 Data Analysis & Computers II
Slide 15 Relationship of individual independent variables and the dependent variable There are two types of tests for individual independent variables: The likelihood ratio test evaluates the overall relationship between an independent variable and the dependent variable The Wald test evaluates whether or not the independent variable is statistically significant in differentiating between the two groups in each of the embedded binary logistic comparisons.
If an independent variable has an overall relationship to the dependent variable, it might or might not be statistically significant in differentiating between pairs of groups defined by the dependent variable. SW388R7 Data Analysis & Computers II
Slide 16 Relationship of individual independent variables and the dependent variable The interpretation for an independent variable focuses on its ability to distinguish between pairs of groups and the contribution which it makes to changing the odds of being in one dependent variable group rather than the other.
We should not interpret the significance of an independent variables role in distinguishing between pairs of groups unless the independent variable also has an overall relationship to the dependent variable in the likelihood ratio test.
The interpretation of an independent variables role in differentiating dependent variable groups is the same as we used in binary logistic regression. The difference in multinomial logistic regression is that we can have multiple interpretations for an independent variable in relation to different pairs of groups. SW388R7 Data Analysis & Computers II
Slide 17 Relationship of individual independent variables and the dependent variable Parameter Estimates 3.240 2.478 1.709 1 .191 .019 .020 .906 1 .341 1.019 .980 1.061 .071 .108 .427 1 .514 1.073 .868 1.327 -1.373 .620 4.913 1 .027 .253 .075 .853 3.639 2.456 2.195 1 .138 .003 .020 .017 1 .897 1.003 .963 1.043 .172 .110 2.463 1 .117 1.188 .958 1.474 -1.657 .613 7.298 1 .007 .191 .057 .635 Intercept AGE EDUC CONLEGIS Intercept AGE EDUC CONLEGIS HIGHWAYS AND BRIDGES a TOO LITTLE ABOUT RIGHT B Std. Error Wal d df Si g. Exp(B) Lower Bound Upper Bound 95% Confi dence Interval for Exp(B) The reference category i s: TOO MUCH. a. Parameter Estimates 3.240 2.478 1.709 1 .191 .019 .020 .906 1 .341 1.019 .980 1.061 .071 .108 .427 1 .514 1.073 .868 1.327 -1.373 .620 4.913 1 .027 .253 .075 .853 3.639 2.456 2.195 1 .138 .003 .020 .017 1 .897 1.003 .963 1.043 .172 .110 2.463 1 .117 1.188 .958 1.474 -1.657 .613 7.298 1 .007 .191 .057 .635 Intercept AGE EDUC CONLEGIS Intercept AGE EDUC CONLEGIS HIGHWAYS AND BRIDGES a 1 2 B Std. Error Wal d df Si g. Exp(B) Lower Bound Upper Bound 95% Confi dence Interval for Exp(B) The reference category i s: 3. a. SPSS identifies the comparisons it makes for groups defined by the dependent variable in the table of Parameter Estimates, using either the value codes or the value labels, depending on the options settings for pivot table labeling.
The reference category is identified in the footnote to the table.
In this analysis, two comparisons will be made: the TOO LITTLE group (coded 1, shaded blue) will be compared to the TOO MUCH group (coded 3, shaded purple) the ABOUT RIGHT group (coded 2 , shaded orange)) will be compared to the TOO MUCH group (coded 3, shaded purple).
The reference category plays the same role in multinomial logistic regression that it plays in the dummy-coding of a nominal variable: it is the category that would be coded with zeros for all of the dummy-coded variables that all other categories are interpreted against. SW388R7 Data Analysis & Computers II
Slide 18 Relationship of individual independent variables and the dependent variable Likelihood Ratio Tests 268.323 2.350 2 .309 268.625 2.652 2 .265 270.395 4.423 2 .110 275.194 9.221 2 .010 Effect Intercept AGE EDUC CONLEGIS -2 Log Li kel i hood of Reduced Model Chi-Square df Si g. The chi -square stati sti c i s the di fference in -2 l og-l i kel i hoods between the fi nal model and a reduced model . The reduced model i s formed by omi tti ng an effect from the fi nal model . The nul l hypothesi s i s that al l parameters of that effect are 0. Parameter Estimates 3.240 2.478 1.709 1 .191 .019 .020 .906 1 .341 1.019 .980 1.061 .071 .108 .427 1 .514 1.073 .868 1.327 -1.373 .620 4.913 1 .027 .253 .075 .853 3.639 2.456 2.195 1 .138 .003 .020 .017 1 .897 1.003 .963 1.043 .172 .110 2.463 1 .117 1.188 .958 1.474 -1.657 .613 7.298 1 .007 .191 .057 .635 Intercept AGE EDUC CONLEGIS Intercept AGE EDUC CONLEGIS HIGHWAYS AND BRIDGES a 1 2 B Std. Error Wal d df Si g. Exp(B) Lower Bound Upper Bound 95% Confi dence Interval for Exp(B) The reference category i s: 3. a. In this example, there is a statistically significant relationship between the independent variable CONLEGIS and the dependent variable. (0.010 < 0.05) As well, the independent variable CONLEGIS is significant in distinguishing both category 1 of the dependent variable from category 3 of the dependent variable. (0.027 < 0.05) And the independent variable CONLEGIS is significant in distinguishing category 2 of the dependent variable from category 3 of the dependent variable. (0.007 < 0.05) SW388R7 Data Analysis & Computers II
Slide 19 Interpreting relationship of individual independent variables to the dependent variable Likelihood Ratio Tests 268.323 2.350 2 .309 268.625 2.652 2 .265 270.395 4.423 2 .110 275.194 9.221 2 .010 Effect Intercept AGE EDUC CONLEGIS -2 Log Li kel i hood of Reduced Model Chi-Square df Si g. The chi -square stati sti c i s the di fference in -2 l og-l i kel i hoods between the fi nal model and a reduced model . The reduced model i s formed by omi tti ng an effect from the fi nal model . The nul l hypothesi s i s that al l parameters of that effect are 0. Parameter Estimates 3.240 2.478 1.709 1 .191 .019 .020 .906 1 .341 1.019 .980 1.061 .071 .108 .427 1 .514 1.073 .868 1.327 -1.373 .620 4.913 1 .027 .253 .075 .853 3.639 2.456 2.195 1 .138 .003 .020 .017 1 .897 1.003 .963 1.043 .172 .110 2.463 1 .117 1.188 .958 1.474 -1.657 .613 7.298 1 .007 .191 .057 .635 Intercept AGE EDUC CONLEGIS Intercept AGE EDUC CONLEGIS HIGHWAYS AND BRIDGES a 1 2 B Std. Error Wal d df Si g. Exp(B) Lower Bound Upper Bound 95% Confi dence Interval for Exp(B) The reference category i s: 3. a. Survey respondents who had less confidence in congress (higher values correspond to lower confidence) were less likely to be in the group of survey respondents who thought we spend too little money on highways and bridges (DV category 1), rather than the group of survey respondents who thought we spend too much money on highways and bridges (DV category 3).
For each unit increase in confidence in Congress, the odds of being in the group of survey respondents who thought we spend too little money on highways and bridges decreased by 74.7%. (0.253 1.0 = -0.747) SW388R7 Data Analysis & Computers II
Slide 20 Interpreting relationship of individual independent variables to the dependent variable Likelihood Ratio Tests 268.323 2.350 2 .309 268.625 2.652 2 .265 270.395 4.423 2 .110 275.194 9.221 2 .010 Effect Intercept AGE EDUC CONLEGIS -2 Log Li kel i hood of Reduced Model Chi-Square df Si g. The chi -square stati sti c i s the di fference in -2 l og-l i kel i hoods between the fi nal model and a reduced model . The reduced model i s formed by omi tti ng an effect from the fi nal model . The nul l hypothesi s i s that al l parameters of that effect are 0. Parameter Estimates 3.240 2.478 1.709 1 .191 .019 .020 .906 1 .341 1.019 .980 1.061 .071 .108 .427 1 .514 1.073 .868 1.327 -1.373 .620 4.913 1 .027 .253 .075 .853 3.639 2.456 2.195 1 .138 .003 .020 .017 1 .897 1.003 .963 1.043 .172 .110 2.463 1 .117 1.188 .958 1.474 -1.657 .613 7.298 1 .007 .191 .057 .635 Intercept AGE EDUC CONLEGIS Intercept AGE EDUC CONLEGIS HIGHWAYS AND BRIDGES a 1 2 B Std. Error Wal d df Si g. Exp(B) Lower Bound Upper Bound 95% Confi dence Interval for Exp(B) The reference category i s: 3. a. Survey respondents who had less confidence in congress (higher values correspond to lower confidence) were less likely to be in the group of survey respondents who thought we spend about the right amount of money on highways and bridges (DV category 2), rather than the group of survey respondents who thought we spend too much money on highways and bridges (DV Category 3).
For each unit increase in confidence in Congress, the odds of being in the group of survey respondents who thought we spend about the right amount of money on highways and bridges decreased by 80.9%. (0.191 1.0 = 0.809) SW388R7 Data Analysis & Computers II
Slide 21 Relationship of individual independent variables and the dependent variable Likelihood Ratio Tests 327.463 a .000 0 . 333.440 5.976 2 .050 329.606 2.143 2 .343 334.636 7.173 2 .028 338.985 11.521 2 .003 Effect Intercept AGE EDUC POLVIEWS SEX -2 Log Li kel i hood of Reduced Model Chi-Square df Si g. The chi -square stati sti c i s the di fference i n -2 log-li kel i hoods between the fi nal model and a reduced model. The reduced model i s formed by omi tting an effect from the final model . The nul l hypothesi s is that al l parameters of that effect are 0. This reduced model i s equi val ent to the fi nal model because omi tti ng the effect does not i ncrease the degrees of freedom. a. Parameter Estimates 8.434 2.233 14.261 1 .000 -.023 .017 1.756 1 .185 .977 .944 1.011 -.066 .102 .414 1 .520 .936 .766 1.144 -.575 .251 5.234 1 .022 .563 .344 .921 -2.167 .805 7.242 1 .007 .115 .024 .555 0 b . . 0 . . . . 4.485 2.255 3.955 1 .047 -.001 .018 .003 1 .955 .999 .965 1.034 .011 .104 .011 1 .916 1.011 .824 1.240 -.397 .257 2.375 1 .123 .673 .406 1.114 -1.606 .824 3.800 1 .051 .201 .040 1.009 0 b . . 0 . . . . Intercept AGE EDUC POLVIEWS [SEX=1] [SEX=2] Intercept AGE EDUC POLVIEWS [SEX=1] [SEX=2] NATCHLD a TOO LITTLE ABOUT RIGHT B Std. Error Wal d df Si g. Exp(B) Lower Bound Upper Bound 95% Confi dence Interval for Exp(B) The reference category i s: TOO MUCH. a. This parameter is set to zero because i t i s redundant. b. In this example, there is a statistically significant relationship between SEX and the dependent variable, spending on childcare assistance. As well, SEX plays a statistically significant role in differentiating the TOO LITTLE group from the TOO MUCH (reference) group. (0.007 < 0.5) However, SEX does not differentiate the ABOUT RIGHT group from the TOO MUCH (reference) group.(0.51 > 0.5) SW388R7 Data Analysis & Computers II
Slide 22 Interpreting relationship of individual independent variables and the dependent variable Likelihood Ratio Tests 327.463 a .000 0 . 333.440 5.976 2 .050 329.606 2.143 2 .343 334.636 7.173 2 .028 338.985 11.521 2 .003 Effect Intercept AGE EDUC POLVIEWS SEX -2 Log Li kel i hood of Reduced Model Chi-Square df Si g. The chi -square stati sti c i s the di fference i n -2 log-li kel i hoods between the fi nal model and a reduced model. The reduced model i s formed by omi tting an effect from the final model . The nul l hypothesi s is that al l parameters of that effect are 0. This reduced model i s equi val ent to the fi nal model because omi tti ng the effect does not i ncrease the degrees of freedom. a. Parameter Estimates 8.434 2.233 14.261 1 .000 -.023 .017 1.756 1 .185 .977 .944 1.011 -.066 .102 .414 1 .520 .936 .766 1.144 -.575 .251 5.234 1 .022 .563 .344 .921 -2.167 .805 7.242 1 .007 .115 .024 .555 0 b . . 0 . . . . 4.485 2.255 3.955 1 .047 -.001 .018 .003 1 .955 .999 .965 1.034 .011 .104 .011 1 .916 1.011 .824 1.240 -.397 .257 2.375 1 .123 .673 .406 1.114 -1.606 .824 3.800 1 .051 .201 .040 1.009 0 b . . 0 . . . . Intercept AGE EDUC POLVIEWS [SEX=1] [SEX=2] Intercept AGE EDUC POLVIEWS [SEX=1] [SEX=2] NATCHLD a TOO LITTLE ABOUT RIGHT B Std. Error Wal d df Si g. Exp(B) Lower Bound Upper Bound 95% Confi dence Interval for Exp(B) The reference category i s: TOO MUCH. a. This parameter is set to zero because i t i s redundant. b. Survey respondents who were male (code 1 for sex) were less likely to be in the group of survey respondents who thought we spend too little money on childcare assistance (DV category 1), rather than the group of survey respondents who thought we spend too much money on childcare assistance (DV category 3).
Survey respondents who were male were 88.5% less likely (0.115 1.0 = -0.885) to be in the group of survey respondents who thought we spend too little money on childcare assistance. SW388R7 Data Analysis & Computers II
Slide 23 Interpreting relationships for independent variable in problems In the multinomial logistic regression problems, the problem statement will ask about only one of the independent variables. The answer will be true or false based on only the relationship between the specified independent variable and the dependent variable. The individual relationships between other independent variables are the dependent variable are not used in determining whether or not the answer is true or false. SW388R7 Data Analysis & Computers II
Slide 24 Problem 1 11. In the dataset GSS2000, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data, outliers, or influential cases, and that the validation analysis will confirm the generalizability of the results. Use a level of significance of 0.05 for evaluating the statistical relationships. The variables "age" [age], "highest year of school completed" [educ] and "confidence in Congress" [conlegis] were useful predictors for distinguishing between groups based on responses to "opinion about spending on highways and bridges" [natroad]. These predictors differentiate survey respondents who thought we spend too little money on highways and bridges from survey respondents who thought we spend too much money on highways and bridges and survey respondents who thought we spend about the right amount of money on highways and bridges from survey respondents who thought we spend too much money on highways and bridges. Among this set of predictors, confidence in Congress was helpful in distinguishing among the groups defined by responses to opinion about spending on highways and bridges. Survey respondents who had less confidence in congress were less likely to be in the group of survey respondents who thought we spend too little money on highways and bridges, rather than the group of survey respondents who thought we spend too much money on highways and bridges. For each unit increase in confidence in Congress, the odds of being in the group of survey respondents who thought we spend too little money on highways and bridges decreased by 74.7%. Survey respondents who had less confidence in congress were less likely to be in the group of survey respondents who thought we spend about the right amount of money on highways and bridges, rather than the group of survey respondents who thought we spend too much money on highways and bridges. For each unit increase in confidence in Congress, the odds of being in the group of survey respondents who thought we spend about the right amount of money on highways and bridges decreased by 80.9%. 1. True 2. True with caution 3. False 4. Inappropriate application of a statistic SW388R7 Data Analysis & Computers II
Slide 25 Dissecting problem 1 - 1 11. In the dataset GSS2000, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data, outliers, or influential cases, and that the validation analysis will confirm the generalizability of the results. Use a level of significance of 0.05 for evaluating the statistical relationships. The variables "age" [age], "highest year of school completed" [educ] and "confidence in Congress" [conlegis] were useful predictors for distinguishing between groups based on responses to "opinion about spending on highways and bridges" [natroad]. These predictors differentiate survey respondents who thought we spend too little money on highways and bridges from survey respondents who thought we spend too much money on highways and bridges and survey respondents who thought we spend about the right amount of money on highways and bridges from survey respondents who thought we spend too much money on highways and bridges. Among this set of predictors, confidence in Congress was helpful in distinguishing among the groups defined by responses to opinion about spending on highways and bridges. Survey respondents who had less confidence in congress were less likely to be in the group of survey respondents who thought we spend too little money on highways and bridges, rather than the group of survey respondents who thought we spend too much money on highways and bridges. For each unit increase in confidence in Congress, the odds of being in the group of survey respondents who thought we spend too little money on highways and bridges decreased by 74.7%. Survey respondents who had less confidence in congress were less likely to be in the group of survey respondents who thought we spend about the right amount of money on highways and bridges, rather than the group of survey respondents who thought we spend too much money on highways and bridges. For each unit increase in confidence in Congress, the odds of being in the group of survey respondents who thought we spend about the right amount of money on highways and bridges decreased by 80.9%. 1. True 2. True with caution 3. False 4. Inappropriate application of a statistic For these problems, we will assume that there is no problem with missing data, outliers, or influential cases, and that the validation analysis will confirm the generalizability of the results
In this problem, we are told to use 0.05 as alpha for the multinomial logistic regression. SW388R7 Data Analysis & Computers II
Slide 26 Dissecting problem 1 - 2 11. In the dataset GSS2000, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data, outliers, or influential cases, and that the validation analysis will confirm the generalizability of the results. Use a level of significance of 0.05 for evaluating the statistical relationships. The variables "age" [age], "highest year of school completed" [educ] and "confidence in Congress" [conlegis] were useful predictors for distinguishing between groups based on responses to "opinion about spending on highways and bridges" [natroad]. These predictors differentiate survey respondents who thought we spend too little money on highways and bridges from survey respondents who thought we spend too much money on highways and bridges and survey respondents who thought we spend about the right amount of money on highways and bridges from survey respondents who thought we spend too much money on highways and bridges. Among this set of predictors, confidence in Congress was helpful in distinguishing among the groups defined by responses to opinion about spending on highways and bridges. Survey respondents who had less confidence in congress were less likely to be in the group of survey respondents who thought we spend too little money on highways and bridges, rather than the group of survey respondents who thought we spend too much money on highways and bridges. For each unit increase in confidence in Congress, the odds of being in the group of survey respondents who thought we spend too little money on highways and bridges decreased by 74.7%. Survey respondents who had less confidence in congress were less likely to be in the group of survey respondents who thought we spend about the right amount of money on highways and bridges, rather than the group of survey respondents who thought we spend too much money on highways and bridges. For each unit increase in confidence in Congress, the odds of being in the group of survey respondents who thought we spend about the right amount of money on highways and bridges decreased by 80.9%. SPSS only supports direct or simultaneous entry of independent variables in multinomial logistic regression, so we have no choice of method for entering variables. The variables listed first in the problem statement are the independent variables (IVs): "age" [age], "highest year of school completed" [educ] and "confidence in Congress" [conlegis]. The variable used to define groups is the dependent variable (DV): "opinion about spending on highways and bridges" [natroad]. SW388R7 Data Analysis & Computers II
Slide 27 Dissecting problem 1 - 3 11. In the dataset GSS2000, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data, outliers, or influential cases, and that the validation analysis will confirm the generalizability of the results. Use a level of significance of 0.05 for evaluating the statistical relationships. The variables "age" [age], "highest year of school completed" [educ] and "confidence in Congress" [conlegis] were useful predictors for distinguishing between groups based on responses to "opinion about spending on highways and bridges" [natroad]. These predictors differentiate survey respondents who thought we spend too little money on highways and bridges from survey respondents who thought we spend too much money on highways and bridges and survey respondents who thought we spend about the right amount of money on highways and bridges from survey respondents who thought we spend too much money on highways and bridges. Among this set of predictors, confidence in Congress was helpful in distinguishing among the groups defined by responses to opinion about spending on highways and bridges. Survey respondents who had less confidence in congress were less likely to be in the group of survey respondents who thought we spend too little money on highways and bridges, rather than the group of survey respondents who thought we spend too much money on highways and bridges. For each unit increase in confidence in Congress, the odds of being in the group of survey respondents who thought we spend too little money on highways and bridges decreased by 74.7%. Survey respondents who had less confidence in congress were less likely to be in the group of survey respondents who thought we spend about the right amount of money on highways and bridges, rather than the group of survey respondents who thought we spend too much money on highways and bridges. For each unit increase in confidence in Congress, the odds of being in the group of survey respondents who thought we spend about the right amount of money on highways and bridges decreased by 80.9%. SPSS multinomial logistic regression models the relationship by comparing each of the groups defined by the dependent variable to the group with the highest code value.
The responses to opinion about spending on highways and bridges were: 1= Too little, 2 = About right, and 3 = Too much. The analysis will result in two comparisons: survey respondents who thought we spend too little money versus survey respondents who thought we spend too much money on highways and bridges survey respondents who thought we spend about the right amount of money versus survey respondents who thought we spend too much money on highways and bridges. SW388R7 Data Analysis & Computers II
Slide 28 Dissecting problem 1 - 4 The variables "age" [age], "highest year of school completed" [educ] and "confidence in Congress" [conlegis] were useful predictors for distinguishing between groups based on responses to "opinion about spending on highways and bridges" [natroad]. These predictors differentiate survey respondents who thought we spend too little money on highways and bridges from survey respondents who thought we spend too much money on highways and bridges and survey respondents who thought we spend about the right amount of money on highways and bridges from survey respondents who thought we spend too much money on highways and bridges. Among this set of predictors, confidence in Congress was helpful in distinguishing among the groups defined by responses to opinion about spending on highways and bridges. Survey respondents who had less confidence in congress were less likely to be in the group of survey respondents who thought we spend too little money on highways and bridges, rather than the group of survey respondents who thought we spend too much money on highways and bridges. For each unit increase in confidence in Congress, the odds of being in the group of survey respondents who thought we spend too little money on highways and bridges decreased by 74.7%. Survey respondents who had less confidence in congress were less likely to be in the group of survey respondents who thought we spend about the right amount of money on highways and bridges, rather than the group of survey respondents who thought we spend too much money on highways and bridges. For each unit increase in confidence in Congress, the odds of being in the group of survey respondents who thought we spend about the right amount of money on highways and bridges decreased by 80.9%. Each problem includes a statement about the relationship between one independent variable and the dependent variable. The answer to the problem is based on the stated relationship, ignoring the relationships between the other independent variables and the dependent variable.
This problem identifies a difference for both of the comparisons among groups modeled by the multinomial logistic regression. SW388R7 Data Analysis & Computers II
Slide 29 Dissecting problem 1 - 5 11. In the dataset GSS2000, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data, outliers, or influential cases, and that the validation analysis will confirm the generalizability of the results. Use a level of significance of 0.05 for evaluating the statistical relationships. The variables "age" [age], "highest year of school completed" [educ] and "confidence in Congress" [conlegis] were useful predictors for distinguishing between groups based on responses to "opinion about spending on highways and bridges" [natroad]. These predictors differentiate survey respondents who thought we spend too little money on highways and bridges from survey respondents who thought we spend too much money on highways and bridges and survey respondents who thought we spend about the right amount of money on highways and bridges from survey respondents who thought we spend too much money on highways and bridges. Among this set of predictors, confidence in Congress was helpful in distinguishing among the groups defined by responses to opinion about spending on highways and bridges. Survey respondents who had less confidence in congress were less likely to be in the group of survey respondents who thought we spend too little money on highways and bridges, rather than the group of survey respondents who thought we spend too much money on highways and bridges. For each unit increase in confidence in Congress, the odds of being in the group of survey respondents who thought we spend too little money on highways and bridges decreased by 74.7%. Survey respondents who had less confidence in congress were less likely to be in the group of survey respondents who thought we spend about the right amount of money on highways and bridges, rather than the group of survey respondents who thought we spend too much money on highways and bridges. For each unit increase in confidence in Congress, the odds of being in the group of survey respondents who thought we spend about the right amount of money on highways and bridges decreased by 80.9%.
In order for the multinomial logistic regression question to be true, the overall relationship must be statistically significant, there must be no evidence of numerical problems, the classification accuracy rate must be substantially better than could be obtained by chance alone, and the stated individual relationship must be statistically significant and interpreted correctly. SW388R7 Data Analysis & Computers II
Slide 30 Request multinomial logistic regression Select the Regression | Multinomial Logistic command from the Analyze menu. SW388R7 Data Analysis & Computers II
Slide 31 Selecting the dependent variable Second, click on the right arrow button to move the dependent variable to the Dependent text box. First, highlight the dependent variable natroad in the list of variables. SW388R7 Data Analysis & Computers II
Slide 32 Selecting metric independent variables Move the metric independent variables, age, educ and conlegis to the Covariate(s) list box. Metric independent variables are specified as covariates in multinomial logistic regression. Metric variables can be either interval or, by convention, ordinal. In this analysis, there are no non- metric independent variables. Non- metric independent variables would be moved to the Factor(s) list box. SW388R7 Data Analysis & Computers II
Slide 33 Specifying statistics to include in the output While we will accept most of the SPSS defaults for the analysis, we need to specifically request the classification table.
Click on the Statistics button to make a request. SW388R7 Data Analysis & Computers II
Slide 34 Requesting the classification table First, keep the SPSS defaults for Summary statistics, Likelihood ratio test, and Parameter estimates. Second, mark the checkbox for the Classification table. Third, click on the Continue button to complete the request. SW388R7 Data Analysis & Computers II
Slide 35 Completing the multinomial logistic regression request Click on the OK button to request the output for the multinomial logistic regression. The multinomial logistic procedure supports additional commands to specify the model computed for the relationships (we will use the default main effects model), additional specifications for computing the regression, and saving classification results. We will not make use of these options. SW388R7 Data Analysis & Computers II
Slide 36 LEVEL OF MEASUREMENT - 1 11. In the dataset GSS2000, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data, outliers, or influential cases, and that the validation analysis will confirm the generalizability of the results. Use a level of significance of 0.05 for evaluating the statistical relationships. The variables "age" [age], "highest year of school completed" [educ] and "confidence in Congress" [conlegis] were useful predictors for distinguishing between groups based on responses to "opinion about spending on highways and bridges" [natroad]. These predictors differentiate survey respondents who thought we spend too little money on highways and bridges from survey respondents who thought we spend too much money on highways and bridges and survey respondents who thought we spend about the right amount of money on highways and bridges from survey respondents who thought we spend too much money on highways and bridges. Among this set of predictors, confidence in Congress was helpful in distinguishing among the groups defined by responses to opinion about spending on highways and bridges. Survey respondents who had less confidence in congress were less likely to be in the group of survey respondents who thought we spend too little money on highways and bridges, rather than the group of survey respondents who thought we spend too much money on highways and bridges. For each unit increase in confidence in Congress, the odds of being in the group of survey respondents who thought we spend too little money on highways and bridges decreased by 74.7%. Survey respondents who had less confidence in congress were less likely to be in the group of survey respondents who thought we spend about the right amount of money on highways and bridges, rather than the group of survey respondents who thought we spend too much money on highways and bridges. For each unit increase in confidence in Congress, the odds of being in the group of survey respondents who thought we spend about the right amount of money on highways and bridges decreased by 80.9%. 1. True 2. True with caution
Multinomial logistic regression requires that the dependent variable be non-metric and the independent variables be metric or dichotomous.
"Opinion about spending on highways and bridges" [natroad] is ordinal, satisfying the non- metric level of measurement requirement for the dependent variable.
It contains three categories: survey respondents who thought we spend too little money, about the right amount of money, and too much money on highways and bridges. SW388R7 Data Analysis & Computers II
Slide 37 LEVEL OF MEASUREMENT - 2 11. In the dataset GSS2000, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data, outliers, or influential cases, and that the validation analysis will confirm the generalizability of the results. Use a level of significance of 0.05 for evaluating the statistical relationships.
The variables "age" [age], "highest year of school completed" [educ] and "confidence in Congress" [conlegis] were useful predictors for distinguishing between groups based on responses to "opinion about spending on highways and bridges" [natroad]. These predictors differentiate survey respondents who thought we spend too little money on highways and bridges from survey respondents who thought we spend too much money on highways and bridges and survey respondents who thought we spend about the right amount of money on highways and bridges from survey respondents who thought we spend too much money on highways and bridges.
Among this set of predictors, confidence in Congress was helpful in distinguishing among the groups defined by responses to opinion about spending on highways and bridges. Survey respondents who had less confidence in congress were less likely to be in the group of survey respondents who thought we spend too little money on highways and bridges, rather than the group of survey respondents who thought we spend too much money on highways and bridges. For each unit increase in confidence in Congress, the odds of being in the group of survey respondents who thought we spend too little money on highways and bridges decreased by 74.7%. Survey respondents who had less confidence in congress were less likely to be in the group of survey respondents who thought we spend about the right amount of money on highways and bridges, rather than the group of survey respondents who thought we spend too much money on highways and bridges. For each unit increase in confidence in Congress, the odds of being in the group of survey respondents who thought we spend about the right amount of money on highways and bridges decreased by 80.9%. "Age" [age] and "highest year of school completed" [educ] are interval, satisfying the metric or dichotomous level of measurement requirement for independent variables. "Confidence in Congress" [conlegis] is ordinal, satisfying the metric or dichotomous level of measurement requirement for independent variables. If we follow the convention of treating ordinal level variables as metric variables, the level of measurement requirement for the analysis is satisfied. Since some data analysts do not agree with this convention, a note of caution should be included in our interpretation. SW388R7 Data Analysis & Computers II
Slide 38 Sample size ratio of cases to variables Case Processing Summary 62 37.1% 93 55.7% 12 7.2% 167 100.0% 103 270 153 a 1 2 3 HIGHWAYS AND BRIDGES Vali d Missing Total Subpopul ati on N Margi nal Percentage The dependent vari abl e has onl y one val ue observed i n 146 (95.4%) subpopul ati ons. a. Multinomial logistic regression requires that the minimum ratio of valid cases to independent variables be at least 10 to 1. The ratio of valid cases (167) to number of independent variables (3) was 55.7 to 1, which was equal to or greater than the minimum ratio. The requirement for a minimum ratio of cases to independent variables was satisfied.
The preferred ratio of valid cases to independent variables is 20 to 1. The ratio of 55.7 to 1 was equal to or greater than the preferred ratio. The preferred ratio of cases to independent variables was satisfied. SW388R7 Data Analysis & Computers II
Slide 39 OVERALL RELATIONSHIP BETWEEN INDEPENDENT AND DEPENDENT VARIABLES Model Fitting Information 284.429 265.972 18.457 6 .005 Model Intercept Onl y Fi nal -2 Log Li kel i hood Chi-Square df Si g. The presence of a relationship between the dependent variable and combination of independent variables is based on the statistical significance of the final model chi-square in the SPSS table titled "Model Fitting Information".
In this analysis, the probability of the model chi-square (18.457) was 0.005, less than or equal to the level of significance of 0.05. The null hypothesis that there was no difference between the model without independent variables and the model with independent variables was rejected. The existence of a relationship between the independent variables and the dependent variable was supported. SW388R7 Data Analysis & Computers II
Slide 40 Parameter Estimates 3.240 2.478 1.709 1 .191 .019 .020 .906 1 .341 1.019 .980 1.061 .071 .108 .427 1 .514 1.073 .868 1.327 -1.373 .620 4.913 1 .027 .253 .075 .853 3.639 2.456 2.195 1 .138 .003 .020 .017 1 .897 1.003 .963 1.043 .172 .110 2.463 1 .117 1.188 .958 1.474 -1.657 .613 7.298 1 .007 .191 .057 .635 Intercept AGE EDUC CONLEGIS Intercept AGE EDUC CONLEGIS HIGHWAYS AND BRIDGES a 1 2 B Std. Error Wal d df Si g. Exp(B) Lower Bound Upper Bound 95% Confi dence Interval for Exp(B) The reference category i s: 3. a. NUMERICAL PROBLEMS Multicollinearity in the multinomial logistic regression solution is detected by examining the standard errors for the b coefficients. A standard error larger than 2.0 indicates numerical problems, such as multicollinearity among the independent variables, zero cells for a dummy-coded independent variable because all of the subjects have the same value for the variable, and 'complete separation' whereby the two groups in the dependent event variable can be perfectly separated by scores on one of the independent variables. Analyses that indicate numerical problems should not be interpreted.
None of the independent variables in this analysis had a standard error larger than 2.0. (We are not interested in the standard errors associated with the intercept.) SW388R7 Data Analysis & Computers II
Slide 41 Likelihood Ratio Tests 268.323 2.350 2 .309 268.625 2.652 2 .265 270.395 4.423 2 .110 275.194 9.221 2 .010 Effect Intercept AGE EDUC CONLEGIS -2 Log Li kel i hood of Reduced Model Chi-Square df Si g. The chi -square stati sti c i s the di fference in -2 l og-l i kel i hoods between the fi nal model and a reduced model . The reduced model i s formed by omi tti ng an effect from the fi nal model . The nul l hypothesi s i s that al l parameters of that effect are 0. RELATIONSHIP OF INDIVIDUAL INDEPENDENT VARIABLES TO DEPENDENT VARIABLE - 1 The statistical significance of the relationship between confidence in Congress and opinion about spending on highways and bridges is based on the statistical significance of the chi-square statistic in the SPSS table titled "Likelihood Ratio Tests".
For this relationship, the probability of the chi-square statistic (9.221) was 0.010, less than or equal to the level of significance of 0.05. The null hypothesis that all of the b coefficients associated with confidence in Congress were equal to zero was rejected. The existence of a relationship between confidence in Congress and opinion about spending on highways and bridges was supported. SW388R7 Data Analysis & Computers II
Slide 42 Parameter Estimates 3.240 2.478 1.709 1 .191 .019 .020 .906 1 .341 1.019 .980 1.061 .071 .108 .427 1 .514 1.073 .868 1.327 -1.373 .620 4.913 1 .027 .253 .075 .853 3.639 2.456 2.195 1 .138 .003 .020 .017 1 .897 1.003 .963 1.043 .172 .110 2.463 1 .117 1.188 .958 1.474 -1.657 .613 7.298 1 .007 .191 .057 .635 Intercept AGE EDUC CONLEGIS Intercept AGE EDUC CONLEGIS HIGHWAYS AND BRIDGES a 1 2 B Std. Error Wal d df Si g. Exp(B) Lower Bound Upper Bound 95% Confi dence Interval for Exp(B) The reference category i s: 3. a. RELATIONSHIP OF INDIVIDUAL INDEPENDENT VARIABLES TO DEPENDENT VARIABLE - 2 In the comparison of survey respondents who thought we spend too little money on highways and bridges to survey respondents who thought we spend too much money on highways and bridges, the probability of the Wald statistic (4.913) for the variable confidence in Congress [conlegis] was 0.027. Since the probability was less than or equal to the level of significance of 0.05, the null hypothesis that the b coefficient for confidence in Congress was equal to zero for this comparison was rejected. SW388R7 Data Analysis & Computers II
Slide 43 Parameter Estimates 3.240 2.478 1.709 1 .191 .019 .020 .906 1 .341 1.019 .980 1.061 .071 .108 .427 1 .514 1.073 .868 1.327 -1.373 .620 4.913 1 .027 .253 .075 .853 3.639 2.456 2.195 1 .138 .003 .020 .017 1 .897 1.003 .963 1.043 .172 .110 2.463 1 .117 1.188 .958 1.474 -1.657 .613 7.298 1 .007 .191 .057 .635 Intercept AGE EDUC CONLEGIS Intercept AGE EDUC CONLEGIS HIGHWAYS AND BRIDGES a 1 2 B Std. Error Wal d df Si g. Exp(B) Lower Bound Upper Bound 95% Confi dence Interval for Exp(B) The reference category i s: 3. a. RELATIONSHIP OF INDIVIDUAL INDEPENDENT VARIABLES TO DEPENDENT VARIABLE - 3 The value of Exp(B) was 0.253 which implies that for each unit increase in confidence in Congress the odds decreased by 74.7% (0.253 - 1.0 = -0.747).
The relationship stated in the problem is supported. Survey respondents who had less confidence in congress were less likely to be in the group of survey respondents who thought we spend too little money on highways and bridges, rather than the group of survey respondents who thought we spend too much money on highways and bridges. For each unit increase in confidence in Congress, the odds of being in the group of survey respondents who thought we spend too little money on highways and bridges decreased by 74.7%. SW388R7 Data Analysis & Computers II
Slide 44 Parameter Estimates 3.240 2.478 1.709 1 .191 .019 .020 .906 1 .341 1.019 .980 1.061 .071 .108 .427 1 .514 1.073 .868 1.327 -1.373 .620 4.913 1 .027 .253 .075 .853 3.639 2.456 2.195 1 .138 .003 .020 .017 1 .897 1.003 .963 1.043 .172 .110 2.463 1 .117 1.188 .958 1.474 -1.657 .613 7.298 1 .007 .191 .057 .635 Intercept AGE EDUC CONLEGIS Intercept AGE EDUC CONLEGIS HIGHWAYS AND BRIDGES a 1 2 B Std. Error Wal d df Si g. Exp(B) Lower Bound Upper Bound 95% Confi dence Interval for Exp(B) The reference category i s: 3. a. RELATIONSHIP OF INDIVIDUAL INDEPENDENT VARIABLES TO DEPENDENT VARIABLE - 4 In the comparison of survey respondents who thought we spend about the right amount of money on highways and bridges to survey respondents who thought we spend too much money on highways and bridges, the probability of the Wald statistic (7.298) for the variable confidence in Congress [conlegis] was 0.007. Since the probability was less than or equal to the level of significance of 0.05, the null hypothesis that the b coefficient for confidence in Congress was equal to zero for this comparison was rejected. SW388R7 Data Analysis & Computers II
Slide 45 Parameter Estimates 3.240 2.478 1.709 1 .191 .019 .020 .906 1 .341 1.019 .980 1.061 .071 .108 .427 1 .514 1.073 .868 1.327 -1.373 .620 4.913 1 .027 .253 .075 .853 3.639 2.456 2.195 1 .138 .003 .020 .017 1 .897 1.003 .963 1.043 .172 .110 2.463 1 .117 1.188 .958 1.474 -1.657 .613 7.298 1 .007 .191 .057 .635 Intercept AGE EDUC CONLEGIS Intercept AGE EDUC CONLEGIS HIGHWAYS AND BRIDGES a 1 2 B Std. Error Wal d df Si g. Exp(B) Lower Bound Upper Bound 95% Confi dence Interval for Exp(B) The reference category i s: 3. a. RELATIONSHIP OF INDIVIDUAL INDEPENDENT VARIABLES TO DEPENDENT VARIABLE - 5 The value of Exp(B) was 0.191 which implies that for each unit increase in confidence in Congress the odds decreased by 80.9% (0.191-1.0=-0.809).
The relationship stated in the problem is supported. Survey respondents who had less confidence in congress were less likely to be in the group of survey respondents who thought we spend about the right amount of money on highways and bridges, rather than the group of survey respondents who thought we spend too much money on highways and bridges. For each unit increase in confidence in Congress, the odds of being in the group of survey respondents who thought we spend about the right amount of money on highways and bridges decreased by 80.9%. SW388R7 Data Analysis & Computers II
Slide 46 Case Processing Summary 62 37.1% 93 55.7% 12 7.2% 167 100.0% 103 270 153 a 1 2 3 HIGHWAYS AND BRIDGES Vali d Missing Total Subpopul ati on N Margi nal Percentage The dependent vari abl e has onl y one val ue observed i n 146 (95.4%) subpopul ati ons. a. CLASSIFICATION USING THE MULTINOMIAL LOGISTIC REGRESSION MODEL: BY CHANCE ACCURACY RATE The proportional by chance accuracy rate was computed by calculating the proportion of cases for each group based on the number of cases in each group in the 'Case Processing Summary', and then squaring and summing the proportion of cases in each group (0.371 + 0.557 + 0.072 = 0.453).
The independent variables could be characterized as useful predictors distinguishing survey respondents who thought we spend too little money on highways and bridges, survey respondents who thought we spend about the right amount of money on highways and bridges and survey respondents who thought we spend too much money on highways and bridges if the classification accuracy rate was substantially higher than the accuracy attainable by chance alone. Operationally, the classification accuracy rate should be 25% or more higher than the proportional by chance accuracy rate. SW388R7 Data Analysis & Computers II
Slide 47 Classification 15 47 0 24.2% 7 86 0 92.5% 5 7 0 .0% 16.2% 83.8% .0% 60.5% Observed 1 2 3 Overal l Percentage 1 2 3 Percent Correct Predi cted CLASSIFICATION USING THE MULTINOMIAL LOGISTIC REGRESSION MODEL: CLASSIFICATION ACCURACY The classification accuracy rate was 60.5% which was greater than or equal to the proportional by chance accuracy criteria of 56.6% (1.25 x 45.3% = 56.6%).
The criteria for classification accuracy is satisfied. SW388R7 Data Analysis & Computers II
Slide 48 Answering the question in problem 1 - 1 11. In the dataset GSS2000, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data, outliers, or influential cases, and that the validation analysis will confirm the generalizability of the results. Use a level of significance of 0.05 for evaluating the statistical relationships. The variables "age" [age], "highest year of school completed" [educ] and "confidence in Congress" [conlegis] were useful predictors for distinguishing between groups based on responses to "opinion about spending on highways and bridges" [natroad]. These predictors differentiate survey respondents who thought we spend too little money on highways and bridges from survey respondents who thought we spend too much money on highways and bridges and survey respondents who thought we spend about the right amount of money on highways and bridges from survey respondents who thought we spend too much money on highways and bridges. Among this set of predictors, confidence in Congress was helpful in distinguishing among the groups defined by responses to opinion about spending on highways and bridges. Survey respondents who had less confidence in congress were less likely to be in the group of survey respondents who thought we spend too little money on highways and bridges, rather than the group of survey respondents who thought we spend too much money on highways and bridges. For each unit increase in confidence in Congress, the odds of being in the group of survey respondents who thought we spend too little money on highways and bridges decreased by 74.7%. Survey respondents who had less confidence in congress were less likely to be in the group of survey respondents who thought we spend about the right amount of money on highways and bridges, rather than the group of survey respondents who thought we spend too much money on highways and bridges. For each unit increase in confidence in Congress, the odds of being in the group of survey respondents who thought we spend about the right amount of money on highways and bridges decreased by 80.9%. 1. True 2. True with caution 3. False 4. Inappropriate application of a statistic
We found a statistically significant overall relationship between the combination of independent variables and the dependent variable.
There was no evidence of numerical problems in the solution.
Moreover, the classification accuracy surpassed the proportional by chance accuracy criteria, supporting the utility of the model. SW388R7 Data Analysis & Computers II
Slide 49 Answering the question in problem 1 - 2 The variables "age" [age], "highest year of school completed" [educ] and "confidence in Congress" [conlegis] were useful predictors for distinguishing between groups based on responses to "opinion about spending on highways and bridges" [natroad]. These predictors differentiate survey respondents who thought we spend too little money on highways and bridges from survey respondents who thought we spend too much money on highways and bridges and survey respondents who thought we spend about the right amount of money on highways and bridges from survey respondents who thought we spend too much money on highways and bridges. Among this set of predictors, confidence in Congress was helpful in distinguishing among the groups defined by responses to opinion about spending on highways and bridges. Survey respondents who had less confidence in congress were less likely to be in the group of survey respondents who thought we spend too little money on highways and bridges, rather than the group of survey respondents who thought we spend too much money on highways and bridges. For each unit increase in confidence in Congress, the odds of being in the group of survey respondents who thought we spend too little money on highways and bridges decreased by 74.7%. Survey respondents who had less confidence in congress were less likely to be in the group of survey respondents who thought we spend about the right amount of money on highways and bridges, rather than the group of survey respondents who thought we spend too much money on highways and bridges. For each unit increase in confidence in Congress, the odds of being in the group of survey respondents who thought we spend about the right amount of money on highways and bridges decreased by 80.9%. 1. True 2. True with caution 3. False 4. Inappropriate application of a statistic We verified that each statement about the relationship between an independent variable and the dependent variable was correct in both direction of the relationship and the change in likelihood associated with a one-unit change of the independent variable, for both of the comparisons between groups stated in the problem. The answer to the question is true with caution.
A caution is added because of the inclusion of ordinal level variables. SW388R7 Data Analysis & Computers II
Slide 50 Problem 2 1. In the dataset GSS2000, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data, outliers, or influential cases, and that the validation analysis will confirm the generalizability of the results. Use a level of significance of 0.05 for evaluating the statistical relationships.
The variables "highest year of school completed" [educ], "sex" [sex] and "total family income" [income98] were useful predictors for distinguishing between groups based on responses to "opinion about spending on space exploration" [natspac]. These predictors differentiate survey respondents who thought we spend too little money on space exploration from survey respondents who thought we spend too much money on space exploration and survey respondents who thought we spend about the right amount of money on space exploration from survey respondents who thought we spend too much money on space exploration.
Among this set of predictors, total family income was helpful in distinguishing among the groups defined by responses to opinion about spending on space exploration. Survey respondents who had higher total family incomes were more likely to be in the group of survey respondents who thought we spend about the right amount of money on space exploration, rather than the group of survey respondents who thought we spend too much money on space exploration. For each unit increase in total family income, the odds of being in the group of survey respondents who thought we spend about the right amount of money on space exploration increased by 6.0%.
1. True 2. True with caution 3. False 4. Inappropriate application of a statistic
SW388R7 Data Analysis & Computers II
Slide 51 Dissecting problem 2 - 1 1. In the dataset GSS2000, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data, outliers, or influential cases, and that the validation analysis will confirm the generalizability of the results. Use a level of significance of 0.05 for evaluating the statistical relationships.
The variables "highest year of school completed" [educ], "sex" [sex] and "total family income" [income98] were useful predictors for distinguishing between groups based on responses to "opinion about spending on space exploration" [natspac]. These predictors differentiate survey respondents who thought we spend too little money on space exploration from survey respondents who thought we spend too much money on space exploration and survey respondents who thought we spend about the right amount of money on space exploration from survey respondents who thought we spend too much money on space exploration.
Among this set of predictors, total family income was helpful in distinguishing among the groups defined by responses to opinion about spending on space exploration. Survey respondents who had higher total family incomes were more likely to be in the group of survey respondents who thought we spend about the right amount of money on space exploration, rather than the group of survey respondents who thought we spend too much money on space exploration. For each unit increase in total family income, the odds of being in the group of survey respondents who thought we spend about the right amount of money on space exploration increased by 6.0%.
1. True 2. True with caution 3. False 4. Inappropriate application of a statistic
For these problems, we will assume that there is no problem with missing data, outliers, or influential cases, and that the validation analysis will confirm the generalizability of the results
In this problem, we are told to use 0.05 as alpha for the multinomial logistic regression. SW388R7 Data Analysis & Computers II
Slide 52 Dissecting problem 2 - 2 1. In the dataset GSS2000, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data, outliers, or influential cases, and that the validation analysis will confirm the generalizability of the results. Use a level of significance of 0.05 for evaluating the statistical relationships.
The variables "highest year of school completed" [educ], "sex" [sex] and "total family income" [income98] were useful predictors for distinguishing between groups based on responses to "opinion about spending on space exploration" [natspac]. These predictors differentiate survey respondents who thought we spend too little money on space exploration from survey respondents who thought we spend too much money on space exploration and survey respondents who thought we spend about the right amount of money on space exploration from survey respondents who thought we spend too much money on space exploration.
Among this set of predictors, total family income was helpful in distinguishing among the groups defined by responses to opinion about spending on space exploration. Survey respondents who had higher total family incomes were more likely to be in the group of survey respondents who thought we spend about the right amount of money on space exploration, rather than the group of survey respondents who thought we spend too much money on space exploration. For each unit increase in total family income, the odds of being in the group of survey respondents who thought we spend about the right amount of money on space exploration increased by 6.0%.
1. True 2. True with caution 3. False 4. Inappropriate application of a statistic SPSS only supports direct or simultaneous entry of independent variables in multinomial logistic regression, so we have no choice of method for entering variables. The variables listed first in the problem statement are the independent variables (IVs): "highest year of school completed" [educ], "sex" [sex] and "total family income" [income98]. The variable used to define groups is the dependent variable (DV): "opinion about spending on space exploration" [natspac]. SW388R7 Data Analysis & Computers II
Slide 53 Dissecting problem 2 - 3 1. In the dataset GSS2000, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data, outliers, or influential cases, and that the validation analysis will confirm the generalizability of the results. Use a level of significance of 0.05 for evaluating the statistical relationships.
The variables "highest year of school completed" [educ], "sex" [sex] and "total family income" [income98] were useful predictors for distinguishing between groups based on responses to "opinion about spending on space exploration" [natspac]. These predictors differentiate survey respondents who thought we spend too little money on space exploration from survey respondents who thought we spend too much money on space exploration and survey respondents who thought we spend about the right amount of money on space exploration from survey respondents who thought we spend too much money on space exploration.
Among this set of predictors, total family income was helpful in distinguishing among the groups defined by responses to opinion about spending on space exploration. Survey respondents who had higher total family incomes were more likely to be in the group of survey respondents who thought we spend about the right amount of money on space exploration, rather than the group of survey respondents who thought we spend too much money on space exploration. For each unit increase in total family income, the odds of being in the group of survey respondents who thought we spend about the right amount of money on space exploration increased by 6.0%.
1. True 2. True with caution 3. False SPSS multinomial logistic regression models the relationship by comparing each of the groups defined by the dependent variable to the group with the highest code value.
The responses to opinion about spending on the space program were: 1= Too little, 2 = About right, and 3 = Too much. The analysis will result in two comparisons: survey respondents who thought we spend too little money versus survey respondents who thought we spend too much money on space exploration survey respondents who thought we spend about the right amount of money versus survey respondents who thought we spend too much money on space exploration. SW388R7 Data Analysis & Computers II
Slide 54 Dissecting problem 2 - 4 The variables "highest year of school completed" [educ], "sex" [sex] and "total family income" [income98] were useful predictors for distinguishing between groups based on responses to "opinion about spending on space exploration" [natspac]. These predictors differentiate survey respondents who thought we spend too little money on space exploration from survey respondents who thought we spend too much money on space exploration and survey respondents who thought we spend about the right amount of money on space exploration from survey respondents who thought we spend too much money on space exploration.
Among this set of predictors, total family income was helpful in distinguishing among the groups defined by responses to opinion about spending on space exploration. Survey respondents who had higher total family incomes were more likely to be in the group of survey respondents who thought we spend about the right amount of money on space exploration, rather than the group of survey respondents who thought we spend too much money on space exploration. For each unit increase in total family income, the odds of being in the group of survey respondents who thought we spend about the right amount of money on space exploration increased by 6.0%.
1. True 2. True with caution 3. False 4. Inappropriate application of a statistic Each problem includes a statement about the relationship between one independent variable and the dependent variable. The answer to the problem is based on the stated relationship, ignoring the relationships between the other independent variables and the dependent variable. This problem identifies a difference for only one of the two comparisons based on the three values of the dependent variable.
Other problems will specify both of the possible comparisons. SW388R7 Data Analysis & Computers II
Slide 55 Dissecting problem 2 - 5 The variables "highest year of school completed" [educ], "sex" [sex] and "total family income" [income98] were useful predictors for distinguishing between groups based on responses to "opinion about spending on space exploration" [natspac]. These predictors differentiate survey respondents who thought we spend too little money on space exploration from survey respondents who thought we spend too much money on space exploration and survey respondents who thought we spend about the right amount of money on space exploration from survey respondents who thought we spend too much money on space exploration.
Among this set of predictors, total family income was helpful in distinguishing among the groups defined by responses to opinion about spending on space exploration. Survey respondents who had higher total family incomes were more likely to be in the group of survey respondents who thought we spend about the right amount of money on space exploration, rather than the group of survey respondents who thought we spend too much money on space exploration. For each unit increase in total family income, the odds of being in the group of survey respondents who thought we spend about the right amount of money on space exploration increased by 6.0%.
1. True 2. True with caution 3. False 4. Inappropriate application of a statistic In order for the multinomial logistic regression question to be true, the overall relationship must be statistically significant, there must be no evidence of numerical problems, the classification accuracy rate must be substantially better than could be obtained by chance alone, and the stated individual relationship must be statistically significant and interpreted correctly. SW388R7 Data Analysis & Computers II
Slide 56 LEVEL OF MEASUREMENT - 1 1. In the dataset GSS2000, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data, outliers, or influential cases, and that the validation analysis will confirm the generalizability of the results. Use a level of significance of 0.05 for evaluating the statistical relationships.
The variables "highest year of school completed" [educ], "sex" [sex] and "total family income" [income98] were useful predictors for distinguishing between groups based on responses to "opinion about spending on space exploration" [natspac]. These predictors differentiate survey respondents who thought we spend too little money on space exploration from survey respondents who thought we spend too much money on space exploration and survey respondents who thought we spend about the right amount of money on space exploration from survey respondents who thought we spend too much money on space exploration.
Among this set of predictors, total family income was helpful in distinguishing among the groups defined by responses to opinion about spending on space exploration. Survey respondents who had higher total family incomes were more likely to be in the group of survey respondents who thought we spend about the right amount of money on space exploration, rather than the group of survey respondents who thought we spend too much money on space exploration. For each unit increase in total family income, the odds of being in the group of survey respondents who thought we spend about the right amount of money on space exploration increased by 6.0%.
1. True 2. True with caution 3. False 4. Inappropriate application of a statistic Multinomial logistic regression requires that the dependent variable be non-metric and the independent variables be metric or dichotomous.
"Opinion about spending on space exploration" [natspac] is ordinal, satisfying the non-metric level of measurement requirement for the dependent variable.
It contains three categories: survey respondents who thought we spend too little money, about the right amount of money, and too much money on space exploration. SW388R7 Data Analysis & Computers II
Slide 57 LEVEL OF MEASUREMENT - 2 1. In the dataset GSS2000, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data, outliers, or influential cases, and that the validation analysis will confirm the generalizability of the results. Use a level of significance of 0.05 for evaluating the statistical relationships.
The variables "highest year of school completed" [educ], "sex" [sex] and "total family income" [income98] were useful predictors for distinguishing between groups based on responses to "opinion about spending on space exploration" [natspac]. These predictors differentiate survey respondents who thought we spend too little money on space exploration from survey respondents who thought we spend too much money on space exploration and survey respondents who thought we spend about the right amount of money on space exploration from survey respondents who thought we spend too much money on space exploration.
Among this set of predictors, total family income was helpful in distinguishing among the groups defined by responses to opinion about spending on space exploration. Survey respondents who had higher total family incomes were more likely to be in the group of survey respondents who thought we spend about the right amount of money on space exploration, rather than the group of survey respondents who thought we spend about the right amount of money on space exploration. For each unit increase in total family income, the odds of being in the group of survey respondents who thought we spend about the right amount of money on space exploration increased by 6.0%.
1. True 2. True with caution "Sex" [sex] is dichotomous, satisfying the metric or dichotomous level of measurement requirement for independent variables. "Highest year of school completed" [educ] is interval, satisfying the metric or dichotomous level of measurement requirement for independent variables. "Total family income" [income98] is ordinal, satisfying the metric or dichotomous level of measurement requirement for independent variables. If we follow the convention of treating ordinal level variables as metric variables, the level of measurement requirement for the analysis is satisfied. Since some data analysts do not agree with this convention, a note of caution should be included in our interpretation. SW388R7 Data Analysis & Computers II
Slide 58 Request multinomial logistic regression Select the Regression | Multinomial Logistic command from the Analyze menu. SW388R7 Data Analysis & Computers II
Slide 59 Selecting the dependent variable Second, click on the right arrow button to move the dependent variable to the Dependent text box. First, highlight the dependent variable natspac in the list of variables. SW388R7 Data Analysis & Computers II
Slide 60 Selecting non-metric independent variables Move the non-metric independent variables listed in the problem to the Factor(s) list box. Select the dichotomous variable sex. Non-metric independent variables are specified as factors in multinomial logistic regression. Non-metric variables can be either dichotomous, nominal, or ordinal.
These variables will be dummy coded as needed and each value will be listed separately in the output. SW388R7 Data Analysis & Computers II
Slide 61 Selecting metric independent variables Move the metric independent variables, educ and income98, to the Covariate(s) list box. Metric independent variables are specified as covariates in multinomial logistic regression. Metric variables can be either interval or, by convention, ordinal. SW388R7 Data Analysis & Computers II
Slide 62 Specifying statistics to include in the output While we will accept most of the SPSS defaults for the analysis, we need to specifically request the classification table.
Click on the Statistics button to make a request. SW388R7 Data Analysis & Computers II
Slide 63 Requesting the classification table First, keep the SPSS defaults for Summary statistics, Likelihood ratio test, and Parameter estimates. Second, mark the checkbox for the Classification table. Third, click on the Continue button to complete the request. SW388R7 Data Analysis & Computers II
Slide 64 Completing the multinomial logistic regression request Click on the OK button to request the output for the multinomial logistic regression. The multinomial logistic procedure supports additional commands to specify the model computed for the relationships (we will use the default main effects model), additional specifications for computing the regression, and saving classification results. We will not make use of these options. SW388R7 Data Analysis & Computers II
Slide 65 Case Processing Summary 33 15.9% 90 43.3% 85 40.9% 94 45.2% 114 54.8% 208 100.0% 62 270 138 a 1 2 3 SPACE EXPLORATION PROGRAM 1 2 RESPONDENTS SEX Val id Missing Total Subpopul ati on N Margi nal Percentage The dependent vari able has only one val ue observed i n 112 (81.2%) subpopul ations. a. Sample size ratio of cases to variables Multinomial logistic regression requires that the minimum ratio of valid cases to independent variables be at least 10 to 1. The ratio of valid cases (208) to number of independent variables( 3) was 69.3 to 1, which was equal to or greater than the minimum ratio. The requirement for a minimum ratio of cases to independent variables was satisfied.
The preferred ratio of valid cases to independent variables is 20 to 1. The ratio of 69.3 to 1 was equal to or greater than the preferred ratio. The preferred ratio of cases to independent variables was satisfied. SW388R7 Data Analysis & Computers II
Slide 66 Model Fitting Information 354.268 334.967 19.301 6 .004 Model Intercept Onl y Fi nal -2 Log Li kel i hood Chi-Square df Si g. OVERALL RELATIONSHIP BETWEEN INDEPENDENT AND DEPENDENT VARIABLES The presence of a relationship between the dependent variable and combination of independent variables is based on the statistical significance of the final model chi-square in the SPSS table titled "Model Fitting Information".
In this analysis, the probability of the model chi-square (19.301) was 0.004, less than or equal to the level of significance of 0.05. The null hypothesis that there was no difference between the model without independent variables and the model with independent variables was rejected. The existence of a relationship between the independent variables and the dependent variable was supported. SW388R7 Data Analysis & Computers II
Slide 67 Parameter Estimates -4.136 1.157 12.779 1 .000 .101 .089 1.276 1 .259 1.106 .929 1.317 .097 .050 3.701 1 .054 1.102 .998 1.216 .672 .426 2.488 1 .115 1.959 .850 4.515 0 b . . 0 . . . . -2.487 .840 8.774 1 .003 .108 .068 2.521 1 .112 1.114 .975 1.273 .058 .034 2.932 1 .087 1.060 .992 1.133 .501 .317 2.492 1 .114 1.650 .886 3.072 0 b . . 0 . . . . Intercept EDUC INCOME98 [SEX=1] [SEX=2] Intercept EDUC INCOME98 [SEX=1] [SEX=2] SPACE EXPLORATION PROGRAM a 1 2 B Std. Error Wal d df Si g. Exp(B) Lower Bound Upper Bound 95% Confi dence Interval for Exp(B) The reference category i s: 3. a. This parameter i s set to zero because it i s redundant. b. NUMERICAL PROBLEMS Multicollinearity in the multinomial logistic regression solution is detected by examining the standard errors for the b coefficients. A standard error larger than 2.0 indicates numerical problems, such as multicollinearity among the independent variables, zero cells for a dummy-coded independent variable because all of the subjects have the same value for the variable, and 'complete separation' whereby the two groups in the dependent event variable can be perfectly separated by scores on one of the independent variables. Analyses that indicate numerical problems should not be interpreted.
None of the independent variables in this analysis had a standard error larger than 2.0. SW388R7 Data Analysis & Computers II
Slide 68 RELATIONSHIP OF INDIVIDUAL INDEPENDENT VARIABLES TO DEPENDENT VARIABLE - 1 Likelihood Ratio Tests 334.967 a .000 0 . 337.788 2.821 2 .244 340.154 5.187 2 .075 338.511 3.544 2 .170 Effect Intercept EDUC INCOME98 SEX -2 Log Li kel i hood of Reduced Model Chi-Square df Si g. The chi -square stati sti c i s the di fference i n -2 log-li kel i hoods between the fi nal model and a reduced model. The reduced model i s formed by omi tting an effect from the final model . The nul l hypothesi s is that al l parameters of that effect are 0. This reduced model i s equi val ent to the fi nal model because omi tti ng the effect does not i ncrease the degrees of freedom. a. The statistical significance of the relationship between total family income and opinion about spending on space exploration is based on the statistical significance of the chi-square statistic in the SPSS table titled "Likelihood Ratio Tests".
For this relationship, the probability of the chi-square statistic (5.187) was 0.075, greater than the level of significance of 0.05. The null hypothesis that all of the b coefficients associated with total family income were equal to zero was not rejected. The existence of a relationship between total family income and opinion about spending on space exploration was not supported. SW388R7 Data Analysis & Computers II
Slide 69 Answering the question in problem 2 1. In the dataset GSS2000, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data, outliers, or influential cases, and that the validation analysis will confirm the generalizability of the results. Use a level of significance of 0.05 for evaluating the statistical relationships.
The variables "highest year of school completed" [educ], "sex" [sex] and "total family income" [income98] were useful predictors for distinguishing between groups based on responses to "opinion about spending on space exploration" [natspac]. These predictors differentiate survey respondents who thought we spend too little money on space exploration from survey respondents who thought we spend too much money on space exploration and survey respondents who thought we spend about the right amount of money on space exploration from survey respondents who thought we spend too much money on space exploration.
Among this set of predictors, total family income was helpful in distinguishing among the groups defined by responses to opinion about spending on space exploration. Survey respondents who had higher total family incomes were more likely to be in the group of survey respondents who thought we spend about the right amount of money on space exploration, rather than the group of survey respondents who thought we spend too much money on space exploration. For each unit increase in total family income, the odds of being in the group of survey respondents who thought we spend about the right amount of money on space exploration increased by 6.0%.
1. True 2. True with caution 3. False 4. Inappropriate application of a statistic
We found a statistically significant overall relationship between the combination of independent variables and the dependent variable.
There was no evidence of numerical problems in the solution.
However, the individual relationship between total family income and spending on space was not statistically significant.
The answer to the question is false. SW388R7 Data Analysis & Computers II
Slide 70 Steps in multinomial logistic regression: level of measurement and initial sample size The following is a guide to the decision process for answering problems about the basic relationships in multinomial logistic regression: Inappropriate application of a statistic Yes No Dependent non-metric? Independent variables metric or dichotomous? Yes Ratio of cases to independent variables at least 10 to 1? Yes No Inappropriate application of a statistic
Run multinomial logistic regression
SW388R7 Data Analysis & Computers II
Slide 71 Steps in multinomial logistic regression: overall relationship and numerical problems Yes Yes Standard errors of coefficients indicate no numerical problems (s.e. <= 2.0)? No False Overall relationship statistically significant? (model chi-square test) No False SW388R7 Data Analysis & Computers II
Slide 72 Steps in multinomial logistic regression: relationships between IV's and DV Overall relationship between specific IV and DV is statistically significant? (likelihood ratio test) Yes Role of specific IV and DV groups statistically significant and interpreted correctly? (Wald test and Exp(B)) No Yes False No False SW388R7 Data Analysis & Computers II
Slide 73 Steps in multinomial logistic regression: classification accuracy and adding cautions Yes Overall accuracy rate is 25% > than proportional by chance accuracy rate? Yes No False One or more IV's are ordinal level treated as metric?
No Yes True Satisfies preferred ratio of cases to IV's of 20 to 1