Discriminant Analysis
Discriminant Analysis
Discriminant Analysis
Content list Purposes of discriminant analysis Discriminant analysis linear equation Assumptions of discriminant analysis SPSS activity discriminant analysis Stepwise discriminant analysis 589 590 590 593 604
Introduction This chapter introduces another extension of regression where the DV may have more than two conditions at a categorical level and IVs are scale data.
590
DA is used when: the dependent is categorical with the predictor IVs at interval level such as age, income, attitudes, perceptions, and years of education, although dummy variables can be used as predictors as in multiple regression. Logistic regression IVs can be of any level of measurement. there are more than two DV categories, unlike logistic regression, which is limited to a dichotomous dependent variable.
A discriminant score. This is a weighted linear combination (sum) of the discriminating variables.
each of the allocations for the dependent categories in the initial classication are correctly classied; there must be at least two groups or categories, with each case belonging to only one group so that the groups are mutually exclusive and collectively exhaustive (all cases can be placed in a group); each group or category must be well dened, clearly differentiated from any other group(s) and natural. Putting a median split on an attitude scale is not a natural way to form groups. Partitioning quantitative variables is only justiable if there are easily identiable gaps at the points of division; for instance, three groups taking three available levels of amounts of housing loan; the groups or categories should be dened before collecting the data; the attribute(s) used to separate the groups should discriminate quite clearly between the groups so that group or category overlap is clearly non-existent or minimal; group sizes of the dependent should not be grossly different and should be at least ve times the number of independent variables. There are several purposes of DA:
To investigate differences between groups on the basis of the attributes of the cases, indicating which attributes contribute most to group separation. The descriptive technique successively identies the linear combination of attributes known as canonical discriminant functions (equations) which contribute maximally to group separation. Predictive DA addresses the question of how to assign new cases to groups. The DA function uses a persons scores on the predictor variables to predict the category to which the individual belongs. To determine the most parsimonious way to distinguish between groups. To classify cases into groups. Statistical signicance tests using chi square enable you to see how well the function separates the groups. To test theory whether cases are classied as predicted.
Discriminant analysis creates an equation which will minimize the possibility of misclassifying cases into their respective groups or categories.
The aim of the statistical analysis in DA is to combine (weight) the variable scores in some way so that a single new composite variable, the discriminant score, is produced. One way of thinking about this is in terms of a food recipe, where changing the proportions (weights) of the ingredients will change the characteristics of the nished cakes. Hopefully the weighted combinations of ingredients will produce two different types of cake. Similarly, at the end of the DA process, it is hoped that each group will have a normal distribution of discriminant scores. The degree of overlap between the discriminant score distributions can then be used as a measure of the success of the technique, so that, like the
592
Poor
Good
different types of cake mix, we have two different types of groups (Fig. 25.1). For example: The top two distributions in Figure 25.1 overlap too much and do not discriminate too well compared to the bottom set. Misclassication will be minimal in the lower pair, whereas many will be misclassied in the top pair. Standardizing the variables ensures that scale differences between the variables are eliminated. When all variables are standardized, absolute weights (i.e. ignore the sign) can be used to rank variables in terms of their discriminating power, the largest weight being associated with the most powerful discriminating variable. Variables with large weights are those which contribute mostly to differentiating the groups. As with most other multivariate methods, it is possible to present a pictorial explanation of the technique. The following example uses a very simple data set, two groups and two variables. If scattergraphs are plotted for scores against the two variables, distributions like those in Figure 25.2 are obtained. The new axis represents a new variable which is a linear combination of x and y, i.e. it is a discriminant function (Fig. 25.3). Obviously, with more than two groups or variables this graphical method becomes impossible.
Clearly, the two groups can be separated by these two variables, but there is a large amount of overlap on each single axis (although the y variable is the better discriminator). It is possible to construct a new axis which passes through the two group centroids (means), such that the groups do not overlap on the new axis. In a two-group situation predicted membership is calculated by rst producing a score for D for each case using the discriminate function. Then cases with D values smaller than the cut-off value are classied as belonging to one group while those with values larger are classied into the other group. SPSS will save the predicted group membership and D scores as new variables. The group centroid is the mean value of the discriminant score for a given category of the dependent variable. There are as many centroids as there are groups or categories. The cut-off is the mean of the two centroids. If the discriminant score of the function is less than or equal to the cut-off the case is classed as 0, whereas if it is above, it is classed as 1.
594
Click Dene Range button and enter the lowest and highest code for your groups (here it is 1 and 2) (Fig. 25.5). 4 Click Continue. 5 Select your predictors (IVs) and enter into Independents box (Fig. 25.6) and select Enter Independents Together. If you planned a stepwise analysis you would at this point select Use Stepwise Method and not the previous instruction. 6 Click on Statistics button and select Means, Univariate Anovas, Boxs M, Unstandardized and Within-Groups Correlation (Fig. 25.7).
7 8 9
Continue >> Classify. Select Compute From Group Sizes, Summary Table, Leave One Out Classication, Within Groups, and all Plots (Fig. 25.8). Continue >> Save and select Predicted Group Membership and Discriminant Scores (Fig. 25.9). OK.
596
Interpreting the printout Tables 25.1 to 25.12 The initial case processing summary as usual indicates sample size and any missing data. Group statistics tables In discriminant analysis we are trying to predict a group membership, so rstly we examine whether there are any signicant differences between groups on each of the independent variables using group means and ANOVA results data. The Group Statistics and Tests of Equality of Group Means tables provide this information. If there are no signicant group differences it is not worthwhile proceeding any further with the analysis. A rough idea of variables that may be important can be obtained by inspecting the group means and
standard deviations. For example, mean differences between self-concept scores and anxiety scores depicted in Table 25.1 suggest that these may be good discriminators as the separations are large. Table 25.2 provides strong statistical evidence of signicant differences between means of smoke and no smoke groups for all IVs with self-concept and anxiety producing very high value Fs. The Pooled Within-Group Matrices (Table 25.3) also supports use of these IVs as intercorrelations are low.
598
Log determinants and Boxs M tables In ANOVA, an assumption is that the variances were equivalent for each group but in DA the basic assumption is that the variance-co-variance matrices are equivalent. Boxs M tests the null hypothesis that the covariance matrices do not differ between groups formed by the dependent. The researcher wants this test not to be signicant so that the null hypothesis that the groups do not differ can be retained. For this assumption to hold, the log determinants should be equal. When tested by Boxs M, we are looking for a non-signicant M to show similarity and lack of signicant differences. In this case the log determinants appear similar and Boxs M is 176.474 with F = 11.615 which is signicant at p < .000 (Tables 25.4 and 25.5). However, with large samples, a signicant result is not regarded as too important. Where three or more groups exist, and M is signicant, groups with very small log determinants should be deleted from the analysis. Table of eigenvalues This provides information on each of the discriminate functions (equations) produced. The maximum number of discriminant functions produced is the number of groups minus 1. We are only using two groups here, namely smoke and no smoke, so only one function is displayed. The canonical correlation is the multiple correlation between the predictors and the discriminant function. With only one function it provides an index of overall model t which is interpreted as being the proportion of variance explained (R2). In our
The ranks and natural logarithms of determinants printed are those of the group covariance matrices.
example (Table 25.6) a canonical correlation of .802 suggests the model explains 64.32% of the variation in the grouping variable, i.e. whether a respondent smokes or not.
Table 25.6 Eigenvalues table
Eigenvalues Function 1
a.
Eigenvalue 1.806
a
% of variance 100.0
Cumulative % 100.0
Wilks lambda Wilks lambda indicates the signicance of the discriminant function. This table (Table 25.7) indicates a highly signicant function (p < .000) and provides the proportion of total variability not explained, i.e. it is the converse of the squared canonical correlation. So we have 35.6% unexplained.
Table 25.7 Wilks lambda table
Wilks Lambda Test of function(s) 1 Wilks Lambda .356 Chi-square 447.227 df 5 Sig. .000
The standardized canonical discriminant function coefcients table The interpretation of the discriminant coefcients (or weights) is like that in multiple regression. Table 25.8 provides an index of the importance of each predictor like the standardized regression coefcients (betas) did in multiple regression. The sign indicates the direction of the relationship. Self-concept score was the strongest predictor while low anxiety (note ve sign) was next in importance as a predictor. These two variables with large coefcients stand out as those that strongly predict allocation to the smoke or do not smoke group. Age, absence from work and anti-smoking attitude score were less successful as predictors.
600
EXTENSION CHAPTERS ON ADVANCED TECHNIQUES Table 25.8 Standardized canonical discriminant function coefcients table
Standardized Canonical Discriminant Function Coefcients Function 1 age self concept score anxiety score days absent last year total anti-smoking policies subtest B .212 .763 .614 .073 .378
The structure matrix table Table 25.9 provides another way of indicating the relative importance of the predictors and it can be seen below that the same pattern holds. Many researchers use the structure matrix correlations because they are considered more accurate than the Standardized Canonical Discriminant Function Coefcients. The structure matrix table (Table 25.9) shows the corelations of each variable with each discriminate function. These Pearson coefcients are structure coefcients or discriminant loadings. They serve like factor loadings in factor analysis. By identifying the largest loadings for each discriminate function the researcher gains insight into how to name each function. Here we have self-concept and anxiety (low scores) which suggest a label of personal condence and effectiveness as the function that discriminates between non-smokers and smokers. Generally, just like factor loadings, 0.30 is seen as the cut-off between important and less important variables. Absence is clearly not loaded on the discriminant function, i.e. is the weakest predictor and suggests that work absence is not associated with smoking behaviour but a function of other unassessed factors.
Pooled within-groups correlations between discriminating variables and standardized canonical discriminant functions. Variables ordered by absolute size of correlation within function.
DISCRIMINANT ANALYSIS 601 Table 25.10 Canonical Discriminant Function Coefcients table
Canonical discriminant function coefcients Function 1 age self concept score anxiety score days absent last year total anti-smoking policies subtest B (Constant) Unstandardized coefcients. .024 .080 .100 .012 .134 4.543
The canonical discriminant function coefcient table These unstandardized coefcients (b) are used to create the discriminant function (equation). It operates just like a regression equation. In this case we have (Table 25.10): D = (.024 age) + (.080 self-concept) + (.100 anxiety) + (.012 days absent) + (.134 anti smoking score) 4.543. The discriminant function coefcients b or standardized form beta both indicate the partial contribution of each variable to the discriminate function controlling for all other variables in the equation. They can be used to assess each IVs unique contribution to the discriminate function and therefore provide information on the relative importance of each variable. If there are any dummy variables, as in regression, individual beta weights cannot be used and dummy variables must be assessed as a group through hierarchical DA running the analysis, rst without the dummy variables then with them. The difference in squared canonical correlation indicates the explanatory effect of the set of dummy variables. Group centroids table A further way of interpreting discriminant analysis results is to describe each group in terms of its prole, using the group means of the predictor variables. These group means are called centroids. These are displayed in the Group Centroids table (Table 25.11). In our example, non-smokers have a mean of 1.125 while smokers produce a mean of 1.598. Cases with scores near to a centroid are predicted as belonging to that group.
Table 25.11 Functions at group centroids table
Functions at Group Centroids Function smoke or not non-smoker smoker Unstandardized canonical discriminant functions evaluated at group means. 1 1.125 1.598
602
Classication table Finally, there is the classication phase. The classication table, also called a confusion table, is simply a table in which the rows are the observed categories of the dependent and the columns are the predicted categories. When prediction is perfect all cases will lie on the diagonal. The percentage of cases on the diagonal is the percentage of correct classications. The cross validated set of data is a more honest presentation of the power of the discriminant function than that provided by the original classications and often produces a poorer outcome. The cross validation is often termed a jack-knife classication, in that it successively classies all cases but one to develop a discriminant function and then categorizes the case that was left out. This process is repeated with each case left out in turn. This cross validation produces a more reliable function. The argument behind it is that one should not use the case you are trying to predict as part of the categorization process. The classication results (Table 25.12) reveal that 91.8% of respondents were classied correctly into smoke or do not smoke groups. This overall predictive accuracy of the discriminant function is called the hit ratio. Non-smokers were classied with slightly better accuracy (92.6%) than smokers (90.6%). What is an acceptable hit ratio? You must compare the calculated hit ratio with what you could achieve by chance. If two samples are equal in size then you have a 50/50 chance anyway. Most researchers would accept a hit ratio that is 25% larger than that due to chance.
Table 25.12 Classication results table
Classication Resultsb,c Predicted Group Membership smoke or not Original Count % Cross-validateda Count % non-smoker smoker non-smoker smoker non-smoker smoker non-smoker smoker non-smoker 238 17 92.6 9.4 238 17 92.6 9.4 smoker 19 164 7.4 90.6 19 164 7.4 90.6 Total 257 181 100.0 100.0 257 181 100.0 100.0
Cross validation is done only for those cases in the analysis. In cross validation, each case is classied by the functions derived from all cases other than that case. b 91.8% of original grouped cases correctly classied. c 91.8% of cross-validated grouped cases correctly classied.
Saved variables As a result of asking the analysis to save the new groupings, two new variables can now be found at the end of your data le. dis_1 is the predicted grouping based on the discriminant analysis coded 1 and 2, while dis1_1 are the D scores by which the cases were coded into their categories. The average D scores for each group are of course the group centroids reported earlier. While these scores and groups can be used for other analyses, they are
useful as visual demonstrations of the effectiveness of the discriminant function. As an example, histograms (Fig. 25.10) and box plots (Fig. 25.11) are alternative ways of illustrating the distribution of the discriminant function scores for each group. By reading the range of scores on the axes, noting (group centroids table) the means of both as well as the very minimal overlap of the graphs and box plots, a substantial discrimination is revealed. This suggests that the function does discriminate well, as the previous tables indicated.
40 30 smoker 20 10 Frequency 0 40 30 20 10 0 2.50000 0.00000 2.50000 Discriminant Scores from Function1 for Analysis 1
Figure 25.10 Histograms showing the distribution of discriminant scores for smokers and non-smokers.
New cases Mahalanobis distances (obtained from the Method Dialogue Box) are used to analyse cases as it is the distance between a case and the centroid for each group of the dependent. So a new case or cases can be compared with an existing set of cases. A new case will have one distance for each group and therefore can be classied as belonging to the group for which its distance is smallest. Mahalanobis distance is measured in terms of SD from the centroid, therefore a case that is more than 1.96 Mahalanobis distance units from the centroid has a less than 5% chance of belonging to that group. How to write up the results A discriminant analysis was conducted to predict whether an employee was a smoker or not. Predictor variables were age, number of days from work in previous year, self-concept score,
non-smoker
604
2.50000
0.00000
2.50000
non-smoker
smoker
Figure 25.11 Box plots illustrating the distribution of discriminant scores for the two groups.
anxiety score, and attitude to anti-smoking workplace policy. Signicant mean differences were observed for all the predictors on the DV. While the log determinants were quite similar, Boxs M indicated that the assumption of equality of covariance matrices was violated. However, given the large sample, this problem is not regarded as serious. The discriminate function revealed a signicant association between groups and all predictors, accounting for 64.32% of between group variability, although closer analysis of the structure matrix revealed only two signicant predictors, namely self-concept score (.706) and anxiety score (.527) with age and absence poor predictors. The cross validated classication showed that overall 91.8% were correctly classied.
amount to the canonical R squared. The criteria for adding or removing is typically the setting of a critical signicance level for F to remove. To undertake this example, please access SPSS Ch 25 Data File A. It is the same le we used above. On this occasion we will enter the same predictor variables one step at a time to see which combinations are the best set of predictors, or whether all of them are retained. Only one of the SPSS screen shots will be displayed, as the others are the same as those used above. 1 Click Analyze >> Classify >> Discriminant. 2 Select grouping variable and transfer to Grouping Variable box. Then click Dene Range button and enter the lowest and highest codes for your grouping variable dene range. 3 Click Continue then select predictors and enter into Independents box. Then click on Use Stepwise Methods. This is the important difference from the previous example (Fig. 25.12). 4 Statistics >> Means, Univariate Anovas, Boxs M, Unstandardized and Within Groups Correlation. 5 Click Classify. Select Compute From Group Sizes, Summary Table, Leave One Out Classication, Within Groups, and all Plots. 6 Continue >> Save and select Predicted Group Membership and Discriminant Scores. 7 OK.
Figure 25.12 Discriminant analysis dialogue box selected for stepwise method.
606
Interpretation of printout Tables 25.13 and 25.14 Many of the tables in this stepwise discriminant analysis are the same as those for the basic analysis, and we will therefore only comment on the extra stepwise statistics tables. Stepwise statistics tables The Stepwise Statistics Table (25.13) shows that four steps were taken, with each one including another variable and therefore these four were included in the Variables in the Analysis and Wilks Lambda tables because each was adding some predictive power to the function. In some stepwise analyses only the rst one or two steps might be taken, even though there are more variables, because succeeding additional variables are not adding to the predictive power of the discriminant function.
Table 25.13 Variables in the analysis table
Variables in the Analysis Step 1 2 3 self concept score self concept score anxiety score self concept score anxiety score total anti-smoking policies subtest B 4 self concept score anxiety score total anti-smoking policies subtest B age Tolerance 1.000 .998 .998 .996 .979 .979 .982 .976 .977 .980 F to Remove 392.672 277.966 128.061 255.631 138.725 45.415 264.525 139.844 41.295 12.569 218.439 392.672 309.665 461.872 636.626 320.877 485.614 677.108 748.870 Raos V
Wilks lambda table This Table (25.14) reveals that all the predictors add some predictive power to the discriminant function as all are signicant with p < .000. The remaining tables providing the discriminant function coefcients, structure matrix, group centroids and the classication are the same as above.
Table 25.14 Wilks lambda table
Wilks Lambda Exact F Step 1 2 3 4 Number of Variables 1 2 3 4 Lambda .526 .406 .368 .358 df1 1 2 3 4 df2 1 1 1 1 df3 436 436 436 436 Statistic 392.672 317.583 248.478 194.468 df1 1 2 3 4 df2 436.000 435.000 434.000 433.000 Sig. .000 .000 .000 .000
SPSS Activity. Please access SPSS Chapter 25 Data File B on the Web page and conduct both a normal DA and a stepwise DA using all the variables in both analyses. Discuss your results in class. The dependent or grouping variable is whether the workplace is seen as a benecial or unpleasant environment. The predictors are mean opinion scale scores on dimensions of workplace perceptions.
Review questions
Qu. 25.1 The technique used to develop an equation for predicting the value of a qualitative DV based on a set of IVs that are interval and categorical is: (a) (b) (c) (d) (e) cluster analysis discriminant regression logistic regression multivariate analysis factor analysis
Qu. 25.2 The number of correctly classied cases in discriminant analysis is given by: (a) (b) (c) (d) (e) the cut-off score the hit rate the discriminant score the F statistic none of these
608
Qu. 25.3 If there are more than 2 DV categories: (a) you can use either discriminant analysis or logistic regression (b) you cannot use logistic regression (c) you cannot use discriminate analysis (d) you should use logistic regression (e) you should use discriminant analysis
Qu. 25.4 Why would you use discriminant analysis rather than regression analysis? Check your answers in the information above.
Now access the Web page for Chapter 25 and check your answers to the above questions. You should also attempt the SPSS activity located there.
Further reading
Agresti, A. 1996. An Introduction to Categorical Data Analysis. John Wiley and Sons.