BF 02638452
BF 02638452
BF 02638452
W . J. K r z a n o w s k i
University of Exeter
Abstract: Recent research into graphical association models has focussed interest
on the conditional Gaussian distribution for analyzing mixtures of categorical and
continuous variables. A special case of such models, utilizing the homogeneous
conditional Gaussian distribution, has in fact been known since 1961 as the loca-
tion model, and for the past 30 years has provided a basis for the multivariate
analysis of mixed categorical and continuous variables. Extensive development of
this model took place throughout the 1970's and 1980's in the context of discrimi-
nation and classification, and comprehensive methodology is now available for
such analysis of mixed variables. This paper surveys these developments and sum-
marizes current capabilities in the area. Topics include distances between groups,
discriminant analysis, error rates and their estimation, model and feature selection,
and the handling of missing data.
1. Introduction
analysis of such data sets are possible: arbitrary categorization of all the con-
tinuous variables followed by analysis using standard methods for multivari-
ate categorical data, or arbitrarily scoring all the categorical variables and
then using standard methods for multivariate continuous data, or analyzing
the categorical variables and the continuous variables separately (each by
standard methods) and then attempting to synthesize the two sets of results.
None of these options seems satisfactory for comprehensive analysis of the
data, however. The first approach loses information in the categorization of
continuous variables, the second introduces considerable subjectivity in the
numerical scoring adopted, while the third ignores any associations existing
between the categorical and the continuous variables.
A much more satisfactory general approach is first to specify a
parametric model for mixed variables, then to fit the model to the data at hand
and finally to use the parameter estimates for drawing inferences. By
parametric model here is meant a suitable joint probability distribution for a
set of q categorical variables and c continuous variables. Standard probabil-
ity theory tells us that a joint distribution of p variables can be expressed as
the conditional distribution of any subset of these variables given the values
of the remainder, times the marginal distribution of these remaining variables.
Thus if we want to specify the joint distribution of q categorical and c con-
tinuous variables then there appear to be two routes that we could take: as the
conditional distribution of the categorical variables given the values of the
continuous variables, times the marginal distribution of the latter; or as the
conditional distribution of the continuous variables given the values of the
categorical variables, times the marginal distribution of the latter.
The first possibility was briefly raised by Cox (1972), who suggested
that the joint distribution of a mixture of binary and continuous variables
could be written as a logistic conditional distribution of the binary variables
for given values of the continuous variables, times a marginal multivariate
normal distribution for the latter. However, this idea appears not to have
been pursued any further in the analysis of mixed data sets, almost all work in
the area focussing on the second route outlined above. Here it is assumed
that the continuous variables have a different multivariate normal distribution
for each possible setting of categorical variable values, while the categorical
variables have an arbitrary marginal multinomial distribution. This model
has been termed the "conditional Gaussian distribution" (CGD), and it forms
the central plank of graphical association models for the analysis of mixed
categorical and continuous variables. There has been a great deal of interest
recently in these models, and full details can be found in the work of Lau-
ritzen and Wermuth (1989), Edwards (1990), Wermuth and Lauritzen (1990)
and Whittaker (1990, Chapter 11). We briefly summarize here the relevant
technical results for our subsequent purposes.
Location Model for Mixtures 27
1 T
f ( j , y ) = pj(2r~) -c/2 I Y,j I -~ exp { - 2 (y -Bj) Ey I (Y -Bj) }. (1)
1 yr
f(j,y)=exp{otj+13Ty- ~- ~jy}. (2)
The parameters in (1) are called the " m o m e n t " parameters of the CGD, the
triple (pj,Bj,Zj) comprising, respectively, the cell probability, the cell mean
and the cell dispersion matrix for the j-th state, while the parameters in (2) are
the "canonical" parameters of the CGD. Here c~j are scalars (the discrete
canonical parameters), the 13j are c-element vectors (the linear canonical
parameters) and the f~j are (c • c) positive-definite symmetric matrices (the
cell precision matrices). Expanding (2) in terms of vector and matrix ele-
ments yields the form
~ •jkYk -- 21 k=l
f (j,y) = exp {0~j "l- k=l ~ /=1
~ ~[jk,YkYl }- (3)
Since the values of ct), 13jk and ~'jkt depend on the state j of the discrete vari-
ables, and the latter can be viewed as "factors" in the terminology of design
of experiments, then each of ocj, [3jk and "fjkt can be expressed as a sum of
main effects of the relevant individual discrete variables and interactions of
all orders between them. This yields an expansion into terms resembling
ANOVA or log-linear models.
A graphical association model is a model with density of the form (3),
containing expansions in terms of main effects and interactions, in which all
pairs of variables in a specified set are conditionally independent given the
remaining variables. (This model is "graphical" because it is a model for
multivariate random observations whose independence structure is character-
ized by a graph, so the word "graphical" should here be interpreted in the
28 W.J. K r z a n o w s k i
context of mathematical graph theory; for full background details see Whit-
taker, 1990). Lauritzen and Wermuth (1989) established that two variables
are conditionally independent given the rest if and only if all interaction
terms involving the two variables are zero. Edwards (1990) defined hierarch-
ical interaction models as the most general densities of form (3) in which the
marginality principle is still respected (i.e., if a particular interaction term is
set to zero then all interaction terms that "include" it are also set to zero).
The goal of graphical modeling is then to determine the most parsimonious
such model for a given set of data; the technical aspects concerned with
fitting these models (maximum likelihood estimation of parameters with and
without constraints, likelihood ratio tests, distributional results) are covered
in the references cited earlier.
Although we will not be concerned specifically with graphical model-
ing here, it is pertinent to note that the full CGD model has appeared occa-
sionally in other contexts. One such previous occurrence was in the calcula-
tion of distance between two populations (Krzanowski, 1983a). If we sup-
pose that there are g populations, denoted rci (i = 1. . . . . g), and that a
different CGD is permitted in each population, then we must introduce an
extra subscript into the model parameters to allow for the different popula-
tions. Thus Pij now denotes the probability of cell j in population Iti, while
t.tij and Z/j respectively denote the mean vector and dispersion matrix of Y in
cell j of population rq. The density (1) then generalizes to:
1
f ( j , y ; 7ti) = pi)(2n) ~ I Zij I -~ exp { - ~- (y -I.ti)) T E~ 1 (y - t.tij)} (4)
Pab = j~=l (PaJPbJ)l/Z2c/2 I Zaj 11/4 I Zbj 1-1/4 I I + Y'aj Y~'b) I-1/2 •
1
exp { - ~ - k =1 [(Vajk
-Vbjk) + Xkj)]} (5)
where ~ij,lij are solutions of (Zbj -~,ij Y'aj) lij = 0 and Vajk = I~j l.ta).
Since "affinity" is the converse of "distance", possible measures of
distance
_ between 1~,a and "l~b are Z~ab = {2(1 - Pab)l "'~ , Aab = --log Pab or
Aab = COS-I Pab. The first of these measures was used.
Location Model for Mixtures 29
By forming the ratio of the joint probability densities in the two populations,
it readily follows (see, e.g. Krzanowski 1975) that for equal costs due to the
two types of misclassification and equal prior probabilities of group member-
ship the Bayes classification rule is to allocate an individual with X = j and
Y = y t o 7tl if
1
(~lj -- ~12j)T x - I {Y -- 2 (]s + ~12j)} > log (P2j/Plj) (7)
and to rt2 otherwise. This allocation rule is, in effect, a different linear
discriminant function for each discrete variable location. It is clear that the
misallocation probabilities with this rule will therefore be the weighted sums
of the misallocation probabilities at each location, these misallocation
Location Model for Mixtures 31
where DZm = (111,,- ~tZrn)T Z-1 (!11,,- tXZ,n) is the squared Mahalanobis dis-
tance between nl and n2 in location m.
If there are differential costs c 12,c21 due to misclassification of an indi-
vidual, and differential prior probabilities q l,q2 of observing an individual
from the two populations, the net effect is to add k = log ( c l 2 q 2 / c z l q l ) to
log (Pim/Pjm) in both (7) and (8). We will assume c 12 = c21 and ql = q2 for
simplicity throughout.
In practice, of course, the population parameters will be unknown but
random samples ("training sets") are generally available from nl and n2. The
simplest approach that has been adopted in such cases is to estimate the popu-
lation parameters from the training sets, and to replace the parameters in (7)
and (8) by these estimates. Chang and Afifi (1974) considered the special
case of q = 1, i.e., one binary variable, and assumed that there was at least
one observation in each of the two binary variable locations in each popula-
tion. Let there be nij observations in the j-th location of the training set from
hi, and let Yijk be the k-th continuous variable vector in this location. The
situation then corresponds exactly to a 2 • (location • population)
MANOVA, whence estimators of the population parameters are
-- 1 ?li
Pij = no ~hi
2
where n i = ]E n i j and n = n 1 + n 2. Chang and Afifi called the resulting allo-
j=l
cation rule the "double discriminant function"; Tu and Han (1982) studied
this rule further, in particular discussing an "inverse sampling" procedure to
ensure non-singularity of matrices.
As Chang and Afifi pointed out, there is no bar in principle to extension
of the above approach for the case q > 1. However, it is evident that if
32 W.J. Krzanowski
sample sizes are small, or if q (and hence s) is at all large, then there are
bound to be locations for which no data are present in the training sets. What
strategy is then to be followed when this location occurs in an individual to
be classified? Also, there will be some locations with only one or two indivi-
duals present in the training sets, so the parameters for these locations will be
very poorly estimated. It is therefore clear that an alternative to the naive
estimation method given above is needed if this model is to have widespread
practical utility. A second problem is that the misclassification probabilities
(8) are derived under the assumption of conditional normality on the continu-
ous variables. How can the performance of an allocation rule derived from
(7) be assessed if this assumption is not satisfied?
Krzanowski (1975) tackled both of these problems, proposing a scheme
for obtaining smoothed parameter estimates and outlining steps to make
data-based error rate estimation feasible. For the parameter estimation we
first note that the binary variables can be treated as if they were factors in a
MANOVA context, the 2 q locations being the possible categories of a q-
factor experiment where each factor has two possible levels and the mul-
tivariate response is y. Then if we denote by vi the overall mean of y in
population "/~i,by (Zij the main effect of Xj, by fSi,jk the interaction between X j
and Xk, and so on for interactions between the Xj of all orders, then we can
express the B~j as the linear model:
q
Bij = V i + • O~iuXu + ~ . , ~ i , uvXuXv + " - + ' ~ / i , l . . . . . q X 1 "" " Xq (9)
u=l u<v
q
log rl0 = a)i + E 8iu x, + EE ~i, uvxuxv +...+ Illi, l . . . . . qX 1 . . . X q (10)
u =1 u <v
where xu is as before.
Such expansions in terms of the main effects of the individual xi and
the interactions of all orders between them link up with the expansions
Location Model for Mixtures 33
Let us suppose that the training sets consist of n l ,n2 individuals from
nl, n2, respectively and denote the i-th individual in the training set from nj
b y v~ ) (i = 1 . . . . . n j ; j = 1,2). Then the hypothesis-testing approach says
that to allocate an individual z r = (xr, yT), we use the test statistic for the null
hypothesis that all the v~l) and z belong to nl while all the v! 2) belong to 7~2
versus the alternative that all the v! 0 belong to n2 while all the v!2) and z
belong to x2.
sup (Llm X L)
Now the likelihood-ratio test statistic in this case is T =
sup (L2m • L ) '
where L is the joint likelihood for all the v~ ) and Ljm is the likelihood for z in
xj given x = m. Using the joint density from model (6), it is easy to show (see
Krzanowski 1982) that
#.(j) ^(j)
where 2., ,Pim are the estimates of •,Pim respectively when z has been
included with the training set from nj (j = 1,2). For stability, smoothed
parameter estimates using second-order linear and log-linear models are
again recommended. Krzanowski (1982) showed that simplified estimation
of the parameters is obtained if all parameters are estimated for the training
set data only, and then some simple algebraic identities are used to update
inverses and determinants on including z successively with the two training
sets. The final allocation rule is to classify z to nl if T > 1 and otherwise to
7C2 .
Error rates can again be estimated using the leave-one-out procedure,
and this requires one initial estimation of all parameters using the training
data only together with a re-estimation of all parameters when each indivi-
dual is removed from its own training set and placed in the other one. Once
again, some useful matrix and vector identities are available to enable the
latter estimates to be obtained easily from the former ones; full details are
given by Krzanowski (1982).
densities for the pij of the Dirichlet form h ( { P i j } l r c ) ~ FI p~,j-i where the
j=l
ctij are positive constants reflecting prior knowledge about the discrete vari-
able locations. When no such prior information exists, he suggested setting
aij = oti for all j = 1. . . . . s and i = 1,2. He then obtained expressions for the
predictive densities of z in rq and r~2, both when the parameters I-qj, Z, and p~j
are estimated by the "naive" quantifies Yij, S and no~n, and also when the
second-order models (9) and (10) are employed. As all the resulting expres-
sions are rather complicated they are not given here; for full details the reader
is referred to Vlachonikolis (1990).
36 W.J. Krzanowski
In addition to the basic allocation rules and their error rates, described
in the previous section, various extra features of the location model have been
developed and are now available for use by the practitioner.
Krzanowski (1976) proposed a simple graphical procedure for investi-
gating the worth of the location model discrimination procedure over and
above the use of a simple linear discriminant function between two popula-
tions. The parameter-replacement version of Bayes rule (7) requires esti-
mates of the continuous variable means ~t0 and dispersion matrix E. If ~lij
and E are the estimates obtained in a particular application (whether by using
the naive estimators Yij,S or the smoothed second-order ones), then it is a
simple matter to obtain the matrix of Mahalanobis D 2 values between every
pair of states in the two populations. This (2s • 2s symmetric) matrix has
entries (~ij -- ~tkl) T ~-1 (~ti j _ Okl) where j,l take all values from 1 to s and i,k
take values 1 or 2. Use of (metric) scaling on this matrix thus produces a
low-dimensional representation of the 2s states which (through the ordering
of the principal axes) gives an impression of the relative importance of
differences between states and between populations. The more compactly
clustered are the states within populations, the less difference is there
between them in respect of the continuous variable parameters and hence the
less benefit will be derived from use of the location model in preference to a
simple linear discriminant function. A detailed illustrative example in this
paper showed that the major axis of the two-dimensional metric scaling
configuration split off all the even-numbered states from the odd-numbered
ones, while the minor axis split the populations. Since the even-numbered
38 W.J. Krzanowski
states were those for which the first binary variable X1 took the value zero
while the odd-numbered ones were those for which it took the value one, this
demonstration showed that the main effect of XI was the biggest source of
differences in the data. Use of the location model will allow different linear
discriminant functions for the two values of X~, but a simple linear discrim-
inant function will involve averaging over this difference and so will not give
as good a final result in this particular example.
This graphical idea was taken one step further and formalized into a
hypothesis-testing procedure by Krzanowski (1979). Since the location
model methodology will show greatest improvement over a simple linear
discriminant function when there is large variability among the cells in
respect of the continuous variable means ~tij, a first stage is to look for linear
transformations of the continuous variables such that there is as little varia-
tion as possible among the cell means in each population for the transformed
data. Krzanowski (1979) gave several alternative ways of deriving such
linear transformations, and then went on to derive a likelihood-ratio test for
equality of the (true) cell means in each population. This procedure is thus a
likelihood-ratio test for the adequacy of a simple linear discriminant function
in place of the Bayes rule (7) based on the location model (but note that con-
ditional normality of the continuous variables is now a critical assumption).
There is also the possibility of using fewer transformed variables than there
are original variables in future applications, and this aspect was investigated
further by Krusinska (1988b).
One annoying feature of many practical applications of discriminant
analysis is the presence of missing values in the data. In a comprehensive
and important contribution, Little and Schluchter (1985) provided maximum
likelihood estimation schemes for parameters of the location model when
some data are missing. Their procedure uses the EM algorithm, embraces
both the "naive" and "smoothed" approaches to parameter estimation, and
allows constraints to be imposed on some of the parameters if so desired. The
authors also discussed general aspects of imputation and discrimination as
applications of the technique.
Up to this point all developments had been in terms of two-group
discriminant analysis but Krzanowski (1986) extended the location model to
multiple-group discrimination. The connection here was made by noting that,
in general, the Bayes classification rule with equal costs and equal prior pro-
babilities is identical to the maximum likelihood classification rule while for
the continuous-variable-only case where z - N ( ~ i , Y.) in nl, the maximum
likelihood rule is identical to the minimum distance rule (i.e., allocating z to
that population ni for which (z - ~ti)7"E -1 (z - ~ti) is smallest). For the special
case of homogeneous CGD's, and treating z T = (x T, yT) as a degenerate
"population" in which unit probability is ascribed to the categorical state
Location Model for Mixtures 39
defined by x and zero probability to all other states, and whose continuous
component has probability mass unity at the observed value y and zero else-
where, Krzanowski (1986) showed that the affinity (5) between z and ~1
reduced to
1
Pi = {(2rt) c I Z I )-l/4p~rn e x p { - - 7 (Y --~tim)T~-l(Y
~4
-- ['tim)} i f x = m . ( 1 2 )
4. Feature Selection
and hence of corresponding likelihoods. For this step, therefore, Daudin pro-
posed the maximization of a modified AIC which, in effect, is the increase in
AIC for a given number of continuous variables due to the presence of the
population factor Z. Backward elimination or forward selection was advo-
cated in place of a global search, and an illustrative example was considered
in some detail.
Further selection strategies were advocated in a series of papers by
Krusinska (1988a, 1989a, 1989b). The first of these papers focussed on the
two-group case and discussed the selection of those features (i.e. those vari-
ables from the complete set of categorical and continuous) that minimize an
estimate o f p ( n l In2) + p(n2 In1). Various different estimates of this quantity
were considered: replacement of parameters in expressions (8) using either
naive estimates, smoothed estimates or the U-method (Lachenbruch and
Mickey 1968); or empirical estimates via either resubstitution or leave-one-
out (again encompassing either naive or smoothed estimation). The second
paper allowed multiple-group situations and considered selection of the
(minimum number) of features that give significant discriminatory measure
T2= Trace ( H G -1) where H is the between-states-and-populations sum of
squares and products (SSP) matrix while G is the within-states-and-
populations SSP matrix. Some distributional results were provided to check
on significance of T 2 and thereby to provide a stopping rule. In both papers,
backward elimination was advocated, and both approaches require at least
one continuous variable to be present at each stage of the process. Note that
both of these approaches involve strictly "discriminatory" criteria, by con-
trast with Daudin's "adequacy of model" criterion, but normality still plays
an important role in definitions (8) and in the T 2 distribution results. (How-
ever, the latter may be slightly questionable as the appropriate distribution
should be that of the maximum T 2 among g > 1 values at each step.) The
third paper of the set provided a two-step (sub-optimal) branch-and-bound
algorithm in place of the backward-elimination process using T 2.
Although each of the papers cited above provided at least one illustra-
tive practical example of the relevant technique, no comparisons have yet
been made among the competing proposals, so it is not possible to make
recommendations. This is clearly an area that needs further research, but see
also the remarks in section 5.3 below.
For this section it will be convenient to change notation from that used
hitherto. Let us suppose that v is the individual to be allocated, and that in
the two-group case the training sets consist of a sample xl,x2 . . . . . xn from nl
and a sample Yl,Y2 . . . . . Ym from n2. Write D(n 1,n2), D(v,ni), D(xi,xj) for
the distances (however defined) between groups, between an element and a
group, and between elements respectively.
One of the oldest distance-based allocation rules can be formally attri-
buted to Matusita (1956), but has been used both formally and informally by
many others. This is the intuitively reasonable rule that allocates v to the
"nearer" of the two populations:
and that (13) then reduced to (7) with this distance function. (Note that in this
case, D(nl,rc2) is given by A12 from case (iii) after Equation (5), with all
population parameters replaced by their estimates from the training sets.)
A problem arises with use of (13) on multinomial data, since in this
case the maximum likelihood rule is again recovered but this rule does not
work well with sparse data. In an attempt to overcome the problem, Dillon
Location Model for Mixtures 43
wt exp ( - dZt)
p r { s ~ xtlx} = g
(15)
Z Wh exp ( - d~h)
h=l
g
where w 1,w2 . . . . . Wg are weights satisfying E wi = 1.
i=1
Given the known group membership of individuals in the training data,
the (conditional) likelihood of the training data is multinomial with probabili-
ties (15) and observed group frequencies, so iterative approximation methods
(e.g. Fisher's scoring) can be used to provide maximum likelihood estimates
of all the unknown parameters (B and the wi) and hence an individual can be
classified to the group for which it has highest estimated probability. Takane
et al. advocated model evaluation via AIC, and showed how such additional
features as subset selection could be incorporated easily. Note the resem-
blance of the methodology to logistic discrimination (Anderson 1982), and
indeed many of the computational and sampling concerns are the same with
44 W.J. Krzanowski
F2 = - -1 y~ D2 (v,yi) - 1 Z. E D 2 ( y i , Y j )
m i 2m 2 i j
analyzed Krzanowski's Data Set 4 and also considered a data set comprising
386 medical consultations with c = 9 and q = 5. Those authors first selected
a subset of variables using standard stepwise selection on the simple LDF, the
augmented LDF, and the logistic discriminant function, and then they
obtained leave-one-out error rates for the chosen subsets.
All the above were two-group problems. Daudin (1986) provided a
three-group problem in discriminating between the categories " b a d " ,
"acceptable" and " g o o d " for 632 melons with c = 6 and q = 5. He quoted
both resubstitution and leave-one-out error rates for the location model, the
simple LDF and the augmented LDF, both with and without prior selection of
variables.
Krusinska (1988a, 1989a) used a data set consisting of 164 bronchial
asthma sufferers with c = 6, q = 8 and she reported leave-one-out error rates
and T 2 values for various selected subsets and selection strategies based on
the location model. Finally, Takane et al. (1987) re-analyzed Krzanowski's
Data Set 4 by ideal point discriminant analysis, with and without prior selec-
tion of variables.
Nearly all the above comparisons were ones contrasting the location
model with some variant of the LDF. In the majority of cases, the location
model (without prior variable selection) did as well as or better than the sim-
ple LDF (also without prior variable selection). Where the simple LDF did
badly compared to the location model, the augmented LDF (including squares
and cross-products of variables) had a performance much closer to that of the
location model. Prior selection of variables generally improved perfor-
mances. Daudin's results are the only ones where all methods underwent
prior selection of variables, and here the location model still performed much
better than the other methods (41.9% misclassification as against 44.6% with
the augmented LDF and 49.9% with the simple LDF; 7.1% of the " b a d "
melons allocated to the " g o o d " group as against 8.5% with the augmented
LDF and 10.5% with the simple LDF). Thus the above results suggest that
the location model is, in general, the best method for mixed variables fol-
lowed by the augmented LDF and then the simple LDF. Where such com-
parisons were made, logistic discrimination seemed to be comparable to the
simple LDF on mixed data and no particular benefit was derived from a qua-
dratic discriminant function as against the simple LDF.
However, some contradictory results were obtained in those comparis-
ons where some methods had prior selection of variables while other methods
did not. For example, the performance of ideal point discriminant analysis
was no better than that of the location model in the data set on which they
were compared if the full set of variables was used in both methods, but it did
do better if prior selection was made before ideal point analysis and not
before location model analysis. It is the present author's view that all results
46 W.J. KrzanowskJ
involving prior variable selection must be treated with caution for several rea-
sons. Although the leave-one-out procedure guards against bias in the error-
rate estimation process, additional bias is being introduced by the variable
selection since by definition it is those variables that are " b e s t " for the train-
ing data which are being selected. Thus comparison of a method that has not
had selection with one that has had prior selection is unfair. Also different
methods may react differently to the selection process (for example in
Daudin's data, prior selection reduced the error rate for the simple LDF only
from 50.3% to 49.9% but for the augmented LDF from 50.0% to 44.6%).
Thus unfair comparisons may result even when all methods have undergone
prior selection. The whole area of assessing performances of allocation rules
with and without variable selection has received very little attention to date,
and considerably more needs to be done. A start has been made in the simple
LDF context (see Ganeshanandam and Krzanowski 1989), and work on the
mixed-variable case is currently in progress.
6. Future Prospects
References
AFIFI, A. A., and ELASHOFF, R. M. (1969), "Multivariate Two-sample Tests with Dichoto-
mous and Continuous Variables. 1. The Location Model," Annals of Mathematical
Statistics, 40, 290-298.
AKAIKE, H. (1973), "Information Theory and an Extension of the Maximum Likelihood
Principle," in Second International Symposium on Information Theory, Eds., B.N.
Petrov and F. Csaki, Budapest: Akademia Kiado, 267-281.
ANDERSON, J. A. (1982), "Logistic Discrimination," In Handbook of Statistics 2,
Classification, Pattern Recognition and Reduction of Dimensionality, Eds., P.R. Krish-
naiah and L.N. Kanal, Amsterdam: North Holland, 169-191.
ANDERSON, T. W. (1973), "An Asymptotic Expansion of the Distribution of the Studentized
Classification Statistic W," Annals of Statistics, 1,964-972.
BALAKRISHNAN, N., and TIKU, M. L. (1988), "Robust Classification Procedures Based on
Dichotomous and Continuous Variables," Journal of Classification, 5, 53-80.
CHANG, P. C., and A F I n , A. A. (1974), "Classification Based on Dichotomous and Continu-
ous Variables," Journal of the American Statistical Association, 69, 336-339.
COX, D. R. (1972), "The Analysis of Multivariate Binary Data," Applied Statistics, 21, 113-
120.
CUADRAS, C. M. (1989), "Distance Analysis in Discrimination and Classification Using
Both Continuous and Categorical Variables," in Statistical Data Analysis and Inference,
Ed., Y. Dodge, Amsterdam: North Holland, 459-473.
CUADRAS, C. M. (1991), "A Distance-based Approach to Discriminant Analysis and Its Pro-
perties," Mathematics preprint series no. 90, Barcelona University.
DAUDIN, J. J. (1986), "Selection of Variables in Mixed-variable Discriminant Analysis,"
Biometrics, 42, 473-481.
DILLON, W. R., and GOLDSTEIN, M. (1978), "On the Performance of Some Multinomial
Classification Rules," Journal of the American Statistical Association, 73, 305-313.
EDWARDS, D. (1990), "Hierarchical Interaction Models," Journal of the Royal Statistical
Society, Series B, 52, 3-20.
GANESHANANDAM, S., and KRZANOWSKI, W. J. (1989), "On Selecting Variables and
Assessing Their Performance in Linear Discriminant Analysis," Australian Journal of
Statistics, 31,433-447.
GOWER, J. C. (1971), "A General Coefficient of Similarity and Some of Its Properties,"
Biometrics, 27, 857-871.
HAN, C.-P. (1979), "Alternative Methods of Estimating the Likelihood Ratio in Classification
of Multivariate Normal Observations," American Statistician, 33, 204-206.
KNOKE, J. D. (1982), "Discriminant Analysis with Discrete and Continuous Variables,"
Biometrics, 38, 191-200.
48 W . J . Krzanowski