BF 02638452

Journal of Classification 10:25-49 (1993)
The Location Model for Mixtures of Categorical

and Continuous Variables
W . J. K r z a n o w s k i
University of Exeter
Abstract: Recent research into graphical association models has focussed interest
on the conditional Gaussian distribution for analyzing mixtures of categorical and
continuous variables. A special case of such models, utilizing the homogeneous
conditional Gaussian distribution, has in fact been known since 1961 as the loca-
tion model, and for the past 30 years has provided a basis for the multivariate
analysis of mixed categorical and continuous variables. Extensive development of
this model took place throughout the 1970's and 1980's in the context of discrimi-
nation and classification, and comprehensive methodology is now available for
such analysis of mixed variables. This paper surveys these developments and sum-
marizes current capabilities in the area. Topics include distances between groups,
discriminant analysis, error rates and their estimation, model and feature selection,
and the handling of missing data.
Keywords: Classification; Discrimination; Distances; Error rates; Feature selec-

tion.
1. Introduction
Multivariate data sets containing mixtures of categorical and continu-

ous variables arise frequently in practice. Various simple approaches to the
Constructive comments from the anonymous referees are gratefully acknowledged.

Author's Address: Mathematical Statistics and Operational Research Department,
University of Exeter, Laver Building, North Park Road, Exeter EX4 4QE, UK. E-mail:
wjk@uk.ac.exeter.msor0 (JANET) or wjk@msor0.exeter.ac.uk (BITNET).
26 W.J. Krzanowski
analysis of such data sets are possible: arbitrary categorization of all the con-
tinuous variables followed by analysis using standard methods for multivari-
ate categorical data, or arbitrarily scoring all the categorical variables and
then using standard methods for multivariate continuous data, or analyzing
the categorical variables and the continuous variables separately (each by
standard methods) and then attempting to synthesize the two sets of results.
None of these options seems satisfactory for comprehensive analysis of the
data, however. The first approach loses information in the categorization of
continuous variables, the second introduces considerable subjectivity in the
numerical scoring adopted, while the third ignores any associations existing
between the categorical and the continuous variables.
A much more satisfactory general approach is first to specify a
parametric model for mixed variables, then to fit the model to the data at hand
and finally to use the parameter estimates for drawing inferences. By
parametric model here is meant a suitable joint probability distribution for a
set of q categorical variables and c continuous variables. Standard probabil-
ity theory tells us that a joint distribution of p variables can be expressed as
the conditional distribution of any subset of these variables given the values
of the remainder, times the marginal distribution of these remaining variables.
Thus if we want to specify the joint distribution of q categorical and c con-
tinuous variables then there appear to be two routes that we could take: as the
conditional distribution of the categorical variables given the values of the
continuous variables, times the marginal distribution of the latter; or as the
conditional distribution of the continuous variables given the values of the
categorical variables, times the marginal distribution of the latter.
The first possibility was briefly raised by Cox (1972), who suggested
that the joint distribution of a mixture of binary and continuous variables
could be written as a logistic conditional distribution of the binary variables
for given values of the continuous variables, times a marginal multivariate
normal distribution for the latter. However, this idea appears not to have
been pursued any further in the analysis of mixed data sets, almost all work in
the area focussing on the second route outlined above. Here it is assumed
that the continuous variables have a different multivariate normal distribution
for each possible setting of categorical variable values, while the categorical
variables have an arbitrary marginal multinomial distribution. This model
has been termed the "conditional Gaussian distribution" (CGD), and it forms
the central plank of graphical association models for the analysis of mixed
categorical and continuous variables. There has been a great deal of interest
recently in these models, and full details can be found in the work of Lau-
ritzen and Wermuth (1989), Edwards (1990), Wermuth and Lauritzen (1990)
and Whittaker (1990, Chapter 11). We briefly summarize here the relevant
technical results for our subsequent purposes.
Location Model for Mixtures 27
Suppose that the q categorical variables and c continuous variables are

denoted X = (X1,X2 . . . . . Xq) r and Y = ( Y 1 , Y 2 . . . . . Yc) r. Furthermore,
assume that the i-th variable Xi has si possible categories so that overall there
q
are s = I-I si possible states, i.e., patterns, of discrete-variable values. The
i=1
above model thus implies that if X falls in state j then Y - N ( B j , Zj) while the
$
probability that X falls in state j is p j (j = 1 . . . . . s; Z p j = 1). Hence the

j=l
joint probability density of observing state j of X and value y of Y is
1 T
f ( j , y ) = pj(2r~) -c/2 I Y,j I -~ exp { - 2 (y -Bj) Ey I (Y -Bj) }. (1)
By collecting terms and redefining parameters, this density can be rewritten

in the form
1 yr
f(j,y)=exp{otj+13Ty- ~- ~jy}. (2)
The parameters in (1) are called the " m o m e n t " parameters of the CGD, the
triple (pj,Bj,Zj) comprising, respectively, the cell probability, the cell mean
and the cell dispersion matrix for the j-th state, while the parameters in (2) are
the "canonical" parameters of the CGD. Here c~j are scalars (the discrete
canonical parameters), the 13j are c-element vectors (the linear canonical
parameters) and the f~j are (c • c) positive-definite symmetric matrices (the
cell precision matrices). Expanding (2) in terms of vector and matrix ele-
ments yields the form
~ •jkYk -- 21 k=l
f (j,y) = exp {0~j "l- k=l ~ /=1
~ ~[jk,YkYl }- (3)
Since the values of ct), 13jk and ~'jkt depend on the state j of the discrete vari-
ables, and the latter can be viewed as "factors" in the terminology of design
of experiments, then each of ocj, [3jk and "fjkt can be expressed as a sum of
main effects of the relevant individual discrete variables and interactions of
all orders between them. This yields an expansion into terms resembling
ANOVA or log-linear models.
A graphical association model is a model with density of the form (3),
containing expansions in terms of main effects and interactions, in which all
pairs of variables in a specified set are conditionally independent given the
remaining variables. (This model is "graphical" because it is a model for
multivariate random observations whose independence structure is character-
ized by a graph, so the word "graphical" should here be interpreted in the
28 W.J. K r z a n o w s k i
context of mathematical graph theory; for full background details see Whit-
taker, 1990). Lauritzen and Wermuth (1989) established that two variables
are conditionally independent given the rest if and only if all interaction
terms involving the two variables are zero. Edwards (1990) defined hierarch-
ical interaction models as the most general densities of form (3) in which the
marginality principle is still respected (i.e., if a particular interaction term is
set to zero then all interaction terms that "include" it are also set to zero).
The goal of graphical modeling is then to determine the most parsimonious
such model for a given set of data; the technical aspects concerned with
fitting these models (maximum likelihood estimation of parameters with and
without constraints, likelihood ratio tests, distributional results) are covered
in the references cited earlier.
Although we will not be concerned specifically with graphical model-
ing here, it is pertinent to note that the full CGD model has appeared occa-
sionally in other contexts. One such previous occurrence was in the calcula-
tion of distance between two populations (Krzanowski, 1983a). If we sup-
pose that there are g populations, denoted rci (i = 1. . . . . g), and that a
different CGD is permitted in each population, then we must introduce an
extra subscript into the model parameters to allow for the different popula-
tions. Thus Pij now denotes the probability of cell j in population Iti, while
t.tij and Z/j respectively denote the mean vector and dispersion matrix of Y in
cell j of population rq. The density (1) then generalizes to:
1
f ( j , y ; 7ti) = pi)(2n) ~ I Zij I -~ exp { - ~- (y -I.ti)) T E~ 1 (y - t.tij)} (4)
Krzanowski (1983a) surveyed the various possible general definitions of the

distance Aab between ~a and rob, and chose to work with the Matusita (1956)
definition, also known as the Hellinger distance. This definition involves cal-
culation of the affinity 9ab between ~,~ and r~b, and Krzanowski (1983a)
showed that for the case z = (j,y) and densities in (4),
Pab = j~=l (PaJPbJ)l/Z2c/2 I Zaj 11/4 I Zbj 1-1/4 I I + Y'aj Y~'b) I-1/2 •
1
exp { - ~ - k =1 [(Vajk
-Vbjk) + Xkj)]} (5)
where ~ij,lij are solutions of (Zbj -~,ij Y'aj) lij = 0 and Vajk = I~j l.ta).
Since "affinity" is the converse of "distance", possible measures of
distance
_ between 1~,a and "l~b are Z~ab = {2(1 - Pab)l "'~ , Aab = --log Pab or
Aab = COS-I Pab. The first of these measures was used.
For practical applications, the parameters in (5) may be estimated from

data by maximum likelihood, yielding intuitively reasonable estimates. Pij is
given by^the proportion of individuals falling in state j of population hi, while
~tij and Eij are given by the mean vector Yij and covariance matrix Sij of the
continuous variable values for these individuals. However, if s is at all large
or if sample sizes are small, many of the states will have few observations
and some Zij will be poorly estimated. In this case it is possible to constrain
the model, which will lead to pooled estimates. Various levels of pooling are
possible:
(i) pool within states for each population (equivalent to assuming that
the dispersion matrix is constant over cells in each population
separately, i.e., that Eij - Ei for i = 1. . . . . g);
(ii) pool within populations for each state (equivalent to assuming that
the dispersion matrix is constant over populations in each cell
separately, i.e., that Eij --Ej f o r j = 1. . . . . s);
(iii) pool within populations and states (equivalent to assuming that the
dispersion matrix is constant over cells and populations, i.e., that
Zij ~ Z for all i,j).
Krzanowski (1984) provided a Monte Carlo estimation scheme for the null
distribution of distance between ~a and ~b in case (iii), a result which enables
some inferential procedures to be applied to the analyses of data sets in prac-
tice.
Case (iii) above, where the same dispersion matrix Z is assumed for
each combination of categorical variable values (i.e., at each discrete "loca-
tion") is known as the homogeneous CGD case (in which the mixed interac-
tion components of the canonical parameters ctj, ~jk and ~jkl are all set to
zero). This case was first inlroduced by Olkin and Tate (1961) under the
name "location model" for analysis of mixed binary and continuous vari-
ables. These authors looked at canonical correlations between binary and
continuous variables for various possibilities involving c and q, established
population results connecting these canonical correlations and the continuous
variable means Ix,i, and investigated the distribution theory for their esti-
mates. Afifi and Elashoff(1969) extended the study of the model to the two-
sample case. They investigated the effect of ignoring the binary nature of the
xi in calculating the usual two-sample Hotelling's T a, and showed that the
test was not consistent but that the distribution of T 2 depended on nuisance
parameters. They then went on to derive an information-theoretic test of
difference between groups and established the null distribution of the test
statistic. In this work, they assumed that the parameter estimates Pij, ~ij and
Eij given above would be available for all binary-variable locations; other-
wise the test could not be done.
30 W.J. Krzanowski
The major practical developments of the location model that have

taken place since these two pioneering papers have been almost exclusively
in the context of discriminant analysis, and it is with this aspect that the
current survey is concerned. In Section 2 we set up the basic location model
formulation and summarize the different approaches adopted in practice,
while in Section 3 we consider possible extensions of the basic ideas. Section
4 is concerned with model and feature selection aspects and problems, while
Section 5 surveys alternative ways of tackling mixed-variable discrimination.
In Section 6 we indicate how the graphical modeling ideas considered at the
start can point the way to future developments.
2. Discriminant Analysis Methodology
2.1 Bayes Rule
We assume that there are two populations n I and n2, discrimination

between which is required. Historically, the location model methodology was
developed from the starting point of a mixture of c continuous and q binary
variables, and it is convenient to follow this line of development here. In this
case we have si = 2 discrete variable categories for each i, and hence s = 2 q
states, or cells, altogether. If we denote the two possible 'values' of each
binary variable as 0 and 1, then the s cells can be logically arranged in the
q
order j iZ=l x i 2 i-1 where xi is the value of the i-th binary variable. The
location model thus specifies:
P r ( X = j Ixi) = Pij and (Y I X = j,rti) - N(Bi j , Z)

for i = 1,2 and j = 1. . . . . s . (6)
By forming the ratio of the joint probability densities in the two populations,
it readily follows (see, e.g. Krzanowski 1975) that for equal costs due to the
two types of misclassification and equal prior probabilities of group member-
ship the Bayes classification rule is to allocate an individual with X = j and
Y = y t o 7tl if
1
(~lj -- ~12j)T x - I {Y -- 2 (]s + ~12j)} > log (P2j/Plj) (7)
and to rt2 otherwise. This allocation rule is, in effect, a different linear
discriminant function for each discrete variable location. It is clear that the
misallocation probabilities with this rule will therefore be the weighted sums
of the misallocation probabilities at each location, these misallocation
probabilities being obtainable from standard linear discriminant theory and

the weights being the location probabilities Pij. Denoting by p ( l t i l l t j ) the
probability of allocating to ni an individual that came from rcj, we have
p(r~i Ircj) = E" P j m ~ {(log [Pim/Pjm] - ~1 D~)/Dm} for i ~ j (8)

m=l
where DZm = (111,,- ~tZrn)T Z-1 (!11,,- tXZ,n) is the squared Mahalanobis dis-
tance between nl and n2 in location m.
If there are differential costs c 12,c21 due to misclassification of an indi-
vidual, and differential prior probabilities q l,q2 of observing an individual
from the two populations, the net effect is to add k = log ( c l 2 q 2 / c z l q l ) to
log (Pim/Pjm) in both (7) and (8). We will assume c 12 = c21 and ql = q2 for
simplicity throughout.
In practice, of course, the population parameters will be unknown but
random samples ("training sets") are generally available from nl and n2. The
simplest approach that has been adopted in such cases is to estimate the popu-
lation parameters from the training sets, and to replace the parameters in (7)
and (8) by these estimates. Chang and Afifi (1974) considered the special
case of q = 1, i.e., one binary variable, and assumed that there was at least
one observation in each of the two binary variable locations in each popula-
tion. Let there be nij observations in the j-th location of the training set from
hi, and let Yijk be the k-th continuous variable vector in this location. The
situation then corresponds exactly to a 2 • (location • population)
MANOVA, whence estimators of the population parameters are
-- 1 ?li
hij = Yij - tliJ k~=l Yijk

2 2 ni
=S - 1 Z Z Z (Yijk - Yij) (Y/jk - Yij) r
n -4 i=l j : l k=l
and
Pij = no ~hi
2
where n i = ]E n i j and n = n 1 + n 2. Chang and Afifi called the resulting allo-
j=l
cation rule the "double discriminant function"; Tu and Han (1982) studied
this rule further, in particular discussing an "inverse sampling" procedure to
ensure non-singularity of matrices.
As Chang and Afifi pointed out, there is no bar in principle to extension
of the above approach for the case q > 1. However, it is evident that if
32 W.J. Krzanowski
sample sizes are small, or if q (and hence s) is at all large, then there are
bound to be locations for which no data are present in the training sets. What
strategy is then to be followed when this location occurs in an individual to
be classified? Also, there will be some locations with only one or two indivi-
duals present in the training sets, so the parameters for these locations will be
very poorly estimated. It is therefore clear that an alternative to the naive
estimation method given above is needed if this model is to have widespread
practical utility. A second problem is that the misclassification probabilities
(8) are derived under the assumption of conditional normality on the continu-
ous variables. How can the performance of an allocation rule derived from
(7) be assessed if this assumption is not satisfied?
Krzanowski (1975) tackled both of these problems, proposing a scheme
for obtaining smoothed parameter estimates and outlining steps to make
data-based error rate estimation feasible. For the parameter estimation we
first note that the binary variables can be treated as if they were factors in a
MANOVA context, the 2 q locations being the possible categories of a q-
factor experiment where each factor has two possible levels and the mul-
tivariate response is y. Then if we denote by vi the overall mean of y in
population "/~i,by (Zij the main effect of Xj, by fSi,jk the interaction between X j
and Xk, and so on for interactions between the Xj of all orders, then we can
express the B~j as the linear model:
q
Bij = V i + • O~iuXu + ~ . , ~ i , uvXuXv + " - + ' ~ / i , l . . . . . q X 1 "" " Xq (9)
u=l u<v
where xu is the observed value of X,, in location j.

The above provides a MANOVA structure for the conditional means of
the continuous variables. Moving on to the marginal distributions of the
binary variables, we now have contingency tables of numbers of occurrences
in each of the 2 q locations for the two populations. Thus we again have a 2 q
factorial structure defined by the levels of the Xi, but now the responses at
each location are incidences rather than realizations of a continuous vector y.
A standard approach for the analysis of such data is by formulating an analo-
gous log-linear model for the expected values riij = niPij in each location, so
in our case we have the model
q
log rl0 = a)i + E 8iu x, + EE ~i, uvxuxv +...+ Illi, l . . . . . qX 1 . . . X q (10)
u =1 u <v
where xu is as before.
Such expansions in terms of the main effects of the individual xi and
the interactions of all orders between them link up with the expansions
discussed in the introduction to graphical modeling above. A current concern

of graphical modeling would be to determine which terms of (9) and (10) to
retain and which to delete in forming the most parsimonious model that fitted
a given set of data. Krzanowski (1975), however, adopted the pragmatic
approach of retaining only (and all) main effects and first-order interactions in
both (9) and (10); he proposed fitting the resulting second-order models to the
continuous variable parameters by multivariate regression and to the discrete
variable parameters by iterative proportional fitting. This scheme involves
2q(q + 1) + 4 parameters altogether. If the data are too sparse to admit such
second-order models, then it should be possible to fit first-order models in
which just the main effects are retained (giving 4q + 4 parameters to be
estimated); a possible intermediate stage is one in which separate main effects
are fitted in the two populations, but the interactions are constrained to be
equal across populations (i.e., ~l.,~ = []2,uv and 01,,v = q~2,uv for all u,v with
q 2 + 3q + 4 parameters to be estimated).
This approach ensures that estimates l~ij and Pij are available even for
those discrete-variable locations that have no observations in the training
sets, so that the classification rule (7) can be estimated in all eventualities.
What of the estimation of error rates induced by this rule? As mentioned
above, using parameter estimates obtained from the second-order models in
equation (8) will not give accurate assessment if the continuous variables are
not normally distributed at each location, so a data-based method was sought.
A suitable such method had earlier been proposed by Lachenbruch and
Mickey (1968) in the now familiar leave-one-out method, for which each data
point is omitted from the training sets in turn and classified on the basis of the
allocation rule computed from the remaining observations; the proportion of
individuals misallocated in each of the two training samples gives the two
estimated error rates. Naive application of this procedure to large data sets
may be feasible with modem computers, but at the time would have been
computationany prohibitive with the location model. However, Krzanowski
(1975) showed that various matrix identities could be employed advanta-
geously in the multivariate regression, and that the iterative scaling computa-
tions could be arranged in sufficiently effective manner for the whole process
to be carried out relatively simply and quickly. Various examples, both real
and simulated, demonstrated both the efficiency and efficacy of the methodol-
ogy.
Once a methodology was available for mixtures of binary and continu-
ous variables, extension to general mixtures of categorical and continuous
variables was extremely simple and was effected essentially by replacing an
m-state categorical variable with ( m - l) dummy binary variables and
proceeding as before. Suppose the categorical variable with m states is
replaced by the (m - l) dummy variables X1 . . . . . Xm-1. Then state j of the
34 W.J. K r z a n o w s k i
categorical variable can be indicated by setting Xj = 1 and Xi = 0 for all

i , j (j = 1. . . . . m - 1), in which case state m would be indicated by setting
all X i to zero. Note, however, that no more than one such dummy binary vari-
able can have value 1 at any location so models (9) and (10) will be over-
parameterized. Two extra features therefore had to be incorporated into the
estimation scheme: (i) all interaction terms within each group of dummy
binary variables had to be excluded from the linear model (9) for the ktij (to
avoid break-down of the multivariate regression estimation procedure), and
(ii) all multinominal states corresponding to joint incidences xu = 1, xv = 1
within each group of dummy binary variables had to be fixed at zero (to
ensure correct iterative scaling estimates in the log-linear model (10)). Full
details of this generalization were provided by Krzanowski (1980).
The Bayes allocation procedure (7) derives from the ratio of the two
probability densities in the two populations, i.e. the ratio of the likelihoods for
the observation to be classified. The problem in practice is to estimate this
ratio, and the replacing of parameters of (7) by their estimates from the train-
ing data is the simplest and most commonly used way of doing so. However,
two other general procedures have also been proposed: the hypothesis-testing
method and the Bayesian predictive method. These approaches have been
discussed in the context of multivariate normal data, and compared with the
parameter-replacement approach for such data by Han (1979). We outline
their implementation with the location model in the two following sections.
2.2 Hypothesis-testing Rule
Let us suppose that the training sets consist of n l ,n2 individuals from
nl, n2, respectively and denote the i-th individual in the training set from nj
b y v~ ) (i = 1 . . . . . n j ; j = 1,2). Then the hypothesis-testing approach says
that to allocate an individual z r = (xr, yT), we use the test statistic for the null
hypothesis that all the v~l) and z belong to nl while all the v! 2) belong to 7~2
versus the alternative that all the v! 0 belong to n2 while all the v!2) and z
belong to x2.
sup (Llm X L)
Now the likelihood-ratio test statistic in this case is T =
sup (L2m • L ) '
where L is the joint likelihood for all the v~ ) and Ljm is the likelihood for z in
xj given x = m. Using the joint density from model (6), it is easy to show (see
Krzanowski 1982) that
= {,Pim /Pim ) [ (.^(2)

P l m /.^(2)\
P2m) (11)
Li =lm=l J
#.(j) ^(j)
where 2., ,Pim are the estimates of •,Pim respectively when z has been
included with the training set from nj (j = 1,2). For stability, smoothed
parameter estimates using second-order linear and log-linear models are
again recommended. Krzanowski (1982) showed that simplified estimation
of the parameters is obtained if all parameters are estimated for the training
set data only, and then some simple algebraic identities are used to update
inverses and determinants on including z successively with the two training
sets. The final allocation rule is to classify z to nl if T > 1 and otherwise to
7C2 .
Error rates can again be estimated using the leave-one-out procedure,
and this requires one initial estimation of all parameters using the training
data only together with a re-estimation of all parameters when each indivi-
dual is removed from its own training set and placed in the other one. Once
again, some useful matrix and vector identities are available to enable the
latter estimates to be obtained easily from the former ones; full details are
given by Krzanowski (1982).
2.3 Bayesian Predictive Rule
The Bayesian approach to the problem is to postulate prior distributions

for all the unknown parameters (I.tij,Z and Pij for all i,j), use the likelihood of
the training data under the location model to obtain posterior distributions of
these parameters, multiply the joint density of z in each population by these
posterior distributions and then integrate the resulting products with respect
to the unknown parameters to obtain predictive densities of z in nl and n2.
The allocation of z is to the population in which it has the higher predictive
density.
Vlachonikolis (1990) adopted the vague prior density
g({~tij},E)o~ I E I -~(c§ for the continuous variable parameters, and prior
s
densities for the pij of the Dirichlet form h ( { P i j } l r c ) ~ FI p~,j-i where the
j=l
ctij are positive constants reflecting prior knowledge about the discrete vari-
able locations. When no such prior information exists, he suggested setting
aij = oti for all j = 1. . . . . s and i = 1,2. He then obtained expressions for the
predictive densities of z in rq and r~2, both when the parameters I-qj, Z, and p~j
are estimated by the "naive" quantifies Yij, S and no~n, and also when the
second-order models (9) and (10) are employed. As all the resulting expres-
sions are rather complicated they are not given here; for full details the reader
is referred to Vlachonikolis (1990).
36 W.J. Krzanowski
2.4 Assessment and Comparison of the Rules
Various studies, both empirical and theoretical, have been conducted to

establish the features of these three allocation rules and to compare their per-
formances. Here we summarize the main findings.
Average optimal error rates incurred by the Bayes rule (7) (i.e. error
rates assuming all population parameters to be known) have been tabulated
for the cases c = 1 continuous variable and q = 2, 3,4 binary variables over a
range of parameter values in the relatively simple case of independent
binaries by Krzanowski (1975) and Knoke (1982). More general situations
(correlated binaries and c > 1) were considered by Krzanowski (1977).
Asymptotic expansions of the parameter-replacement classification rule
(using "naive" estimators of parameters) and corresponding expected actual
error rates were obtained for the case of one binary variable by Tu and Han
(1982) and for the general case of mixed binary and continuous variables by
Vlachonikolis (1985), who also provided tabulations for various sample sizes
and parameter combinations. These asymptotic expansions depend heavily
on the normal-case expansions derived by Okamoto (1963).
For small-sample behavior, only Monte Carlo simulation results are so
far available. Krzanowski (1975) conducted a very small and limited study to
check on the performance of the location model. Much more extensive inves-
tigations were conducted by Vlachonikolis (1986), who obtained estimates of
the expected actual error rates for which he had previously derived asymp-
totic expansions, and by Vlachonikolis (1990) to investigate performance of
the Bayesian predictive rule. The parameter ranges and combinations in
these two studies were the same as in Vlachonikolis (1985) but this time both
"naive" and "smoothed" estimators of parameters were investigated.
Finally, empirical assessment of performance of the various allocation rules
(by either leave-one-out, resubstitution or test-set estimation of error rates on
various real data sets) can be found in Chang and Afifi (1974), Krzanowski
(1975, 1980, 1982), Knoke (1982), Tu and Han (1982), Vlachonikolis and
Marriott (1982) and Leung (1989). It should be noted that the majority of
tabulations, such as those cited above, have various practical drawbacks.
They only cater for known population parameters, so can only be used as a
general guide on the performance of an allocation rule or to set baselines for
the expected level of error rates, and they are very dependent on the situa-
tions considered. Later authors very often follow the precedent set by previ-
ous ones in terms of situation, parameter settings and combinations, etc., and
important cases can be easily missed.
Nonetheless, such tabulations do provide useful information, and the
above studies seem to point up the following conclusions. Average expected
error rates with the parameter-replacement Bayes allocation rule: (a) increase
as the number of continuous variables increases; (b) decrease in large sam-

pies as the number of binary (categorical) variables increases; (c) decrease as
the within-location Mahalanobis distances D~ between nl and rr increase;
(d) decrease as the difference in binary incidence probabilities between nl
and n2 increases; and (e) increase as the correlation between binary variables
increases.
Generally, expected actual error rates are slightly higher than the
corresponding optimal error rates (approximately 5% - 30% in magnitude),
but the estimated actual error rates in the Monte Carlo studies were nearly
always smaller than their asymptotic expansion counterparts. However, the
difference between the two was rarely significant and the asymptotic expan-
sion seems to be a good approximation even for sample sizes as small as 50
per group. Virtually no difference was detected between the parameter-
replacement Bayes procedure and each of the hypothesis-testing and Baye-
sian predictive rules respectively.
3. Useful Practical Extensions
In addition to the basic allocation rules and their error rates, described
in the previous section, various extra features of the location model have been
developed and are now available for use by the practitioner.
Krzanowski (1976) proposed a simple graphical procedure for investi-
gating the worth of the location model discrimination procedure over and
above the use of a simple linear discriminant function between two popula-
tions. The parameter-replacement version of Bayes rule (7) requires esti-
mates of the continuous variable means ~t0 and dispersion matrix E. If ~lij
and E are the estimates obtained in a particular application (whether by using
the naive estimators Yij,S or the smoothed second-order ones), then it is a
simple matter to obtain the matrix of Mahalanobis D 2 values between every
pair of states in the two populations. This (2s • 2s symmetric) matrix has
entries (~ij -- ~tkl) T ~-1 (~ti j _ Okl) where j,l take all values from 1 to s and i,k
take values 1 or 2. Use of (metric) scaling on this matrix thus produces a
low-dimensional representation of the 2s states which (through the ordering
of the principal axes) gives an impression of the relative importance of
differences between states and between populations. The more compactly
clustered are the states within populations, the less difference is there
between them in respect of the continuous variable parameters and hence the
less benefit will be derived from use of the location model in preference to a
simple linear discriminant function. A detailed illustrative example in this
paper showed that the major axis of the two-dimensional metric scaling
configuration split off all the even-numbered states from the odd-numbered
ones, while the minor axis split the populations. Since the even-numbered
38 W.J. Krzanowski
states were those for which the first binary variable X1 took the value zero
while the odd-numbered ones were those for which it took the value one, this
demonstration showed that the main effect of XI was the biggest source of
differences in the data. Use of the location model will allow different linear
discriminant functions for the two values of X~, but a simple linear discrim-
inant function will involve averaging over this difference and so will not give
as good a final result in this particular example.
This graphical idea was taken one step further and formalized into a
hypothesis-testing procedure by Krzanowski (1979). Since the location
model methodology will show greatest improvement over a simple linear
discriminant function when there is large variability among the cells in
respect of the continuous variable means ~tij, a first stage is to look for linear
transformations of the continuous variables such that there is as little varia-
tion as possible among the cell means in each population for the transformed
data. Krzanowski (1979) gave several alternative ways of deriving such
linear transformations, and then went on to derive a likelihood-ratio test for
equality of the (true) cell means in each population. This procedure is thus a
likelihood-ratio test for the adequacy of a simple linear discriminant function
in place of the Bayes rule (7) based on the location model (but note that con-
ditional normality of the continuous variables is now a critical assumption).
There is also the possibility of using fewer transformed variables than there
are original variables in future applications, and this aspect was investigated
further by Krusinska (1988b).
One annoying feature of many practical applications of discriminant
analysis is the presence of missing values in the data. In a comprehensive
and important contribution, Little and Schluchter (1985) provided maximum
likelihood estimation schemes for parameters of the location model when
some data are missing. Their procedure uses the EM algorithm, embraces
both the "naive" and "smoothed" approaches to parameter estimation, and
allows constraints to be imposed on some of the parameters if so desired. The
authors also discussed general aspects of imputation and discrimination as
applications of the technique.
Up to this point all developments had been in terms of two-group
discriminant analysis but Krzanowski (1986) extended the location model to
multiple-group discrimination. The connection here was made by noting that,
in general, the Bayes classification rule with equal costs and equal prior pro-
babilities is identical to the maximum likelihood classification rule while for
the continuous-variable-only case where z - N ( ~ i , Y.) in nl, the maximum
likelihood rule is identical to the minimum distance rule (i.e., allocating z to
that population ni for which (z - ~ti)7"E -1 (z - ~ti) is smallest). For the special
case of homogeneous CGD's, and treating z T = (x T, yT) as a degenerate
"population" in which unit probability is ascribed to the categorical state
defined by x and zero probability to all other states, and whose continuous
component has probability mass unity at the observed value y and zero else-
where, Krzanowski (1986) showed that the affinity (5) between z and ~1
reduced to
1
Pi = {(2rt) c I Z I )-l/4p~rn e x p { - - 7 (Y --~tim)T~-l(Y
~4
-- ['tim)} i f x = m . ( 1 2 )
Since alfinity is the converse of distance, a "minimum distance" rule is the

same as a "maximum alfinity" rule. The multiple-group allocation rule is
thus to allocate z to the population rcj for which pj is greatest; with two
groups some simple algebraic manipulation shows that this rule reduces to
(7). All the usual features (smoothed parameter estimates, leave-one-out
error rates, etc.) are easily implemented. For details, see Krzanowsld (1986).
One aspect of the location model that has been tacitly accepted without
question in all the developments is the conditional normality of the continu-
ous variables, but what can we do if this assumption is not warranted? A start
on answering this question was made by Balakrishnan and Tiku (1988), who
developed robust classification procedures for the special cases of one binary
and either one or two continuous variables. They used Tiku's modified max-
imum likelihood estimators (Tiku and Balakrishnan 1984) in which the r
smallest and r largest observations are censored and the resulting (normal)
likelihood is approximated in a simple fashion, obtained asymptotic error
rates for various symmetric non-normal populations and conducted Monte
Carlo studies for small nij. In general, the error rates were shown to be
equivalent to the usual ones if normality is appropriate, but they are better
and more stable under non-normality of y.
Finally, Leung (1989) has provided an asymptotic expansion of the stu-
dentized parameter-replacement Bayes allocation rule (7). The asymptotic
expansions provided by Vlachonikolis (1985) require knowledge of the true
values of Pij and D2m = (lal,,, -~t2m)rZ-l(I-tlm -~2m), so that the only use that
could be made of them was in the tabulations already described in Section 2.
Leung, however, used Anderson's (1973) approach to generalize these expan-
sions by accommodating estimates ofpij and D~. Thus, it is now possible to
calculate an asymptotic expected actual error rate in any practical applica-
tion. Leung illustrated the calculation by obtaining this expected actual error
rate for Chang and Afifi's (1974) example, and comparing the result with their
empirical estimate. Note, however, that this expansion assumes large sam-
ples and normality of y.
40 W.J. Krzanowski
4. Feature Selection
One major shortcoming of the location model methodology is that the

training data becomes very sparsely distributed among the categorical states
when the total number of such states s becomes large (either because q is
large or because each si is large). With sparse data, low-order models have to
be fitted in order to obtain smoothed parameter estimates and this might not
be satisfactory. It seems better to restrict the number of categorical variables
and to fit higher-order models. This point was first made by Krzanowski
(1983b), who provided a mechanism for selecting the "most effective" subset
of categorical variables for the model. For a given number of categorical
variables, he argued that the "most effective" choice comprises those
categorical variables that yield the largest estimated distance A~2 between n 1
and x2 (according to the special case (ifi) of Equation (5)). Ideally one would
conduct an "all subsets" search with A12 as the objective function but this
might not be computationally feasible, so a backward elimination procedure
was described instead. Also, since selection is based just on the training data,
"naive" estimators can be used instead of "smoothed" ones. Overall the
procedure is very fast and easily implemented.
The idea was taken up and extended to more general situations, involv-
ing selection of models as well as of features, by a number of authors. The
first was Daudin (1986), who extended the conditional distribution of Y to
include "populations" as an extra categorical variable, Z say. Thus if we
treat nl as the "base-line" population, then all individuals in rh are assigned
the value z = 0 while all individuals in x 2 are assigned the value z = 1. The
linear and log-linear models (9) and (10) are then extended by including
terms such as txz, f3xiz, yxixjz, and so on. Daudin kept to the second-order res-
triction previously suggested for these models, and hence included only main
effects (terms x 1,x2 . . . . . Xq,Z) and first-order interactions (terms
XlZ,X2Z . . . . . XqZ,XlXz,XlX3 . . . . . Xq-lXq). He distinguished two types of
model parameters in the linear model (9) for the ~tij, namely those terms that
involved the variable Z (cz1) and those that involved only the xi (cz2), and his
aim was to discard in turn the discrete variables, the continuous variables, and
the model parameters, that contribute least to discrimination. Selection was
to be made on the basis of the Akaike information criterion (AIC: log-
likelihood minus the number of independent parameters, Akaike (1973))
thereby making the assumption of normality of Y an important requirement.
He proposed a three-step selection procedure: (i) selection among the con-
tinuous variables and oq terms, (ii) selection among the cz2 terms, (iii) selec-
tion among log-linear terms.
Maximization of AIC was the objective, but there is a problem in step
0) because deleting continuous variables implies non-compatibility of I Z I's
and hence of corresponding likelihoods. For this step, therefore, Daudin pro-
posed the maximization of a modified AIC which, in effect, is the increase in
AIC for a given number of continuous variables due to the presence of the
population factor Z. Backward elimination or forward selection was advo-
cated in place of a global search, and an illustrative example was considered
in some detail.
Further selection strategies were advocated in a series of papers by
Krusinska (1988a, 1989a, 1989b). The first of these papers focussed on the
two-group case and discussed the selection of those features (i.e. those vari-
ables from the complete set of categorical and continuous) that minimize an
estimate o f p ( n l In2) + p(n2 In1). Various different estimates of this quantity
were considered: replacement of parameters in expressions (8) using either
naive estimates, smoothed estimates or the U-method (Lachenbruch and
Mickey 1968); or empirical estimates via either resubstitution or leave-one-
out (again encompassing either naive or smoothed estimation). The second
paper allowed multiple-group situations and considered selection of the
(minimum number) of features that give significant discriminatory measure
T2= Trace ( H G -1) where H is the between-states-and-populations sum of
squares and products (SSP) matrix while G is the within-states-and-
populations SSP matrix. Some distributional results were provided to check
on significance of T 2 and thereby to provide a stopping rule. In both papers,
backward elimination was advocated, and both approaches require at least
one continuous variable to be present at each stage of the process. Note that
both of these approaches involve strictly "discriminatory" criteria, by con-
trast with Daudin's "adequacy of model" criterion, but normality still plays
an important role in definitions (8) and in the T 2 distribution results. (How-
ever, the latter may be slightly questionable as the appropriate distribution
should be that of the maximum T 2 among g > 1 values at each step.) The
third paper of the set provided a two-step (sub-optimal) branch-and-bound
algorithm in place of the backward-elimination process using T 2.
Although each of the papers cited above provided at least one illustra-
tive practical example of the relevant technique, no comparisons have yet
been made among the competing proposals, so it is not possible to make
recommendations. This is clearly an area that needs further research, but see
also the remarks in section 5.3 below.
5. Other Possible Approaches with Mixed Variables
5.1 Linear Discriminant Analysis
The simplest possible practical approach is to ignore the categorical

nature of some of the variables by replacing all m(> 2)-state categorical
42 W.J. KrzanowskJ
variables by ( m - 1) dummy binary variables, scoring all binary variables

zero and one and using the ordinary linear discriminant function (LDF) as if
all the variables were continuous. This procedure was investigated by Krza-
nowski (1977), who showed that often it will give satisfactory results but
clearly will become poorer the more diversity there is among the separate
location LDF's (7). Worst results will occur when individual LDF's become
'reversed' between locations. In addition to the techniques already men-
tioned in section 3 above, changes in binary/continuous correlations between
populations provides a useful diagnostic of potentially poor performance with
a simple LDF. Knoke (1982) and Vlachonikolis and Marriott (1982) indepen-
dently showed that considerable improvement could be achieved by including
squares of variables and cross-products between them (particularly those
involving mixtures xiy/) in the LDF. With the widespread availability of the
LDF and associated variable selection procedures in standard statistical
software, these "modified linear discriminant functions" obviously carry
considerable practical appeal.
5.2 Distance-based Discrimination
For this section it will be convenient to change notation from that used
hitherto. Let us suppose that v is the individual to be allocated, and that in
the two-group case the training sets consist of a sample xl,x2 . . . . . xn from nl
and a sample Yl,Y2 . . . . . Ym from n2. Write D(n 1,n2), D(v,ni), D(xi,xj) for
the distances (however defined) between groups, between an element and a
group, and between elements respectively.
One of the oldest distance-based allocation rules can be formally attri-
buted to Matusita (1956), but has been used both formally and informally by
many others. This is the intuitively reasonable rule that allocates v to the
"nearer" of the two populations:
allocate v to nj if D(v,nj) = min [D(v,nl),D(v, n2)]. (13)
If populations are multivariate normal with common dispersion matrices and

Mahalanobis distances are used, then (13) reduces to the usual simple linear
discriminant function. In the mixed-variable case using the location model,
Krzanowski (1986) showed that D ( v , ; z i ) = {2(1 - pi)}~ with Pi given by (12)
and that (13) then reduced to (7) with this distance function. (Note that in this
case, D(nl,rc2) is given by A12 from case (iii) after Equation (5), with all
population parameters replaced by their estimates from the training sets.)
A problem arises with use of (13) on multinomial data, since in this
case the maximum likelihood rule is again recovered but this rule does not
work well with sparse data. In an attempt to overcome the problem, Dillon
and Goldstein (1978) introduced a new distance-based discrimination pro-

cedure. If we let D(i)(I~ 1 ,rt2) denote the distance between nl and rt2 when v is
included with the sample from rti, then this new procedure is to allocate v to
that group which yields the greatest separation between rrl and nz:
allocate v to rcj ifD0)(TCl,/t2) = max [D(I)(/II,~2),D(2)(/tl,/t2)]. (14)
Krzanowski (1987) studied this procedure theoretically with the help of

influence functions, and showed that for mixed variables with the location
model and distance A12 (14) produced an equivalent rule to (7). Thus neither
of these two distance-based approaches seems to offer anything more than the
Bayes classification rule for mixed variables with the location model, at least
when (5) is used as the basis for distance calculation.
Takane, Bozdogan and Shibayama (1987) adopted a different approach,
which they called "ideal point discriminant analysis." They allowed g > 2
groups, and assumed that the complete set of training data was contained in
the (n • matrix X (where n is the total number of individuals and p is the
total number of variables measured on each individual). The starting point is
to suppose that the n individuals can be represented as n points in k-
dimensional space, and that the (n • k) matrix Y of coordinates in this space
is connected to X by the linear relationship Y = X B for parameters B. Let M
be a (g x k) matrix of "group ideal points" (typically the group centroids
obtained from Y). Then Takane et al. defined the distance from subject s to
k
group t by dst = { Z (Ysj - mtj) 2 }'/2 and postulated the model
j=l
wt exp ( - dZt)
p r { s ~ xtlx} = g
(15)
Z Wh exp ( - d~h)
h=l
g
where w 1,w2 . . . . . Wg are weights satisfying E wi = 1.
i=1
Given the known group membership of individuals in the training data,
the (conditional) likelihood of the training data is multinomial with probabili-
ties (15) and observed group frequencies, so iterative approximation methods
(e.g. Fisher's scoring) can be used to provide maximum likelihood estimates
of all the unknown parameters (B and the wi) and hence an individual can be
classified to the group for which it has highest estimated probability. Takane
et al. advocated model evaluation via AIC, and showed how such additional
features as subset selection could be incorporated easily. Note the resem-
blance of the methodology to logistic discrimination (Anderson 1982), and
indeed many of the computational and sampling concerns are the same with
44 W.J. Krzanowski
both approaches. However, the authors highlighted what they considered to

be the main distinguishing features of ideal point discrimination, namely the
multidimensional scaling connection, the more natural parameterization, and
the possibility of dimension reduction.
The most recent distance-based discrimination approach is that due to
Cuadras (1989, 1991), who builds on Rat's (1982) diversity indices. Cuadras
defines (for the two-group case) the two discriminant functions
ly. D2(v,xi)_ 1 ZZD2(xi,xj);

Fl = n i 2n 2 ' j
F2 = - -1 y~ D2 (v,yi) - 1 Z. E D 2 ( y i , Y j )
m i 2m 2 i j
and allocates v to rtj i f R j = min ( F 1 , F 2 ) .

The benefit of this approach is that it operates exclusively with dis-
tances between e l e m e n t s rather than groups, so in the mixed-variable case we
can use any of the standard distance measures from cluster analysis that will
cope not only with mixtures of variables but also with obstacles such as miss-
ing values. A good choice of distance would be the one derived from
Gower's (1971) general coefficient of similarity (see also Lerman 1987).
5.3 Empirical Comparison of Results
A limited number of empirical comparisons of different approaches to

mixed-variable discrimination has been reported in the literature, and these
are first briefly summarized before conclusions are drawn.
Chang and Afifi (1974) reported a study of 43 suicide attempts with
q = 1 and c = 2. They quoted parameter-replacement error rates from (8) for
the location model (using naive estimators with both separate and pooled
covariance matrices in cells) and corresponding parameter-replacement error
rates for the simple LDF. Leung (1989) re-estimated the location model error
rates for this data set by means of the asymptotic studentized expansion.
Krzanowski (1975) gave five data sets, all with a medical background,
ranging over various values of c and q. He quoted leave-one-out error rates
for the location model Bayes rule (7) (with smoothed parameter estimates),
the simple LDF, logistic discrimination, and a classification rule based on
dichotomized variables.
Knoke (1982) reported a data set comprising 137 patients who had pre-
viously recovered from myocardial infarction, with c = 2 and q = 3. He gave
resubstitution, leave-one-out and test set (105 extra patients) error rates for
the usual location model rule, the simple LDF, the augmented LDF, and the
quadratic discriminant function. Vlachonikolis and Marriott (1982) re-
analyzed Krzanowski's Data Set 4 and also considered a data set comprising
386 medical consultations with c = 9 and q = 5. Those authors first selected
a subset of variables using standard stepwise selection on the simple LDF, the
augmented LDF, and the logistic discriminant function, and then they
obtained leave-one-out error rates for the chosen subsets.
All the above were two-group problems. Daudin (1986) provided a
three-group problem in discriminating between the categories " b a d " ,
"acceptable" and " g o o d " for 632 melons with c = 6 and q = 5. He quoted
both resubstitution and leave-one-out error rates for the location model, the
simple LDF and the augmented LDF, both with and without prior selection of
variables.
Krusinska (1988a, 1989a) used a data set consisting of 164 bronchial
asthma sufferers with c = 6, q = 8 and she reported leave-one-out error rates
and T 2 values for various selected subsets and selection strategies based on
the location model. Finally, Takane et al. (1987) re-analyzed Krzanowski's
Data Set 4 by ideal point discriminant analysis, with and without prior selec-
tion of variables.
Nearly all the above comparisons were ones contrasting the location
model with some variant of the LDF. In the majority of cases, the location
model (without prior variable selection) did as well as or better than the sim-
ple LDF (also without prior variable selection). Where the simple LDF did
badly compared to the location model, the augmented LDF (including squares
and cross-products of variables) had a performance much closer to that of the
location model. Prior selection of variables generally improved perfor-
mances. Daudin's results are the only ones where all methods underwent
prior selection of variables, and here the location model still performed much
better than the other methods (41.9% misclassification as against 44.6% with
the augmented LDF and 49.9% with the simple LDF; 7.1% of the " b a d "
melons allocated to the " g o o d " group as against 8.5% with the augmented
LDF and 10.5% with the simple LDF). Thus the above results suggest that
the location model is, in general, the best method for mixed variables fol-
lowed by the augmented LDF and then the simple LDF. Where such com-
parisons were made, logistic discrimination seemed to be comparable to the
simple LDF on mixed data and no particular benefit was derived from a qua-
dratic discriminant function as against the simple LDF.
However, some contradictory results were obtained in those comparis-
ons where some methods had prior selection of variables while other methods
did not. For example, the performance of ideal point discriminant analysis
was no better than that of the location model in the data set on which they
were compared if the full set of variables was used in both methods, but it did
do better if prior selection was made before ideal point analysis and not
before location model analysis. It is the present author's view that all results
46 W.J. KrzanowskJ
involving prior variable selection must be treated with caution for several rea-
sons. Although the leave-one-out procedure guards against bias in the error-
rate estimation process, additional bias is being introduced by the variable
selection since by definition it is those variables that are " b e s t " for the train-
ing data which are being selected. Thus comparison of a method that has not
had selection with one that has had prior selection is unfair. Also different
methods may react differently to the selection process (for example in
Daudin's data, prior selection reduced the error rate for the simple LDF only
from 50.3% to 49.9% but for the augmented LDF from 50.0% to 44.6%).
Thus unfair comparisons may result even when all methods have undergone
prior selection. The whole area of assessing performances of allocation rules
with and without variable selection has received very little attention to date,
and considerably more needs to be done. A start has been made in the simple
LDF context (see Ganeshanandam and Krzanowski 1989), and work on the
mixed-variable case is currently in progress.
6. Future Prospects
In addition to the variable selection problem outlined above, where else

should effort be concentrated in the mixed-variable discrimination area? It is
evident that there is considerable scope for investigating the effect of relaxing
assumptions inherent in the location model, and developing suitable
modifications of the model in such circumstances. For instance, much
remains to be done on the robustness of allocation rule (7) to departures from
normality and constant within-cell dispersions. If departures do cause poor
performance, then development of robust discriminant functions for the
mixed-variable case would be essential. Similarly, is there a call for location
quadratic discriminant functions to cater for various types of dispersion inho-
mogeneity?
A second possible direction of progress brings us back to our starting
point, the use of graphical modeling. The whole development of location
model methodology to date has assumed a fairly rigid second-order structure
for obtaining smoothed parameter estimates via (9) and (10), but now a much
wider horizon has opened up with the advent of graphical modeling tech-
niques. The possibility of tailoring best models to each data set is clearly the
next step, with development of appropriate software also a top priority.
Krusinska (1990) seems to be pointing the way in this direction, but there is
clearly still much to be achieved.
Finally we consider the question of availability of software for carrying
out the techniques discussed in this paper. Unfortunately, despite the time
that has now elapsed since the methods were first proposed, none of the pro-
cedures based on the location model has yet found its way into any of the
widely available general statistical software packages. Attempts are being

made to interest at least one of the producers in adding suitable routines to a
future release, but until such efforts bear fruit potential users will have to be
content with acquiring private software. The author has a number of Fortran
routines for carrying out many of the location-model-based techniques
described above. Although these are not in the most tidy or efficient form
(and some are still in a developmental state), he will be happy to send them
by e-mail to anyone on request.
References
AFIFI, A. A., and ELASHOFF, R. M. (1969), "Multivariate Two-sample Tests with Dichoto-
mous and Continuous Variables. 1. The Location Model," Annals of Mathematical
Statistics, 40, 290-298.
AKAIKE, H. (1973), "Information Theory and an Extension of the Maximum Likelihood
Principle," in Second International Symposium on Information Theory, Eds., B.N.
Petrov and F. Csaki, Budapest: Akademia Kiado, 267-281.
ANDERSON, J. A. (1982), "Logistic Discrimination," In Handbook of Statistics 2,
Classification, Pattern Recognition and Reduction of Dimensionality, Eds., P.R. Krish-
naiah and L.N. Kanal, Amsterdam: North Holland, 169-191.
ANDERSON, T. W. (1973), "An Asymptotic Expansion of the Distribution of the Studentized
Classification Statistic W," Annals of Statistics, 1,964-972.
BALAKRISHNAN, N., and TIKU, M. L. (1988), "Robust Classification Procedures Based on
Dichotomous and Continuous Variables," Journal of Classification, 5, 53-80.
CHANG, P. C., and A F I n , A. A. (1974), "Classification Based on Dichotomous and Continu-
ous Variables," Journal of the American Statistical Association, 69, 336-339.
COX, D. R. (1972), "The Analysis of Multivariate Binary Data," Applied Statistics, 21, 113-
120.
CUADRAS, C. M. (1989), "Distance Analysis in Discrimination and Classification Using
Both Continuous and Categorical Variables," in Statistical Data Analysis and Inference,
Ed., Y. Dodge, Amsterdam: North Holland, 459-473.
CUADRAS, C. M. (1991), "A Distance-based Approach to Discriminant Analysis and Its Pro-
perties," Mathematics preprint series no. 90, Barcelona University.
DAUDIN, J. J. (1986), "Selection of Variables in Mixed-variable Discriminant Analysis,"
Biometrics, 42, 473-481.
DILLON, W. R., and GOLDSTEIN, M. (1978), "On the Performance of Some Multinomial
Classification Rules," Journal of the American Statistical Association, 73, 305-313.
EDWARDS, D. (1990), "Hierarchical Interaction Models," Journal of the Royal Statistical
Society, Series B, 52, 3-20.
GANESHANANDAM, S., and KRZANOWSKI, W. J. (1989), "On Selecting Variables and
Assessing Their Performance in Linear Discriminant Analysis," Australian Journal of
Statistics, 31,433-447.
GOWER, J. C. (1971), "A General Coefficient of Similarity and Some of Its Properties,"
HAN, C.-P. (1979), "Alternative Methods of Estimating the Likelihood Ratio in Classification
of Multivariate Normal Observations," American Statistician, 33, 204-206.
KNOKE, J. D. (1982), "Discriminant Analysis with Discrete and Continuous Variables,"
48 W . J . Krzanowski
KRUSINSKA, E. (1988a), "Variable Selection in Location Model for Mixed Variable

Discrimination: A Procedure Based on Total Probability of Misclassification," EDV in
Medizin und Biologie, 19, 14-18.
KRUSINSKA, E (1988b), "Linear Transformations in Location Model and Their Influence
on Classification Results in Mixed Variable Discrimination," EDV in Medizin und
Biologie, 19, 110-114.
KRUSINSKA, E. (1989a), "New Procedure for Selection of Variables in Location Model for
Mixed Variable Discrimination," Biometrical Journal, 31, 511-523.
KRUSINSKA, E. (1989b), "Two Step Semi-optimal Branch and Bound Algorithm for Feature
Selection in Mixed Variable Discrimination," Pattern Recognition, 22, 455-459.
KRUSINSKA, E. (1990), "Suitable Location Model Selection in the Terminology of Graphi-
cal Models," Biometrical Journal, 32, 817-826.
KRZANOWSKI, W. J. (1975), "Discrimination and Classification Using Both Binary and
Continuous Variables," Journal of the American Statistical Association, 70, 782-790.
KRZANOWSKI, W. J. (1976), "Canonical Representation of the Location Model for Discrim-
ination or Classification," Journal of the American Statistical Association, 71, 845-848.
KRZANOWSKI, W. J. (1977), "The Performance of Fisher's Linear Discriminant Function
Under Non-optimal Conditions," Technometrics, 19, 191-200.
KRZANOWSKI, W. J. (1979), "Some Linear Transformations for Mixtures of Binary and
Continuous Variables, With Particular Reference to Linear Discriminant Analysis,"
Biometrika, 66, 33-39.
KRZANOWSKI, W. J. (1980), "'Mixtures of Continuous and Categorical Variables in
Discriminant Analysis," Biometrics, 36, 493-499.
KRZANOWSKI, W. J. (1982), "Mixtures of Continuous and Categorical Variables in
Discriminant Analysis: A Hypothesis-testing Approach," Biometrics, 38, 991-1002.
K.RZANOWSKI, W. J. (t983a), "Distance Between Populations Using Mixed Continuous and
Categorical Variables," Biometrika, 70, 235-243.
KRZANOWSKI, W. J. (1983b), "Stepwise Location Model Choice in Mixed-variable
Discrimination," Applied Statistics, 32, 260-266.
KRZANOWSKI, W. J. (1984), "On the Null Distribution of Distance Between Two Groups,
Using Mixed Continuous and Categorical Variables," Journal of Classification, 1,243-
253.
KRZANOWSKI, W. J. (1986), "Multiple Discriminant Analysis in the Presence of Mixed
Continuous and Categorical Data," Computers and Mathematics with Applications,
12A(2), 179-185.
KRZANOWSKI, W. J. (1987), "A Comparison Between Two Distance-based Discriminant
Principles," Journal of Classification, 4, 73-84.
LACHENBRUCH, P. A., and MICKEY, M. R. (1968), "Estimation of Error Rates in Discrim-
inant Analysis," Technometrics, 10, 1-11.
LAURITZEN, S. L., and WERMUTH, N. (t989), "Graphical Models for Association
Between Variables, Some of Which Are Qualitative and Some Quantitative," Annals of
Statistics, 17, 31-54.
LERMAN, I. C. (1987), "Construction d'un indice de Similarit6 entre objets d6crits par des
variables d'un type quelconque. Application an probl~me du consensus en classification
(1)," Revue de Statistique Appliqu(e, 35, 39-60.
LEUNG, C. Y. (1989), "The Studentized Location Linear Discriminant Function," Communi-
cations in Statistics, Theory and Methods, 18, 3977-3990.
LI'YFLE, R. J. A., and SCHLUCHTER, M. D. (1985), "Maximum Likelihood Estimation for

Mixed Continuous and Categorical Data with Missing Values," Biometrika, 72, 497-
512.
MATUS1TA, K. (1956), "Decision Rule, Based on the Distance, for the Classification Prob-
lem," Annals of Mathematical Statistics, 8, 67-77.
OKAMOTO, M. (1963), "An Asymptotic Expansion for the Distribution of the Linear
Discriminant Function," Annals of Mathematical Statistics, 34, 1286-1301 (with correc-
tion in 39, 1358-1359).
OLKIN, I., and TATE, R. F. (1961), "Multivariate Correlation Models with Mixed Discrete
and Continuous Variables," Annals of Mathematical Statistics, 32, 448-465 (with
correction in 36, 343-344).
RAO, C. R. (1982), "Diversity and Dissimilarity Coefficients: A Unified Approach," Theoret-
ical Population Biology, 21, 24-43.
TAKANE, Y., BOZDOGAN, H., and SHIBAYAMA, T. (1987), "Ideal Point Discriminant
Analysis," Psychometrika, 52, 371-392.
TIKU, M. L., and BALAKRISHNAN, N. (1984), "Robust Multivariate Classification Pro-
cedures Based on the MML Estimators," Communications in Statistics - Theory and
Methods, 13, 967-986.
TU, C. T., and HAN, C. P. (1982), "Discriminant Analysis Based on Binary and Continuous
Variables," Journal of the American Statistical Association, 77, 447-454.
VLACHONIKOLIS, I. G. (1985), "On the Asymptotic Distribution of the Location Linear
Discriminant Function," Journal of the Royal Statistical Society, Series B, 47, 498-509.
VLACHONIKOLIS, I. G. (1986), "On the Estimation of the Expected Probability of
Misclassification in Discriminant Analysis with Mixed Binary and Continuous Vari-
ables," Computers and Mathematics with Applications, 12A(2), 187-195.
VLACHONIKOLIS, I. G. (1990), "Predictive Discrimination and Classification with Mixed
Binary and Continuous Variables," Biometrika, 77, 657-662.
VLACHONIKOLIS, I, G., and MARRIOTI', F. H. C. (1982), "Discrimination with Mixed
Binary and Continuous Data." Applied Statistics, 31, 23-31.
WERMUTH, N., and LAURITZEN, S. L. (1990), "On Substantive Research Hypotheses,
Conditional Independence Graphs and Graphical Chain Models," Journal of the Royal
Statistical Society, Series B, 52, 21-50.
WHITTAKER, J. (1990), Graphical Models in Applied Multivariate Statistics, Chichester:
Wiley.

BF 02638452

Uploaded by

Document Informationclick to expand document information

Document Informationclick to expand document information

Copyright:

Available Formats

BF 02638452

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BF 02638452

Uploaded by

Copyright:

Available Formats

Journal of Classification 10:25-49 (1993)

The Location Model for Mixtures of Categorical

Keywords: Classification; Discrimination; Distances; Error rates; Feature selec-

Multivariate data sets containing mixtures of categorical and continu-

Constructive comments from the anonymous referees are gratefully acknowledged.

Suppose that the q categorical variables and c continuous variables are

probability that X falls in state j is p j (j = 1 . . . . . s; Z p j = 1). Hence the

By collecting terms and redefining parameters, this density can be rewritten

Krzanowski (1983a) surveyed the various possible general definitions of the

For practical applications, the parameters in (5) may be estimated from

The major practical developments of the location model that have

2. Discriminant Analysis Methodology

2.1 Bayes Rule

We assume that there are two populations n I and n2, discrimination

P r ( X = j Ixi) = Pij and (Y I X = j,rti) - N(Bi j , Z)

probabilities being obtainable from standard linear discriminant theory and

p(r~i Ircj) = E" P j m ~ {(log [Pim/Pjm] - ~1 D~)/Dm} for i ~ j (8)

hij = Yij - tliJ k~=l Yijk

where xu is the observed value of X,, in location j.

discussed in the introduction to graphical modeling above. A current concern

categorical variable can be indicated by setting Xj = 1 and Xi = 0 for all

2.2 Hypothesis-testing Rule

= {,Pim /Pim ) [ (.^(2)

2.3 Bayesian Predictive Rule

The Bayesian approach to the problem is to postulate prior distributions

2.4 Assessment and Comparison of the Rules

Various studies, both empirical and theoretical, have been conducted to

as the number of continuous variables increases; (b) decrease in large sam-

3. Useful Practical Extensions

Since alfinity is the converse of distance, a "minimum distance" rule is the

One major shortcoming of the location model methodology is that the

5. Other Possible Approaches with Mixed Variables

5.1 Linear Discriminant Analysis

The simplest possible practical approach is to ignore the categorical

variables by ( m - 1) dummy binary variables, scoring all binary variables

5.2 Distance-based Discrimination

allocate v to nj if D(v,nj) = min [D(v,nl),D(v, n2)]. (13)

If populations are multivariate normal with common dispersion matrices and

and Goldstein (1978) introduced a new distance-based discrimination pro-

allocate v to rcj ifD0)(TCl,/t2) = max [D(I)(/II,~2),D(2)(/tl,/t2)]. (14)

Krzanowski (1987) studied this procedure theoretically with the help of

both approaches. However, the authors highlighted what they considered to

ly. D2(v,xi)_ 1 ZZD2(xi,xj);

and allocates v to rtj i f R j = min ( F 1 , F 2 ) .

5.3 Empirical Comparison of Results

A limited number of empirical comparisons of different approaches to

In addition to the variable selection problem outlined above, where else

widely available general statistical software packages. Attempts are being

KRUSINSKA, E. (1988a), "Variable Selection in Location Model for Mixed Variable

LI'YFLE, R. J. A., and SCHLUCHTER, M. D. (1985), "Maximum Likelihood Estimation for

You might also like