Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Subclass discriminant analysis

2006, IEEE Transactions on Pattern Analysis and Machine Intelligence

1274 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 28, NO. 8, AUGUST 2006 Subclass Discriminant Analysis Manli Zhu, Student Member, IEEE, and Aleix M. Martinez, Member, IEEE Abstract—Over the years, many Discriminant Analysis (DA) algorithms have been proposed for the study of high-dimensional data in a large variety of problems. Each of these algorithms is tuned to a specific type of data distribution (that which best models the problem at hand). Unfortunately, in most problems the form of each class pdf is a priori unknown, and the selection of the DA algorithm that best fits our data is done over trial-and-error. Ideally, one would like to have a single formulation which can be used for most distribution types. This can be achieved by approximating the underlying distribution of each class with a mixture of Gaussians. In this approach, the major problem to be addressed is that of determining the optimal number of Gaussians per class, i.e., the number of subclasses. In this paper, two criteria able to find the most convenient division of each class into a set of subclasses are derived. Extensive experimental results are shown using five databases. Comparisons are given against Linear Discriminant Analysis (LDA), Direct LDA (DLDA), Heteroscedastic LDA (HLDA), Nonparametric DA (NDA), and Kernel-Based LDA (K-LDA). We show that our method is always the best or comparable to the best. Index Terms—Feature extraction, discriminant analysis, pattern recognition, classification, eigenvalue decomposition, stability criterion, mixture of Gaussians. æ 1 F INTRODUCTION EATURE extraction via Discriminant Analysis (DA) is one of the most sought after approaches in applications in computer vision [9], [24], [7], [3], [19]. The main advantage of these algorithms is that they can automatically extract a low-dimensional feature representation where the data can be easily separated according to their class labels [8], [23], [9], [22]. Unfortunately, to date, these techniques have only found success in a limited number of applications. Usually, one technique will work on some problems, while distinct algorithms will be preferred in others. And, there still exist problems for which no DA method will successfully find an adequate representation of the data [22], [20]. The reason why each algorithm is only applicable to a limited number of applications is mainly due to the assumptions embedded in each of the methods. For example, the well-known Linear Discriminant Analysis (LDA) algorithm assumes the sample vectors of each class are generated from underlying multivariate Normal distributions of common covariance matrix but different means (i.e., homoscedastic data) [8], [23], [9]. This assumption has restricted the use of LDA considerably. Over the years, authors have defined several extensions to the basic formulation of LDA [9], [11], [10], [2], [16]. One such method is Nonparametric Discriminant Analysis (NDA) where the Normal assumption is relaxed in one of the two metrics [9]. However, NDA still assumes the data in each class can be grouped in a single (nondisjoined) cluster. Hence, if the data of a class is divided into two or more clusters, NDA will not generally work. Another alternative is to use a weighted version of LDA, such as the approximate Pairwise Accuracy Criterion (aPAC) [16] . The authors are with the Department of Electrical and Computer Engineering, The Ohio State University, 205 Dreese Lab, 2015 Neil Ave., Columbus, OH 43210. E-mail: {zhum, aleix}@ece.osu.edu. Manuscript received 8 June 2005; revised 4 Nov. 2005; accepted 9 Jan. 2006; published online 13 June 2006. Recommended for acceptance by T. Tan. For information on obtaining reprints of this article, please send e-mail to: tpami@computer.org, and reference IEEECS Log Number TPAMI-0300-0605. 0162-8828/06/$20.00 ß 2006 IEEE or Penalized DA (PDA) [11]. Here, weights are introduced in the definition of the metrics to reduce (or penalized) the role of the least stable samples (or features). Such weights allow us to slightly deviate from the Normal assumption, by making the metrics of DA a bit more flexible. These methods generally outperform LDA in practice because the data is not usually Gaussian. However, these algorithms also assume each class is represented by a single cluster and, therefore, none of the methods defined in this paragraph can solve the problem posed by nonlinearly separable classes. Fig. 1 shows an example. In this example, none of the methods described above would be able to find that one-dimensional feature vector which minimizes the Bayes error. To solve this, we could use nonlinear methods. Flexible Discriminant Analysis (FDA) [10] attempts to include the idea of nonlinear fitting in DA. To do that, FDA reformulates the DA problem as a regression one and, then, uses a nonlinear function (e.g., such as a polynomial of order larger than two) to fit the data. Alternatively, we could search for a nonlinear function that maps the data into a space of higher dimensionality where the classes are linearly separable. This is the wellknown idea behind the use of kernels. Generalized Discriminant Analysis (GDA) [2] is one such method. In GDA, a kernel is first used to embed the data into a space where the data is (hopefully) linearly separable and then LDA is used to search for those features that best divide the classes. Unfortunately, three main problems prevent the efficient use of such techniques. The first one is that of finding the appropriate kernel for each particular problem (i.e., for each set of data distributions). The second problem posed by nonlinear approaches (such as the kernel trick and FDA) is that one generally requires of a very large number of samples to successfully apply those algorithms. And, third, nonlinear methods usually have an associated high computational cost, both in training and testing. Our goal is to propose an algorithm that can adapt to the data at hand. This means we need to find a way to describe a large number of data distributions, regardless of whether these correspond to compact sets or not. One way to do this Published by the IEEE Computer Society ZHU AND MARTINEZ: SUBCLASS DISCRIMINANT ANALYSIS 1275 Fig. 1. In this example, the first class is represented by a single Gaussian distribution, while the second is represented by two separated Gaussians. Linear DA algorithms will not be able to successfully reduce the dimensionality of the original feature space to one because the second class corresponds to two disjoint distributions. One can solve this problem by dividing the second class into two subclasses. Here, X and B represent the sample covariance matrix and the betweensubclass scatter matrix (see text). is to approximate the underlying distribution of each class as a mixture of Gaussians. Such mixtures are flexible enough to represent a large number of those distributions typically encountered in real applications [21], [13]. For example, in our Fig. 1, this means we will need one Gaussian to successfully describe the distribution of the first class, and two Gaussians to represent the data of the second. Once the data distribution of each class has been approximated using a mixture of Gaussians, it is easy to use the following generalized eigenvalue decomposition equation to find those discriminant vectors that best (linearly) classify the data, B V ¼ X V; ð1Þ where B is the between-subclass scatter matrix, X is the covariance matrix of the data, V is a matrix whose columns correspond to the discriminant vectors, and  is a diagonal matrix of corresponding eigenvalues. Since B measures the scatter of the subclass means, we will refer to this method as Subclass Discriminant Analysis (SDA). It is important to note that the difficulty in (1) is not given by the way we compute the discriminant vectors. The real challenge (and our goal) is to find that division of the classes into a set of subclasses so that the classification in the reduced space of Vq (where Vq is a p  q matrix which maps the original space of p dimensions to one of q, for some q  p) is maximized. This paper defines simple criteria that can be used to determine those subclass divisions that optimize classification when used in a linear discriminant setting such as that defined in (1). In particular, we present two criteria. Our first solution uses the leave-one-out test to find the optimal division. Here, the samples left out are used to determine that subdivision which works best. Although this is a reasonably good approach, it comes with an associated high computational cost. To solve this problem, we will introduce a second method which takes advantage of the fact that (1) is not guaranteed to work when the smallest angle between the ith eigenvector given by the metric to be maximized and the first i eigenvectors given by the metric to be minimized is close to zero [20]. We will show how we can use this known fact to define a fast, easy to use criterion. Note that our approach (defined in the preceding paragraphs) differs from previous algorithms where the EM (Expectation-Maximization) algorithm [5] is first used to estimate the true underlying distribution of each class, before using LDA [26], [12], [19]. Our goal is to optimize classification, not to recover the true underlying (but unknown) distribution of the data. Moreover, the methods presented in [26], [12], [19] can only be applied when the number of samples is very large. Otherwise, one cannot efficiently use the EM algorithm to fit a mixture of Gaussian distributions. In Section 2, we show that SDA can successfully resolve all the problems addressed by the other methods reported in the literature thus far. In this section, we will also justify our definition of B . Section 3 defines the two criteria used to determine the optimal number of subclasses. Experimental results are in Section 4. We conclude in Section 5. 2 THE METRICS OF DISCRIMINANT ANALYSIS Most DA algorithms defined thus far are based on FisherRao’s criterion [8], [23], which is given by jvT A vj ; jvT B vj ð2Þ where the matrices A and B, are assumed to be symmetric and positive-definite, so that they define a metric. It is wellknown that LDA uses the between and within-class scatter matrices, P A ¼ SB and B ¼ SW , respectively, in (2); where SB ¼ Ci¼1 ði  Þði  ÞT , C is the number of classes, i the sample mean of class i,  the global mean (including the samples of all classes), SW ¼ ni C X 1X ðxij  i Þðxij  i ÞT ; n i¼1 j¼1 xij is the jth sample of class i, and ni the number of samples in that class. In the section above, we have introduced several extensions to the classical LDA algorithm as defined by Fisher and Rao. Such modifications usually redefine one of these two metrics, A or B. For example, NDA uses the following nonparametric version of the between-class scatter matrix for A, SB ¼ ni X C C X 1X n i¼1 j¼1 l¼1 l6¼i l ij  xij  Mlij T  xij  Mlij ; where Mlij is the mean of the k nearest samples to xij belonging to class lðl 6¼ iÞ and lij is any scale factor which prevents the results to be affected by isolated samples located far from the boundary. Other DA algorithms also rework the definition of the between-class scatter matrix. aPAC is such an algorithm where the P new scatter matrix is a weighted version of SB , PC1 C T i.e., i¼1 j¼iþ1 !ðdij ÞSij , where Sij ¼ ði  j Þði  j Þ , dij is the Mahalanobis distance between classes i and j and !ðÞ is a weight function which makes the contribution of each class pair be equal to the classification accuracy (one minus the Bayes error) between these two classes up to an additive constant. Other methods modify the definition of B. For instance, PDA includes a penalizing matrix in the definition of the within-class scatter matrix, B ¼ SW þ . Here, 1276 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 28, NO. 8, AUGUST 2006 Fig. 2. (a) Shows a two-class example where the data is embedded in a three-dimensional space. A two-dimensional representation suffices to optimally discriminate between the samples of the two classes. (b) LDA cannot find such a two-dimensional solution, because the rank of SB is one. Here, the horizontal axis corresponds to the one-dimensional space found by LDA. For easy view, the data has also been uniformly distributed about the vertical axis. (c) SDA can recover this two-dimensional discriminant subspace by dividing each of the classes into two subclasses. penalizes those eigenvectors of (2) that are noisy (e.g., by assigning weights that are proportional to the second derivative over the components in each eigenvector). Each of these methods solves a specific problem where LDA is known to fail and, therefore, each method will be best when applied to a specific problem. Earlier, we mentioned how SDA can be used to efficiently solve all the problems addressed by each of the methods just defined. For example, SDA can represent each class as a mixture of Gaussians to solve the problem targeted by NDA. And, as illustrated in Fig. 1, SDA can also be used to classify nonlinearly separable classes if one divides each class into a set of disjoint clusters. Another problem that was not discussed above but is common to most DA methods, is given by the deficiency in the rank of A. For example, the rankðSB Þ  C  1 and, therefore, LDA, PDA, aPAC, and other DA methods can only extract C  1 features from the original feature space. If the classes are linearly separable in the original space, C  1 features would guarantee to be sufficient to discriminate the C classes (assuming that the data distributions are correctly represented by our metrics A and B) [9]. In practice, however, the data is rarely linearly separable and the C  1 features obtained are consequently not the optimal ones. To address this problem, some researchers have defined alternative metrics that can be used to extract more than C  1 features. This can then be used to define a subspace of larger dimensionality where the data is separable. An example of this is shown in Fig. 2a. In this example, LDA cannot recover the two-dimensional space necessary to successfully discriminate both classes, because rankðSB Þ ¼ 1; Fig. 2b. This problem can be solved by defining a metric for A that not only considers the difference between class means but also between their covariances. Heteroscedastic LDA [17] (HLDA) is such a method, where the authors use the Chernoff distance to estimate class similarity based on both means and covariances, i.e., A is redefined as " C C1 X X 1=2 1=2 ij ði  j Þði  j ÞT ij SC ¼ i¼1 j¼iþ1  # 1 1 þ 4 log ij  log i  log j ; 2 2 ð3Þ where i is the covariance matrix of the samples in class i, ij the average between i and j , and we have assumed equal priors. This algorithm can now be used to find the two-dimensional subspace where the classes are linearly separable. Note, however, that by dividing the data of each class into a set of subclasses, we can also define scatter matrices whose ranks are (in general) larger than C  1. For instance, following the most traditional definition of a scatter matrix in LDA [8], [23], [9], one could define the between-subclass scatter matrix as [31], bB ¼  Hi C X X pij ðij  Þðij  ÞT ; ð4Þ i¼1 j¼1 where Hi is the number of subclass divisions in class i and pij and ij are the prior and mean of the jth subclass in class i. Such a definition is consistent with the theory of DA. Indeed, further analysis shows that (4) works similarly to LDA in that it simultaneously attempts to maximize the distance between the class means and between the subclass means in the same class. This is formally summarized in the following result. bB Result 1. The covariance matrix of the subclass means  P b B ¼ SB þ Q with Q ¼ C ni Qi can be decomposed as  i¼1 n and where Qi measure the distance between the means P i nij of the subclasses in the same class only, i.e., Qi ¼ H j¼1 ni ðij  i Þðij  i ÞT , and nij is the number of samples in the jth subclass of class i. This result (the proof of which is in Appendix A which can be found at http://computer.org/tpami/archives.htm) shows that when we use (4) in (1), we will find those basis vectors that maximize the distance between both, the class means, and the means of the subclasses that correspond to the same class. If we have the correct division of classes into their optimal number of subclasses, this definition will work well. To see this, note that the classes only need to be divided into subclasses when these also correspond to distinct clusters in the reduced space. An example of this was illustrated in Fig. 1. In practice, however, the data may sometimes be oversegmented into too many subclasses. This may be due to complex nonlinear distributions, ZHU AND MARTINEZ: SUBCLASS DISCRIMINANT ANALYSIS 1277 outliers or other factors. When this happens, it is convenient to define a B that favors the separability of those subclasses that correspond to different classes. Formally, B ¼ Hk Hi C X C1 X X X pij pkl ðij  kl Þðij  kl ÞT ; ð5Þ procedure n times—one for each of the samples we can leave out. Now, we let ri;H ¼ 1 if we correctly classify xi when we use Xi for training. Otherwise, we let ri;H ¼ 0. The recognition rate for a fixed value of H is then given by i¼1 j¼1 k¼iþ1 l¼1 nij n is the prior of the jth subclass of class i. where pij ¼ The advantage of this new definition of between-subclass scatter is that it emphasizes the role of class separability over that of intrasubclass scatter. However, it is important to note that the information of the latter is not lost. Knowing the distances from a distribution f1 to two others, f2 and f3 , also provides information on the distance between f2 and f3 . Therefore, although indirectly, (5) also measures the distance between the subclasses in the same class. This means that if some subclasses (belonging to the same class) need to be separated, these will (because the required information has not been lost). However, since (5) favors class separability, only those subclasses that are essential to boost the classification result will actually be separated in the reduced space. We also note that (5) is equal to SB when Hi ¼ 1 for all i ¼ f1; . . . ; Cg. And, since it can be readily P shown that the rankðB Þ  minðH  1; pÞ (where H ¼ Ci¼1 Hi is the total number of subclass divisions and p is the dimensionality of the range of the covariance matrix), SDA can also solve the problem posed by the deficiency of the rank of SB . An example of this is shown in Fig. 2. In this example, the range of X is three, p ¼ 3, the number of classes is two, C ¼ 2, and the number of samples n ¼ 100. When using SDA, we can vary the value of H from a low of 2 to a high of 100, which produces a set of solutions with 1  rankðB Þ  3. One such solution, for H ¼ 4 is shown in Fig. 2c, in which case the result is optimal. Therefore, as stated earlier, the goal is now reduced to finding a way to automatically select the optimal number of subclass divisions. 3 OPTIMAL SUBCLASS DIVISIONS The most convenient criterion to use for the selection of the optimal number of subclasses is the leave-one-out-test (LOOT). In this case, we use all but one sample for training (n  1 samples) and then test whether our results generalize to the sample left out. Although convenient, this solution is computationally expensive. In this section, we will show that we can also use a recent result which shows that the basis vectors obtained by any generalized eigenvalue decomposition equation are not guaranteed to be correct when the smallest angle between the ith eigenvector given by the metric to be maximized (i.e., A) and the first i eigenvectors given by the metric to be minimized (B) is close to zero. We will conclude this section with a discussion on how to cluster the data into the set of possible subclass divisions we want to analyze with our criteria. 3.1 Leave-One-Out-Test Criterion For each possible value of H (i.e., number of subclasses), we work as follows: We first divide the data into H subclasses and then calculate the projection matrix Vq from (1) using all but the ith sample of the training data, Xi ¼ fx1 ; . . . ; xi1 ; xiþ1 ; . . . ; xn g. The sample left-out, xi , is used for testing. This means we will need to repeat this RH ¼ n 1X ri;H : n i¼1 The optimal value of H, Ho , is Ho ¼ argmax RH : ð6Þ H This allows us to compute the optimal B with Ho . The resulting projection matrix Vq is a p  q matrix whose columns are the eigenvectors associated to the q largest eigenvalues of (1) when H ¼ Ho . This is summarized in Algorithm 1. Since Ho is chosen according to how well our classifier generalizes to new samples, this leave-one-out-test criterion generally gives an appropriate value for H. Algorithm 1 SDAloot Initialization: RH ¼ 0, 8H. for i ¼ 1 to n do Generate the training set Xi ¼ fx1 ; . . . ; xi1 ; xiþ1 ; . . . ; xn g using the NN clustering defined in Algorithm 3. Calculate X using Xi . for H ¼ C to tC y do Calculate B using (5). Perform the eigenvalue decomposition 1 X B V ¼ VX . Project the data onto the subspaces of Vq , i.e., Yi ¼ VTq Xi . if the sample xi is classified correctly then RH þ þ end if end for end for Ho ¼ arg maxH RH . Calculate B and X using all our training data and Ho . The final projection matrix Vq is given by the first q columns of V, where 1 X B V ¼ VX . y The value of t can be specified by the user or be set so as to guarantee that the minimum number of samples per subclass is sufficiently large. 3.2 Stability Criterion Our second approach is based on a stability criterion which we have recently introduced in [20]. In particular, we showed that the success of any linear method, such as that given in (1), does not depend on whether the data is homoscedastic or not. We have demonstrated that the results of the linear methods based on (2) are not guaranteed to be correct when the smallest angle between the ith eigenvector of A and the first i eigenvectors of B is close to zero. We will now show how we can use this result to find the optimal value of H. This will lead to the design of a low cost algorithm that will facilitate the use of SDA in a larger number of applications. 1278 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, KH ¼ m X i  X VOL. 28, uTj wH;i i¼1 j¼1 Fig. 3. (a) In this example, the first basis vector obtained with (1) does not minimize the classification error. Note that u1 (which is orthogonal to v1 would be the optimal and desirable solution). Close analysis of this figure shows that the angle between the first eigenvector of B and X is zero. This means that it is not possible to maximize the metric given by B and minimize that of X simultaneously. (b) A classical example where most linear equations perform as expected. Note that in this case the angle between the first eigenvectors of B and X is large (i.e., 90o ). This is formally computed by (7). To understand the reason why an algorithm that is based on a generalized eigenvalue decomposition equation (such as (1)) may not produce the most desirable result, we need to go back to the Fisher-Rao criterion given in (2). Note that in this equation, we want to simultaneously maximize the measure given by the scatter matrix A (which in our case is B ) and minimize that computed by B (in our case, B ¼ X ). Unfortunately, there may be some cases where this cannot be accomplished in parallel. This problem is illustrated in Fig. 3. In this example, ui (wi , respectively) is the eigenvector of X (B , respectively) associated to the ith largest eigenvalue. In Fig. 3a, we show an example where (1) fails. This is so, because, in this particular case, it is not possible to maximize B and minimize X simultaneously. As seen in the figure, one would want to select u2 as a solution to minimize X but w1 to maximize B . Since it is impossible to solve this, (1) will choose that direction v1 associated to the largest relative variance as given by (2). In our Fig. 3a, this means that we will select u2 as our solution, which leads to an incorrect classification. Fortunately, this problem can be easily detected because when this happens the angle between the first eigenvectors of B and X is small (e.g., in Fig. 3a it is zero). This can be formally computed as [20], K¼ m X i  X i¼1 j¼1 uTj wi 2 ; ð7Þ where m < rankðB Þ. As argued above and as shown in Fig. 3, one wants the value of K defined above to be as small as possible. For example, in Fig. 3b, we illustrate a case where the cosine of the angle between w1 and u1 is zero and, therefore, the result v1 is correct. This is so, because now the first eigenvector of B and the last of X agree, i.e., w1 ¼ u2 . That is to say, one would always want the ideal case where the two metrics (that to be maximized and that to be minimized) agree. To achieve this, we can define a method that divides each class into that number of subclasses which minimizes (7). To efficiently use this criterion, we will need to compute the value of (7) for each of the possible values of H and then select the minimum. This means we will first calculate 2 NO. 8, AUGUST 2006 ; ð8Þ where wH;i is the ith eigenvector of B when the data is divided by a total of H subclasses. The problem of selecting H now reduces to finding that Ho which minimizes (8). It is important to note that the larger the value of H, the larger the number of eigenvectors in B (or, equivalently, the larger m can be). Unfortunately, when this happens, the number of summations in (8) increases. This means that KH will generally grow with m. This calls for a normalization step. From elementary geometry, we know that p  2 X  uTj wH;i ¼ cos2 i;1 þ cos2 i;2 þ . . . þ cos2 i;p ¼ 1; j¼1 where i;j is the angle between uj and wH;i and p ¼ rankðX Þ. In other words, the inner summation in (8) is always guaranteed to be bounded by zero and one, Pi 2 T ðu w H;i Þ 2 ½0; 1. j j¼1 The above argument shows that (8) depends on the value of m. The larger m is, the larger (8) can be. To prevent our algorithm from being biased toward small values of H, we need to normalize (8) with 1=m. Our optimal number of classes is therefore given by Ho ¼ arg min H 1 KH : m ð9Þ This second criterion is summarized in Algorithm 2. We see that the complexity of this algorithm is of the order of n2 p þ lp3 , while that of our LOOT-based SDA had a complexity larger than n3 p þ tnp3 (recall that n is the number of sample and p is the dimensionality of the range of X ). Note that regardless of the values of t and l, the complexity of SDA using the stability criterion is much smaller than that of the leave-one-out test. This complexity can still be further constrained by the clustering algorithm used to separate the classes into subclasses. This is to be defined next. Algorithm 2 SDAstability Sort the training samples using the NN clustering defined in Algorithm 3. Calculate X . Calculate X U ¼ UX . for H ¼ C to l  C y do Calculate B . Compute B W ¼ WB . Calculate KH using (8). end for Ho ¼ argminH m1 KH Calculate B with H ¼ Ho . The final projection matrix Vq is given by the first q columns of V, where 1 X B V ¼ VX . y Again, the value of l can be specified by the user or be set so as to guarantee that the minimum number of samples per subclass is sufficiently large. 3.3 Dividing Classes into Subclasses Thus far, we have defined two criteria that can be used to determine the optimal number of subclass divisions. We now turn to another related problem—that of deciding how ZHU AND MARTINEZ: SUBCLASS DISCRIMINANT ANALYSIS 1279 Fig. 4. (a) Shown here are the two original Gaussian distributions of the data as given by the mean and covariance matrix of the samples in each of the two classes. (b) Each class is now described using two Gaussian distributions. These have been estimated using the NN-based approach summarized in Algorithm 3. to divide each class into a set of possible subclasses. For computational reasons, it is quite obvious that we cannot test every possible division of n samples into H subclasses for every possible value of H. However, when these subclasses are assumed to be Gaussian, their samples will cluster together. This is easy to prove from the known fact that Gaussian distributions are convex. This means we can use a clustering approach to divide the samples in each class into a set of subclasses (i.e., clusters). And, although the number of clustering methods available is very large, we want one that can efficiently work when the number of samples in each class is either large or small. This means that the K-means algorithm or the estimation of a Gaussian mixture are not adequate because such methods do not work properly when the number of samples is small. When this happens, one is usually left to use nonparametric methods such as those based on the Nearest Neighbor (NN) rule [6]. In this section, we define a NN-based algorithm to be used as clustering method in SDA. We will nonetheless compare our results to those obtained with K-means and Gaussian mixtures. We will see that the classification results obtained when one uses our NN-clustering algorithm are always superior or equivalent to the ones obtained when we use the K-means or the mixture of Gaussians instead. The reader may have already noticed that our clustering method is nonparametric, whereas the other two algorithms mentioned above (K-means and Gaussian mixture) are parametric. Hence, we will also compare our results to those obtained with the nonparametric clustering (valley-seeking) algorithm of Koontz and Fukunaga [15]. In this case, we will show that the classification results obtained with our method and that of Koontz and Fukunaga are equivalent. However, our method presents two important advantages. 1) While in the clustering approach of Koontz and Fukunaga one needs to optimize a parameter (which controls the size of the local area where computations are carried out), our method is parameter-free. 2) The algorithm of Koontz and Fukunaga is iterative in nature and, therefore, associated with a large computational cost and is not guaranteed to converge. 3.3.1 NN Clustering Our clustering algorithm (Nearest Neighbor-based) first sorts the vectors in class i so that the set fxi1 ; . . . ; xini g is constructed as follows: xi1 and xini are the two most distant feature vectors of class i, i.e., the Euclidean distance between these two vectors is the largest, argmaxjk kxij  xik k2 , where kxk2 is the norm-2 length of x. xi2 is the closest feature vector to xi1 and xiðni 1Þ is the closest to xini . In general, xij is the j  1th feature vector closest to xi1 and xiðni jÞ is the jth closest to xini . We can now readily use this sorted set fxi1 ; . . . ; xini g to divide the data in each class into Hi subclasses. For example, we can divide the data into two equally balanced (in the sense of having the same number of samples) clusters (Hi ¼ 2) by simply partitioning this set into two parts fxi1 ; . . . ; xibni =2c g and fxiðbni =2cþ1Þ ; . . . ; xini g. Or, generally, we can divide each class into h (equally balanced) subclasses, i.e., Hi ¼ h, 8i. This is suitable for those cases where 1) the underlying distribution of each of the classes is not Gaussian, but can be represented as a combination of two or more Gaussians or 2) the classes are not separable, but the subclasses are. Fig. 4 shows an example. In Fig. 4a, we show the original clustering given by the labels associated to each of the samples. Then, after sorting the data with the algorithm just described (and summarized in Algorithm 3) and dividing the data in each class into two clusters (i.e., Gaussians), we obtain the result shown in Fig. 4b. Algorithm 3 NN clustering for i ¼ 1 to C do ^ i ¼ f^ ^ ini g be the samples in class i. Let X xi1 ;    ; x Construct the matrix D ¼ fdij g where dij is the Euclidean distance between samples i and j, i.e., ^ ik k2 . xij  x djk ¼ k^ ^ is and xini ¼ x ^ ib , where ½s; b ¼ argmaxj;k djk . Let xi1 ¼ x for g ¼ 1 to bni =2cy do 1280 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 28, NO. 8, AUGUST 2006 TABLE 1 Results on the UCI Data Sets Shown here are the classification results (in percentage) obtained using the nearest neighbor rule in the reduced spaces given by each of the methods. Results for NDA and HLDA are obtained by averaging over a large set of possible parameters. Values shown in parentheses correspond to the standard deviation from such means. ^ im , where m ¼ argminj6¼s dsj . Let xigþ1 ¼ x dsm ¼ 1. ^ ik , where k ¼ argminj6¼b dbj . Let xini g ¼ x dbk ¼ 1. end for Divide the resulting sorted set fxi1 ; . . . ; xini g into Hi subclasses. In doing this, try to keep the same number of samples in each subclass. end for y For the case where modðn =2Þ 6¼ 0, there will be a sample i left to be sorted. This will obviously correspond to xi;bni =2cþ1 . In our method, we have assumed that every class is divided by the same number of subclasses h at each given time. Although this may seem simplistic, it is based on the fact that we can use two Gaussian distributions to represent the data generated by a single Gaussian (but not vice versa). Moreover, recall that we already took care of the problems caused by oversegmentation when we defined B in Section 2. Our definition ensures that the between subclass scatter is only used when needed. 4 EXPERIMENTAL RESULTS DA algorithms have been applied to a large variety of problems—especially in computer vision. In this section, we will first show results on three data sets of the UCI repository available at [1]. Moreover, we will show how our algorithm applies to two particular problems in computer vision: 1) categorization of images of objects and 2) face recognition. We will compare our results to LDA [23], [8], Direct LDA (DLDA) [28], Nonparametric DA (NDA) [9], Heteroscedastic LDA (HLDA) [17], Kernel LDA (K-LDA) [2], [27], and Complete Kernel Fisher Discriminant analysis (CKFD) [25]. In K-LDA, we have used a second degree polynomial kernel, kðx; yÞ ¼ ðx; yÞ2 , which we will refer to as the2 KP-LDA method, and a Gaussian kernel, kðx; yÞ ¼ expðkxyk  Þ, that we named KG-LDA. Note that in KG-LDA we need to specify the value of . This value will be optimized for each of the data sets by means of the cross-validation procedure. The polynomial kernel used in CKFD is kðx; yÞ ¼ ðxT y þ 1Þa where a is also estimated through cross-validation. We refer to this method as P-CKFD. Similarly, the Gaussian kernel defined above is used to construct G-CKFD. We also note that both NDA and HLDA, have a free parameter to be specified by the user. In NDA, this is the number of K nearest neighbors. In HLDA, it is the number of dimensions of the reduced space. Since the results vary depending on the value of such parameters, we have decided to calculate the results of NDA and HLDA for a typical range of values and then report on the mean and standard deviation of these. For the NDA case, this means we will vary K from a low of 5 to maximum of n0  1, where n0 ¼ mini ni and ni is the number of samples in class i. In HLDA, we vary the number of dimensions from a low of C  1 (which is the value used by LDA) to the minfC þ 40; p  1g. 4.1 UCI Data Sets The first of the databases used was the “Wisconsin Diagnostic Breast Cancer” (WDBC) set. In this database, we have 30 parameters defining the shape and texture of the nucleus of the cells to be analyzed. The task is to discriminate between benign and malignant cells. The 569 samples available are randomly divided into a training set of 285 samples and a testing set of 284 testing instances. This guarantees that the sample-to-dimension ratio is appropriate. The discriminant methods mentioned above are first used to find a lowdimensional representation of the data and, then, the nearest neighbor algorithm is used to classify each of the testing samples according to the class label of the closest training vector. These results are summarized in Table 1. The results shown for NDA and HLDA correspond to the average and standard deviation (which is in parentheses) over the range of values of their parameters as described above. The second data set used from the UCI database was the “Landset Satellite Data” (LSD). In this case, the data is already divided into a training set of 4,435 samples and a testing set of 2,000 instances. Each sample is a feature vector of 36 dimensions which define the spectral information of the sensors located in a satellite. Here, the goal is to classify the data into six different land types. The results are also summarized in Table 1. The final test shown in Table 1 was obtained using the “Multifeature Digit Data set” (MDD). In MDD, the task is to classify each of the images of handwritten digits into one of ZHU AND MARTINEZ: SUBCLASS DISCRIMINANT ANALYSIS 1281 TABLE 2 Subclasses Number of subclass divisions per class obtained by each of the two implementations of SDA, i.e., Ho =C. Fig. 5. (a) This figure shows an example image for each of the eight categories in the ETH-80 database [14]. In (b), we show a sample image for each of the 10 different cows in this database. the 10 possible classes. These digits can be represented using a variety of feature sets. The first one, referred to as “MDD-fou,” represents the digits using 76 of the Fourier coefficients of the image. The second, “MDD-fac,” uses a set of 216 profile correlations to describe each of the images. The third, “MDD-kar,” is based on a PCA representation of the image pixels. The fourth, “MDD-pix,” uses the averages of a window of two by three pixels scanned over all the possible image locations, which generates feature vectors of 240 dimensions. And, fifth, Zernike moments are used to obtained a feature representation of 47 dimensions—referred to as “MDD-zer.” We see that the results produced by the two criteria for SDA defined in this paper are equivalent. Furthermore, these are among the best. In particular, KG-LDA is probably the most comparable to SDA. However, the computational time required to train and test the KG-LDA algorithm on the UCI data sets defined above, was about one order of magnitude (or more) over that of SDAstability . In Table 1, we have also shown the results obtained with CKFD (with a polynomial and a Gaussian kernel). The reason for showing this results was that this method uses cross-validation to optimize the three parameters of the algorithm. These parameters are: that kernel variable, the dimensionality of the reduced space generated by the algorithm, and a fusion coefficient which task is to combine the results obtained in the range and null spaces of SW . Again, we see that both implementations of SDA are generally equivalent to those of CKFD. However, the computational cost of CKFD is over an order of magnitude larger than the complexity of SDAstability . Our implementations of SDA automatically divide the data into a set of subclasses. Note that each algorithm, SDAstability and SDAloot , may divide the data differently. In Table 2, we show the number of subclass divisions per class automatically obtained by each of the two criteria defined in this paper. This table shows the value of Ho =C. Although each criterion may result on slightly distinct subclass divisions, the classification accuracy does not vary too much as shown in Table 1. Recall that our goal was not to find that division of the data which best approximates the true underlying distribution of each class. Rather, we want to find that division which best classifies the data in the reduced space. We may hence conclude that each criterion is able to find a division (albeit a different one) that generates good results. 4.2 Object Categorization A classical problem in computer vision is to classify a set of objects into a group of known categories (e.g., apples, cars, etc.). In our experiments, we have used the ETH-80 database which contains images of the following categories: apples, pears, cars, cows, horses, dogs, tomatoes, and cups [14]. Examples of these are shown in Fig. 5a. Each category includes the images of 10 objects (e.g., 10 different cows—as 1282 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, Fig. 6. (a) A cow seen from one of the 41 different orientations. (b) The silhouette of this cow. (c) Each line measures the distance from the centroid to several points in the silhouette. shown in Fig. 5b) photographed at a total of 41 orientations. This gives us a total of 410 images per category. 4.2.1 Feature-Based Classification In this approach, we first obtain the silhouette of the object (i.e., the contour that separates the object from the background), Fig. 6b. Then, we compute the centroid of this silhouette and, finally, calculate the length of equally separated lines that connect the centroid with the silhouette; see Fig. 6c. In our experiments, we obtain a total of 300 distances, which generates a feature space of 300 dimensions. To test the classification accuracy of each method, we use the leave-one-object out test. This means that we use 79 of the 80 objects for training and one for testing. Since there are obviously 80 ways in which we can leave one object out, we repeat the test eighty times and computed the average. These results are summarized in Fig. 7a. Since we used the leave-one-object out for testing, the value of Ho varied each time. For the SDAstability algorithm, the average value of Ho was 79, which means we would have an average of about 42 samples per subclass. For the SDAloot , this average was 25. 4.2.2 Appearance-Based Classification For comparison purposes, we have also considered the classification of the objects in the ETH-80 database using a simple (pixel) appearance-based approach. Here, we first crop the subimage corresponding to the smallest rectangle that circumscribes the silhouette of the object. Since distinct images will generate croppings of different sizes, we resize each of the resulting subimages to a standard size of 25 by 30 pixels, i.e., a 750-dimensional feature space. Because we have a total of 3,239 samples for training, the samples-tofeature ratio is still adequate. Again, we use the leave-oneobject out test to obtain our results, which are summarized in Fig. 7b. In this case, the average value of Ho was 80 when we used SDAstability and 27 when using SDAloot . In the two experiments shown above, using the ETH-80 database, we see that SDA always yield the highest recognition rate. Furthermore, SDA gives very similar results for both the feature and appearance-based methods. Previous results reporting the superiority of one method over the other might have been due to the DA approach used, rather than the discriminant information contained in each feature-space. VOL. 28, NO. 8, AUGUST 2006 4.3 Face Recognition In our last experiment, we used the AR-face database [18]. The AR-face database consists of frontal-view face images of over 100 people. Each person was photographed under different lighting conditions and distinct facial expressions, and some images have partial occlusions (sunglasses or scarf). A total of 13 images were taken in each session for a total of two sessions, i.e., 26 images per person. These sessions were separated by an interval of two weeks. We used the first 13 images corresponding to the first session (shown in Fig. 8a) of 100 randomly selected individuals for training, i.e., 1,300 training images. The 13 images of the second session (shown in Fig. 8b) were used for testing. In this case, we first crop the face part of the image (without including hair or background) and then resize all images to a standard image size of 29 by 21 pixels. This yields a 609-dimensional feature space. During the cropping process, we also align all faces to have a common position for the eyes and mouth (since this is known to improve accuracy). The results are shown in Fig. 7c. In this test, both methods resulted in Ho ¼ 100. 4.4 Comparison of the Results on Different Clustering Methods As mentioned earlier, there are many clustering algorithms that could have been used to divide the data in each class into a set of subclasses. We now show that the classification results obtained using our NN-based approach (defined in Section 3.3) are comparable to those obtained with the K-means and the valley-seeking algorithms. To show this, we have rerun all our experiments described above but now using the K-means and the valley-seeking algorithms to divide the data into subclasses. These algorithms are refereed to as SDAKmeans and SDAvalley . The classification accuracy for each of these implementations of SDA, are shown in Table 3. Note that while our NN-based clustering approach always generated the same partition of the data, this is not the case for the K-means and valley-seeking algorithms. On the one hand, K-means is an iterative approach that requires an initial guess (which is usually given at random). Different initializations result in distinct partitions. On the other hand, the valley-seeking algorithm requires that we manually choose a window size where local computations are to be conducted. Different window sizes will produce distinct clustering results. To make a fair comparison, we have randomly initialized the K-means 10 times and then computed the average and standard deviation (which is shown in parentheses in the table). Similarly, the valleyseeking algorithm was run for several window sizes—average and standard deviation are shown. Finally, the results labeled SDANN in Table 3 are those obtained with our NNclustering method and the stability criterion derived earlier. The results of the SDAKmeans method were also obtained using our stability criterion. This means, we tried several values of H for clustering and then select the optimal according to (9). The valley-seeking algorithm works differently. For a fixed window size, the value of Ho is ZHU AND MARTINEZ: SUBCLASS DISCRIMINANT ANALYSIS 1283 Fig. 7. (a) Results of the categorization problem using the feature-based approach. (b) Results of the categorization problem using the appearancebased approach. (c) Face recognition results using an appearance-based approach. automatically given by the clustering method. Therefore, there is no need for the optimization step. Nonetheless, one may want to optimize the window size because different sizes will result in very different partitions. We can do this by means of the LOOT or stability criteria defined in this paper. For example, we could define a new algorithm which uses our stability criterion to optimize the window size of the valley-seeking method—we referred to this as SDAV S of Valley-Stability (VS). The results obtained with this new approach are summarized in Table 4. We see that this new algorithm results in similar classification rates to those of our previous methods—SDAloot and SDAstability . This illustrates how the results described in this paper can be extended or used to define novel algorithms. For example, we could use our results to efficiently find the optimal division in the methods defined in [4], [29]. We would now like to get back to the possibility of estimating the underlying distribution of the data using a mixture of Gaussians. Unfortunately, and as already anticipated, we were unable to obtain such results because the EM algorithm requires to invert every sample covariance. In the databases used above, it is only possible to calculate such inverses in the WDBC set. In this case, the result obtained using a mixture of Gaussians (estimated with the EM algorithm) and our stability criterion was 95 percent which is comparable to those reported above. This method would however be restricted to applications where the number of samples per class is much larger than the dimensionality of the data. 1284 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 28, NO. 8, AUGUST 2006 Fig. 8. Facial images from the AR-face database with different expressions and different occlusions. TABLE 3 SDAK -means and SDAvalley Classification results when clustering the data in each class with the K-means and the valley-seeking. As a final note, we should mention that the algorithms presented in this paper assume there is a sufficiently large number of samples per class. Note that when the number of classes is small, these cannot be reliably divided into subclasses. In this case, SDA will not generally divide the data into subclasses and, hence, will be equivalent to LDA. To show this, we have run SDAstability using the images of the first session in the AR face database. In this case, however, we only used four randomly chosen images for training. The remaining nine were subsequently used for testing. This process was repeated five times. In all cases, SDAstability was equivalent to LDA, i.e., Ho ¼ 100. 5 CONCLUSIONS Over the years, many discriminant analysis algorithms have been proposed, each targeted to a specific type of data distribution. Since the type of distribution is usually unknown in practice, this meant that we were left to test each of the existing methods before choosing that which worked best on our data. This was however very time consuming and would not guarantee we find that algorithm that best adapts to our problem. To remedy this, we have proposed a new method whose goal is to adapt to a large variety of data distributions. Rather than working with complex nonlinear methods (which require a large number of training samples to ZHU AND MARTINEZ: SUBCLASS DISCRIMINANT ANALYSIS 1285 TABLE 4 SDAV S Successful classification rates as given by the VS (Valley-Stability) implementation of SDA. work properly and usually have a large computational cost), we have shown how we can solve this by dividing each class into a set of subclasses. In Section 2, we have shown how such subclass divisions can be used to adapt to different types of class distributions and how this allows us to find the most adequate subspace representation. This led us to define a new between-subclass scatter matrix given in (5). The main problem that needed to be addressed next, was to define an easy-to-use criterion that could provide a good estimate for the number of subclasses needed to represent each class pdf. In Section 3, we defined two such criteria. Our first criterion is based on the idea of the leave-one-sample-out estimate. Here, the best partition is estimated using a crossvalidation test on the training set. Although this was shown to be a reasonable approach, it came to an associated high computational cost. To solve this, we introduced a second criterion which takes advantage of the fact that any generalized eigenvalue decomposition equation (such as that used by our method) is not guaranteed to work when the smallest angle between the ith eigenvector given by the metric to be maximized and the first i eigenvectors given by the metric to be minimized is close to zero. We have shown experimental results on several data sets from the UCI repository (each with features of very distinct sorts) and on two computer vision applications (object categorization and face recognition). Our results show that the two implementations of SDA yield the highest classification rates. [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] ACKNOWLEDGMENTS The authors would like to thank the referees for their comments. They also thank Bastian Leibe and Bernt Schiele for making the ETH-80 database available to them. This research was partially supported by the US National Institutes of Health under grant R01 DC 005241. REFERENCES [1] [2] C.L. Blake and C.J. Merz, “UCI Repository of Machine Learning Databases,” Dept. of Information and Computer Sciences, University of California, Irvine, http://www.ics.uci.edu/ mlearn/MLRepository.html, 1998. G. Baudat and F. Anouar, “Generalized Discriminant Analysis Using a Kernel Approach,” Neural Computation, vol. 12, pp. 23852404, 2000. [18] [19] [20] [21] [22] [23] P.N. Belhumeur, J.P. Hespanha, and D.J. Kriegman, “Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 19, no. 7, pp. 711-720, July 1997. F. De la Torre and T. Kanade, “Oriented Discriminant Analysis,” Proc. British Machine Vision Conf., Sept. 2004. A.P. Dempster, N.M. Laird, and D.B. Rubin, “Maximum Likelihood from Incomplete Data via the EM Algorithm,” J. Royal Statistical Soc., vol. 30, no. 1, pp. 1-38, 1977. L. Devroye, L. Györfi, and G. Lugosi, A Probabilistic Theory of Pattern Recognition. Springer-Verlag, 1996. K. Etemad and R. Chellapa, “Discriminant Analysis for Recognition of Human Face Images,” J. Optical Soc. Am. A, vol. 14, no. 8, pp. 1724-1733, 1997. R.A. Fisher, “The Statistical Utilization of Multiple Measurements,” Annals of Eugenics, vol. 8, pp. 376-386, 1938. K. Fukunaga, Introduction to Statistical Pattern Recognition, second ed. Academic Press, 1990. T. Hastie, R. Tibshirani, and A. Buja, “Flexible Discriminant Analysis by Optimal Scoring,” J. Am. Statistical Assoc., vol. 89, pp. 1255-1270, 1994. T. Hastie, A. Buja, and R. Tibshirani, “Penalized Discriminant Analysis,” Annals of Statistics, vol. 23, pp. 73-102, 1995. T. Hastie and R. Tibshirani, “Discriminant Analysis by Gaussian Mixture,” J. Royal. Statistical Soc. B., vol. 58, pp. 155-176, 1996. A.K. Jain, R.P.W. Duin, and J. Mao, “Statistical Pattern Recognition: A Review,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 22, no. 1, pp. 4-37, Jan. 2000. B. Leibe and B. Schiele, “Analyzing Appearance and Contour Based Methods for Object Categorization,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, June 2003. W.L.G. Koontz and K. Fukunaga, “A Nonparametric ValleySeeking Technique for Cluster Analysis,” IEEE Trans. Computers, vol. 21, pp. 171-178, 1972. M. Loog, R.P.W. Duin, and T. Haeb-Umbach, “Multiclass Linear Dimension Reduction by Weighted Pairwise Fisher Criteria,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 23, no. 7, pp. 762-766, July 2001. M. Loog and R.P.W. Duin, “Linear Dimensionality Reduction via a Heteroscedastic Extension of LDA: The Chernoff Criterion,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 26, no. 6, pp. 732-739, June 2004. A.M. Martı́nez and A.C. Kak, “PCA versus LDA,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 23, no. 2, pp. 228-233, Feb. 2001. A.M. Martı́nez and J. Vitrià, “Clustering in Image Space for Place Recognition and Visual Annotations for Human-Robot Interaction,” IEEE Trans. Systems, Man, and Cybernetics B, vol. 31, no. 5, pp. 669-682, 2001. A.M. Martı́nez and M. Zhu, “Where Are Linear Feature Extraction Methods Applicable?” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 27, no. 12, pp. 1934-1944, Dec. 2005. G.J. McLanchlan and K.E. Basford, Mixture Models: Inference and Applications to Clustering. New York: Marcel Dekker, 1988. G.J. McLanchlan, Discriminant Analysis and Statistical Pattern Recognition. New York: Wiley, 2004. C.R. Rao, Linear Statistical Inference and Its Applications, second ed. Wiley Interscience, 2002. 1286 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, [24] D.L. Swets and J.J. Weng, “Using Discriminant Eigenfeatures for Image Retrieval,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 18, no. 8, pp. 831-836, Aug. 1996. [25] J. Yang, A.F. Frangi, J. Yang, D. Zhang, and Z. Jin, “KPCA Plus LDA: A Complete Kernel Fisher Discriminant Framework for Feature Extraction and Recognition,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 27, no. 2, pp. 230-244, Feb. 2005. [26] M. Yang, D. Kriegman, and N. Ahuja, “Face Detection Using Multimodal Density Models,” Computer Vision and Image Understanding, vol. 84, no. 2, pp. 264-284, 2001. [27] M. Yang, “Kernel Eigenfaces vs. Kernel Fisherfaces: Face Recognition Using Kernel Methods,” Proc. Fifth Int’l Conf. Automatic Face and Gesture Recognition, pp. 215-220, May 2002. [28] H. Yu and J. Yang, “A Direct LDA Algorithm for HighDimensional Data—with Applications to Face Recognition,” Pattern Recognition, vol. 34, pp. 2067-2070, 2001. [29] S.K. Zhou and R. Chellappa, “Multiple-Exemplar Discriminant Analysis for Face Recognition,” Proc. Int’l Conf. Pattern Recognition, pp. 191-194, 2004. [30] M. Zhu, A.M. Martı́nez, and H. Tan, “Template-Based Recognition of Sitting Postures,” Proc. IEEE Workshop Computer Vision and Pattern Recognition for Human-Computer Interaction, 2003. [31] M. Zhu and A.M. Martı́nez, “Optimal Subclass Discovery for Discriminant Analysis,” Proc. IEEE Workshop Learning in Computer Vision and Pattern Recognition (LCVPR), 2004. VOL. 28, NO. 8, AUGUST 2006 Manli Zhu received the MS degree in electrical engineering in 2002 from Zhejiang University, China. She is currently a PhD candidate in the Department of Electrical and Computer Engineering at The Ohio State University. Her research interests are in the field of statistical pattern recognition, especially in feature extraction. She is also working on object classification and face recognition. She is a student member of the IEEE. Aleix M. Martinez is an assistant professor in the Department of Electrical and Computer Engineering at The Ohio State University (OSU). He is also affiliated with the Department of Biomedical Engineering and to the Center for Cognitive Science and is the founding director of the Computational Biology and Cognitive Science Lab. Prior to joining OSU, he was affiliated with the Electrical and Computer Engineering Department at Purdue University and with the Sony Computer Science Lab. He co-organized the First IEEE Workshop on Computer Vision and Pattern Recognition for Human-Computer Interaction (2003) and served as a cochair for the Second IEEE Workshop on Face Processing in Video (2005). In 2003, he was a coguest editor of the special issue on face recognition in Computer Vision and Image Understanding. He is a coauthor of the AR-face database and the Purdue ASL database. His areas of interest are learning, vision, linguistics, and their interaction. He is a member of the IEEE. . For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.