Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Graph Embedding Exploiting Subclasses

2015 IEEE Symposium Series on Computational Intelligence, 2015
...Read more
Maronidis, A., Tefas, A., & Pitas, I. (2016). Graph Embedding Exploiting Subclasses. In 2015 IEEE Symposium Series on Computational Intelligence (SSCI 2015): Proceedings of a meeting held 7-10 December 2015, Cape Town, South Africa (pp. 1452-1459). Institute of Electrical and Electronics Engineers (IEEE). https://doi.org/10.1109/SSCI.2015.206 Peer reviewed version Link to published version (if available): 10.1109/SSCI.2015.206 Link to publication record in Explore Bristol Research PDF-document University of Bristol - Explore Bristol Research General rights This document is made available in accordance with publisher policies. Please cite only the published version using the reference above. Full terms of use are available: http://www.bristol.ac.uk/pure/user- guides/explore-bristol-research/ebr-terms/
Graph Embedding Exploiting Subclasses Anastasios Maronidis, Anastasios Tefas and Ioannis Pitas Department of Informatics, Aristotle University of Thessaloniki, P.O.Box 451, 54124 Thessaloniki, Greece Email: amaronidis@iti.gr, tefas@aiia.csd.auth.gr, pitas@aiia.csd.auth.gr Abstract—Recently, subspace learning methods for Dimensionality Reduction (DR), like Subclass Discrim- inant Analysis (SDA) and Clustering-based Discrimi- nant Analysis (CDA), which use subclass information for the discrimination between the data classes, have attracted much attention. In parallel, important work has been accomplished on Graph Embedding (GE), which is a general framework unifying several sub- space learning techniques. In this paper, GE has been extended in order to integrate subclass discriminant information resulting to the novel Subclass Graph Embedding (SGE) framework. The kernelization of SGE is also presented. It is shown that SGE comprises a generalization of the typical GE including subclass DR methods. In this vein, the theoretical link of SDA and CDA methods with SGE is established. The efficacy and power of the SGE has been substantiated by comparing subclass DR methods versus a diversity of unimodal methods all pertaining to the SGE framework via a series of experiments on various real-world data. I. I NTRODUCTION In recent years, a variety of subspace learn- ing algorithms for dimensionality reduction (DR) has been developed. Locality Preserving Projections (LPP) [1], [2] and Principal Component Analysis (PCA) [3] are two of the most popular unsupervised linear DR algorithms with a wide range of applica- tions. Besides, supervised methods like Linear Dis- criminant Analysis (LDA) [4] have shown superior performance in many classification problems, since through the DR process they aim at achieving data class discrimination. Usually in practice, there is the case where many data clusters appear inside the same class imposing the need to integrate this information in the DR approach. Along these lines, techniques such as Clustering Discriminant Analysis (CDA) [5] and Subclass Discriminant Analysis (SDA) [6] have been proposed. Both of them utilize a specific objective criterion that incorporates the data subclass information in an attempt to discriminate subclasses that belong to different classes, while they put no constraints to subclasses within the same class. In parallel to the development of subspace learn- ing techniques, a lot of work has been carried out in DR from a graph theoretic perspective. Towards this direction, Graph Embedding (GE) has been introduced as a generalized framework, which unifies several existing DR methods and furthermore offers as a platform for developing novel algorithms [7]. In [2], [7] the connection of LPP, PCA and LDA with the GE framework has been illustrated and in [7], employing GE, the authors propose Marginal Fisher Analysis (MFA). In addition, the ISOMAP [8], Locally Linear Embedding (LLE) [9] and Lapla- cian Eigenmaps (LE) [10] algorithms have also been interpreted within the GE framework [7]. From the perspective of GE, the data are consid- ered as vertices of a graph, which is accompanied by two matrices, the intrinsic and the penalty matrix, weighing the edges among vertices. The intrinsic matrix encodes the similarity relationships, while the penalty matrix encodes the undesirable connections among the data. In this context, the DR task is translated to the problem of transforming the initial graph into a new one in a way that the weights of the intrinsic matrix are reinforced, while the weights of the penalty matrix are suppressed. Apart from the core ideas on GE presented in [7], some other interesting works have also been published recently in the literature. A graph-based supervised DR method has been proposed in [11] for circumventing the problem of non-Gaussian dis- tributed data. The importance degrees of the same- class and not-same-class vertices are encoded by the intrinsic and extrinsic graphs, respectively, based on a strictly monotonically decreasing function. More- over, the kernel extension of the proposed approach is also presented. In [12], the selection of the neigh- bor parameters of the intrinsic and extrinsic graph matrices is adaptively performed based on the dif- ferent local manifold structure of different samples, enhancing in this way the intra-class similarity and inter-class separability. Methodologies that convert a set of graphs into a
Maronidis, A., Tefas, A., & Pitas, I. (2016). Graph Embedding Exploiting Subclasses. In 2015 IEEE Symposium Series on Computational Intelligence (SSCI 2015): Proceedings of a meeting held 7-10 December 2015, Cape Town, South Africa (pp. 1452-1459). Institute of Electrical and Electronics Engineers (IEEE). https://doi.org/10.1109/SSCI.2015.206 Peer reviewed version Link to published version (if available): 10.1109/SSCI.2015.206 Link to publication record in Explore Bristol Research PDF-document University of Bristol - Explore Bristol Research General rights This document is made available in accordance with publisher policies. Please cite only the published version using the reference above. Full terms of use are available: http://www.bristol.ac.uk/pure/userguides/explore-bristol-research/ebr-terms/ Graph Embedding Exploiting Subclasses Anastasios Maronidis, Anastasios Tefas and Ioannis Pitas Department of Informatics, Aristotle University of Thessaloniki, P.O.Box 451, 54124 Thessaloniki, Greece Email: amaronidis@iti.gr, tefas@aiia.csd.auth.gr, pitas@aiia.csd.auth.gr Abstract—Recently, subspace learning methods for Dimensionality Reduction (DR), like Subclass Discriminant Analysis (SDA) and Clustering-based Discriminant Analysis (CDA), which use subclass information for the discrimination between the data classes, have attracted much attention. In parallel, important work has been accomplished on Graph Embedding (GE), which is a general framework unifying several subspace learning techniques. In this paper, GE has been extended in order to integrate subclass discriminant information resulting to the novel Subclass Graph Embedding (SGE) framework. The kernelization of SGE is also presented. It is shown that SGE comprises a generalization of the typical GE including subclass DR methods. In this vein, the theoretical link of SDA and CDA methods with SGE is established. The efficacy and power of the SGE has been substantiated by comparing subclass DR methods versus a diversity of unimodal methods all pertaining to the SGE framework via a series of experiments on various real-world data. I. I NTRODUCTION In recent years, a variety of subspace learning algorithms for dimensionality reduction (DR) has been developed. Locality Preserving Projections (LPP) [1], [2] and Principal Component Analysis (PCA) [3] are two of the most popular unsupervised linear DR algorithms with a wide range of applications. Besides, supervised methods like Linear Discriminant Analysis (LDA) [4] have shown superior performance in many classification problems, since through the DR process they aim at achieving data class discrimination. Usually in practice, there is the case where many data clusters appear inside the same class imposing the need to integrate this information in the DR approach. Along these lines, techniques such as Clustering Discriminant Analysis (CDA) [5] and Subclass Discriminant Analysis (SDA) [6] have been proposed. Both of them utilize a specific objective criterion that incorporates the data subclass information in an attempt to discriminate subclasses that belong to different classes, while they put no constraints to subclasses within the same class. In parallel to the development of subspace learning techniques, a lot of work has been carried out in DR from a graph theoretic perspective. Towards this direction, Graph Embedding (GE) has been introduced as a generalized framework, which unifies several existing DR methods and furthermore offers as a platform for developing novel algorithms [7]. In [2], [7] the connection of LPP, PCA and LDA with the GE framework has been illustrated and in [7], employing GE, the authors propose Marginal Fisher Analysis (MFA). In addition, the ISOMAP [8], Locally Linear Embedding (LLE) [9] and Laplacian Eigenmaps (LE) [10] algorithms have also been interpreted within the GE framework [7]. From the perspective of GE, the data are considered as vertices of a graph, which is accompanied by two matrices, the intrinsic and the penalty matrix, weighing the edges among vertices. The intrinsic matrix encodes the similarity relationships, while the penalty matrix encodes the undesirable connections among the data. In this context, the DR task is translated to the problem of transforming the initial graph into a new one in a way that the weights of the intrinsic matrix are reinforced, while the weights of the penalty matrix are suppressed. Apart from the core ideas on GE presented in [7], some other interesting works have also been published recently in the literature. A graph-based supervised DR method has been proposed in [11] for circumventing the problem of non-Gaussian distributed data. The importance degrees of the sameclass and not-same-class vertices are encoded by the intrinsic and extrinsic graphs, respectively, based on a strictly monotonically decreasing function. Moreover, the kernel extension of the proposed approach is also presented. In [12], the selection of the neighbor parameters of the intrinsic and extrinsic graph matrices is adaptively performed based on the different local manifold structure of different samples, enhancing in this way the intra-class similarity and inter-class separability. Methodologies that convert a set of graphs into a vector space have also been presented. For instane, a novel prototype selection method from a classlabeled set of graphs has been proposed in [13]. A dissimilarity metric between a pair of graphs is established and the dissimilarities of a graph from a set of prototypes are calculated providing an n-dimensional feature vector. Several deterministic algorithms are used to select the prototypes with the most discriminative power [13]. The flexibility of GE has also been combined with the generalization ability of the support vector machine classifier resulting to improved classification performance. In [14], the authors propose the substitution of the support vector machine kernel with sub-space or submanifold kernels, that are constructed based on the GE framework. Despite the intense activity around GE, no extension of GE has been proposed, so as to integrate subclass information. In this paper, such an extension is proposed, leading to the novel Subclass Graph Embedding (SGE) framework, which is the main contribution of our work. Using subclass block form in both the intrinsic and penalty graph matrices, SGE optimizes a criterion which preserves the subclass structure and simultaneously the local geometry of the data. The local geometry may be modelled by any similarity or distance measure, while subclass structure may be extracted by any clustering algorithm. Choosing the appropriate parameters, SGE becomes one of the well-known aforementioned algorithms. Along these lines, in this paper it is shown that a variety of unimodal DR algorithms are encapsulated within SGE. Furthermore, the theoretical link between SGE and CDA, SDA methods is also established, which is another novelty of our work. Finally, the kernelization of SGE (K-SGE) is also presented. The efficacy of SGE and K-SGE is demonstrated through a comparison between subclass DR methods and a diversity of unimodal ones – all pertaining to the SGE framework – via a series of experiments on various datasets. The remainder of this paper is organized as follows. The subspace learning algorithms CDA and SDA are presented in Section II in order to pave the way for their connection with SGE. The novel SGE framework along with its kernelization is presented in Section III. The connection between the SGE framework and the several subspace learning techniques is given in Section IV. A comparison of the aforementioned methods on real-world datasets is presented in Section V. Finally, conclusions are drawn in Section VI. II. S UBSPACE L EARNING T ECHNIQUES In this section, we provide the mathematical formulation of the subspace learning techniques CDA and SDA in order to allow their connection with the SGE framework. The other methods mentioned in the Introduction are encapsulated in the proposed SGE framework as well. However, their detailed description is omitted, as they have already been described in [7]. In the following analysis, we consider that each data sample denoted by x is an m-dimensional real′ vector, i.e., x ∈ Rm . We also denote by y ∈ Rm its projection y = VT x to a new m′ -dimensional ′ space using a projection matrix V ∈ Rm×m . CDA and SDA attempt to minimize: v T SW v , (1) v T SB v where SW is called the within and SB the between scatter matrix [15]. These matrices are symmetric and positive semi-definite. The minimization of the ratio (1) leads to the following generalized eigenvalue decomposition problem to find the optimal discriminant projection eigenvectors: J(v) = SW v = λSB v . (2) The eigenvalues λi of the above eigenproblem are by definition positive or zero: 0 ≤ λ1 ≤ λ 2 ≤ · · · ≤ λm . (3) Let v1 , v2 , · · · , vm be the corresponding eigenvectors. Then the projection y = VT x, from the initial space to the new space of reduced dimensionality employs the projection matrix V = [v1 , v2 , · · · , vm′ ] whose columns are the eigenvectors vi , i = 1, . . . , m′ and m′ ≪ m. Looking for a linear transform that effectively separates the projected data of each class, CDA makes use of potential subclass structure. Let us denote the total number of subclasses inside the i-th class by di and, for the j-th subclass of the i-th class, the number of its samples by nij , its q-th sample by ij xij q and its mean vector by µ . CDA attempts to (CDA) minimize (1), where SW is the within-subclass (CDA) and SB the between-subclass scatter matrix, defined in [5]: (CDA) SW = nij di X c X X ij xij q −µ i=1 j=1 q=1 (CDA) SB = c−1 X c X dl di X X i=1 l=i+1 j=1 h=1  ij xij q −µ T , (4)  T µij − µlh µij − µlh . (5) The difference between SDA and CDA mainly lies on the definition of the within scatter matrix, while the between scatter matrix of SDA is a modified version of that of CDA. The exact definitions of the two matrices are: (SDA) SW = n X T (xq − µ) (xq − µ) , (6) q=1 dl di X c X c−1 X X  T pij plh µij−µlh µij−µlh , = (SDA) SB i=1 l=i+1 j=1 h=1 (7) n where pij = nij is the relative frequency of the j-th cluster of the i-th class [6]. It is worth mentioning (SDA) that SW is actually the total covariance matrix of the data. The previously described DR methods along with LPP, PCA and LDA can be seen under a common prism, since their basic calculation element towards the construction of the corresponding optimization criteria is the similaritiy among the samples. Thus we can unify them in a common framework if we consider that the samples form a graph and we set criteria on the similarities between the nodes of this graph. In the following section we describe in detail this approach. III. S UBCLASS G RAPH E MBEDDING In this section, the problem of dimensionality reduction is described from a graph theoretic perspective. Before we present the novel SGE, let us first briefly provide the main ideas of the core GE framework. A. Graph Embedding In the GE framework, the set of the data samples to be projected in a low dimensionality space is represented by two graphs, namely, the intrinsic Gint = {X , Wint } and the penalty Gpen = {X , Wpen } graph, where X = {x1 , x2 , · · · , xn } is the set of the data samples in both graphs. The intrinsic graph models the similarity connections between every pair of data samples that have to be reinforced after the projection. The penalty graph contains the connections between the data samples that must be suppressed after the projection. For both of the above matrices these connections might have negative values imposing the opposite results. Choosing the values of both the intrinsic and the penalty graph matrices, may lead to either supervised, unsupervised or semi-supervised DR algorithms. Now, the problem of DR could be interpreted in another way. It is desirable to project the initial data to the new low dimensional space, such that the geometrical structure of the data is preserved. The corresponding objective function for optimization is: argmin J(Y) , (8) tr{YBY T }=d 1 XX (yq −yp )Wint (q, p)(yq −yp )T } , J(Y)= tr{ 2 q p (9) where Y = [y1 , y2 , · · · , yn ] are the projected vectors, d is a constant, B is a constraint matrix, defined to remove an arbitrary scaling factor in the embedding and Wint (q, p) is the value of Wint at position (q, p) [7]. The structure of the objective function (9) postulates that, the larger the value Wint (q, p) is, the smaller the distance between the projections of the data samples xq and xp has to be. By using some simple algebraic manipulations, equation (9) becomes: J(Y) = tr{YLint YT } , (10) where Lint = Dint −Wint is the intrinsic Laplacian matrix and Dint is the degree matrix defined as the diagonal matrix,P which has at position (q, q) the value Dint (q, q) = p Wint (q, p). The Laplacian matrix Lpen = Dpen − Wpen of the penalty graph is often used as the constraint matrix B. Thus, the above optimization problem becomes: argmin tr{YLint YT } . tr{YLpen YT } (11) The optimization of the above objective function is achieved by solving the generalized eigenproblem: Lint v = λLpen v , (12) keeping the eigenvectors, which correspond to the smallest eigenvalues. This approach leads to the optimal projection of the given data samples. In order to achieve the out of sample projection, the linearization [7] of the above approach should be used. If we employ y = VT x, the objective function (9) becomes: argmin J(V) , (13) tr{VT XLpen XT V}=d where J(V) is defined as: ! XX 1 T T tr{V (xq−xp )Wint (q, p)(xq−xp ) V} , 2 q p (14) where X = [x1 , x2 , . . . , xn ]. By using simple algebraic manipulations, we have: J(V) = tr{VT XLint XT V} . (15) Similarly to the straight approach, the optimal eigenvectors are given by solving the generalized eigenproblem: T T XLint X v = λXLpen X v . = 1 tr{VT 2 nij nij  di X c X X X P (q, p) B. Linear Subclass Graph Embedding = In this section, we propose a GE framework that allows the exploitation of subclass information. In the following analysis, it is assumed that the subclass labels are known. We attempt to minimize the scatter of the data samples within the same subclass, while separating data samples from subclasses that belong to different classes. Finally, we are not concerned about samples that belong to different subclasses of the same class. Usually, in real-world problems, local geometry of the data is related to the global supervised structure. Samples that belong to the same class or subclass, should be “sufficiently close” to each other. SGE actually exploits this fact. It simultaneously handles supervised and unsupervised information. As a consequence, it combines the global labeling information with the local geometrical characteristics of the data samples. This is achieved by weighing the above connections with the similarities of the data samples. The Gaussian similarity function (see eq. 17), has been used in this paper for this purpose.  2  d (xq , xp ) Sqp = S(xq , xp ) = exp − , (17) σ2 where d(xq , xp ) is a distance metric (e.g., Euclidean) and σ 2 is a parameter (variance) that determines the distance scale. i=1 j=1 q=1 p=1 ij (16) ij xij q − xp  xij q − xij p T !  V} (20) tr{VT X (Dint − Wint ) XT V} (21) tr{VT XLint XT V} . = (22) The derivation of (22) is omitted due to lack of space. The matrix Wint is block diagonal with blocks that correspond to each class and is given by:   1 Wint 2 Wint 0    . (23) Wint =  ..   . 0 c Wint i are block diagonal submatrices, with blocks Wint that correspond to the subclasses and are given by:   i1 P  Pi2 0    i Wint = (24) . ..   0 . Pidi Pij is the submatrix of P that corresponds to the data of the j-th cluster of the i-th class. By looking carefully at the form of Wint , it is clear that the degree intrinsic matrix Dint has values Dint ( j−1 i−1 X X nst +q, s=0 t=0 j−1 i−1 X X nst +q) = s=0 t=0 X Pij (q, p) , p (25) where p runs over the indices of the j-th cluster of i-th class. Let us denote as P an affinity matrix. Without limiting the generality, we assume that this matrix has block form, depending on the subclass and the class of the data samples. Using the linearized approach, we attempt to optimize a more general discrimination criterion. We consider again that y = VT x is the projection of x to the new subspace. Let Pij (q, p) be the value of P at position (q, p) of the submatrix that contains the j-th subclass of the i-th class. Then, the proposed criterion is: In parallel, we demand to maximize a criterion, which encodes the similarities among the centroid vectors of the subclasses. Let the value Qlh ij express the similarity between the centroid vectors µij and µlh . The more similar two centroids that belong to different classes are, the further apart their projections mij = VT µij have to be from each other: argmin J(Y) , argmax G(mij ) , J(Y) = nij nij di X c X  1 XX yqij − ypij tr{ 2 i=1 j=1 q=1 p=1 T Pij (q, p) yqij − ypij } (18) G(mij ) = tr{ dl di X c c−1 X X X (26) mij − mlh i=1 l=i+1 j=1 h=1 (19) ij lh Qlh ij m − m T }  (27)  = tr{VT  = dl di X c c−1 X X X i=1 l=i+1 j=1 h=1 Qlh ij ij  µij − µlh  µ −µ  lh T ! V} (28) tr{VT X (Dpen − Wpen ) XT V} (29) tr{VT XLpen XT V} . = (30) Again, the derivation of (30) is omitted due to lack of space. The block matrix Wpen in (29) consists of block submatrices:   1,2 1,c 1,1 · · · Wpen Wpen Wpen 2,1 2,2 2,c   Wpen Wpen · · · Wpen   Wpen =  . . .. .. . . .   . . . . c,1 c,2 c,c Wpen Wpen · · · Wpen (31) i,i The submatrices Wpen lying on the main block diagonal are given by:   Wi1  Wi2 0    i,i Wpen =   , (32) . ..   0 idi W where Wij corresponds to the j-th subclass of the i-th class and is given by:  P P dω Qωt ij ω6 = i t=1 T Wij = − enij (enij ) , 2 (nij ) (33) nij -times z }| { where enij = [ 11 · · · 1 ]T . Respectively, the offdiagonal submatrices of Wpen are given by:  ldl  l2 l1 Wi1 Wi1 · · · Wi1  Wl1 Wl2 · · · Wldl  i2 i2 i2   i,l Wpen =   , i 6= l , .. .. .. ..   . . . . l1 Wid i l2 Wid i ··· ldl Wid i (34) where: lh Wij Qlh ij T enij (enlh ) . = nij nlh (35) It can be easily shown that D = 0, so that Lpen = −Wpen . C. Kernel Subclass Graph Embedding In this section, the kernelization of SGE is presented. Let us denote by X the initial data space, by F a Hilbert space and by f the non-linear mapping function from X to F. The main idea is to firstly map the original data from the initial space into another high-dimensional Hilbert space and then perform linear subspace analysis in that space. If we denote by mF the dimensionality of the Hilbert space, then the above procedure is described as: Pn ! p=1 a1p k(xq , xp ) .. ∈F , X ∋ xq→ yq=f (xq ) = . Pn a k(x , x ) q p p=1 mF p (36) where k is the kernel function. From the above equation it is obvious that Y = AT K , where K is the Gram matrix, which has at (q, p) the value Kqp = k(xq , xp ) and  a11 · · · amF 1  .. .. .. A = [a1 · · · amF ] =  . . . a1n · · · amF n (37) position    (38) is the map coefficient matrix. Consequently, the final SGE optimization becomes: argmin tr{AT KLint KA} . tr{AT KLpen KA} (39) Similarly to the linear case, in order to find the optimal projections, we resolve the generalized eigenproblem: KLint Ka = λKLpen Ka , (40) keeping the eigenvectors that correspond to the smallest eigenvalues. IV. SGE AS A G ENERAL D IMENSIONALITY R EDUCTION F RAMEWORK In this section, it is shown that SGE is a generalized framework that can be used for subspace learning, since all the standard approaches are specific cases of SGE. Let us use the Gaussian similarity function (17), in order to construct the affinity matrix. In the following analysis, we initially let the variance of Gaussian σ 2 tend to infinity. Hence, S(xq , xp ) = 1, ∀(q, p) ∈ {1, 2, · · · , n}2 . Let the intrinsic matrix elements be: ( S(xq ,xp ) = n1ij , if xq , xp ∈ Cij ij nij , P (q, p) = 0 , otherwise (41) where Cij is the set of the samples that belong to the j-th subclass of the i-th class. Obviously, (20) becomes the within-subclass criterion of CDA (also see eq. 4). Thus, in this case, Wint is the intrinsic graph matrix of CDA. Let also: ij lh Qlh ij = S(µ , µ ) = 1, ∀ i, j, h, l (42) the penalty matrix elements. Then, (28) becomes the between-subclass criterion of CDA (also see eq. 5). Thus, Wpen is the penalty graph matrix of CDA and the connection between CDA and GE has been established. Let us consider that each data sample constitutes its own class, i.e., c = n, di = 1 and ni = 1, ∀i ∈ {1, 2, · · · , c}. Thus, each class-block of the penalty graph matrix reduces to a single element of the matrix. Obviously, each data sample coincides with the mean of its class. By setting: Ql1 i1 = then: P S(µi , µl ) 1 = , ∀ (i, l) ∈ {1, 2, · · · , c}2 , n n (43) P dω t=1 Qωt i1  X1 1 − 1. n (ni ) ω6=i (44) These values lie on the main diagonal of the penalty graph matrix. Regarding the off diagonal elements we have: 1 Ql1 i1 = . (45) ni nl n − ω6=i 2 =− n = It can be easily shown that the degree penalty matrix is D = 0, so that Lpen = −Wpen . Obviously, T Lpen = I − n1 en (en ) and XLpen XT becomes the covariance matrix C of the data. By using as intrinsic graph matrix the identity matrix, SGE becomes identical to PCA: T T tr{V IV} tr{V XLint X V} = argmin tr{VT XLpen XT V} tr{VT CV} (46) leading to the following generalized eigenproblem: (47) solved by keeping the smallest eigenvalues, or by setting µ = λ1 , since λ 6= 0, this leads to: Cv = µIv , and Ql1 1 i1 = . ni nl n (48) solved by keeping the greatest eigenvalues, which is obviously the PCA solution. Now, consider that every class consists of a unique subclass, thus di = 1, ∀i ∈ {1, 2, . . . , c}. If we set:  S(xq ,xp ) = n1i , if xq , xp ∈ Ci ni P(q, p) = , 0 , otherwise (49) (52) These are the values of the penalty graph matrix of LDA. So, by taking the Laplacians of the above matrices, we end up to the LDA algorithm. Let us now reject the assumption that the variance of Gaussian tends to infinity. Consider that there is only one class which contains the whole set of the data, i.e., c = 1. Also consider that there are no subclasses within this unique class, i.e., d1 = 1. In this case the intrinsic graph matrix becomes equal to P. Thus, by setting P equal to the affinity matrix S, the intrinsic Laplacian matrix becomes that of LPP. We observe that by utilizing the identity matrix as the penalty Laplacian matrix, obviously we get the LPP algorithm. Since we consider a unique class, which contains a unique subclass, from (31) and (32) we have that Wpen = W11 . The values of W11 are given from (33), which in this case reduces to: W11 = − Q11 T 11 n e (en ) . n2 If we set: n2 , 1−n Q11 11 = T argmin Iv = λCv , then the intrinsic graph matrix becomes that of LDA. Furthermore, if we set: ni nl , ∀ (i, l) ∈ {1, . . . , c}2 (50) Ql1 i1 = n then  P P dω ωt Q t=1 i1 ω6=i ni − n − (51) = 2 nni (ni ) then Wpen = W11 =  1  1  1−n Lpen =  .  .. 1 1−n 1 n n−1 e 1 1−n 1 .. . 1 1−n (53) (54) T (en ) . Consequently,  1 · · · 1−n 1 · · · 1−n   (55) ..  . .. . .  ··· 1 Thus, if we make the assumption that the number of the data-samples becomes very large, then asymptotically we have Lpen = I. Finally, to complete the analysis, if we consider as the intrinsic Laplacian matrix, the matrix Lint = I − 1 n n T e (e ) n (56) and if we set: Qlh ij = nij nlh , n (57) TABLE I: Dimensionality Reduction Using SGE Framework LPP PCA LDA CDA SDA P 11 P(L  int )  d2 (xq ,xp ) (q, p) = exp − , ∀xq , xp σ2 Lint = I Pi1 (q, p) = n1 , xq , xp ∈ ci Pij (q, p) = i 1 nij Lint = I − , xq , xp ∈ cij 1 n ne (en )T in (33) and (35), SGE becomes identical to SDA. The parameters that determine the connection of the several methods with SGE are summarized in Table I. V. E XPERIMENTAL R ESULTS We conducted 5-fold cross-validation classification experiments on several real-world datasets using the proposed linear and kernel SGE framework. For extracting automatically the subclass structure, we have utilized the multiple Spectral Clustering technique [16], keeping the most plausible partition for each dataset. For classifying the data, the Nearest Centroid (NC) classifier has been used with LPP, PCA and LDA algorithms, while the Nearest Cluster Centroid (NCC) [17] has been used with CDA and SDA algorithms. In NCC, the cluster centroids are calculated and the test sample is assigned to the class of the nearest cluster centroid. NC and NCC were selected because they provide the optimal classification solutions in Bayesian terms, thus proving whether the DR methods have reached the goal described by their specific criterion. A. Classification experiments For the classification experiments, we have used diverse publicly available datasets offered for various classification problems. More specifically, FERAIIA, BU, JAFFE and KANADE were used for facial expression recognition, XM2VTS for face frontal view recognition, while MNIST and SEMEION for optical digit recognition. Finally, IONOSPHERE, MONK and PIMA were used in order to further extend our experimental study to diverse data classification problems. The cross-validation classification accuracy rates for the several subspace learning methods over the utilized datasets, are summarized in Table II. The optimal dimensionality of the projected space that returned the above results is depicted in parenthesis. For each dataset, the best performance rate among linear and kernel methods separately is highlighted with bold, while the best overall performance rate Q11 11 = Q(Lpen ) σ2 c di d n2 1−n σ2 1 1 1 ∞ ∞ n c 1 1 n c ∞ c di d ∞ c di d (Lpen = I) 1 Ql1 i1 = n n i nl Ql1 i1 = n Qlh ij = 1 n ij nlh Qlh ij = n among all methods, both linear and kernel, is surrounded by a rectangle. The classification performance rank of each method is also referred in the last two rows of Table II. Specific Rank denotes the method rank for the linear and the kernel methods, independently. Overall Rank refers to the rank of each method among both the linear and the kernel methods. The ranking has been achieved through a post-hoc Bonferroni test [18]. An immediate remark from Table II is that in both linear and kernel case, multimodal methods exhibit better classification performance than the unimodal ones. In particular, the top overall performance is shown by SDA followed by CDA, while the worst performance is shown by KLPP and KPCA. This result undoubtedly shows that the inclusion of subclass information in the DR process offers a strong potential to improve the performance of the state-of-the-art in many classification domains. In comparing linear with kernel methods, a simple calculation yields mean overall rank equal to 5.08 for the linear methods and 5.90 for the kernel ones. Although the average performance of linear methods is clearly better than that of kernel ones, we must admit that there is ample space for improving the kernel results by varying the RBF parameter, as the selection of this parameter is not trivial and may easily lead to over-fitting. Actually, the top performance rates presented in this paper have been obtained by testing indicative values of the above parameter. As a matter of fact, it is interesting to observe that the use of kernels proves to be beneficial for some methods in certain datasets, while deteriorates the performance of others. VI. C ONCLUSIONS In this paper, data subclass information has been incorporated within Graph Embedding (GE) leading to a novel Subclass Graph Embedding (SGE) framework, which constitutes the main contribution of our work. In particular, it has been shown that SGE comprises a generalization of GE, encapsulating a number of state-of-the-art unimodal subspace learning techniques already integrated within TABLE II: Cross Validation Classification Accuracies (%) of Linear Methods on Several Real-World Datasets DATASET FER-AIIA BU JAFFE KANADE MNIST SEMEION XM2VTS IONOSPHERE MONK 1 MONK 2 MONK 3 PIMA SPECIFIC RANK OVERALL RANK LPP PCA LDA CDA SDA KLPP KPCA KDA KCDA KSDA 40.9(3) 39.4(298) 46.8(18) 34.2(92) 71.1(259) 53.6(99) 95.7(54) 84.6(23) 66.7(3) 56.0(1) 77.2(5) 61.8(1) 31.0(120) 38.1(49) 37.6(39) 43.3(46) 79.9(135) 83.2(55) 92.0(86) 72.3(15) 68.3(5) 53.3(4) 80.9(4) 63.5(6) 64.6(6) 51.6(6) 53.2(6) 67.1(6) 84.6(9) 88.2(9) 70.5(1) 78.9(1) 50.8(1) 52.0(1) 49.4(1) 56.5(1) 73.2 49.1(16) 40.0(15) 59.7(7) 84.8(15) 89.2(19) 98.1(3) 80.6(2) 70.0(4) 54.2(1) 74.6(2) 60.5(3) 75.5(11) 52.3(15) 54.1(6) 67.1(5) 85.1(14) 89.4(19) 97.4(2) 83.4(2) 74.2(3) 54.0(2) 66.3(2) 73.5(3) 50.2(252) 52.7(317) 28.8(98) 32.7(99) 81.4(299) 83.8(99) 71.3(297) 83.7(23) 63.3(2) 54.8(1) 62.5(2) 50.7(3) 41.5(29) 35.9(290) 25.9(58) 33.2(88) 64.5(155) 77.4(77) 74.7(56) 70.3(2) 72.5(1) 59.8(3) 79.2(5) 67.5(4) 54.9(6) 46.6(6) 42.4(6) 44.3(6) 86.0(9) 95.3(9) 61.3(1) 92.9(1) 55.8(1) 69.7(1) 51.7(1) 48.9(1) 56.1(12) 41.0(13) 36.1(18) 40.0(6) 83.4(19) 94.1(19) 71.5(3) 93.1(1) 58.3(4) 78.7(1) 67.5(2) 52.5(3) 53.5(12) 48.0(14) 46.3(5) 38.5(6) 85.2(15) 95.9(19) 57.3(4) 92.9(1) 61.7(3) 54.5(1) 58.3(1) 52.9(1) 3.3 5.8 3.8 6.4 3.6 6.0 2.5 4.2 1.6 3.0 3.5 6.7 3.4 6.7 2.9 5.4 2.4 5.2 2.7 5.5 GE. Besides, the connection of SGE with subspace learning algorithms that use subclass information in the embedding process has been also analytically proven. The kernelization of SGE has also been presented. Through an extensive experimental study, it has been shown that subclass learning techniques outperform a number of state-of-the-art unimodal learning methods in many real-world datasets pertaining to various classification domains. In addition, the experimental results highlight the superiority in terms of classification performance of linear methods against kernel ones. [7] [8] [9] [10] In the near future, we intend to employ SGE as a template to design novel DR methods. For instance, as current subclass methods are strongly dependent on the underlying distribution of the data, we anticipate that novel methods, which use neighbourhood information among the data of the several subclasses, will succeed in alleviating this sort of limitations. [11] R EFERENCES [14] [1] [2] [3] [4] [5] [6] X. He and P. Niyogi, “Locality preserving projections,” in NIPS, S. Thrun, L. K. Saul, and B. Schölkopf, Eds. MIT Press, 2003. X. He, S. Yan, Y. Hu, P. Niyogi, and H. Zhang, “Face recognition using laplacianfaces,” IEEE Trans. Pattern Anal. Mach. Intell, vol. 27, no. 3, pp. 328–340, 2005. I. Jolliffe, Principal Component Analysis. Springer Verlag, 1986. D. J. Kriegman, J. P. Hespanha, and P. N. Belhumeur, “Eigenfaces vs. fisherfaces: Recognition using classspecific linear projection,” in ECCV, 1996, pp. I:43–58. X. W. Chen and T. S. Huang, “Facial expression recognition: A clustering-based approach,” Pattern Recognition Letters, vol. 24, no. 9-10, pp. 1295–1302, Jun. 2003. M. L. Zhu and A. M. Martinez, “Subclass discriminant analysis,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 28, no. 8, pp. 1274–1286, Aug. 2006. [12] [13] [15] [16] [17] [18] S. Yan, D. Xu, B. Zhang, H.-J. Zhang, Q. Yang, and S. Lin, “Graph embedding and extensions: A general framework for dimensionality reduction,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 29, no. 1, pp. 40–51, 2007. J. B. Tenenbaum, V. de Silva, and J. C. Langford, “A global geometric framework for nonlinear dimensionality reduction.” Science, vol. 290, no. 5500, pp. 2319–2323, dec 2000. S. T. Roweis and L. K. Saul, “Nonlinear dimensionality reduction by locally linear embedding.” Science, vol. 290, no. 5500, pp. 2323–2326, dec 2000. M. Belkin and P. Niyogi, “Laplacian eigenmaps and spectral techniques for embedding and clustering.” Advances in Neural Information Processing Systems (NIPS), vol. 14, pp. 585–591, 2001. Y. Cui and L. Fan, “A novel supervised dimensionality reduction algorithm: Graph-based fisher analysis,” Pattern Recognition, vol. 45, no. 4, pp. 1471–1481, 2012. J. Shi, Z. Jiang, and H. Feng, “Adaptive graph embedding discriminant projections,” Neural Processing Letters, pp. 1–16, 2013. E. Zare Borzeshi, M. Piccardi, K. Riesen, and H. Bunke, “Discriminative prototype selection methods for graph embedding,” Pattern Recognition, 2012. G. Arvanitidis and A. Tefas, “Exploiting graph embedding in support vector machines,” in Machine Learning for Signal Processing (MLSP), 2012 IEEE International Workshop on. IEEE, 2012, pp. 1–6. R. A. Fisher, “The statistical utilization of multiple measurements.” Annals of Eugenics, vol. 8, pp. 376–386, 1938. A. Azran and Z. Ghahramani, “Spectral methods for automatic multiscale data clustering.” in IEEE Computer Vision and Pattern Recognition (CVPR) (1). IEEE Computer Society, 2006, pp. 190–197. A. Maronidis, A. Tefas, and I. Pitas, “Frontal view recognition using spectral clustering and subspace learning methods.” in ICANN (1), ser. Lecture Notes in Computer Science, W. D. K. I. Diamantaras and L. S. Iliadis, Eds., vol. 6352. Springer, 2010, pp. 460–469. O. J. Dunn, “Multiple comparisons among means,” Journal of American Statistical Association, vol. 56, no. 293, pp. 52–64, 1961.