Maronidis, A., Tefas, A., & Pitas, I. (2016). Graph Embedding Exploiting
Subclasses. In 2015 IEEE Symposium Series on Computational Intelligence
(SSCI 2015): Proceedings of a meeting held 7-10 December 2015, Cape
Town, South Africa (pp. 1452-1459). Institute of Electrical and Electronics
Engineers (IEEE). https://doi.org/10.1109/SSCI.2015.206
Peer reviewed version
Link to published version (if available):
10.1109/SSCI.2015.206
Link to publication record in Explore Bristol Research
PDF-document
University of Bristol - Explore Bristol Research
General rights
This document is made available in accordance with publisher policies. Please cite only the published
version using the reference above. Full terms of use are available: http://www.bristol.ac.uk/pure/userguides/explore-bristol-research/ebr-terms/
Graph Embedding Exploiting Subclasses
Anastasios Maronidis, Anastasios Tefas and Ioannis Pitas
Department of Informatics,
Aristotle University of Thessaloniki,
P.O.Box 451, 54124
Thessaloniki, Greece
Email: amaronidis@iti.gr, tefas@aiia.csd.auth.gr, pitas@aiia.csd.auth.gr
Abstract—Recently, subspace learning methods for
Dimensionality Reduction (DR), like Subclass Discriminant Analysis (SDA) and Clustering-based Discriminant Analysis (CDA), which use subclass information
for the discrimination between the data classes, have
attracted much attention. In parallel, important work
has been accomplished on Graph Embedding (GE),
which is a general framework unifying several subspace learning techniques. In this paper, GE has been
extended in order to integrate subclass discriminant
information resulting to the novel Subclass Graph
Embedding (SGE) framework. The kernelization of
SGE is also presented. It is shown that SGE comprises a
generalization of the typical GE including subclass DR
methods. In this vein, the theoretical link of SDA and
CDA methods with SGE is established. The efficacy and
power of the SGE has been substantiated by comparing
subclass DR methods versus a diversity of unimodal
methods all pertaining to the SGE framework via a
series of experiments on various real-world data.
I.
I NTRODUCTION
In recent years, a variety of subspace learning algorithms for dimensionality reduction (DR)
has been developed. Locality Preserving Projections
(LPP) [1], [2] and Principal Component Analysis
(PCA) [3] are two of the most popular unsupervised
linear DR algorithms with a wide range of applications. Besides, supervised methods like Linear Discriminant Analysis (LDA) [4] have shown superior
performance in many classification problems, since
through the DR process they aim at achieving data
class discrimination.
Usually in practice, there is the case where
many data clusters appear inside the same class
imposing the need to integrate this information in
the DR approach. Along these lines, techniques
such as Clustering Discriminant Analysis (CDA)
[5] and Subclass Discriminant Analysis (SDA) [6]
have been proposed. Both of them utilize a specific
objective criterion that incorporates the data subclass
information in an attempt to discriminate subclasses
that belong to different classes, while they put no
constraints to subclasses within the same class.
In parallel to the development of subspace learning techniques, a lot of work has been carried out
in DR from a graph theoretic perspective. Towards
this direction, Graph Embedding (GE) has been
introduced as a generalized framework, which unifies
several existing DR methods and furthermore offers
as a platform for developing novel algorithms [7].
In [2], [7] the connection of LPP, PCA and LDA
with the GE framework has been illustrated and in
[7], employing GE, the authors propose Marginal
Fisher Analysis (MFA). In addition, the ISOMAP
[8], Locally Linear Embedding (LLE) [9] and Laplacian Eigenmaps (LE) [10] algorithms have also been
interpreted within the GE framework [7].
From the perspective of GE, the data are considered as vertices of a graph, which is accompanied by
two matrices, the intrinsic and the penalty matrix,
weighing the edges among vertices. The intrinsic
matrix encodes the similarity relationships, while the
penalty matrix encodes the undesirable connections
among the data. In this context, the DR task is
translated to the problem of transforming the initial
graph into a new one in a way that the weights of
the intrinsic matrix are reinforced, while the weights
of the penalty matrix are suppressed.
Apart from the core ideas on GE presented in
[7], some other interesting works have also been
published recently in the literature. A graph-based
supervised DR method has been proposed in [11]
for circumventing the problem of non-Gaussian distributed data. The importance degrees of the sameclass and not-same-class vertices are encoded by the
intrinsic and extrinsic graphs, respectively, based on
a strictly monotonically decreasing function. Moreover, the kernel extension of the proposed approach
is also presented. In [12], the selection of the neighbor parameters of the intrinsic and extrinsic graph
matrices is adaptively performed based on the different local manifold structure of different samples,
enhancing in this way the intra-class similarity and
inter-class separability.
Methodologies that convert a set of graphs into a
vector space have also been presented. For instane,
a novel prototype selection method from a classlabeled set of graphs has been proposed in [13].
A dissimilarity metric between a pair of graphs
is established and the dissimilarities of a graph
from a set of prototypes are calculated providing an
n-dimensional feature vector. Several deterministic
algorithms are used to select the prototypes with
the most discriminative power [13]. The flexibility
of GE has also been combined with the generalization ability of the support vector machine classifier
resulting to improved classification performance. In
[14], the authors propose the substitution of the
support vector machine kernel with sub-space or submanifold kernels, that are constructed based on the
GE framework.
Despite the intense activity around GE, no extension of GE has been proposed, so as to integrate
subclass information. In this paper, such an extension
is proposed, leading to the novel Subclass Graph
Embedding (SGE) framework, which is the main
contribution of our work. Using subclass block form
in both the intrinsic and penalty graph matrices, SGE
optimizes a criterion which preserves the subclass
structure and simultaneously the local geometry of
the data. The local geometry may be modelled by
any similarity or distance measure, while subclass
structure may be extracted by any clustering algorithm.
Choosing the appropriate parameters, SGE becomes one of the well-known aforementioned algorithms. Along these lines, in this paper it is shown
that a variety of unimodal DR algorithms are encapsulated within SGE. Furthermore, the theoretical link
between SGE and CDA, SDA methods is also established, which is another novelty of our work. Finally,
the kernelization of SGE (K-SGE) is also presented.
The efficacy of SGE and K-SGE is demonstrated
through a comparison between subclass DR methods
and a diversity of unimodal ones – all pertaining to
the SGE framework – via a series of experiments on
various datasets.
The remainder of this paper is organized as
follows. The subspace learning algorithms CDA and
SDA are presented in Section II in order to pave
the way for their connection with SGE. The novel
SGE framework along with its kernelization is presented in Section III. The connection between the
SGE framework and the several subspace learning
techniques is given in Section IV. A comparison of
the aforementioned methods on real-world datasets
is presented in Section V. Finally, conclusions are
drawn in Section VI.
II.
S UBSPACE L EARNING T ECHNIQUES
In this section, we provide the mathematical formulation of the subspace learning techniques CDA
and SDA in order to allow their connection with the
SGE framework. The other methods mentioned in
the Introduction are encapsulated in the proposed
SGE framework as well. However, their detailed
description is omitted, as they have already been
described in [7].
In the following analysis, we consider that each
data sample denoted by x is an m-dimensional real′
vector, i.e., x ∈ Rm . We also denote by y ∈ Rm
its projection y = VT x to a new m′ -dimensional
′
space using a projection matrix V ∈ Rm×m . CDA
and SDA attempt to minimize:
v T SW v
,
(1)
v T SB v
where SW is called the within and SB the between
scatter matrix [15]. These matrices are symmetric
and positive semi-definite. The minimization of the
ratio (1) leads to the following generalized eigenvalue decomposition problem to find the optimal
discriminant projection eigenvectors:
J(v) =
SW v = λSB v .
(2)
The eigenvalues λi of the above eigenproblem are
by definition positive or zero:
0 ≤ λ1 ≤ λ 2 ≤ · · · ≤ λm .
(3)
Let v1 , v2 , · · · , vm be the corresponding eigenvectors. Then the projection y = VT x, from
the initial space to the new space of reduced dimensionality employs the projection matrix V =
[v1 , v2 , · · · , vm′ ] whose columns are the eigenvectors vi , i = 1, . . . , m′ and m′ ≪ m.
Looking for a linear transform that effectively
separates the projected data of each class, CDA
makes use of potential subclass structure. Let us
denote the total number of subclasses inside the i-th
class by di and, for the j-th subclass of the i-th class,
the number of its samples by nij , its q-th sample by
ij
xij
q and its mean vector by µ . CDA attempts to
(CDA)
minimize (1), where SW
is the within-subclass
(CDA)
and SB
the between-subclass scatter matrix,
defined in [5]:
(CDA)
SW
=
nij
di X
c X
X
ij
xij
q −µ
i=1 j=1 q=1
(CDA)
SB
=
c−1
X
c
X
dl
di X
X
i=1 l=i+1 j=1 h=1
ij
xij
q −µ
T
,
(4)
T
µij − µlh µij − µlh .
(5)
The difference between SDA and CDA mainly
lies on the definition of the within scatter matrix,
while the between scatter matrix of SDA is a modified version of that of CDA. The exact definitions
of the two matrices are:
(SDA)
SW
=
n
X
T
(xq − µ) (xq − µ) ,
(6)
q=1
dl
di X
c X
c−1 X
X
T
pij plh µij−µlh µij−µlh ,
=
(SDA)
SB
i=1 l=i+1 j=1 h=1
(7)
n
where pij = nij is the relative frequency of the j-th
cluster of the i-th class [6]. It is worth mentioning
(SDA)
that SW
is actually the total covariance matrix
of the data.
The previously described DR methods along with
LPP, PCA and LDA can be seen under a common
prism, since their basic calculation element towards
the construction of the corresponding optimization
criteria is the similaritiy among the samples. Thus
we can unify them in a common framework if we
consider that the samples form a graph and we set
criteria on the similarities between the nodes of this
graph. In the following section we describe in detail
this approach.
III.
S UBCLASS G RAPH E MBEDDING
In this section, the problem of dimensionality
reduction is described from a graph theoretic perspective. Before we present the novel SGE, let us
first briefly provide the main ideas of the core GE
framework.
A. Graph Embedding
In the GE framework, the set of the data samples
to be projected in a low dimensionality space is represented by two graphs, namely, the intrinsic Gint =
{X , Wint } and the penalty Gpen = {X , Wpen }
graph, where X = {x1 , x2 , · · · , xn } is the set
of the data samples in both graphs. The intrinsic
graph models the similarity connections between
every pair of data samples that have to be reinforced
after the projection. The penalty graph contains the
connections between the data samples that must
be suppressed after the projection. For both of the
above matrices these connections might have negative values imposing the opposite results. Choosing
the values of both the intrinsic and the penalty graph
matrices, may lead to either supervised, unsupervised
or semi-supervised DR algorithms.
Now, the problem of DR could be interpreted
in another way. It is desirable to project the initial
data to the new low dimensional space, such that the
geometrical structure of the data is preserved. The
corresponding objective function for optimization is:
argmin
J(Y) ,
(8)
tr{YBY T }=d
1 XX
(yq −yp )Wint (q, p)(yq −yp )T } ,
J(Y)= tr{
2
q
p
(9)
where Y = [y1 , y2 , · · · , yn ] are the projected
vectors, d is a constant, B is a constraint matrix,
defined to remove an arbitrary scaling factor in the
embedding and Wint (q, p) is the value of Wint at
position (q, p) [7]. The structure of the objective
function (9) postulates that, the larger the value
Wint (q, p) is, the smaller the distance between the
projections of the data samples xq and xp has to
be. By using some simple algebraic manipulations,
equation (9) becomes:
J(Y) = tr{YLint YT } ,
(10)
where Lint = Dint −Wint is the intrinsic Laplacian
matrix and Dint is the degree matrix defined as
the diagonal matrix,P
which has at position (q, q) the
value Dint (q, q) = p Wint (q, p).
The Laplacian matrix Lpen = Dpen − Wpen of
the penalty graph is often used as the constraint
matrix B. Thus, the above optimization problem
becomes:
argmin
tr{YLint YT }
.
tr{YLpen YT }
(11)
The optimization of the above objective function is
achieved by solving the generalized eigenproblem:
Lint v = λLpen v ,
(12)
keeping the eigenvectors, which correspond to the
smallest eigenvalues.
This approach leads to the optimal projection of
the given data samples. In order to achieve the out of
sample projection, the linearization [7] of the above
approach should be used. If we employ y = VT x,
the objective function (9) becomes:
argmin
J(V) ,
(13)
tr{VT XLpen XT V}=d
where J(V) is defined as:
!
XX
1
T
T
tr{V
(xq−xp )Wint (q, p)(xq−xp ) V} ,
2
q
p
(14)
where X = [x1 , x2 , . . . , xn ]. By using simple algebraic manipulations, we have:
J(V) = tr{VT XLint XT V} .
(15)
Similarly to the straight approach, the optimal eigenvectors are given by solving the generalized eigenproblem:
T
T
XLint X v = λXLpen X v .
=
1
tr{VT
2
nij nij
di X
c X
X
X
P (q, p)
B. Linear Subclass Graph Embedding
=
In this section, we propose a GE framework that
allows the exploitation of subclass information. In
the following analysis, it is assumed that the subclass
labels are known. We attempt to minimize the scatter
of the data samples within the same subclass, while
separating data samples from subclasses that belong
to different classes. Finally, we are not concerned
about samples that belong to different subclasses of
the same class.
Usually, in real-world problems, local geometry of the data is related to the global supervised
structure. Samples that belong to the same class or
subclass, should be “sufficiently close” to each other.
SGE actually exploits this fact. It simultaneously
handles supervised and unsupervised information.
As a consequence, it combines the global labeling
information with the local geometrical characteristics
of the data samples. This is achieved by weighing the
above connections with the similarities of the data
samples. The Gaussian similarity function (see eq.
17), has been used in this paper for this purpose.
2
d (xq , xp )
Sqp = S(xq , xp ) = exp −
, (17)
σ2
where d(xq , xp ) is a distance metric (e.g., Euclidean)
and σ 2 is a parameter (variance) that determines the
distance scale.
i=1 j=1 q=1 p=1
ij
(16)
ij
xij
q − xp
xij
q
−
xij
p
T
!
V} (20)
tr{VT X (Dint − Wint ) XT V} (21)
tr{VT XLint XT V} .
=
(22)
The derivation of (22) is omitted due to lack of space.
The matrix Wint is block diagonal with blocks that
correspond to each class and is given by:
1
Wint
2
Wint
0
. (23)
Wint =
..
.
0
c
Wint
i
are block diagonal submatrices, with blocks
Wint
that correspond to the subclasses and are given by:
i1
P
Pi2
0
i
Wint
=
(24)
.
..
0
.
Pidi
Pij is the submatrix of P that corresponds to the
data of the j-th cluster of the i-th class. By looking
carefully at the form of Wint , it is clear that the
degree intrinsic matrix Dint has values
Dint (
j−1
i−1 X
X
nst +q,
s=0 t=0
j−1
i−1 X
X
nst +q) =
s=0 t=0
X
Pij (q, p) ,
p
(25)
where p runs over the indices of the j-th cluster of
i-th class.
Let us denote as P an affinity matrix. Without
limiting the generality, we assume that this matrix
has block form, depending on the subclass and
the class of the data samples. Using the linearized
approach, we attempt to optimize a more general
discrimination criterion. We consider again that y =
VT x is the projection of x to the new subspace. Let
Pij (q, p) be the value of P at position (q, p) of the
submatrix that contains the j-th subclass of the i-th
class. Then, the proposed criterion is:
In parallel, we demand to maximize a criterion,
which encodes the similarities among the centroid
vectors of the subclasses. Let the value Qlh
ij express
the similarity between the centroid vectors µij and
µlh . The more similar two centroids that belong to
different classes are, the further apart their projections mij = VT µij have to be from each other:
argmin J(Y) ,
argmax G(mij ) ,
J(Y) =
nij nij
di X
c
X
1 XX
yqij − ypij
tr{
2
i=1 j=1 q=1 p=1
T
Pij (q, p) yqij − ypij }
(18)
G(mij ) = tr{
dl
di X
c
c−1 X
X
X
(26)
mij − mlh
i=1 l=i+1 j=1 h=1
(19)
ij
lh
Qlh
ij m − m
T
}
(27)
= tr{VT
=
dl
di X
c
c−1 X
X
X
i=1 l=i+1 j=1 h=1
Qlh
ij
ij
µij − µlh
µ −µ
lh T
!
V} (28)
tr{VT X (Dpen − Wpen ) XT V} (29)
tr{VT XLpen XT V} .
=
(30)
Again, the derivation of (30) is omitted due to lack
of space. The block matrix Wpen in (29) consists of
block submatrices:
1,2
1,c
1,1
· · · Wpen
Wpen
Wpen
2,1
2,2
2,c
Wpen
Wpen
· · · Wpen
Wpen =
.
.
..
..
.
.
.
.
.
.
.
c,1
c,2
c,c
Wpen
Wpen
· · · Wpen
(31)
i,i
The submatrices Wpen
lying on the main block
diagonal are given by:
Wi1
Wi2
0
i,i
Wpen =
, (32)
.
..
0
idi
W
where Wij corresponds to the j-th subclass of the
i-th class and is given by:
P
P
dω
Qωt
ij
ω6
=
i
t=1
T
Wij = −
enij (enij ) ,
2
(nij )
(33)
nij -times
z }| {
where enij = [ 11 · · · 1 ]T . Respectively, the offdiagonal submatrices of Wpen are given by:
ldl
l2
l1
Wi1
Wi1
· · · Wi1
Wl1 Wl2 · · · Wldl
i2
i2
i2
i,l
Wpen =
, i 6= l ,
..
..
..
..
.
.
.
.
l1
Wid
i
l2
Wid
i
···
ldl
Wid
i
(34)
where:
lh
Wij
Qlh
ij
T
enij (enlh ) .
=
nij nlh
(35)
It can be easily shown that D = 0, so that Lpen =
−Wpen .
C. Kernel Subclass Graph Embedding
In this section, the kernelization of SGE is presented. Let us denote by X the initial data space, by
F a Hilbert space and by f the non-linear mapping
function from X to F. The main idea is to firstly map
the original data from the initial space into another
high-dimensional Hilbert space and then perform
linear subspace analysis in that space. If we denote
by mF the dimensionality of the Hilbert space, then
the above procedure is described as:
Pn
!
p=1 a1p k(xq , xp )
..
∈F ,
X ∋ xq→ yq=f (xq ) =
.
Pn
a
k(x
,
x
)
q
p
p=1 mF p
(36)
where k is the kernel function. From the above
equation it is obvious that
Y = AT K ,
where K is the Gram matrix, which has at
(q, p) the value Kqp = k(xq , xp ) and
a11 · · · amF 1
..
..
..
A = [a1 · · · amF ] = .
.
.
a1n · · · amF n
(37)
position
(38)
is the map coefficient matrix. Consequently, the final
SGE optimization becomes:
argmin
tr{AT KLint KA}
.
tr{AT KLpen KA}
(39)
Similarly to the linear case, in order to find the optimal projections, we resolve the generalized eigenproblem:
KLint Ka = λKLpen Ka ,
(40)
keeping the eigenvectors that correspond to the
smallest eigenvalues.
IV.
SGE AS A G ENERAL D IMENSIONALITY
R EDUCTION F RAMEWORK
In this section, it is shown that SGE is a generalized framework that can be used for subspace
learning, since all the standard approaches are specific cases of SGE. Let us use the Gaussian similarity function (17), in order to construct the affinity
matrix.
In the following analysis, we initially let the
variance of Gaussian σ 2 tend to infinity. Hence,
S(xq , xp ) = 1, ∀(q, p) ∈ {1, 2, · · · , n}2 .
Let the intrinsic matrix elements be:
(
S(xq ,xp )
= n1ij , if xq , xp ∈ Cij
ij
nij
,
P (q, p) =
0
, otherwise
(41)
where Cij is the set of the samples that belong to
the j-th subclass of the i-th class. Obviously, (20)
becomes the within-subclass criterion of CDA (also
see eq. 4). Thus, in this case, Wint is the intrinsic
graph matrix of CDA. Let also:
ij
lh
Qlh
ij = S(µ , µ ) = 1, ∀ i, j, h, l
(42)
the penalty matrix elements. Then, (28) becomes the
between-subclass criterion of CDA (also see eq. 5).
Thus, Wpen is the penalty graph matrix of CDA
and the connection between CDA and GE has been
established.
Let us consider that each data sample constitutes
its own class, i.e., c = n, di = 1 and ni = 1, ∀i ∈
{1, 2, · · · , c}. Thus, each class-block of the penalty
graph matrix reduces to a single element of the
matrix. Obviously, each data sample coincides with
the mean of its class. By setting:
Ql1
i1 =
then:
P
S(µi , µl )
1
= , ∀ (i, l) ∈ {1, 2, · · · , c}2 ,
n
n
(43)
P
dω
t=1
Qωt
i1
X1
1
− 1.
n
(ni )
ω6=i
(44)
These values lie on the main diagonal of the penalty
graph matrix. Regarding the off diagonal elements
we have:
1
Ql1
i1
= .
(45)
ni nl
n
−
ω6=i
2
=−
n
=
It can be easily shown that the degree penalty matrix
is D = 0, so that Lpen = −Wpen . Obviously,
T
Lpen = I − n1 en (en ) and XLpen XT becomes
the covariance matrix C of the data. By using
as intrinsic graph matrix the identity matrix, SGE
becomes identical to PCA:
T
T
tr{V IV}
tr{V XLint X V}
= argmin
tr{VT XLpen XT V}
tr{VT CV}
(46)
leading to the following generalized eigenproblem:
(47)
solved by keeping the smallest eigenvalues, or by
setting µ = λ1 , since λ 6= 0, this leads to:
Cv = µIv ,
and
Ql1
1
i1
= .
ni nl
n
(48)
solved by keeping the greatest eigenvalues, which is
obviously the PCA solution.
Now, consider that every class consists of a
unique subclass, thus di = 1, ∀i ∈ {1, 2, . . . , c}. If
we set:
S(xq ,xp )
= n1i , if xq , xp ∈ Ci
ni
P(q, p) =
,
0
, otherwise
(49)
(52)
These are the values of the penalty graph matrix
of LDA. So, by taking the Laplacians of the above
matrices, we end up to the LDA algorithm.
Let us now reject the assumption that the variance of Gaussian tends to infinity. Consider that there
is only one class which contains the whole set of
the data, i.e., c = 1. Also consider that there are no
subclasses within this unique class, i.e., d1 = 1. In
this case the intrinsic graph matrix becomes equal to
P. Thus, by setting P equal to the affinity matrix S,
the intrinsic Laplacian matrix becomes that of LPP.
We observe that by utilizing the identity matrix
as the penalty Laplacian matrix, obviously we get
the LPP algorithm. Since we consider a unique class,
which contains a unique subclass, from (31) and (32)
we have that Wpen = W11 . The values of W11 are
given from (33), which in this case reduces to:
W11 = −
Q11
T
11 n
e (en ) .
n2
If we set:
n2
,
1−n
Q11
11 =
T
argmin
Iv = λCv ,
then the intrinsic graph matrix becomes that of LDA.
Furthermore, if we set:
ni nl
, ∀ (i, l) ∈ {1, . . . , c}2
(50)
Ql1
i1 =
n
then
P
P
dω
ωt
Q
t=1 i1
ω6=i
ni − n
−
(51)
=
2
nni
(ni )
then Wpen = W11 =
1
1
1−n
Lpen = .
..
1
1−n
1
n
n−1 e
1
1−n
1
..
.
1
1−n
(53)
(54)
T
(en ) . Consequently,
1
· · · 1−n
1
· · · 1−n
(55)
.. .
..
.
.
···
1
Thus, if we make the assumption that the number of
the data-samples becomes very large, then asymptotically we have Lpen = I.
Finally, to complete the analysis, if we consider
as the intrinsic Laplacian matrix, the matrix
Lint = I −
1 n n T
e (e )
n
(56)
and if we set:
Qlh
ij =
nij nlh
,
n
(57)
TABLE I: Dimensionality Reduction Using SGE Framework
LPP
PCA
LDA
CDA
SDA
P
11
P(L
int )
d2 (xq ,xp )
(q, p) = exp −
, ∀xq , xp
σ2
Lint = I
Pi1 (q, p) = n1 , xq , xp ∈ ci
Pij (q, p) =
i
1
nij
Lint = I −
, xq , xp ∈ cij
1 n
ne
(en )T
in (33) and (35), SGE becomes identical to SDA.
The parameters that determine the connection of the
several methods with SGE are summarized in Table
I.
V.
E XPERIMENTAL R ESULTS
We conducted 5-fold cross-validation classification experiments on several real-world datasets using
the proposed linear and kernel SGE framework.
For extracting automatically the subclass structure,
we have utilized the multiple Spectral Clustering
technique [16], keeping the most plausible partition
for each dataset. For classifying the data, the Nearest
Centroid (NC) classifier has been used with LPP,
PCA and LDA algorithms, while the Nearest Cluster
Centroid (NCC) [17] has been used with CDA and
SDA algorithms. In NCC, the cluster centroids are
calculated and the test sample is assigned to the class
of the nearest cluster centroid. NC and NCC were selected because they provide the optimal classification
solutions in Bayesian terms, thus proving whether
the DR methods have reached the goal described by
their specific criterion.
A. Classification experiments
For the classification experiments, we have used
diverse publicly available datasets offered for various classification problems. More specifically, FERAIIA, BU, JAFFE and KANADE were used for
facial expression recognition, XM2VTS for face
frontal view recognition, while MNIST and SEMEION for optical digit recognition. Finally, IONOSPHERE, MONK and PIMA were used in order to
further extend our experimental study to diverse data
classification problems.
The cross-validation classification accuracy rates
for the several subspace learning methods over the
utilized datasets, are summarized in Table II. The
optimal dimensionality of the projected space that
returned the above results is depicted in parenthesis.
For each dataset, the best performance rate among
linear and kernel methods separately is highlighted
with bold, while the best overall performance rate
Q11
11
=
Q(Lpen )
σ2
c
di
d
n2
1−n
σ2
1
1
1
∞
∞
n
c
1
1
n
c
∞
c
di
d
∞
c
di
d
(Lpen = I)
1
Ql1
i1 = n
n i nl
Ql1
i1 =
n
Qlh
ij = 1
n
ij nlh
Qlh
ij =
n
among all methods, both linear and kernel, is surrounded by a rectangle. The classification performance rank of each method is also referred in the
last two rows of Table II. Specific Rank denotes the
method rank for the linear and the kernel methods,
independently. Overall Rank refers to the rank of
each method among both the linear and the kernel
methods. The ranking has been achieved through a
post-hoc Bonferroni test [18].
An immediate remark from Table II is that in
both linear and kernel case, multimodal methods
exhibit better classification performance than the
unimodal ones. In particular, the top overall performance is shown by SDA followed by CDA, while the
worst performance is shown by KLPP and KPCA.
This result undoubtedly shows that the inclusion
of subclass information in the DR process offers a
strong potential to improve the performance of the
state-of-the-art in many classification domains.
In comparing linear with kernel methods, a simple calculation yields mean overall rank equal to 5.08
for the linear methods and 5.90 for the kernel ones.
Although the average performance of linear methods
is clearly better than that of kernel ones, we must
admit that there is ample space for improving the
kernel results by varying the RBF parameter, as the
selection of this parameter is not trivial and may easily lead to over-fitting. Actually, the top performance
rates presented in this paper have been obtained by
testing indicative values of the above parameter. As
a matter of fact, it is interesting to observe that
the use of kernels proves to be beneficial for some
methods in certain datasets, while deteriorates the
performance of others.
VI.
C ONCLUSIONS
In this paper, data subclass information has been
incorporated within Graph Embedding (GE) leading to a novel Subclass Graph Embedding (SGE)
framework, which constitutes the main contribution
of our work. In particular, it has been shown that
SGE comprises a generalization of GE, encapsulating a number of state-of-the-art unimodal subspace learning techniques already integrated within
TABLE II: Cross Validation Classification Accuracies (%) of Linear Methods on Several Real-World
Datasets
DATASET
FER-AIIA
BU
JAFFE
KANADE
MNIST
SEMEION
XM2VTS
IONOSPHERE
MONK 1
MONK 2
MONK 3
PIMA
SPECIFIC RANK
OVERALL RANK
LPP
PCA
LDA
CDA
SDA
KLPP
KPCA
KDA
KCDA
KSDA
40.9(3)
39.4(298)
46.8(18)
34.2(92)
71.1(259)
53.6(99)
95.7(54)
84.6(23)
66.7(3)
56.0(1)
77.2(5)
61.8(1)
31.0(120)
38.1(49)
37.6(39)
43.3(46)
79.9(135)
83.2(55)
92.0(86)
72.3(15)
68.3(5)
53.3(4)
80.9(4)
63.5(6)
64.6(6)
51.6(6)
53.2(6)
67.1(6)
84.6(9)
88.2(9)
70.5(1)
78.9(1)
50.8(1)
52.0(1)
49.4(1)
56.5(1)
73.2
49.1(16)
40.0(15)
59.7(7)
84.8(15)
89.2(19)
98.1(3)
80.6(2)
70.0(4)
54.2(1)
74.6(2)
60.5(3)
75.5(11)
52.3(15)
54.1(6)
67.1(5)
85.1(14)
89.4(19)
97.4(2)
83.4(2)
74.2(3)
54.0(2)
66.3(2)
73.5(3)
50.2(252)
52.7(317)
28.8(98)
32.7(99)
81.4(299)
83.8(99)
71.3(297)
83.7(23)
63.3(2)
54.8(1)
62.5(2)
50.7(3)
41.5(29)
35.9(290)
25.9(58)
33.2(88)
64.5(155)
77.4(77)
74.7(56)
70.3(2)
72.5(1)
59.8(3)
79.2(5)
67.5(4)
54.9(6)
46.6(6)
42.4(6)
44.3(6)
86.0(9)
95.3(9)
61.3(1)
92.9(1)
55.8(1)
69.7(1)
51.7(1)
48.9(1)
56.1(12)
41.0(13)
36.1(18)
40.0(6)
83.4(19)
94.1(19)
71.5(3)
93.1(1)
58.3(4)
78.7(1)
67.5(2)
52.5(3)
53.5(12)
48.0(14)
46.3(5)
38.5(6)
85.2(15)
95.9(19)
57.3(4)
92.9(1)
61.7(3)
54.5(1)
58.3(1)
52.9(1)
3.3
5.8
3.8
6.4
3.6
6.0
2.5
4.2
1.6
3.0
3.5
6.7
3.4
6.7
2.9
5.4
2.4
5.2
2.7
5.5
GE. Besides, the connection of SGE with subspace
learning algorithms that use subclass information in
the embedding process has been also analytically
proven. The kernelization of SGE has also been
presented.
Through an extensive experimental study, it has
been shown that subclass learning techniques outperform a number of state-of-the-art unimodal learning
methods in many real-world datasets pertaining to
various classification domains. In addition, the experimental results highlight the superiority in terms of
classification performance of linear methods against
kernel ones.
[7]
[8]
[9]
[10]
In the near future, we intend to employ SGE as a
template to design novel DR methods. For instance,
as current subclass methods are strongly dependent
on the underlying distribution of the data, we anticipate that novel methods, which use neighbourhood
information among the data of the several subclasses,
will succeed in alleviating this sort of limitations.
[11]
R EFERENCES
[14]
[1]
[2]
[3]
[4]
[5]
[6]
X. He and P. Niyogi, “Locality preserving projections,” in
NIPS, S. Thrun, L. K. Saul, and B. Schölkopf, Eds. MIT
Press, 2003.
X. He, S. Yan, Y. Hu, P. Niyogi, and H. Zhang, “Face recognition using laplacianfaces,” IEEE Trans. Pattern Anal.
Mach. Intell, vol. 27, no. 3, pp. 328–340, 2005.
I. Jolliffe, Principal Component Analysis. Springer Verlag,
1986.
D. J. Kriegman, J. P. Hespanha, and P. N. Belhumeur,
“Eigenfaces vs. fisherfaces: Recognition using classspecific linear projection,” in ECCV, 1996, pp. I:43–58.
X. W. Chen and T. S. Huang, “Facial expression recognition: A clustering-based approach,” Pattern Recognition
Letters, vol. 24, no. 9-10, pp. 1295–1302, Jun. 2003.
M. L. Zhu and A. M. Martinez, “Subclass discriminant
analysis,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 28, no. 8, pp. 1274–1286, Aug. 2006.
[12]
[13]
[15]
[16]
[17]
[18]
S. Yan, D. Xu, B. Zhang, H.-J. Zhang, Q. Yang, and S. Lin,
“Graph embedding and extensions: A general framework
for dimensionality reduction,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 29, no. 1,
pp. 40–51, 2007.
J. B. Tenenbaum, V. de Silva, and J. C. Langford, “A
global geometric framework for nonlinear dimensionality
reduction.” Science, vol. 290, no. 5500, pp. 2319–2323,
dec 2000.
S. T. Roweis and L. K. Saul, “Nonlinear dimensionality
reduction by locally linear embedding.” Science, vol. 290,
no. 5500, pp. 2323–2326, dec 2000.
M. Belkin and P. Niyogi, “Laplacian eigenmaps and spectral techniques for embedding and clustering.” Advances in
Neural Information Processing Systems (NIPS), vol. 14, pp.
585–591, 2001.
Y. Cui and L. Fan, “A novel supervised dimensionality
reduction algorithm: Graph-based fisher analysis,” Pattern
Recognition, vol. 45, no. 4, pp. 1471–1481, 2012.
J. Shi, Z. Jiang, and H. Feng, “Adaptive graph embedding
discriminant projections,” Neural Processing Letters, pp.
1–16, 2013.
E. Zare Borzeshi, M. Piccardi, K. Riesen, and H. Bunke,
“Discriminative prototype selection methods for graph embedding,” Pattern Recognition, 2012.
G. Arvanitidis and A. Tefas, “Exploiting graph embedding
in support vector machines,” in Machine Learning for Signal Processing (MLSP), 2012 IEEE International Workshop
on. IEEE, 2012, pp. 1–6.
R. A. Fisher, “The statistical utilization of multiple measurements.” Annals of Eugenics, vol. 8, pp. 376–386, 1938.
A. Azran and Z. Ghahramani, “Spectral methods for automatic multiscale data clustering.” in IEEE Computer Vision
and Pattern Recognition (CVPR) (1). IEEE Computer
Society, 2006, pp. 190–197.
A. Maronidis, A. Tefas, and I. Pitas, “Frontal view recognition using spectral clustering and subspace learning
methods.” in ICANN (1), ser. Lecture Notes in Computer
Science, W. D. K. I. Diamantaras and L. S. Iliadis, Eds.,
vol. 6352. Springer, 2010, pp. 460–469.
O. J. Dunn, “Multiple comparisons among means,” Journal
of American Statistical Association, vol. 56, no. 293, pp.
52–64, 1961.