Speed Up Kernel Discriminant Analysis
Speed Up Kernel Discriminant Analysis
Speed Up Kernel Discriminant Analysis
Abstract Linear Discriminant Analysis (LDA) has been fully utilize the computational results of the existing
a popular method for dimensionality reduction which training samples. Moreover, it is easy to produce sparse
preserves class separability. The projection vectors are projections (Sparse KDA) with a L1 -norm regularizer.
commonly obtained by maximizing the between class Extensive experiments on spoken letter, handwritten
covariance and simultaneously minimizing the within digit image and face image data demonstrate the effec-
class covariance. LDA can be performed either in the tiveness and efficiency of the proposed algorithm.
original input space or in the reproducing kernel Hilbert
space (RKHS) into which data points are mapped, which Keywords Kernel Discriminant Analysis · Regres-
leads to Kernel Discriminant Analysis (KDA). When sion · Subspace Learning · Dimensionality Reduction
the data are highly nonlinear distributed, KDA can
achieve better performance than LDA. However, com-
puting the projective functions in KDA involves eigen- 1 Introduction
decomposition of kernel matrix, which is very expen-
sive when a large number of training samples exist. Dimensionality reduction has been a key problem in
In this paper, we present a new algorithm for kernel many fields of information processing, such as data min-
discriminant analysis, called Spectral Regression Ker- ing, information retrieval, and pattern recognition. When
nel Discriminant Analysis (SRKDA). By using spectral data is represented as points in a high-dimensional space,
graph analysis, SRKDA casts discriminant analysis into one is often confronted with tasks like nearest neighbor
a regression framework which facilitates both efficient search. Many methods have been proposed to index the
computation and the use of regularization techniques. data for fast query response, such as K-D tree, R tree,
Specifically, SRKDA only needs to solve a set of regu- R* tree, etc [16]. However, these methods can only op-
larized regression problems and there is no eigenvector erate with small dimensionality, typically less than 100.
computation involved, which is a huge save of computa- The effectiveness and efficiency of these methods drop
tional cost. The new formulation makes it very easy to exponentially as the dimensionality increases, which is
develop incremental version of the algorithm which can commonly referred to as the “curse of dimensionality”.
During the last decade, with the advances in com-
Deng Cai · Xiaofei He
puter technologies and the advent of the World Wide
State Key Lab of CAD&CG, Web, there has been an explosion in the amount and
College of Computer Science, complexity of digital data being generated, stored, ana-
Zhejiang University lyzed, and accessed. Much of this information is multi-
Tel.: +86-571-88206681
Fax: +86-571-88206680
media in nature, including text, image, and video data.
E-mail: {dengcai,xiaofeihe}@cad.zju.edu.cn The multimedia data are typically of very high dimen-
Jiawei Han
sionality, ranging from several thousands to several hun-
Department of Computer Science, dreds of thousands. Learning with such high dimension-
University of Illinois at Urbana Champaign ality in many cases is almost infeasible. Thus, learnabil-
E-mail: hanj@cs.uiuc.edu ity necessitates dimensionality reduction [8, 30, 33, 34].
2
Once the high-dimensional data is mapped into lower- these similar techniques can be applied on non-linear
dimensional space, conventional indexing schemes can dimensionality reduction algorithms which use kernel
then be applied [20, 28, 29]. techniques.
One of the most popular dimensionality reduction In this paper, we propose a new algorithm for kernel
algorithms might be Linear Discriminant Analysis (LDA). discriminant analysis, called Spectral Regression Kernel
LDA is a supervised method that has proved successful Discriminant Analysis (SRKDA). Our analysis essen-
on classification problems [9, 15]. The projection vec- tially follows our previous idea for speeding up LDA
tors are commonly obtained by maximizing the be- [6]. By using spectral graph analysis, SRKDA casts dis-
tween class covariance and simultaneously minimizing criminant analysis into a regression framework which
the within class covariance. The classical LDA is a lin- facilitates both efficient computation and the use of reg-
ear method and fails for nonlinear problems. To deal ularization techniques. Specifically, SRKDA only needs
with this limitation, nonlinear extensions of LDA through to solve a set of regularized regression problems and
“kernel trick” have been proposed. The main idea of there is no eigenvector computation involved, which is
kernel-based methods is to map the input data to a a huge save of computational cost. Moreover, the new
feature space through a nonlinear mapping, where the formulation makes it very easy to develop incremen-
inner products in the feature space can be computed by tal version of the algorithm which can fully utilize the
a kernel function without knowing the nonlinear map- previous computational results on the existing training
ping explicitly [27]. Kernel Fisher Discriminant Analy- samples.
sis (KFD) in [22] and Generalized Discriminant Anal- The points below highlight the contributions of this
ysis (GDA) in [1] are two independently developed ap- paper:
proaches for kernel-based nonlinear extensions of LDA.
They are essentially equivalent. To avoid confusion, we – KDA in the binary-class case has been shown to be
will refer this approach as Kernel Discriminant Analysis equivalent to regularized kernel regression with the
(KDA) hereafter. class label as the output [27]. Our paper extends
When solving the optimization problem of KDA, this relation to multi-class case.
we need to handle the possible singularity problem of – We provides a new formulation of KDA optimiza-
the total scatter matrix. There are two approaches try- tion problem. With this new formulation, the KDA
ing to address this issue either by using regularization optimization problem can be efficiently solved by
techniques [22] or by applying singular value decom- avoiding the eigen-decomposition of the kernel ma-
position [1, 26][35]. Both of these two approaches for trix. Theoretical analysis shows that the new ap-
solving optimization problem of KDA involve the eigen- proach can achieve 27-times speedup over the ordi-
decomposition of the kernel matrix which is computa- nary KDA approaches.
tionally expensive. Moreover, due to the difficulty of de- – Moreover, SRKDA can be naturally performed in
signing an incremental solution for the eigen-decomposition the incremental manner. The computational results
on the kernel matrix, there has been little work on de- on the existing training samples can be fully uti-
signing incremental KDA algorithms that can efficiently lized when new training samples are injected into
incorporate new data examples as they become avail- the system. Theoretical analysis shows that SRKDA
able. in the incremental mode has only quadratic-time
In [23], S. Mika et al. made a first attempt to speed complexity, which is a huge improvement comparing
up KDA through a greedy approximation technique. to the cubic-time complexity of the ordinary KDA
However, their algorithm was developed to handle the approaches.
binary classification problem. For a multi-class prob- – Since SRKDA uses regression as a building block,
lem, the authors suggested the one against the rest various kinds of regularization techniques can be
scheme by considering all two-class problems. Recent easily incorporated (e.g., L1 -norm regularizer to pro-
studies [4–7, 9] show that various linear dimensionality duce sparse projections). Our approach provides a
reduction algorithms can be formulated as regression huge possibility to develop new variations of kernel
problems and thus have efficient computational solu- discriminant analysis.
tions. Particularly, our previous work [6] has demon- – A short version of this work has been published in
strated that LDA can be formulated as a regression ICDM [3]. In this journal version, we provide two
problem and be efficiently solved. The similar idea has new sections on theoretical analysis (Section 3.1)
been applied to unsupervised dimensionality reduction and sparse KDA (Section 5). Moreover, we have
algorithms [7] and semi-supervised dimensionality re- added significant amount of the experimental re-
duction algorithms [5]. However, it is not clear that how sults.
3
Let α = [α1 , · · · , αm ]T , it can be proved [1] that Eqn. Σr = diag(σ1 , · · · , σr ) is the diagonal matrix of nonzero
(4) is equivalent to: eigenvalues and Ur is the first r columns of U . Thus Σr−1
exists and UrT Ur = I, where I is the identity matrix.
αT KW Kα
α
α opt = arg max , (5) Substituting K in Eqn. (5), we get
T
α KKαα
T
and the corresponding eigen-problem is: Σr UrT α UrT W Ur Σr UrT α
α opt = arg max T .
α = λKKα
KW Kα α. (6) Σr UrT α UrT Ur Σr UrT α
where K is the kernel matrix (Kij = K(xi , xj )) and W We proceed to variable modification using β = Σr UrT α
is defined as: and get:
1/mk , if xk and xj both belong to
Wij = the k-th class; (7) β T UrT W Urβ
β opt = arg max ,
0, otherwise. βTβ
Each eigenvector α gives a projective function ν in Thus, the optimal β ’s are the leading eigenvectors of
the feature space. For a data example x, we have matrix UrT W Ur . Once β ’s are calculated, α can be com-
m
puted as α = Ur Σr−1β .
X
hνν , φ(x)i = αi hφ(xi ), φ(x)i The second method is using the idea of regulariza-
i=1 tion, by adding constant values to the diagonal elements
Xm of KK, as KK + γI, for γ > 0. It is easy to see that
= αi K(xi , x) KK + γI is nonsingular. This method is used in [22].
i=1 By noticing that
T
= α K(:, x)
. KK + γI = U ΣU T U ΣU T + γI = U (Σ 2 + γI)U T ,
where K(:, x) = [K(x1 , x), · · · , K(xm , x)]T . Let {α α1 , · · · , α c−1 }
be the c − 1 eigenvectors of the eigen-problem in Eqn. we define Σ e = (Σ 2 + γI)1/2 , the objective function of
(6) with respect to the non-zero eigenvalues. The trans- regularized KDA can be written as:
formation matrix Θ = [α α1 , · · · , α c−1 ] is an m × (c − 1)
matrix and a data sample x can be embedded into c − 1 α T KW Kα α
dimensional subspace by max
αT (KK + γI)α α
x → z = ΘT K(:, x). α U ΣU W U ΣU T α
T T
= max
αT U Σ e ΣU
e Tα
The above approach extends LDA into RKHS by βT Σe −1 ΣU T W U Σ Σ e −1β
using “kernel trick” is independently developed by Mika = max T
β β
et al . [22] and Baudat et al . [1]. This algorithm was
named as Kernel Fisher Discriminant (KFD) in [22] and e T α . The optimal β ’s are the leading
where β = ΣU
Generalized Discriminant Analysis (GDA) in [1]. e −1 ΣU T W U Σ Σ
e −1 . With this
eigenvectors of matrix Σ
formulation, the above two methods can be computed
2.1 Computational Analysis of KDA in exactly the same way.
To reduce the computation in calculating β , we shall
To get a stable solution of the eigen-problem in Eqn. exploit the special structure of W . Without loss of gen-
(6), the matrix KK is required to be non-singular [17]. erality, we assume that the data points are ordered ac-
When K is singular, there are two methods to solve this cording to their labels. It is easy to check that the ma-
problem. The first method is by using eigen-decomposition trix W has a block-diagonal structure
of K, which was proposed in [1]. (1)
Suppose the rank of K is r(r ≤ m) and the eigen- W 0 ··· 0
0 W (2) · · · 0
decomposition of K is as follows:
W = . .. . . .. (8)
.
. . . .
K = U ΣU T = Ur Σr UrT
0 0 · · · W (c)
where Σ = diag(σ1 , · · · , σm ) is the diagonal matrix
of sorted eigenvalues (σ1 ≥ · · · ≥ σm ≥ 0) and U is where {W (k) }ck=1 is an mk × mk matrix with all the
the matrix of normalized eigenvectors associated to Σ. elements equal to 1/mk .
5
(1) (c)
We partition the m×r matrix Ur as [Ur , · · · , Ur ]T , 3 Efficient KDA via Spectral Regression
(k) (k)
where Ur ∈ Rr×mk . Let vi be the i-th column vector
(k)
of Ur , we have: In order to solve the KDA eigen-problem in Eqn. (6)
efficiently, we use the following theorem:
c
X Theorem 1 Let y be the eigenvector of eigen-problem
UrT W Ur = Ur(k) W (k) (Ur(k) )T
k=1 W y = λy (10)
c mk mk
!
X 1 X (k)
X (k)
= vi (vi )T α = y, then α is the eigenvector
with eigenvalue λ. If Kα
mk
k=1 i=1 i=1 of eigen-problem in Eqn. (6) with the same eigenvalue
Xc
λ.
= mk v̄(k) (v̄(k) )T
k=1 Proof We have W y = λy. At the left side of Eqn. (6),
=HH T α by y, we have
replace Kα
√ √ α = KW y = Kλy = λKy = λKKα
KW Kα α
where H = m1 v̄(1) , · · · , mc v̄(c) ∈ Rr×c and v̄(k)
is the average vector of vi .
(k) Thus, α is the eigenvector of eigen-problem Eqn. (6)
To calculate the c leading eigenvectors of HH T , it with the same eigenvalue λ.
is not necessary to work on matrix HH T which is of Theorem 1 shows that instead of solving the eigen-
size r × r. We can use a much more efficient algorithm. problem Eqn. (6), the KDA projective functions can be
Suppose the Singular Value Decomposition of H is obtained through two steps:
It can be easily verified that the solution α ∗ = (K + matrix K is positive definite and the δ = 0, Theorem 1
δI)−1 y given by equations in Eqn. (11) is the optimal shows that the c − 1 solutions α k = K −1 yk are exactly
solution of the following regularized regression problem the eigenvectors of the KDA eigen-problem in Eqn. (6)
[36]: with respect to the eigenvalue 1. In this case, SRKDA is
m equivalent to ordinary KDA. Thus, it is interesting and
X 2
min f (xi ) − yi + δkf k2K (12) important to see when the positive semi-definite kernel
f ∈F matrix K will be positive definite.
i=1
One of the most popular kernels is the Gaussian
where yi is the i-th element of y, F is the RKHS as-
RBF kernel, K(xi , xj ) = exp(−kxi − xj k2 /2σ 2 ). Our
sociated with Mercer kernel K and k kK is the corre-
discussion in this section will only focus on Gaussian
sponding norm.
kernel. Regarding the Gaussian kernel, we have the fol-
Now let us analyze the eigenvectors of W which is lowing lemma:
defined in Eqn. (7) and (8). The W is block-diagonal, Lemma 1 (Full Rank of Gaussian RBF Gram
thus, its eigenvalues and eigenvectors are the union of Matrices [21]) Suppose that x1 , · · · , xm are distinct
the eigenvalues and eigenvectors of its blocks (the latter points, and σ 6= 0. The matrix K given by
padded appropriately with zeros). It is straightforward
to show that W (k) has eigenvector e(k) ∈ Rmk asso- Kij = exp(−kxi − xj k2 /2σ 2 )
ciated with eigenvalue 1, where e(k) = [1, 1, · · · , 1]T . has full rank.
Also there is only one non-zero eigenvalue of W (k) be-
Proof See [21] and Theorem 2.18 in [27].
cause the rank of W (k) is 1. Thus, there are exactly c
eigenvectors of W with the same eigenvalue 1. These In other words, the kernel matrix K is positive definite
eigenvectors are (provided no two xi are the same).
Thus, we have the following theorem:
yk =[ 0, · · · , 0 , 1, · · · , 1, 0, · · · , 0 ]T k = 1, · · · , c (13) Theorem 2 If all the sample vectors are different and
| {z } | {z } | {z }
Pk−1 Pc
mi mk i=k+1 mi the Gaussian RBF kernel is used, all c − 1 projective
i=1
Table 2 Computational complexity of KDA and SRKDA Zou et al . [37] proposed an elegant sparse PCA al-
Algorithm operation counts (flam [31]) gorithm (SPCA) using their “Elastic Net” framework
KDA 9 3
m + cm2 + O(nm2 )
for L1 -penalized regression on regular principle com-
Batch mode 2
ponents, solved very efficiently using least angle regres-
1 3
SRKDA 6
m + cm2 + O(nm2 ) sion (LARS) [13]. Subsequently, d’Aspremont et al . [11]
Incremental KDA 9 3
2
m + cm2 + O(nm∆m) relaxed the hard cardinality constraint and solved for
mode a convex approximation using semi-definite program-
SRKDA ( ∆m
2
+ c)m2 + O(nm∆m)
ming. In [24, 25], Moghaddam et al . proposed a spectral
m: the number of data samples
bounds framework for sparse subspace learning. Partic-
n: the number of features
c: the number of classes ularly, they proposed both exact and greedy algorithms
∆m: the number of new data samples for sparse PCA and sparse LDA.
The projective function of a kernel subspace learn-
ing algorithm can be written as
m
X
We summarize our complexity analysis results in
Table 2. The main conclusions include: f (x) = α T K(:, x) = αi K(xi , x). (15)
i=1
– The ordinary KDA needs to perform eigen-decompositionIn the ordinary kernel subspace learning algorithms, αi
on the kernel matrix, which is very computation- are usually nonzero and the projective function is de-
ally expensive. Moreover, it is difficult to develop pendent on all the samples in the training set. When
incremental algorithm based on the ordinary KDA we aim at learning a sparse function (sparse α), many
formulation. In both batch and incremental modes, αi will equal to zero. Thus, the projective function will
ordinary KDA has the dominant part of the cost as only depend on part of the training samples. From this
9 3
2m . sense, the sparse kernel subspace learning algorithms
– SRKDA performs regression instead of eigen-decomposition.
share the similar idea of Support Vector Machines [36].
In the batch mode, it only has the dominant part Those samples with non-zero αi can also be called as
of the cost as 16 m3 , which is a 27-times speedup of support vectors. One advantage of this parsimony is
ordinary KDA. Moreover, it is easy to develop incre- that it requires less storage for the model and less com-
mental version of SRKDA which only has quadratic- putational time in the testing phase.
time complexity with respect to m. This computa- Following [25], the objective function of Sparse Ker-
tional advantage makes SRKDA much more practi- nel Discriminant Analysis (SparseKDA) can be defined
cal in real world applications. as the following cardinality-constrained optimization:
T
α
max αα TKW Kα
KKαα (16)
α) = k
subject to card(α
5 Sparse KDA via Spectral Regression
The feasible set is all sparse α ∈ Rm with k non-zero
Since SRKDA uses regression as a building block, vari- elements and card(α α) as their L0 -norm. Unfortunately,
ous kinds of regularization techniques can be easily in- this optimization problem is NP-hard and generally in-
corporated which makes SRKDA more flexible. In this tractable .
section, we will discuss the usage of L1 -norm regularizer In [24, 25], Moghaddam et al . proposed a spectral
to produce a sparse KDA solution. bounds framework for sparse subspace learning. Partic-
Recently, there are considerable interests on devel- ularly, they proposed both exact and greedy algorithms
oping sparse subspace learning algorithms, i.e., the pro- for sparse PCA and sparse LDA. Their spectral bounds
jective vectors are sparse. While the traditional linear framework is based on the following optimal condition
subspace learning algorithms (e.g., PCA, LDA) learn of the sparse solution.
a set of combined features which are linear combina- For simplicity, we define A = KW K and B = KK.
tions of all the original features, sparse linear subspace A sparse vector α ∈ Rm with cardinality k yielding the
learning algorithms can learn the combined features maximum objective value in Eqn. (16) would necessar-
which are linear combinations of part of the original ily imply that
features (important ones). Such parsimony not only pro- α T Aα
α β T Akβ
duces a set of projective functions that are easy to in- λmax = T
= T
α Bα α β Bk β
terpret but also leads to better performance [37, 25].
where β ∈ Rk contains the k non-zero elements in α and
1 Actually, we only need to store R. the k × k principle sub-matrices of A and B obtained
9
by deleting the rows and columns corresponding to the Table 3 Statistics of the three data sets
zero indices of α . The k-dimensional quadratic form in train test # of
dataset dim (n)
β is equivalent to a standard unconstrained generalized size (m) size classes (c)
Rayleigh quotient, which can be solved by a generalized Isolet 617 6238 1559 26
USPS 256 7291 2007 10
eigen-problem.
PIE 1024 8000 3554 68
The above observation gives the exact algorithm for
sparse subspace learning: a discrete search for the k in-
dices which maximize λmax of the subproblem (Ak , Bk ).
However, such observation does not suggest an efficient all the possible cardinality on α ) of the regression prob-
algorithm because an exhaustive search is still NP-hard. lem in Eqn. (17) can be computed in O(m3 ). Thus,
To solve this problem, Moghaddam et al . proposed an SRKDA with a L1 -norm regularizer provides us an ef-
efficient greedy algorithm which combines backward elim- ficient algorithm to compute the sparse KDA solution.
ination and forward selection [24, 25]. However, there
are two major drawbacks of their approach:
6 Experimental Results
1. Even their algorithm is a greedy one, the cost of
backward elimination is with complexity O(m4 )[25].
In this section, we investigate the performance of our
2. In reality, more than one projective functions are
proposed SRKDA algorithm in batch mode, incremen-
usually necessary for subspace learning. However,
tal mode and sparse mode. All of our experiments have
the optimal condition of the sparse solution only
been performed on a P4 3.20GHz Windows XP machine
gives the guide to find ONE sparse “eigenvector”,
with 2GB memory. For the purpose of reproducibility,
which is the first projective function. It is unclear
we provide all the algorithms used in these experiments
how to find the following projective functions. Al-
at:
though [24] suggests to use recursive deflation, the
http://www.zjucadcg.cn/dengcai/Data/data.html
sparseness of the following projective functions is
not guaranteed.
– The CMU PIE face database4 contains 68 subjects The main observations from the performance com-
with 41,368 face images as a whole. The face images parisons include:
were captured under varying pose, illumination and – The Kernel Discriminant Analysis model is very ef-
expression. In our experiment, the five near frontal fective in classification. SRKDA has the best per-
poses (C05, C07, C09, C27, C29) under different il- formance for almost all the cases in all the three
luminations and expressions are used which leaves data sets (even better than SVM). For Isolet data
us 11,554 face images. All the images are manually set, previous study [12] reported the minimum error
aligned and cropped. The cropped images are 32×32 rate training on Isolet1+2+3+4 by OPT6 with 30
pixels, with 256 gray levels per pixel5 . Among the bit ECOC is 3.27%. KDA (SRKDA) achieved better
11,554 images, 8,000 images are used as the train- performance in our experiment for this train/test
ing set and the remaining 3,554 images are used for split. For USPS data set, previous studies [27] re-
testing. We also run several cases by training all the ported error rate 3.7% for KDA and 4.0% for SVM,
algorithms on the first 2000, 3000, · · · , 8000 images slightly better than the results in our experiment.
in the training set. For all the cases, KDA (SRKDA) achieved signif-
icantly better performance than LDA, which sug-
6.2 Compared algorithms gests the effectiveness of kernel approaches.
– Since the eigen-decomposition of the kernel matrix
Four algorithms which are compared in our experiments is involved, the ordinary KDA is computationally
are listed below: expensive in training. SRKDA uses regression in-
stead of eigen-decomposition to solve the optimiza-
1. Linear Discriminant Analysis (LDA) [15], which pro- tion problem, and thus achieves significant speedup
vides us a baseline performance of linear algorithms. comparing to ordinary KDA. The empirical results
We can examine the usefulness of kernel approaches are consistent with the theoretical estimation of the
by comparing the performance of KDA and LDA. efficiency. The time of training SRKDA is compa-
2. Kernel Discriminant Analysis (KDA) as discussed rable with that of training SVM. SRKDA is faster
in Section 2. We test the regularized version and than SVM on Isolet and PIE data sets, while slower
choose the regularization parameter δ by five fold than SVM on USPS data set. This is because the
cross-validation on the training set. time of training SVM is dependant with the number
3. Spectral Regression Kernel Discriminant Analysis of support vectors [2]. For some data sets with lots
(SRKDA), our approach proposed in this paper. of noise (e.g., USPS), the number of support vectors
The regularization parameter δ is also chosen by is far less than the number of samples. In this case,
five fold cross-validation on the training set. SVM can be trained very fast.
4. Support Vector Machine (SVM) [36], which is be-
lieved as one of the state-of-the-art classification al-
gorithms. Specifically, we use the LibSVM system 6.4 Experiments on Incremental KDA
[10] which implements the multi-class classification
with one versus one strategy. SVM is used to get the In this experiment, we study the computational cost
sense that how good the performance of KDA is. of SRKDA performing in the incremental manner. The
USPS and PIE data sets are used. We start from the
We use the Gaussian RBF kernel for all the kernel-
training set with the size of 1000 (the first 1000 sam-
based methods. We tune the kernel width parameter σ
ples in the whole training set) and increase the training
and large margin parameter C in SVM to achieve best
size by 200 for each step. SRKDA is then performed in
testing performance for SVM. Then, the same kernel
the incremental manner. It is important to note that
width parameter σ is used in all the other kernel-based
SRKDA in the incremental manner gives the exactly
algorithms.
same projective functions as the SRKDA in the batch
mode. Thus, we only care about the computational
6.3 Results costs in this experiment.
Figure 2 and 3 shows log-log plots of how CPU-time
The classification error rate as well as the training time of KDA (SRKDA, incremental SRKDA) increases with
(second) for each method on the three data sets are the size of the training set on USPS and PIE data set
reported on the Table (4 ∼ 6) respectively. respectively. Lines in a log-log plot correspond to poly-
4
nomial growth O(md ), where d corresponds to the slope
http://www.ri.cmu.edu/projects/project 418.html
5 http://www.zjucadcg.cn/dengcai/Data/FaceData.html 6 Conjugate-gradient implementation of back-propagation
11
4
of the line. The ordinary KDA scales roughly O(m2.9 ), 10
KDA
which is slightly better than the theoretical estimation. SRKDA
SRKDA−Incre
SRKDA in the batch mode has better scaling, which 3
10
O(m2.9)
is also better than theoretical estimation with roughly
Computational time (s)
O(m2.6)
O(m2.6 ) over much of the range. This explains why 2
10
1.8
O(m )
SRKDA can be more than 27 times faster than ordi-
nary KDA in the previous experiments. The SRKDA 1
10
in the incremental mode has the best scaling, which is
(to some surprise) better than quadratic with roughly
0
O(m1.8 ) over much of the range. 10
−1
10
3 4
10 10
6.5 Experiments on Sparse KDA Training size
4
10 formance of SRKDA(sparse) is better than that of the
KDA
SRKDA ordinary KDA and SRKDA.
3 SRKDA−Incre
10 2.9
O(m )
Computational time (s)
O(m2.6)
2
10 O(m1.8) 7 Conclusions
1
In this paper, we propose a novel algorithm for kernel
10
discriminant analysis, called Spectral Regression Ker-
nel Discriminant Analysis (SRKDA). Our algorithm is
0
10 developed from a graph embedding viewpoint of KDA
problem. It combines the spectral graph analysis and
−1
10 regression to provide an efficient approach for kernel
3 4
10 10 discriminant analysis. Specifically, SRKDA only needs
Training size
to solve a set of regularized regression problems and
Fig. 3 Computational cost of KDA, batch SRKDA and incre- there is no eigenvector computation involved, which is
mental SRKDA on the PIE data set.
a huge save of computational cost. The theoretical anal-
ysis shows that SRKDA can achieve 27-times speedup
Table 7 Classification error on Isolet dataset over the ordinary KDA. Moreover, the new formulation
Error (%) Sparsity makes it very easy to develop incremental version of the
Training Set KDA SRKDA SRKDA(Sparse) algorithm which can fully utilize the computational re-
Isolet1 11.74 12.89 11.74 60% sults of the existing training samples. With incremental
Isolet1+2 3.79 3.85 3.59 60% implementation, the computational cost of SRKDA re-
Isolet1+2+3 2.99 3.08 2.82 60%
Isolet1+2+3+4 2.82 2.89 2.82 60%
duces to quadratic-time complexity. Since SRKDA uses
regression as a building block, various kinds of regular-
Table 8 Classification error on USPS dataset ization techniques can be easily incorporated (e.g., L1 -
norm regularizer to produce sparse projections). Our
Error (%) Sparsity
Training Set KDA SRKDA SRKDA(Sparse) approach provides a huge possibility to develop new
1500 6.58 5.88 5.83 60% variations of kernel discriminant analysis. Extensive ex-
3000 5.53 5.38 5.13 60% perimental results show that our method consistently
4500 5.53 4.88 4.73 60%
outperforms the other state-of-the-art KDA extensions
6000 5.03 4.43 4.04 60%
7291 4.83 4.04 3.94 60% considering both effectiveness and efficiency.
Table 9 Classification error on PIE dataset Acknowledgements This work was supported in part by Na-
tional Natural Science Foundation of China under Grants 60905001
Error (%) Sparsity
and 90920303, National Key Basic Research Foundation of China
Training Set KDA SRKDA SRKDA(Sparse)
under Grant 2009CB320801. Any opinions, findings, and conclu-
2000 5.18 4.81 4.73 60%
sions expressed here are those of the authors and do not neces-
3000 4.25 3.94 3.71 60%
sarily reflect the views of the funding agencies.
4000 5.53 3.24 3.12 60%
5000 3.23 2.90 2.81 60%
6000 2.91 2.53 2.44 60%
7000 2.65 2.19 2.17 60% References
8000 2.41 2.17 2.14 60%
1. G. Baudat and F. Anouar. Generalized discriminant analysis
using a kernel approach. Neural Computation, (12):2385–
2404, 2000.
bles. The sparsity is defined as the percentage of zero 2. C. J. C. Burges. A tutorial on support vector machines for
entries in a projective vector. For ordinary KDA and pattern recognition. Data Mining and Knowledge Discovery,
2(2):121–167, 1998.
SRKDA, the projective functions (vectors) are dense 3. D. Cai, X. He, and J. Han. Efficient kernel discriminant
and the sparsity is zero. analysis via spectral regression. In Proc. Int. Conf. on Data
Mining (ICDM’07), 2007.
As can be seen, the SRKDA(sparse) generates much 4. D. Cai, X. He, and J. Han. Spectral regression: A unified
more parsimonious model. The sparsity of the projec- approach for sparse subspace learning. In Proc. Int. Conf.
tive function in SRKDA(sparse)is 60%, which means on Data Mining (ICDM’07), 2007.
5. D. Cai, X. He, and J. Han. Spectral regression: A unified sub-
the number of the “support vectors” is less than half of
space learning framework for content-based image retrieval.
the total training samples. Moreover, such parsimony In Proceedings of the 15th ACM International Conference
leads to better performance. In all the cases, the per- on Multimedia, Augsburg, Germany, 2007.
13
6. D. Cai, X. He, and J. Han. SRDA: An efficient algorithm the 2005 ACM SIGMOD international conference on Man-
for large scale discriminant analysis. IEEE Transactions on agement of data, pages 730–741, 2005.
Knowledge and Data Engineering, 20(1):1–12, January 2008. 29. H. T. Shen, X. Zhou, and B. Cui. Indexing text and visual
7. D. Cai, X. He, W. V. Zhang, and J. Han. Regularized locality features for www images. In 7th Asia Pacific Web Confer-
preserving indexing via spectral regression. In Proceedings of ence (APWeb2005), 2005.
the 16th ACM conference on Conference on information and 30. H. T. Shen, X. Zhou, and A. Zhou. An adaptive and dynamic
knowledge management (CIKM’07), pages 741–750, 2007. dimensionality reduction method for high-dimensional index-
8. K. Chakrabarti and S. Mehrotra. Local dimensionality reduc- ing. The VLDB Journal, 16(2):219–234, 2007.
tion: A new approach to indexing high dimensional spaces. 31. G. W. Stewart. Matrix Algorithms Volume I: Basic Decom-
In Proc. 2000 Int. Conf. Very Large Data Bases (VLDB’00), positions. SIAM, 1998.
2000. 32. G. W. Stewart. Matrix Algorithms Volume II: Eigensystems.
9. S. Chakrabarti, S. Roy, and M. V. Soundalgekar. Fast and SIAM, 2001.
accurate text classification via multiple linear discriminant 33. D. Tao, X. Li, X. Wu, and S. J. Maybank. General tensor
projections. The VLDB Journal, 12(2):170–185, 2003. discriminant analysis and gabor features for gait recognition.
10. C.-C. Chang and C.-J. Lin. LIBSVM: a library for IEEE Transactions on Pattern Analysis and Machine Intel-
support vector machines, 2001. Software available at ligence, 29(10):1700–1715, 2007.
http://www.csie.ntu.edu.tw/∼cjlin/libsvm. 34. D. Tao, X. Li, X. Wu, and S. J. Maybank. Geometric mean
11. A. d’Aspremont, L. E. Chaoui, M. I. Jordan, and G. R. G. for subspace selection. IEEE Transactions on Pattern Anal-
Lanckriet. A direct formulation for sparse PCA using ysis and Machine Intelligence, 31(2):260–274, 2009.
semidefinite programming. In Advances in Neural Informa- 35. D. Tao, X. Tang, X. Li, and Y. Rui. Kernel direct biased
tion Processing Systems 17, 2004. discriminant analysis: A new content-based image retrieval
12. T. G. Dietterich and G. Bakiri. Solving multiclass learning relevance feedback algorithm. IEEE Transactions on Multi-
problems via error-correcting output codes. Jounal of Arti- media, 8(4):716–727, 2006.
ficial Intelligence Research, (2):263–286, 1995. 36. V. N. Vapnik. Statistical learning theory. John Wiley &
13. B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least Sons, 1998.
angle regression. Annals of Statistics, 32(2):407–499, 2004. 37. H. Zou, T. Hastie, and R. Tibshirani. Sparse principle com-
14. M. A. Fanty and R. Cole. Spoken letter recognition. In ponent analysis. Technical report, Statistics Department,
Advances in Neural Information Processing Systems 3, 1990. Stanford University, 2004.
15. K. Fukunaga. Introduction to Statistical Pattern Recogni-
tion. Academic Press, 2nd edition, 1990.
16. V. Gaede and O. Günther. Multidimensional access methods.
ACM Comput. Surv., 30(2):170–231, 1998.
17. G. H. Golub and C. F. V. Loan. Matrix computations. Johns
Hopkins University Press, 3rd edition, 1996.
18. T. Hastie, R. Tibshirani, and J. Friedman. The Elements
of Statistical Learning: Data Mining, Inference, and Predic-
tion. New York: Springer-Verlag, 2001.
19. J. J. Hull. A database for handwritten text recognition re-
search. IEEE Trans. Pattern Anal. Mach. Intell., 16(5),
1994.
20. H. Jin, B. C. Ooi, H. T. Shen, C. Yu, and A. Zhou. An
adaptive and efficient dimensionality reduction algorithm for
high-dimensional indexing. In Proc. 2003 Int. Conf. on Data
Engineering (ICDE’03), 2003.
21. C. A. Micchelli. Algebraic aspects of interpolation. In Pro-
ceedings of Symposia in Applied Mathematics, volume 36,
pages 81–102, 1986.
22. S. Mika, G. Rätsch, J. Weston, B. Schölkopf, and K.-R.
Müller. Fisher discriminant analysis with kernels. In Proc.
of IEEE Neural Networks for Signal Processing Workshop
(NNSP), 1999.
23. S. Mika, A. Smola, and B. Schölkopf. An improved train-
ing algorithm for kernel fisher discriminants. In Proceedings
AISTATS 2001. Morgan Kaufmann, 2001.
24. B. Moghaddam, Y. Weiss, and S. Avidan. Spectral bounds
for sparse PCA: Exact and greedy algorithms. In Advances
in Neural Information Processing Systems 18, 2005.
25. B. Moghaddam, Y. Weiss, and S. Avidan. Generalized spec-
tral bounds for sparse LDA. In ICML ’06: Proceedings of the
23rd international conference on Machine learning, pages
641–648, 2006.
26. C. H. Park and H. Park. Nonlinear discriminant analysis
using kernel functions and the generalized singular value de-
composition. SIAM J. Matrix Anal. Appl., 27(1):87–102,
2005.
27. B. Schölkopf and A. J. Smola. Learning with Kernels. MIT
Press, 2002.
28. H. T. Shen, B. C. Ooi, and X. Zhou. Towards effective index-
ing for very large video sequence database. In Proceedings of