Kernel Ridge Regression Classification
Kernel Ridge Regression Classification
Abstract—We present a nearest nonlinear subspace classifier subspace classification methods [9][10] classify a new test
that extends ridge regression classification method to kernel sample into the class whose subspace is the closest. CLASS
version which is called Kernel Ridge Regression Classification featuring information compression (CLAFIC) is one of the
(KRRC). Kernel method is usually considered effective in earliest and well-known subspace methods [11] and its
discovering the nonlinear structure of the data manifold. The
extension into the nonlinear subspace is the Kernel CLAFIC
basic idea of KRRC is to implicitly map the observed data into
potentially much higher dimensional feature space by using (KCLAFIC). This method employs the principal component
kernel trick and perform ridge regression classification in analysis to compute the basis vectors spanning subspace of
feature space. In this new feature space, samples from a each class. Linear Regression Classification (LRC) [12]
single-object class may lie on a linear subspace, such that a new method is another popular subspace method. In LRC,
test sample can be represented as a linear combination of classification is taken as a class specific linear regression
class-specific galleries, then the minimum distance between the
problem and the regression coefficients are estimated by
new test sample and class specific subspace is used for
classification. Our experimental studies on synthetic data sets using the least square estimation method, and then the
and some UCI benchmark datasets confirm the effectiveness of classification is made by the minimum distance between the
the proposed method. original sample and the projected sample. However, the least
square estimation is very sensitive to outliers [13]. Therefore,
the performance of LRC decreases sharply as the sample
I. INTRODUCTION contaminated by outliers. Due to L2,1-norm based loss
2264
Authorized licensed use limited to: INDIAN INST OF INFO TECH AND MANAGEMENT. Downloaded on December 05,2023 at 12:38:41 UTC from IEEE Xplore. Restrictions apply.
φq
T
( x) = φ ( X i )α i φq
i i
( x) φ ( x)
= φ ( X i ) (φ T ( X i )φ ( X i ) + λ I ) φ T ( X i )φ ( x)
−1
= φ T ( x)φ ( X i ) (φ T ( X i )φ ( X i ) + λ I ) φ T ( X i )φ ( x)
−1
(10)
(17)
H iφφ ( x) = K ( x, X i ) ( K ( X i , X i ) + λ I ) K ( X i , x )
−1
Where φq i
( x ) is the projection of φ ( x) onto the subspace of = K ( x, X i ) A
the ith class by the class specific projection matrix which is Here
defined as follows: A = ( K ( X i , X i ) + λ I ) K ( X i , x)
−1
(18)
H i = φ ( X i ) (φ ( X i )φ ( X i ) + λ I ) φ ( X i )
φ T −1 T
(11) Since K ( X i , X i ) + λ I is positive definite, and its Cholsky
If the original sample belongs to the subspace of class i, the decomposition can be written as
predicted sample φq i
( x ) in kernel space F will be the closet K ( X i , X i ) + λ I = LT L (19)
sample to the original sample. Then the matrix A in Equation (18) can be efficiently
2
i * = arg min φq
i
( x) − φ ( x) computed by solving the following linear equation
i 2 (12) LT LA = K ( X i , x) (20)
T T
= φq
i
( x) φq
i
( x) − 2 * φq
i
( x) φ ( x) + φ T ( x)φ ( x) Note that the third term in Equation (12) has no effect on
According to Mercer’s theorem [19], the form of nonlinear classification results, since it has nothing to do with the class
function φ ( x) is not necessarily known explicitly and could information. Therefore, after neglecting the third term, we
be determined by a kernel function k : X × X → R which has have
T T
the following property: φq
i
( x) φq
i
( x) − 2 * φq
i
( x) φ ( x)
k ( xi , x j ) = φ T ( xi )φ ( x j ) (13) = AT K ( X i , X i ) A − 2 * K ( x, X i ) A
(21)
There are numerous types of kernel functions [19]. In our = AT ( K ( X i , X i ) − 2 * ( K ( X i , X i ) + λ I ) ) A
experiments, we adopt most popular Gaussian kernel that is = − AT ( K ( X i , X i ) + 2λ I ) A
given by
Equivalently, the classification process (12) can be
⎛ x −x 2
⎞ reformulated as
k ( xi , x j ) = exp ⎜ − 2 ⎟
i j
(14)
⎜ t ⎟ i* = arg max { AT ( K ( X i , X i ) + 2λ I ) A} (22)
⎝ ⎠ i
The parameter t is empirically set as the average Euclidean The KRRC algorithm is given in Algorithm 1.
distance of all training samples.
Obviously, the classification process (12) can be expressed Algorithm 1 Kernel Ridge Regression Classification
in terms of inner products between mapped training samples Input: training data matrix X_train, Label vector for training
in F. Let us define kernel matrix K whose elements is data L_train and testing data matrix X_test.
⎛ k ( x1i , x1j ) k ( x1i , x2j ) " k ( x1i , xnj j ) ⎞ Procedure:
⎜ ⎟ For each testing data sample x, predict its label as follows:
⎜ k ( x2i , x1j ) k ( x2i , x2j ) " k ( x2i , xnj j ) ⎟
K(Xi, X j ) = ⎜ ⎟ (15) Step 1. Compute the kernel matrix K(Xi,Xi) and K(Xi,x) with
⎜ # # " # ⎟ Gaussian kernel (14).
⎜ k ( xi , x j ) k ( xi , x j ) i j ⎟
" k ( xni , xn j ) ⎠ Step 2. Compute matrix A with (20).
⎝ ni 1 ni 2
Step 3. Decision is made in favor of the class with the
Following some simple algebraic steps, we see that the first minimum distance in (22).
term in Equation (12) can be reformulated as Output: Label vector for testing data L_test.
T
φq
i
( x) φq
i
( x)
= φ T ( x)φ ( X i ) (φ T ( X i )φ ( X i ) + λ I ) φ T ( X i )
−1
2265
Authorized licensed use limited to: INDIAN INST OF INFO TECH AND MANAGEMENT. Downloaded on December 05,2023 at 12:38:41 UTC from IEEE Xplore. Restrictions apply.
training and the last fold is used for testing. This process is these two synthetic data sets, while KNN, KLRC and KRRC
repeated 5 times, leaving one different fold for testing each give better results. Since these synthetic data sets has
time. The average accuracy and corresponding standard nonlinear structure and the assumption underlying LRC
deviation over the five runs of cross validation is reported for method is not satisfy.
evaluation.
B. Experiments on Synthetic Data Sets C. Experiments on UCI Data Sets
We first conduct experiments on two synthetic data sets In the experiments, we choose 14 real world data sets with
displayed in Fig. 2 and Fig.3. In these figure, the data points varying dimensions and number of data points from UCI data
that belong to the same class are shown with the same color repository to test our algorithm. The data sets are named as
and style. Obviously, they can’t be classified linearly. The wine, Soybean2, Soybean1, liver, heart, glass, breast, yeast,
performance is shown in Tables I – II. vowel, diabetes, seeds, dermatology, hepatitis and balance
[22]. The detail of the data description is shown in Table III.
Table IV shows the classification results of different methods.
10 The numbers in the brackets are the corresponding standard
deviation. According to Table IV, our method generally
5 shows higher performance than the other methods.
TABLE III
UCI DATA DESCRIPTIONS AND EXPERIMENTAL SETTINGS
0
2266
Authorized licensed use limited to: INDIAN INST OF INFO TECH AND MANAGEMENT. Downloaded on December 05,2023 at 12:38:41 UTC from IEEE Xplore. Restrictions apply.
IV. CONCLUSION [16] Huang, Shih-Ming, and Jar-Ferr Yang. "Improved principal component
regression for face recognition under illumination variations." Signal
In this paper, we presented a kernel ridge regression Processing Letters, IEEE. Vol.19, pp. 179-182. April, 2012.
classification (KRRC) algorithm based on ridge regression [17] Lu, Yuwu, Xiaozhao Fang, and Binglei Xie. (June, 2013) Kernel linear
for classification. KRRC algorithm firstly makes a nonlinear regression for face recognition. Neural Computing and Applications
[Online]. pp. 1-7. Available: http://dx.doi.org/
mapping of the data to a feature space, and then perform 10.1007/s00521-013-1435-6.
ridge regression classification method on this feature space, [18] Hastie, Trevor, et al. The elements of statistical learning: data mining,
so KRRC is good at enhancing the linearity of distribution inference and prediction. The Mathematical Intelligencer 27.2 (2005):
structure underlying samples and able to obtain higher 83-85.
[19] Shawe-Taylor, John, and Nello Cristianini. Kernel methods for pattern
accuracy than LRC. We showed the effective performance of analysis. Cambridge university press, 2004.
our method by comparing its results on the synthetic and UCI [20] Hoerl, Arthur E., and Robert W. Kennard, "Ridge regression: Biased
data sets with related subspace based classification methods. estimation for nonorthogonal problems," Technometrics, Vol. 12, pp.
However, KRRC require matrix inversion computation which 55-67, January, 1970.
[21] Gujarati, Damodar N., and J. B. Madsen, "Basic econometrics,"
can be computationally intensive for high dimensional and Journal of Applied Econometrics. Vol. 13, pp. 209-212, February 1998.
large datasets, including text documents, face images, and [22] Bache, K. & Lichman, M. (2013). UCI Machine Learning Repository.
gene expression data. Therefore, developing efficient Irvine, CA: University of California, School of Information and
algorithms yet with theoretical guarantees will be interesting Computer Science. Available: http://archive.ics.uci.edu/ml
in future research.
ACKNOWLEDGMENT
The authors would like to thank the anonymous reviewers
for their constructive comments and suggestions.
REFERENCES
[1] Cover, Thomas, and Peter Hart, "Nearest neighbor pattern
classification," IEEE Transactions on Information Theory, vol.13, pp.
21-27, January 1967.
[2] Fayed, Hatem A., and Amir F. Atiya, "A novel template reduction
approach for the k-nearest neighbor method," IEEE Transactions on
Neural Networks, vol. 20, pp. 890-896, May 2009.
[3] T. Hastie and R. Tibshirani, “Discriminant adaptive nearest neighbor
classification,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 18, no. 6,
pp. 607–616, Jun. 1996.
[4] J. Peng, D. R. Heisterkamp, and H. K. Dai, “LDA/SVM driven nearest
neighbor classification,” IEEE Trans. Neural Netw., vol. 14, no. 4, pp.
940–942, Jul. 2003.
[5] H. A. Fayed and A. F. Atiya, “A novel template reduction approach for
the K-nearest neighbor method,” IEEE Trans. Neural Netw., vol. 20,
no. 5, pp. 890–896, May 2009.
[6] Y. G. Liu , S. Z. S. Ge , C. G. Li and Z. S. You, "K-NS: A classifier by
the distance to the nearest subspace", IEEE Trans. Neural Netw., vol.
22, no. 8, pp.1256 -1268, 2011.
[7] Oja, Erkki. Subspace methods of pattern recognition. England:
Research Studies Press, 1983, Vol. 4.
[8] Cappelli, Raffaele, Dario Maio, and Davide Maltoni,"Subspace
classification for face recognition," Biometric Authentication. 2002.
[9] Li, Stan Z, "Face recognition based on nearest linear combinations,"
IEEE Computer Society Conference on Computer Vision and Pattern
Recognition, IEEE, 1998.
[10] Li, Stan Z., and Juwei Lu, "Face recognition using the nearest feature
line method," IEEE Transactions on Neural Networks, Vol. 10, pp.
439-443, February, 1999.
[11] S. Watanabe, P. F. Lambert, C. A. Kulikowski, J. L. Buxton, and R.
Walker, Evaluation and selection of variables in pattern recognition, In
Computer and Information Sciences II. New York: Academic, 1967.
[12] Naseem, Imran, Roberto Togneri, and Mohammed Bennamoun,
"Linear regression for face recognition," IEEE Trans. Pattern Analysis
and Machine Intelligence, vol.32, pp.2106-2112, 2010.
[13] Huber, Peter J. Robust statistics. Springer Berlin Heidelberg, 2011.
[14] Ren, Chuan-Xian, Dao-Qing Dai, and Hong Yan, "L2,1-norm based
Regression for Classification," 2011 First Asian Conference on Pattern
Recognition (ACPR), IEEE, 2011.
[15] Naseem, Imran, Roberto Togneri, and Mohammed Bennamoun,
"Robust regression for face recognition," Pattern Recognition, vol.45,
pp. 104-118, January 2012.
2267
Authorized licensed use limited to: INDIAN INST OF INFO TECH AND MANAGEMENT. Downloaded on December 05,2023 at 12:38:41 UTC from IEEE Xplore. Restrictions apply.