Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
5 views

2015 Elsevier Feedforward Kernel Neural Networks Generalized Least Learning Machine and Its Deep Learning With Application to Image Classification

This paper proposes a novel architecture for feedforward kernel neural networks (FKNN) that allows for hidden-layer-tuning-free learning and enhances image classification performance. It reveals that with pre-fixed hidden nodes and kernel-based activation functions, a special kernel principal component analysis (KPCA) is implicitly executed, enabling the use of various error functions beyond mean squared error. Experimental results confirm the effectiveness of the FKNN's deep architecture and its deep learning framework in improving classification outcomes.

Uploaded by

chandreshgovind
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

2015 Elsevier Feedforward Kernel Neural Networks Generalized Least Learning Machine and Its Deep Learning With Application to Image Classification

This paper proposes a novel architecture for feedforward kernel neural networks (FKNN) that allows for hidden-layer-tuning-free learning and enhances image classification performance. It reveals that with pre-fixed hidden nodes and kernel-based activation functions, a special kernel principal component analysis (KPCA) is implicitly executed, enabling the use of various error functions beyond mean squared error. Experimental results confirm the effectiveness of the FKNN's deep architecture and its deep learning framework in improving classification outcomes.

Uploaded by

chandreshgovind
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Applied Soft Computing 37 (2015) 125–141

Contents lists available at ScienceDirect

Applied Soft Computing


journal homepage: www.elsevier.com/locate/asoc

Feedforward kernel neural networks, generalized least learning


machine, and its deep learning with application to image classification
Shitong Wang a,b,∗ , Yizhang Jiang a,b , Fu-Lai Chung b , Pengjiang Qian a
a
School of Digital Media, Jiangnan University, Wuxi, Jiangsu, PR China
b
Department of Computing, The Hong Kong Polytechnic University, Hong Kong

a r t i c l e i n f o a b s t r a c t

Article history: In this paper, the architecture of feedforward kernel neural networks (FKNN) is proposed, which can
Received 5 February 2015 include a considerably large family of existing feedforward neural networks and hence can meet most
Received in revised form 12 June 2015 practical requirements. Different from the common understanding of learning, it is revealed that when
Accepted 29 July 2015
the number of the hidden nodes of every hidden layer and the type of the adopted kernel based activa-
Available online 20 August 2015
tion functions are pre-fixed, a special kernel principal component analysis (KPCA) is always implicitly
executed, which can result in the fact that all the hidden layers of such networks need not be tuned and
Keywords:
their parameters can be randomly assigned and even may be independent of the training data. There-
Feedforward kernel neural networks
Least learning machine
fore, the least learning machine (LLM) is extended into its generalized version in the sense of adopting
Kernel principal component analysis much more error functions rather than mean squared error (MSE) function only. As an additional merit,
(KPCA) it is also revealed that rigorous Mercer kernel condition is not required in FKNN networks. When the
Hidden-layer-tuning-free learning proposed architecture of FKNN networks is constructed in a layer-by-layer way, i.e., the number of the
Deep architecture and learning hidden nodes of every hidden layer may be determined only in terms of the extracted principal com-
ponents after the explicit execution of a KPCA, we can develop FKNN’s deep architecture such that its
deep learning framework (DLF) has strong theoretical guarantee. Our experimental results about image
classification manifest that the proposed FKNN’s deep architecture and its DLF based learning indeed
enhance the classification performance.
© 2015 Elsevier B.V. All rights reserved.

1. Introduction sometimes adopted to circumvent the overfitting phenomena. In


addition, before training a feedforward neural network by certain
The wide popularity of feedforward neural networks in many traditional learning algorithm like the BP algorithm in [3], we must
fields is mainly due to two factors: (1) the strong approxima- fix the number of the hidden layers and the number of the hid-
tion capability for complex multivariate nonlinear function directly den nodes of every hidden layer, and choose an appropriate error
from input samples; (2) The strong modeling capability for a large function in terms of the training task. If they are inappropriate, the
class of natural and artificial phenomena which are very difficult performance of traditional learning algorithms will degrade a lot.
to handle with classical parameter techniques. However, when In this work, we first propose an architecture, called FKNN,
applied to many application scenarios, all parameters of feedfor- of feedforward neural networks with infinitely differential ker-
ward neural networks [1,2] need to be adjusted in a backward way nel based activation functions, which can include a large family of
and thus there exists the dependence relationship between dif- existing feedforward neural networks such as radial basis function
ferent layers of parameters in such networks, which results in a (RBF) networks, sigmodial feedforward networks, self-organizing
serious bottleneck issue: their traditional learning algorithms like feature map (SOM) networks and wavelet neural networks [3], and
BP algorithm are usually much slower than required, for example, hence can meet most practical requirements. The contributions of
taking several hours, several days and even falling into local min- our work here can be highlighted in two main aspects: (1) When
ima. On the other hand, cross-validation and/or early stopping are the number of the hidden nodes of every hidden layer and the
type of the adopted kernel based activation functions are pre-
fixed, different from the common understanding of learning, we
∗ Corresponding author at: School of Digital Media, Jiangnan University, Wuxi, prove that when all the hidden layers of such networks are tuning-
Jiangsu, PR China. Tel.: +86 13182791468. free and their parameters are randomly assigned and even may
E-mail address: wxwangst@aliyun.com (S. Wang). be independent of the training data, such networks are universal

http://dx.doi.org/10.1016/j.asoc.2015.07.040
1568-4946/© 2015 Elsevier B.V. All rights reserved.
126 S. Wang et al. / Applied Soft Computing 37 (2015) 125–141

approximators with probability one. Therefore, the least learn- a deep architecture of FKNN networks with its deep learning
ing machine (LLM) [28], as a generalized version of the extreme framework DLF. Since this novel deep learning framework is
learning machine (ELM) [4–13], can be further extended into its built in terms of KLA = multi-layer KPCAs + LA, which makes the
generalized version in the sense of adopting much more error func- deep learning have rigorous theoretical guarantee the first time.
tions instead of only the MSE function. (2) When the proposed We show that training this deep architecture can be finished
architecture of FKNN networks is constructed in a layer-by-layer within O(N) time complexity.
way, i.e., the number of the hidden nodes of every hidden layer (4) When less abstract features are required, a lower level of the
may be determined after the explicit execution of a KPCA, we can hierarchy in the proposed deep architecture can provide us a
develop FKNN’s deep architecture such that its deep learning has range of feature representations at varying levels of abstraction.
strong theoretical guarantee with the enhanced performance for Another advantage is that a lower level of the hierarchy can be
image classification. The more detailed contributions of this work used as the shared transformed data space for different tasks.
can be summarized as follows: Let us keep in mind that multi-layer feedforward neural networks
with BP-like learning cannot have this advantage.
(5) The effectiveness of the proposed deep FKNN’s architecture and
(1) Given the proposed architecture of FKNN networks with the its DLF based learning algorithm is confirmed by our experi-
pre-fixed number of the hidden nodes of every hidden layer ments on image classification.
and the pre-fixed type of the adopted kernel based activation
functions, we reveal that for any hidden layer with its randomly
The rest of this paper is organized as follows. In Section 2, we
assigned kernel parameter vectors, a special KPCA is implicitly
define the proposed FKNN network and its architecture. We prove
performed and hence all the hidden layers may be tuning free.
that training such a FKNN network may be hidden-layer-tuning-
With this special KPCA, we can justify an appropriate number
free and then develop the proposed generalized LLM. In Section 3,
of the hidden nodes in the hidden layer in terms of the rank
we propose a deep FKNN architecture and its deep learning frame-
of the covariance matrix of the kernelized transformed dataset.
work DLF. We also investigate the DLF-based image classification
This is a novel method of estimating an appropriate number
technique. In Section 4, we report our experimental results about
of the hidden nodes in a hidden layer, which is also used to
DLF-based image classification and confirm its superiority over the
justify whether the pre-fixed number of the hidden node is
classical technique in which a KPCA plus a learning algorithm is
appropriate or not.
adopted. Section 5 concludes the paper, and future works are also
(2) We reveal that the proposed feedforward neural networks
given in this section.
behave like kernel methods, and hence their learning can
be separated into two independent parts: implicitly executed
KPCA plus a learning algorithm LA on the transformed data. 2. FKNNs and generalized LLM
However, rigorous Mercer kernel condition is not required. The-
oretical results in kernel methods can help us answer why ELM, In this study, we consider the following feedforward kernel neu-
LLM and its generalized version here outperforms BP-like learn- ral networks (FKNNs hereafter), as shown in Fig. 1. A FKNN network
ing algorithms for feedforward neural networks. Unlike ELM contains the input layer in which the input x = (x1 , x2 , . . ., xd )T ∈ Rd ,
and LLM which are based only on the MSE error function, the L hidden layers in which each hidden node in each layer therein
generalized LLM here can adopt various error functions. takes the same infinitely differential kernel function with differ-
(3) When FKNN networks are constructed in a layer-by-layer way, ent parameters as its activation functions, and the output layer in
i.e., the number of the hidden nodes of every hidden layer is only which the output y of FKNN network can be expressed as the linear
determined after the explicit execution of a KPCA, we develop combination of m activation functions in the last hidden layer by

What is the hidden layer


y
tuning free learning
algorithm with the
Output layer
chosen error functions.

How many layers? α1 α2 αm


Dependent on the 1 2 m Hidden layer
abstract levels

Hidden layer
What is an appropriate
number of hidden nodes
Input layer with the chosen type of
the activation functions at
every hidden layer?
x1 x2 xd

Fig. 1. Architecture of FKNN.


S. Wang et al. / Applied Soft Computing 37 (2015) 125–141 127

T
y Proof. Let us observe the covariance matrix C = 1
X̄X̄ . Its every
N
 N
element cij = x̄ki x̄kj (i, j = 1, 2, . . ., N), where x̄ki , x̄kj denote
output layer
k=1
the ith and jth components of xk − x̄, respectively. For the matrix
α1 α2 αm D = N1 GGT for the single hidden layer of a FKNN network, its every
1 2 m hidden layer 
N
element Dij = g(xk − x̄, ␪i )g(xk − x̄, ␪j ) (i, j = 1, 2, . . ., N). There-
k=1
fore, this theorem holds true if we simply take the kernel feature
input layer
mapping function ϕ(x) = (g(x, ␪1 ), g(x, ␪2 ), . . ., g(x, ␪m )).

In terms of KPCA, Theorem 1 essentially implies that for a given


training set, once the number m of the hidden nodes and the ker-
x1 x2 xd
nel type in the hidden layer are pre-fixed, for arbitrarily assigned
Fig. 2. FKNN network with a single hidden layer. parameter vectors ␪1 , ␪2 , . . ., ␪m which may ever be independent
of the training data, a KPCA has been naturally and implicitly per-
formed by taking the kernel feature mapping function ϕ(x) = (g(x,
using the output weights ˛1 , ˛2 , . . ., ˛m . It is very easy for us to ␪1 ), g(x, ␪2 ), . . ., g(x, ␪m )) to realize a kernel data transformation
extend this architecture to its multiple outputs by using different from C to D. What is more, such a KPCA has the following distinct
linear combinations between the output layer and the last hidden virtue: Unlike commonly used kernel methods like support vector
layer, so we only consider a single output FKNN network for con- machines (SVMs) and support vector regressions (SVRs) in [22], rig-
venience here. Just likewise in current researches of feedforward orous Mercer’s condition [22] for kernel functions is not required,
neural networks, once the architecture of a FKNN network is fixed, since kernel functions here have been presented explicitly in the
i.e. the number of hidden nodes at every hidden layer and the type kernel feature mapping functions, and the kernel trick, as a means
of the activation functions are pre-fixed, the remaining work will of implicitly defining the feature mapping like in SVMs and SVRs, is
deal with the definition of its error function and the choice of its not required due to the implicit execution of such a KPCA or after
learning algorithm for the given training dataset. the inner products appear even if we consider its explicit execution
As we may know well, in order to assure the universal approxi- of such a KPCA. For example, sigmodial kernel function (also see
mation of feedforward neural networks, their activation functions Table 1) k(x, y) = tanh( 1 xT y +  2 ),  1 ,  2 ∈ R is often used in feed-
may take any infinite differential functions. Thus we should restrict forward neural networks. Obviously, it is not positive definite, thus
FKNN’s activation functions to be infinitely differential kernel according to Mercer’s condition, it cannot be used in SVMs and
functions. In terms of Table 1 which summarizes the current rep- SVRs. However, we can adopt it here for a FKNN network. Therefore,
resentative feedforward neural networks, we can readily observe as two commonly used hidden layer activation functions in feed-
that such a restriction does not hamper us to justify its extensive forward neural networks, sigmodial and Gaussian functions have
suitability in covering current representative feedforward neural been included in our FKNN architecture.
networks. In other words, FKNN can be taken to meet most practical
requirements for feedforward neural networks. Remark 1. For the given training dataset, with arbitrarily assigned
Now, let us first study a single hidden layer FKNN network, as parameter vectors ␪1 , ␪2 , . . ., ␪m , in terms of KPCA with the corre-
shown in Fig. 2. Assume such a FKNN network has m hidden nodes sponding kernel feature mapping function ϕ(x) = (g(x, ␪1 ), g(x, ␪2 ),
in the single hidden layer whose activation functions are g(x, ␪1 ), . . ., g(x, ␪m )), we can obtain the corresponding eigenvalues and
g(x, ␪2 ), . . ., g(x, ␪m ) where g(• , • ) denotes the adopted kernel func- eigenvectors. In other words, the rank r of the kernelized covari-
tion, ␪i (i = 1, 2, . . ., m) denotes the parameter vector of the ith kernel ance of the training dataset can be used to decide the number of
function. For the given training dataset {x1 , x2 , . . ., xN } where xi ∈ Rd the hidden nodes of a single hidden layer FKNN network. When
(i = 1, 2, . . ., N), and N is the number of the training patterns in the m = r, it means a full-rank KPCA is performed. When m < r, a low-
training dataset, its training matrix X = [x1 , x2 , . . ., xN ] and its cen- rank KPCA is performed. As pointed out in [23], a low-rank KPCA
tralized training matrix X̄ = {x1 − x̄, x2 − x̄, . . ., xN − x̄} in which can help us overcome the overfitting phenomenon in feedforward
N neural networks and remove the noise affect in the kernel fea-
x̄ = N1 i=1 i
x . Let L = I − N1 11T , I be an N × N identity matrix, 1 be
ture mapping space. When m > r, it means that redundant hidden
a column vector with its every element being one, we have X̄ = XL. nodes may remain in the hidden layer. Therefore, when we use the
Thus, the covariance matrix of the training dataset can be written above KPCA in Theorem 1 to explain the data transformation from
as the input layer to the hidden layer of a single hidden layer FKNN
1 T network, we actually give a novel approach to determine an appro-
C= X̄X̄ (1)
N priate number of the hidden nodes, which can also be used to justify
whether the pre-fixed number of the hidden nodes is appropriate
KPCA extracts the principal components by calculating the or not.
eigenvectors of the kernelized covariance of a given dataset. For
the above single hidden layer FKNN network, we have the following Remark 2. In [24], Yang et al. proved that a kernelized linear
theorem. discriminant analysis (LDA) is equivalent to a KPCA plus LDA on
T the transformed dataset, i.e. KLDA = KPCA + LA. In [14,25,26], Wang
Theorem 1. Assume the covariance matrix C = N1 X̄X̄ for et al. proved that kernelized SVM and its variants are equivalent
the given training matrix X = [x1 , x2 , . . ., xN ] as above, the to KPCA + SVM or SVM’s variants. In the same year, Zhang in [23]
matrix D = N1 GGT for the single hidden layer of a FKNN gave more general results. They proved that the kernelized ver-
network where the matrix G = [G1 , G2 , . . ., GN ] and Gi = sion of a learning algorithm can be implemented by performing
T
(g(xi − x̄, ␪1 ), g(xi − x̄, ␪2 ), . . ., g(xi − x̄, ␪m )) (i = 1, 2, . . ., N), the learning algorithm with the transformed data by the full-rank
there must exist a kernel feature mapping such that D can be gen- KPCA, if the learning algorithm satisfies the following two mild
erated from C after kernelizing it by this kernel mapping. conditions simultaneously: (1) the output result of the learning
128 S. Wang et al. / Applied Soft Computing 37 (2015) 125–141

Table 1
Representative feedforward neural networks.

Types of feedforward neural Sigmodial feedforward neural networks, Radial basis function networks, Wavelet feedforward neural networks,
networks Principal component analysis network, Competitive learning network, Clustering neural network, Evolving neural
networks, SOM networks, Fuzzy neural networks, see [3,15–17], Positive and negative fuzzy rule systems, see [15]
−xT x
i
Activation functions 1) Most frequently used sigmodial functions such as 1
, 1−exp T ;
−xT x+b −x x
with kernels [3,18,19] 1+e i 1+exp i
2 2
2) Decaying RBF functions such as Gaussian function e−||x−xi || / ;
√ 2 2
3) Mexican Hat wavelet function (2/ 3)−1/4 (1 − ||x − xi ||2 / 2 )e−||x−xi || / ;
4) B-spline basis functions such as [20]

 
2 2
5) Morlet wavelet function (2/ 3)e−||x−xi || / cos 5 ||x − xi ||2 / 2 ;

6) Kernel based fuzzy basis functions which use the product inference with fuzzy membership functions for every
dimension j (j = 1, 2, . . ., d), including [21]

6.1) Clipped-parabola (Quadratic) set function with the center mj and the same width d:
 2  
x − mj x − mj
1− if <1
aj (x) = d d
0 else


6.2) Gaussian set function. The Gaussian set function depends on the mean mj and the same standard deviation d/ 2.
2
x−mj
aj (x) = exp − d

6.3) Cauchy set function.


1
aj (x) = 2 x−mj
1+
d


N

1 2
Error functions (see [3]) 1) MSE, i.e. e = N
(yi − ȳi ) , where yi is the output of a FKNN network, ȳi is the actual output of the training sample.
i=1

N
1+yi 1−yi
1
2) e = 2
(1 + yi ) ln 1+ȳi
+ (1 − yi ) ln 1−ȳi

i=1

 N

or e = − 12 yi ȳi (for binary classification where yi ∈ ((−1, +1))


i=1
 N
yi 1−yi
1
3) e = 2
yi ln ȳi
+ (1 − yi ) ln 1−ȳi
for logistic regression with yi ∈ (0,1)
i=1
 N

4) e = − 12 (yi ln ȳi + (1 − yi ) ln(1 − ȳi ))


i=1
 N 
1 0, if |yi − ȳi | ≤ ε
5) e = yi − ȳi ε in which |yi − ȳi |ε =
N |yi − ȳi | − ε, otherwise
i=1

N  
ˇ (yi −ȳi )2
6) Logistic: e = 2
ln 1+ ˇ

i=1 

N 1
(yi − ȳi )
2
|yi − ȳi | ≤ ˇ
7) Huber’s function: e = ei , ei = 2
1
ˇ|yi − ȳi | − ˇ2 |yi − ȳi | > ˇ
i=1  2

N 1
(yi − ȳi )
2
|yi − ȳi | ≤ ˇ
8) Talwar’s function: e = ei , ei = 2
1 2
ˇ |yi − ȳi | > ˇ
i=1 2

9) Hampel’s tanh estimator:


1
(y − ȳ )
2
|yi − ȳi | ≤ ˇ1
 N ⎨2 i i
1 2 2c1 1 + expc2 (ˇ2 −|yi −ȳi |)
e= ei , ei = ˇ − ln − c1 (|yi − ȳi | − ˇ1 ) ˇ1 < |yi − ȳi | ≤ ˇ2
2 1


c2 1 + expc2 (ˇ2 −ˇ1 )
i=1 ⎩ ˇ12 − ln
1 2c 1 2
− c1 (ˇ2 − ˇ1 ) |yi − ȳi | > ˇ2
2 c2 1 + expc2 (ˇ2 −ˇ1 )
N

1
10) e = N
min |||xp − ck |2 for competitive neural networks, e.g., clustering neural networks.
1≤k≤K
p=1
 
11) KPCA network’s MSE function: E |||ei |2 , ei = ||xt − x̂i (t)|| or 1T E[(ei )] = E[||h(ei )||2 ]. h(ei ) is a real function.

Nval

Nval  e2

1 1
12) Correntropy-based error (CE) criterion function [48]: CESval (ei ) = Nval
K (ei ) = Nval
exp − i
, where
2 2

i=1 i=1
ei = (yi − ȳi ).

Learning algorithm BP and its variants, genetic learning methods, Hebbian-learning methods and so on, see [3]
S. Wang et al. / Applied Soft Computing 37 (2015) 125–141 129

algorithm can be calculated solely in terms of xT xi (i = 1, 2, . . ., N), y


where xi is the training data point, and x is a new coming test
data point. (2) Transforming the input data with an arbitrary con-
stant does not change the output result of the learning algorithm.
And then, they pointed out that some common kernel methods LA α1 α2 αm
including kernelized canonical correlation analysis (KCCA), kernel-
ized partial least squares method (KPLS), kernelized KNN (KKNN) 1 2 m
Other tuning
indeed satisfy the above two mild conditions. In terms of the above naturally and implicitly methods like BP,
theoretical results and Theorem 1, because a single hidden layer performed KPCA genetic algorithm
can be discarded
FKNN network here naturally and implicitly performs a KPCA, for
any learning algorithm LA in the output layer satisfying the pre-
vious two mild conditions, this FKNN network naturally behaves
like a kernelized LA (KLA in brevity) with the kernel feature map- x1 x2 xd
ping ϕ(x) = (g(x, ␪1 ), g(x, ␪2 ), . . ., g(x, ␪m )). Please note, without a
special explanation, by a LA we mean that it satisfies the previous Fig. 3. FKNN with different learning algorithms.
two mild conditions hereafter. What is more, the above theoreti-
cal results reveal that the training of a single hidden layer FKNN
ridge regression as a LA in its last layer, we can obtain the least
network can be separated into two independent parts: an implic-
learning machine (LLM) for training on the given training dataset:
itly performed KPCA between the input layer and the hidden layer
1 2  2
plus a LA between the output layer and the hidden layer with the m N
transformed dataset by KPCA. min ˛j + C j
2
Remark 3. According to the principle of Vapnik’s structural risk j=1 i=1
(3)
minimization [22,27], a learning algorithm LA(˛1 , ˛2 , . . ., ˛m ) in 
m
s.t. ˛j g(xi , ␪j ) = yi + i i = 1, 2, . . ., N
the last layer of a single hidden layer FKNN network can assure
j=1
that with the probability 1 −  [22,27]:
R(˛1 , ˛2 , . . ., ˛m ) ≤ Remp (˛1 , ˛2 , . . ., ˛m ) With a sufficiently large regularizer C, LLM will become ELM.
 In other words, when LLM and ELM are applied to the proposed
h log(2N/h) − log(/4) FKNN networks, they are two special cases of all potentially cho-
+ . (2) sen learning algorithms. In particular, due to the existence of the
N
matrix inversion in LLM and the matrix pseudo-inverse in ELM,
where h denotes the VC dimension [22,27]. That is to say, for the their running time actually is O(N3 ) where N is the total number of
given training dataset, once m and the chosen kernel type are pre- the training patterns. When N becomes large, they will obviously
fixed in the hidden layer, the VC confidence term (i.e., the second become impracticable. Hence, by “extremely fast learning in ELM
term) in Eq. (2) will keep unchangeable whereas the empirical risk and least learning in LLM”, we only mean that their learning may
Remp (˛1 , ˛2 , . . ., ˛m ) and actual risk R(˛1 , ˛2 , . . ., ˛m ) depend on the be hidden-layer-tuning free but their training time still keeps very
one particular function chosen by the learning algorithm LA(˛1 , ˛2 , high (i.e., O(N3 )). Even if LLM and ELM take the same strategy as in
. . ., ˛m ). In other words, the fact that we only need to train the last [11], i.e., Eq. (38) in [11], for large training datasets, their time com-
layer of a FKNN does not give any limitation in the types of error plexity indeed has O(m3 ). According to the experiments in [11], m
functions. For example, except for the most frequently used MSE is generally taken as 1000 to meet the approximation/classification
error function, we can also choose other error functions including accuracy and hence their time complexity still keeps considerably
these in Table 1. As we may know well, as two universal approxi- high, especially for the application scenarios where N is huge but
mators with probability one [14], ELM (extreme learning machine) considerably less than the magnitude of O(10003 )(=O(106 )). How-
and its general version LLM (least learning machine) are actually ever, in terms of the above remarks, this serious shortcoming can be
based on the MSE error function to realize its fast training of a single readily battled down by choosing a LA with O(N) time complexity.
hidden layer feedforward neural network through simply learning For example, as a competitive model of LLM in Eq. (3), we can con-
˛1 , ˛2 , . . ., ˛m in the output layer with arbitrarily assigned weights struct an alternative LLM in Eq. (4) with Vapnik’s ε-insensitive loss
in the hidden layer. Therefore, this remark actually says that for a function to achieve this goal with strong generalization capability.
single hidden layer FKNN network, we can adopt the same hidden-
1 2 
m N
layer-tuning-free learning strategy as ELM and LLM, with much
more types of error functions. Moreover, as illustrated in Fig. 3, min ˛j + C (i + i∗ )
2
existing learning algorithms including BP and genetic algorithms ⎧ j=1 i=1
in Table 1 attempt to find out appropriate kernel parameter vec- ⎪  m

⎪ y − ˛j g(xi , ␪j ) ≤ ε + i , i > 0 i = 1, 2, . . ., N
tors in their respective learning ways, however, in essence, their ⎪
⎨ i
(4)
endeavors cannot change the second term in Eq. (2), which means s.t.
j=1
that in order to control the upper bound of the actual risk, we only ⎪
⎪ 
m

⎪ ˛j g(xi , ␪j ) − yi ≤ ε + i∗ , i∗ > 0
need to control the upper bound of Remp (˛1 , ˛2 , . . ., ˛m ) in the last ⎩
layer of feedforward neural network of this type, and hence we j=1
can choose other learning algorithms with other error functions to
Obviously, with arbitrarily assigned parameter vectors ␪1 , ␪2 ,
achieve this goal, as illustrated in Fig. 3.
. . ., ␪m , this alternative LLM is a linear SVR in the training dataset
Remark 4. We can get the generalized LLM by replacing the loss {G1 , G2 , . . ., GN } in which Gi = (g(xi , ␪1 ), g(xi , ␪2 ), . . ., g(xi , ␪m ))T (i = 1,
function used by LLM with other loss functions. That is to say, by 2, . . ., N). Please note this training dataset {G1 , G2 , . . ., GN } can be
the generalized LLM we mean that it is a family of regressors which calculated before solving Eq. (4). According to the solution strategy
uses LLM’s framework with other loss function instead of MSE loss in [13], such a linear SVR can get its optimal solution with O(N)
function only. When a single hidden layer FKNN network is applied time complexity, which means its very applicability for a FKNN
to function approximation problems, as stated in [14,28], if we take on large training datasets. To best of our knowledge, up to date,
130 S. Wang et al. / Applied Soft Computing 37 (2015) 125–141

this is the best theoretical result about the convergence speed of (1) Since all the parameters in the kernel activation functions
current learning algorithms for feedforward neural networks with among all the hidden layers can be randomly assigned and
infinitely differential kernel functions for the application scenarios all the patterns in the kernel activation functions among all
in which N is huge but N < m3 . Let us keep in mind that because the the hidden layers can be randomly selected, and without
kernel parameter vectors in the hidden layer is randomly assigned, any iterative calculation input data in every hidden layer are
the number m of the hidden nodes is generally a comparatively big transformed into the next hidden layer only once in a for-
value (e.g., 1000 in [11], and N being larger than m3 (e.g., 106 ) may ward layer-by-layer way, a multi-layer FKNN network has the
seldom appear in application scenarios (imagine we have 106 train- advantages of both easy implementation and very fast learning
ing patterns!), therefore, the above alternative LLM in Eq. (4) has capability, compared with BP-like learning algorithms where
considerable applicability in practical scenarios. When N is larger parameters in the network need to be iteratively adjusted in
than m3 , we can take LLM or ELM with the same strategy in [11] a backward gradient-descent way such that BP-like learning
(i.e., Eq. (38) in [11]). algorithms generally converge very slowly and even sometimes
converge to local minima.
Remark 5. As seen in [6,11,28], numerous experimental results (2) Since all the parameters between all the hidden nodes can be
indicate that ELM and LLM outperforms the conventional SVM and randomly assigned, so no any dependence of parameters between
SVR in approximation accuracy. Let us reveal the reason from a new different hidden layers exists!
perspective. As a competitive model of ELM and LLM, an alternative (3) In fact, a multi-layer FKNN network here views the behavior
LLM in Eq. (4) can be readily derived as of a feedforward neural network between the last hidden layer
and the input layer as the successive encoding procedure for the
1 
N N m
min − (˛i − ˛∗i )(˛k − ˛∗k )g(xi , ␪j )g(xk , ␪j ) input data in a difficult-to-understand way, due to the natural
2 existence of KPCAs. When we understand the training behav-
i=1 k=1 j=1
ior of FKNN networks from this new perspective as shown in

N 
N
Fig. 1, to large extent, this architecture can help us answer why
− yi (˛i − ˛∗i ) + ε (˛i + ˛∗i ) (5)
feedforward neural networks behave like a black box.
i=1 i=1

N
s.t. (˛i − ˛∗i ) = 0, 0 < ˛i , ˛∗i < C Please note, the generalized LLM’s power can be easily justified
i=1
by rich experimental results of LLM, ELM and linear SVM/SVR in
[6,11,13,28,30–34], we do not report the experimental results on
Let us recall the dual of the conventional SVR with the same the generalized LLM in the experimental section, for the sake of the
Vapnik’s ε-insensitive loss function [22,29]: space of the paper.

1 
N N
min − (˛i − ˛∗i )(˛k − ˛∗k )k(xi , xk ) 3. Deep FKNN network and its deep learning
2
i=1 k=1

N 
N In this section, we will state that a deep FKNN’s architecture can
− yi (˛i − ˛∗i ) + ε (˛i + ˛∗i ) (6) be built in a layer-by-layer way. Although a concrete deep FKNN’s
i=1 i=1 architecture is application-oriented and data dependent, its deep

N learning algorithm can always be characterized by a general deep
s.t. (˛i − ˛∗i ) = 0, 0 < ˛i , ˛∗i < C learning framework DLF. Based on the proposed framework DLF,
i=1 we explore deep FKNN’s application in image classification.

By comparing Eq. (5) with Eq. (6), we can easily find that the
3.1. Deep FKNN network and DLF
alternative LLM in Eq. (4) and hence ELM and LLM in essence employ
 m
the special kernel combination g(xi , ␪j )g(xk , ␪j ) rather than In [14,28], based on multi-layer KPCAs, we proposed the
architecture of multi-layer feedforward neural networks and its
j=1
only a single kernel k(xi , xk ) in the conventional SVR. As we know LLM/ELM based fast learning methods. In the last section, we actu-
very well, kernel combinations can generally enhance the perfor- ally extend the conclusion in [11,14,15,28] and point out that more
mance of kernel methods. Therefore, the above observation actually learning algorithms can be adopted in the hidden-layer-tuning-
helps us answer why ELM and LLM experimentally outperform the free mechanism of multi-layer feedforward neural networks. In this
conventional SVM and SVR. section, we will demonstrate that multi-layer feedforward neural
networks with multi-layer KPCAs + LA can be used for deep learn-
Now, let us observe a multi-layer FKNN network. In terms of ing.
Theorem 1, for a multi-layer FKNN network, we can easily see that One may cheer that in terms of the seemingly surprising con-
a KPCA implicitly behaves between the first hidden layer and the clusions in the last section and LLM/ELM theory in [14], even if
input layer, and another KPCA implicitly behaves between the first all the kernel parameter vectors in FKNN networks are tuning-
and second hidden layers. This process repeats until the last hidden free (i.e., randomly assigned) and learning is only required in the
layer in a multi-layer FKNN network. In other words, there implic- output layer of FKNN networks, FKNNs may become universal
itly exist multi-layer KPCAs, which may be viewed as a special KPCA approximators with probability one. However, we should keep
with a successive nonlinear combination of the kernel feature map- in mind that when learning in this way, FKNN networks actually
ping functions between the input layer and the last hidden layer. behave like shallow architectures such as SVMs. When we consider
Thus, according to the above theoretical analysis, we may realize how to assign appropriate kernel parameter vectors by explicitly
a LA between the output layer and the last hidden layer, with ran- executing KPCA in every hidden layer of FKNN networks, FKNN
domly assigned kernel parameter vectors in every hidden layer. networks will behave like deep architectures which can help us
When we understand the behavior of multi-layer FKNN networks represent complex data. Deep architectures learn complex map-
in the above way, we can see the same benefits [14,28] as LLM, pings by transforming their input data through multiple layers of
which is summarized as follows: nonlinear processing. The advantages of deep architectures can be
S. Wang et al. / Applied Soft Computing 37 (2015) 125–141 131

highlighted by the following motivations: the wide range of func- to construct two data sets for two tasks, which may be viewed as
tions which can be parameterized by integrating weakly nonlinear another nonlinear transformation. Finally, we can finish these two
transformations, the appeal of hierarchical distributed representa- tasks, respectively, by identifying a face with the forehead of a mole
tions and the potential for combining unsupervised and supervised in the square face images (a nonlinear transformation for the first
methods. However, at present, there exists the on-going debate task), a face with a high nose in the round face images (a nonlinear
over deep vs shallow architectures. Deep architectures are gener- transformation for the second task). This example can be charac-
ally more difficult to train than shallow ones, since they involve terized by a deep architecture of FKNNs in Fig. 4 where multi-layer
difficulty nonlinear optimizations and many heuristics. The chal- KPCAs + LA is adopted.
lenges of deep learning explain the appeal of SVMs which learn By observing the learning behavior of FKNN networks in Fig. 4,
nonlinear classifiers via the kernel trick. Unlike deep architec- we can see the following virtues:
tures, SVMs are trained by solving simple QP problems. However
SVMs cannot seemingly benefit from the advantages of deep
learning. (1) The number of the layers of a FKNN network can be increased
Here we will show that FKNN networks can help us enjoy both in a layer-by-layer way and a KPCA behaves as a data transfor-
the success of deep learning and the elegance of kernel meth- mation method in each hidden layer. This shares the same idea
ods. Although we share a similar motivation as in previous works of the deep learning in [37–40]: Enough layers can ensure us
[35,36], our study here is very different. Our study here makes two to obtain enough complex nonlinear functions to express com-
main contributions. First, FKNN networks here can be built in a plex data. Except kernel parameter vectors in every layer, no
layer-by-layer way in terms of KLA = multi-layer KPCAs + LA rather any other parameter is required.
than the strategies in [35–40], which makes their deep learning (2) If the number of the layers of a FKNN network is bigger than or
have rigorous theoretical guarantee the first time. Second, with the equal to 2, even if a linear PCA is adopted in every hidden layer,
adopted fast KPCAs in the deep learning, the time complexity of a nonlinear data transformation in the last hidden layer will
training FKNN networks indeed becomes linear with the number always be obtained. When a nonlinear kernel PCA is adopted in
of the training samples. the first hidden layer, more complex nonlinear data transfor-
In fact, multi-layer KPCAs map the input data into the kernel- mation in the last hidden layer will be generated, which may
ized data space in the last hidden layer through kernelized data be more beneficial for representing more complex data. Both
spaces between multiple hidden layers. One may argue why we linear PCAs and KPCAs in hidden layers actually reflect succes-
need multi-layer KPCAs. An example can help us explain the moti- sive abstracts in a deep way for the input data or kernelized
vation of doing so and observe how to build an application-specific transformed data, respectively. In particular, Wang et al. pro-
deep FKNN. Assume we want to accomplish two tasks, i.e., identify posed a KPCA based learning architecture for LLM or ELM based
one male crime with a square face and the forehead of a mole, and FKNN networks [28]. Mitchell et al. [35] realized a deep struc-
the other male crime with a round face and high nose, from a face ture learning by using multi-layer linear PCAs and pointed out
image database. Quite often, one first picks up all the male images that their deep structure learning framework is a fairly broad
from the face image database to obtain a shared data space for these one which can be used to describe the conventional Convolu-
two tasks. We can imagine this procedure as a nonlinear transfor- tional Networks and Deep Belief Networks whose successes in
mation between all male images and all the images. Based on the several applications have been largely sparking recent growing
shared data space, one may then pick up square face images and interest in deep learning in [35]. Chao and Saul implemented
round face images, respectively, from all the picked male images deep learning by using a special multi-layer kernel machines in

Task y1 Task y2

LA 1 LA 2

KPCA KPCA

KPCA KPCA

KPCA
Shared data space
transformed
KPCA

x1 x2 xd

Fig. 4. An example about FKNN’s deep architecture.


132 S. Wang et al. / Applied Soft Computing 37 (2015) 125–141

[36]. Therefore, the proposed multi-layer FKNN framework is hidden layer to the input layer. More concretely, we summarize the
general in two senses: (1) it generalizes the work in [35] since detailed DLF framework as follows.
except for linear PCAs more nonlinear KPCAs and more ker-
nel functions can be involved; (2) it generalizes the work in Learning framework DLF for a deep FKNN network
[14] since except for the MSE error function adopted in LLM
Input: The given dataset, e.g., an image database; Number of the
and ELM, more error functions can be considered within this hidden layers, with their respective kernel functions used in
framework. multi-layer KPCAs, of a deep FKNN network;
(3) Although existing deep learning techniques have obtained
Output: The parameters of the output layer determined by a LA, and/or
impressive successes, the resulting systems are so complex the kernel parameter vectors in every hidden layer of a deep
such that they require many heuristics without theoretical FKNN network;
guarantee (i.e., why can they be in theory separated into two Step 1: Preprocess the given dataset form the input dataset for the
independent learning components and what is the maximal deep FKNN network; fix the number of the layers in the FKNN,
number of the hidden nodes in every hidden layer) and they are and the type of a kernel function in each hidden layer of the
very difficulty in understanding which theoretical properties deep FKNN network;

are responsible for the improved performance. However, the Step 2: Carry out a full-rank or low-rank KPCA with the chosen kernel
proposed FKNN architecture has strong theoretical guarantee, function and its initial kernel parameter vector in every hidden
i.e., KLA = multi-layer KPCAs + LA, and rich achievements about layer, in a layer-by-layer way from the input layer to the last
hidden layer. If a full-rank KPCA is taken, the number of the
kernel methods can help us support the suitability of FKNN hidden nodes in a hidden layer is automatically determined by
networks in deep learning. Besides, the proposed FKNN archi- the full-rank KPCA. Otherwise, fix the chosen top-k principal
tecture naturally provides a lower level of the hierarchy which components as the number of the hidden nodes, in terms of
can be used as “the last output layer”, as shown in Fig. 4, thus the extracted principal components by a low-rank KPCA.
resulting in a range of data representations at varying levels of Step 3: Carry out the chosen learning algorithm LA in the output layer
abstraction. on the transformed dataset after multi-layer KPCAs such that
(4) The entire FKNN network behaves like a kernelized learning the parameters in the output layer are determined and the
kernel parameter vectors in the last hidden layer are
algorithm KLA since multi-layer KPCAs can be viewed as a optimized.
special KPCA and KLA = this special KPCA+ LA. The most dis-
Step 4: Determine an appropriate kernel parameter vector in every
tinctive advantage of such a multi-layer FKNN network exists
hidden layer by:
in that we may attempt to choose a fast KPCA algorithm and (1) the grid search in every hidden layer in a layer-by-layer
a fast implementation of LA independently to achieve FKNN’s way from the last hidden layer to the input layer.;
fast deep learning. For example, in order to make the time Or (2) choose certain optimization method with certain
complexity of FKNN’s deep learning become linear with N, we performance criterion (e.g. MSE) and the initial kernel
parameter vector to optimize the kernel parameter vector in
may choose the fast KPCA method in [41], which is proved
every hidden layer in a layer-by-layer way from the last
to have O(kN) time complexity, where k is the number of hidden layer to the input layer.
the extracted components, we may also choose ELM and LLM
in [14,15,28], linear SVM in [13], KNN [31] and support vec-
It should be pointed out that different from the strategies used in
tor clustering in [42] as the fast learning method in the last
the existing typical deep learning techniques [35–40], the mature
layer of the deep FKNN network in terms of the training
KPCA techniques can be exploited to play a crucial role in the pro-
tasks.
posed general deep learning framework DLF. Although both DLF
(5) Since we can carry out KPCAs in a layer-by-layer way, as seen
and the existing deep learning techniques can indeed be used to
in the dotted lines in Fig. 4, a potential advantage is that when
build deep neural networks in a layer-by-layer way, DLF is very
less abstract features are required, a lower level of the hierarchy
different from them in principle. In other words, although exist-
can be used as the “output” layer, readily providing a range
ing deep learning techniques have obtained impressive successes,
of feature representations at varying levels of abstraction. The
the resulting systems are so complex such that they require many
other advantage is that a lower level of the hierarchy can be
heuristics without theoretical guarantee (i.e., why can they be in
used as the shared transformed data space for different tasks.
theory separated into two independent learning components and
Let us keep in mind that multi-layer feedforward neural networks
what is the maximal number of the hidden nodes in every hidden
with BP-like learning cannot have this advantage.
layer) and that it would be very difficult to understand which the-
oretical properties are responsible for the improved performance.
However, since KLA = multi-layer KPCAs + LA, DLF has strong theo-
Obviously, deep FKNN networks are application-oriented and
retical guarantee.
data dependent, However, since they can be constructed in a layer-
by-layer way, we can summarize their deep learning using a general
deep learning framework DLF below. In other words, when ori- 3.2. Deep FKNN network for image classification
ented for a specific application, we may design a concrete FKNN
and realize its deep learning implementation by instantiating and In order to exhibit the very applicability of the proposed deep
even simplifying the proposed framework DLF. For example, we FKNN network and the proposed deep learning framework DLF,
may instantiate LA with the classical SVM in DLF. We may also omit here we consider a concrete application of image data classification,
the step 4 in DLF in terms of the trade-off between the deep learn- though there is no reason the DLF could not be applied to other type
ing performance and the running time. This learning framework of data. We also hope to do so in near future. Now, let us state how
consists of four main steps: (1) application-oriented data prepara- we can apply DLF to image classification.
tion; (2) realizing the data transformation by multi-layer KPCAs, For a given set of images as the training dataset, in order to do
in a layer-by-layer way from the input layer to the last hidden the data preparation well, we use a quad-tree decomposition to
layer; (3) carrying out a learning algorithm LA on the transformed subdivide each image recursively, the level of the quad-tree was a
dataset after multi-layer KPCAs; (4) optimizing the kernel param- set of non-overlapping l × l (In general, we use 4 × 4) pixel patches.
eter vectors in multi-KPCAs, in terms of the obtained result by the We present the set of all l × l patches from all images in the training
learning algorithm in step (3), in a layer-by-layer way from the last dataset to a deep FKNN network as the input data for a deep FKNN
S. Wang et al. / Applied Soft Computing 37 (2015) 125–141 133

network. For simplicity, we adopt Gaussian kernel functions with then we start with a raw data matrix D0 , which is m × n. In case of
the same kernel width in every hidden layer, and the top-k eigen- images, this means each row of the matrix is an image. The level
vectors as a reduced-dimensionality basis. Each patch is projected one divided data matrix, D1 , in this network is generated by recur-
into this new basis, and the reduced-dimensionality patches are sively dividing the vectors in D0 into l × l (4 × 4 = 16) dimensional
n
then joined back together in their original order. The result of this patches, so it will be a m · 16 × 16 matrix. We apply a KPCA to
is that each image had its dimensionality multiplied by k/16 (Fig. 5). D1 , extract the top-k eigenvectors into F1 , and use F1 as a basis
These reduced-dimensionality images formed the next layer up in to project the vectors of D1 onto. Applying the projection to the
the deep FKNN network. n
vectors in D1 results in P1 , a m · 16 × k matrix. Adjacent vectors
This process is repeated, using the newly created layer as the
in P1 are then united (using the inverse of the splitting operator),
data for the divide-KPCA-union process to create the next layer. At n
resulting in D2 , which is m · 16·4 × 4k. And then, we can continue
every layer after the first, the dimensionality is reduced by a fixed
to apply a KPCA to build current hierarchy on D2 . In general, Dt
factor of 4. The process is terminated when the remaining data is k
will be m × n · 16·4 t . When 16·4t = n, the hierarchy (i.e., the deep
too small to split, and the entire image is represented by a single
FKNN’s architecture) is complete, giving the last hidden layer with
k-dimensional feature vector at the last hidden layer of the deep
dimensions of m × k. More concretely, based on the above, below
FKNN network.
we state the DLF-based image classification algorithm in which the
As an illustration, if our original data is a set of m vectors, each
instantiation of step 4 in DLF is not considered.
length n (e.g., for 200 16 × 16 images, m = 200, n = 16 × 16 = 256),

Fig. 5. An example of how the deep FKNN network works on a pair of images from a 32 × 32 image dataset.
134 S. Wang et al. / Applied Soft Computing 37 (2015) 125–141

DLF based image classification algorithm Please note, the DivideQuads function does a quad-tree style
division of an image into four equal-sized sub-images, and the
Input: Given the set D0 of m × n images, the number l of hidden layers
in the FKNN, where l = log4 (n/16), the number k of eigenvectors
UnionQuads function inverts this operation. The function KPCA(Dt ,
to keep k) computes an eigen-decomposition of the kernelized covariance
of Dt and then returns the top-k eigenvectors.
Output: Feature space hierarchy F, projected data hierarchy Dl and
classification accuracy for Dl .

Step 1: Subdivide D0 into D1 as the input data of the deep FKNN


4. Experiments
network by
D1 = DivideQuads(D0 ); In this section, we will demonstrate the effectiveness of the pro-
posed deep FKNN architecture and its deep learning framework DLF
Step 2: Obtain the transformed images after multi-layer KPCA by
for t = 1 to l do by experimental results about the proposed DLF based image classi-
Ft = KPCA(Dt , k) with Gaussian kernel function with its fication algorithm on images. We organize experiments to compare
kernel width; the performance of a deep FKNN network with multi-layer KPCAs
Pt = Ft Dt plus a LA with a KPCA plus a LA (and hence a kernel method). Our
Dt+1 = UnionQuads(Pt ) ;
experimental results indicate that the proposed FKNN network can
end for
yield better image classification performance and hence confirm a
Step 3: Carry out certain learning algorithm such as SVM, KNN and
basic deep-structure hypothesis that real-world data contain deep
Naive Bayes on Dl to obtain the classification result for Dl .
structures, and that exploiting these structures will yield improved
performance on machine learning tasks.

Table 2
Example images for image classification tasks.

Task 1 (Nature) Class 1: coast

Class 2: forest

Task 2 (Buildings) Class 1: inside city

Class 2: tall building

Task 3 (Roads) Class 1: highway

Class 2: street
S. Wang et al. / Applied Soft Computing 37 (2015) 125–141 135

The experimental studies are organized as follows. In Section data, generated from the original image dataset, as the input data
4.1, the experiment settings and the adopted image datasets are for multi-layer KPCAs. After multi-layer KPCAs, 200 k-dimensional
described, respectively. In Section 4.2, the classification results data will be presented to a LA in the output layer of the deep
obtained by DLF and KPCA + LA for the adopted image datasets are FKNN network. Similarly, the input dataset is generated for both
reported and analyzed. methods on the original 200 32 × 32 image datasets from task 1 to
task 3.
For each experiment, 70% of images are taken as the training set,
4.1. Experiment settings
and the remaining 30% of images are used for testing.

4.1.1. Image datasets


For the task of image classification, we extract 600 grayscale 4.1.2. Comparison algorithms and parameter settings
images from the classical image database [43], which belong to 6 In our experiments, we consider two comparison algorithms.
different classes, i.e. coast, forest, inside city, tall building, highway, One is the above DLF based image classification algorithm (we
and street. And then, we uniformly divide this datasets into three still call it as DLF here for simplicity), i.e., KPCA with Gaus-
different binary classification tasks (as shown in Table 2 for exam- sian kernels in every hidden layer of the deep FKNN network
ple images). The resolutions used are 16 × 16 and 32 × 32 pixels. We plus a LA in the last hidden layer of the deep FKNN network;
create this novel dataset such that we could have natural images the other is a simple kernel method, i.e., a KPCA with Gaus-
(i.e. not artificially generated or composited) in multiple resolu- sian kernel on the input data plus a LA on the transformed data
tions, with multiple images of each object. after KPCA. We take a LA from three standard classifiers: SVM
Now, let us explain the input dataset, generated from the origi- [45], KNN [44] and Naive Bayes [46] on the transformed training
nal 200 16 × 16 image datasets from task 1 to task 3, respectively, data.
for the DLF based image classification algorithm and the classi- In our experiments, since we mainly aim at observing whether
cal KPCA plus LA method. For the classical KPCA plus LA method, DLF based deep FKNN network can has better learning performance
KPCA takes the original image dataset as its input data and then than KPCA + LA (i.e., a single-hidden-layer FKNN network), we list
generates top-k eigenvectors for each input datum. For the DLF the corresponding results of the above two comparison algorithms
based image classification algorithm, it takes 3200 16-dimensional only with different given kernel widths, i.e., ı ∈ {0.1, 0.5, 0.9, 1, 5,

Table 3
Resolutions used were 16 × 16 pixels on Task 1.

Kernel parameter Algorithms Wins

KPCA plus SVM DLF plus SVM

Acc Positive F1 Negative F1 Acc Positive F1 Negative F1

 = 0.1 0.5333 0.5625 0.5000 0.7167 0.7018 0.7302 33.33%:55.55%


 = 0.5 0.6000 0.6000 0.6000 0.6167 0.6567 0.5660
 = 0.9 0.4500 0.4590 0.4407 0.6500 0.7042 0.5714
 =1 0.5167 0.5397 0.4912 0.6667 0.7143 0.6000
 =5 0.5000 0.5455 0.4444 0.4167 0.4615 0.3636
 =9 0.4333 0.4688 0.3929 0.4000 0.4375 0.3571
 = 10 0.4167 0.4615 0.3636 0.4000 0.4375 0.3571
 = 50 0.4000 0.4375 0.3571 0.4000 0.4375 0.3571
 = 100 0.4000 0.4375 0.3571 0.4667 0.5294 0.3846

Kernel parameter Algorithms Wins

KPCA plus KNN DLF plus KNN

Acc Positive F1 Negative F1 Acc Positive F1 Negative F1

 = 0.1 0.5333 0.5625 0.5000 0.5833 0.5098 0.6377 0%:55.55%


 = 0.5 0.5667 0.5806 0.5517 0.6333 0.6207 0.6452
 = 0.9 0.6000 0.5862 0.6129 0.6667 0.6154 0.7059
 =1 0.6833 0.6545 0.7077 0.7000 0.6667 0.7273
 =5 0.6500 0.5333 0.7200 0.6500 0.5333 0.7200
 =9 0.6333 0.5000 0.7105 0.6500 0.5333 0.7200
 = 10 0.6500 0.5333 0.7200 0.6500 0.5333 0.7200
 = 50 0.6333 0.5217 0.7027 0.6333 0.5217 0.7027
 = 100 0.6333 0.5217 0.7027 0.6333 0.5000 0.7105

Kernel parameter Algorithms Wins

KPCA plus Naive Bayes DLF plus Naive Bayes

Acc Positive F1 Negative F1 Acc Positive F1 Negative F1

 = 0.1 0.3667 0.0952 0.5128 0.5333 0.2222 0.6667 0%:77.77%


 = 0.5 0.5833 0.3902 0.6835 0.7167 0.6909 0.7385
 = 0.9 0.5500 0.3077 0.6667 0.7333 0.7143 0.7500
 =1 0.5833 0.5283 0.6269 0.6833 0.6667 0.6984
 =5 0.5500 0.4255 0.6301 0.5500 0.4706 0.6087
 =9 0.5000 0.4000 0.5714 0.5500 0.4906 0.5970
 = 10 0.5000 0.4000 0.5714 0.5500 0.4906 0.5970
 = 50 0.5500 0.4906 0.5970 0.5500 0.4906 0.5970
 = 100 0.5500 0.4906 0.5970 0.5833 0.5455 0.6154
136 S. Wang et al. / Applied Soft Computing 37 (2015) 125–141

9, 10, 50, 100}, instead of the kernel widths obtained by certain The dimensionality of the resultant feature space was the same for
optimization method, such as leave-one-out cross-validation, and both algorithms. The experimental results are illustrated in from
so on [49], in a layer-by-layer way. Let us keep in mind that although Tables 3–8 in which “Wins” represents one technique outperforms
we can fix the Gaussian kernel widths in a layer-by-layer way to the other. In cases where performance is the same, no winner is
scale up the runtime of DLF which is linear with the depth of the listed. From Tables 3–8, i.e., Task 1 to Task 3, we reveal the following
hierarchy of the deep FKNN network, DLF is still time-consuming. observations.
Therefore, we focus our attention on the main goal of this study
and leave these works for future investigation. In addition, we fix
(1) Focused on the index of “wins” in Tables 3–8:
the regularization parameter C = 100 for SVM and the number of
While DLF is better on average for every kernel width, it is
nearest points K = 5 for KNN.
not always better for KPCA + LA. Table 3 shows how often each
Both algorithms are implemented using MATLAB on a computer
algorithm beat the other with each value of kernel width. In
with Intel Core 2 Duo P8600 2.4 GHz CPU and 2 GB RAM.
cases where both algorithms tie, neither is counted as winning.
In order to evaluate the experimental results reasonably, two
We note that multi-layer KPCAs not only win more frequently,
traditional evaluation indices, i.e., Accuracy and F1-measure (or Acc
but also win by a larger margin of accuracy and F1-measure.
and F1 for simplicity, respectively [47]) are adopted, in which the
(2) Compared with KPCA + LA on the best accuracy from task 1 to
accuracy is the average number of correctly classified labels and the
task 3:
F1-measure is the harmonic mean of precisions and recalls.
(i) On Task 1 – coast & forest classification task: As shown
in Table 3 (Task 1 with 16 × 16 pixels), the best accuracies
4.2. Performance analysis of both algorithms of KPCA + LA with different classifiers are 60.00% (SVM),
68.33% (KNN) and 58.33% (Naive Bayes), respectively. Com-
4.2.1. Fix the dimension pared with the KPCA + LA method, the best accuracy of
In this subsection, the datasets from task 1 to task 3 are used our DLF with different classifiers are 71.67% (SVM), 70.00%
to generate two feature spaces. The first uses the classical KPCA (KNN) and 73.33% (Naive Bayes), respectively. The simi-
with Gaussian kernels to generate 16 features, and the second uses lar experimental results can be observed in Table 4 (Task 1
multi-layer KPCAs with Gaussian kernels to generate 16 features. with 32 × 32 pixels), i.e., the best accuracies of KPCA + LA vs

Table 4
Resolutions used were 32 × 32 pixels on Task 1.

Kernel parameter Algorithms Wins

KPCA plus SVM DLF plus SVM

Acc Positive F1 Negative F1 Acc Positive F1 Negative F1

 = 0.1 0.5333 0.5172 0.5484 0.6333 0.6667 0.5926 0%:100%


 = 0.5 0.6667 0.7436 0.5238 0.7500 0.7826 0.7059
 = 0.9 0.6000 0.7073 0.3684 0.7500 0.7692 0.7273
 =1 0.5500 0.4706 0.6087 0.6667 0.6970 0.6296
 =5 0.4333 0.4688 0.3929 0.5333 0.5758 0.4815
 =9 0.4000 0.4375 0.3571 0.5333 0.6000 0.4400
 = 10 0.4000 0.4375 0.3571 0.5333 0.6000 0.4400
 = 50 0.4000 0.4375 0.3571 0.5333 0.6000 0.4400
 = 100 0.5333 0.5172 0.5484 0.6333 0.6667 0.5926

Kernel parameter Algorithms Wins

KPCA plus KNN DLF plus KNN

Acc Positive F1 Negative F1 Acc Positive F1 Negative F1

 = 0.1 0.5333 0.3913 0.6216 0.6167 0.6349 0.5965 11.11%:88.88%


 = 0.5 0.7167 0.6792 0.7463 0.7500 0.7692 0.7273
 = 0.9 0.6500 0.6182 0.6769 0.7000 0.6667 0.7273
 =1 0.6667 0.6552 0.6774 0.7500 0.7170 0.7761
 =5 0.6667 0.6000 0.7143 0.6500 0.5333 0.7200
 =9 0.6167 0.4651 0.7013 0.6500 0.5333 0.7200
 = 10 0.6333 0.5000 0.7105 0.6500 0.5333 0.7200
 = 50 0.6333 0.5217 0.7027 0.6500 0.5333 0.7200
 = 100 0.6333 0.5217 0.7027 0.6500 0.5333 0.7200

Kernel parameter Algorithms Wins

KPCA plus Naive Bayes DLF plus Naive Bayes

Acc Positive F1 Negative F1 Acc Positive F1 Negative F1

 = 0.1 0.3333 0 0.5000 0.5833 0.6914 0.3590 11.11%:55.55%


 = 0.5 0.6333 0.5000 0.7105 0.7833 0.7347 0.8169
 = 0.9 0.6000 0.4545 0.6842 0.7500 0.7458 0.7541
 =1 0.6000 0.4545 0.6842 0.7167 0.6909 0.7385
 =5 0.5500 0.4706 0.6087 0.6000 0.5000 0.6667
 =9 0.5667 0.4583 0.6389 0.5500 0.4706 0.6087
 = 10 0.5500 0.4255 0.6301 0.5500 0.4706 0.6087
 = 50 0.5500 0.4906 0.5970 0.5500 0.4706 0.6087
 = 100 0.5500 0.4906 0.5970 0.5500 0.4706 0.6087
S. Wang et al. / Applied Soft Computing 37 (2015) 125–141 137

Table 5
Resolutions used were 16 × 16 pixels on Task 2.

Kernel parameter Algorithms Wins

KPCA plus SVM DLF plus SVM

Acc Positive F1 Negative F1 Acc Positive F1 Negative F1

 = 0.1 0.4000 0.4000 0.4000 0.5333 0.5882 0.4615 11.11%:77.77%


 = 0.5 0.5000 0.5714 0.4000 0.7000 0.6897 0.7097
 = 0.9 0.5500 0.5714 0.5263 0.7167 0.6909 0.7385
 =1 0.6333 0.5926 0.6667 0.6000 0.6000 0.6000
 =5 0.6333 0.6452 0.6207 0.7167 0.7213 0.7119
 =9 0.7000 0.7097 0.6897 0.7167 0.7213 0.7119
 = 10 0.7167 0.7302 0.7018 0.7167 0.7213 0.7119
 = 50 0.7167 0.7302 0.7018 0.7300 0.7419 0.7241
 = 100 0.7000 0.7188 0.6786 0.7167 0.7213 0.7119

Kernel parameter Algorithms Wins

KPCA plus KNN DLF plus KNN

Acc Positive F1 Negative F1 Acc Positive F1 Negative F1

 = 0.5 0.5500 0.5091 0.5846 0.5833 0.6032 0.5614 0%:100%


 = 0.9 0.5833 0.5614 0.6032 0.6500 0.6441 0.6557
 =1 0.5667 0.5517 0.5806 0.7167 0.7018 0.7302
 =5 0.6500 0.6667 0.6316 0.7000 0.7000 0.7000
 =9 0.6833 0.6780 0.6885 0.7000 0.7000 0.7000
 = 10 0.6833 0.6780 0.6885 0.7000 0.7000 0.7000
 = 50 0.6833 0.6780 0.6885 0.7000 0.7000 0.7000
 = 100 0.6833 0.6780 0.6885 0.7000 0.7188 0.6786

Kernel parameter Algorithms Wins

KPCA plus Naive Bayes DLF plus Naive Bayes

Acc Positive F1 Negative F1 Acc Positive F1 Negative F1

 = 0.1 0.4500 0.1951 0.5823 0.5000 0.6250 0.2500 22.22%:66.66%


 = 0.5 0.4667 0.3333 0.5556 0.6667 0.6875 0.6429
 = 0.9 0.7000 0.7000 0.7000 0.7500 0.7368 0.7619
 =1 0.7000 0.7097 0.6897 0.7333 0.7143 0.7500
 =5 0.7333 0.7037 0.7576 0.7667 0.7667 0.7667
 =9 0.7167 0.7213 0.7119 0.7000 0.6667 0.7273
 = 10 0.7167 0.7213 0.7119 0.7000 0.6667 0.7273
 = 50 0.6833 0.6545 0.7077 0.7833 0.7719 0.7937
 = 100 0.5500 0.4906 0.5970 0.5500 0.4706 0.6087

DLF with different classifiers are 66.67% vs 75.00% (SVM), 73.00% (SVM), 71.67% (KNN) and 78.33% (Naive Bayes),
71.67% vs 75.00% (KNN) and 63.33% vs 78.33% (Naive respectively. The similar experimental results can be
Bayes), respectively. observed in Table 6 (Task 2 with 32 × 32 pixels), i.e., the
(ii) On Task 2 – inside city & tall building classification task: best accuracies of KPCA + LA vs DLF with different classi-
As shown in Table 5 (Task 2 with 16 × 16 pixels), the fiers are 71.67% vs 75.00% (SVM), 70.00% vs 78.33% (KNN)
best accuracies of KPCA + LA with different classifiers are and 76.67% vs 80.00% (Naive Bayes), respectively.
71.67% (SVM), 68.33% (KNN) and 73.33% (Naive Bayes), (iii) On Task 3 – highway & street classification task: As shown
respectively. Compared with the KPCA + LA method, the in Table 7 (Task 3 with 16 × 16 pixels), the best accuracies
best accuracies of our DLF with different classifiers are of KPCA + LA with different classifiers are 91.67% (SVM),

Task 1
SVM KNN Naive Bayes
0.9 0.8
KPCA(δ=0.1) KPCA(δ=0.1) KPCA(δ =0.1)
0.65 DLF(δ =0.1)
0.8 DLF(δ =0.1) 0.7 DLF(δ =0.1)
0.6
Accuracy(%)

Accuracy(%)
Accuracy(%)

0.7
0.6 0.55
0.6
0.5
0.5
0.5
0.45
0.4 0.4 0.4

0.3 0 5 10 15 20 0.35
0 5 10 15 20 Dimension 0 5 10 15 20
Dimension Dimension

Fig. 6. Performance of DLF and KPCA + LA with different feature dimensions on Task 1.
138 S. Wang et al. / Applied Soft Computing 37 (2015) 125–141

Table 6
Resolutions used were 32 × 32 pixels on Task 2.

Kernel parameter Algorithms Wins

KPCA plus SVM DLF plus SVM

Acc Positive F1 Negative F1 Acc Positive F1 Negative F1

 = 0.1 0.4000 0.4000 0.4000 0.5500 0.6582 0.3415 22.22%:66.66%


 = 0.5 0.4833 0.5634 0.3673 0.7500 0.7761 0.7170
 = 0.9 0.5000 0.2500 0.6250 0.6167 0.6349 0.5965
 =1 0.5500 0.4000 0.6400 0.6833 0.6984 0.6667
 =5 0.7167 0.7302 0.7018 0.6833 0.6984 0.6667
 =9 0.6333 0.6452 0.6207 0.7000 0.7097 0.6897
 = 10 0.6500 0.6441 0.6557 0.7000 0.7097 0.6897
 = 50 0.7167 0.7463 0.6792 0.7000 0.7097 0.6897
 = 100 0.7167 0.7385 0.6909 0.7167 0.7302 0.7018

Kernel parameter Algorithms Wins

KPCA plus KNN DLF plus KNN

Acc Positive F1 Negative F1 Acc Positive F1 Negative F1

 = 0.1 0.4000 0.4000 0.4000 0.6167 0.5818 0.6462 22.22%:77.77%


 = 0.5 0.6333 0.7027 0.5217 0.7833 0.7719 0.7937
 = 0.9 0.5000 0.5313 0.4643 0.6500 0.6182 0.6769
 =1 0.5167 0.5538 0.4727 0.6333 0.6071 0.6563
 =5 0.6333 0.6563 0.6071 0.6833 0.6780 0.6885
 =9 0.6333 0.6452 0.6207 0.6667 0.6552 0.6774
 = 10 0.6500 0.6557 0.6441 0.6667 0.6552 0.6774
 = 50 0.7000 0.6897 0.7097 0.6667 0.6552 0.6774
 = 100 0.7000 0.6897 0.7097 0.6833 0.6667 0.6984

Kernel parameter Algorithms Wins

KPCA plus Naive Bayes DLF plus Naive Bayes

Acc Positive F1 Negative F1 Acc Positive F1 Negative F1

 = 0.1 0.5000 0.6250 0.2500 0.5500 0.6824 0.2286 33.33%:55.55%


 = 0.5 0.6500 0.7342 0.4878 0.6833 0.6780 0.6885
 = 0.9 0.5667 0.6905 0.2778 0.7333 0.6923 0.7647
 =1 0.5833 0.6988 0.3243 0.7333 0.6923 0.7647
 =5 0.6667 0.6667 0.6667 0.8000 0.8000 0.8000
 =9 0.7500 0.7368 0.7619 0.7500 0.7368 0.7619
 = 10 0.7667 0.7407 0.7879 0.7500 0.7368 0.7619
 = 50 0.7500 0.7273 0.7692 0.7333 0.7143 0.7500
 = 100 0.7500 0.7273 0.7692 0.7167 0.7018 0.7302

96.67% (KNN) and 90.00% (Naive Bayes), respectively. Com- In summary, DLF achieves higher accuracies than classical
pared with the KPCA + LA method, the best accuracy of KPCA + LA in most cases with different kernel widths and obtains
our DLF with different classifiers are 96.67% (SVM), 98.33% the best accuracy than classical KPCA + LA in all cases on dif-
(KNN) and 100.00% (Naive Bayes), respectively. The simi- ferent classification tasks. And the experimental results confirm
lar experimental results can be observed in Table 8 (Task 3 that the proposed DLF based image classification algorithm has
with 32 × 32 pixels), i.e., the best accuracies of KPCA + LA vs the ability of mining the deep classification knowledge from the
DLF with different classifiers are 90.00% vs 91.67% (SVM), image dataset. When we fix the kernel widths of multi-layer
96.67% vs 100.00% (KNN) and 96.67% vs 98.33% (Naive KPCAs, the data obtained by DLF will contain more information
Bayes), respectively. than the data obtained by a classical KPCA. In other words, the

Task 2
SVM KNN Naive Bayes
0.8 0.8
KPCA(δ =0.1) KPCA(δ=0.1) KPCA(δ =0.1)
0.8
0.7 DLF(δ=0.1) DLF(δ =0.1) 0.7 DLF(δ =0.1)
Accuracy(%)
Accuracy(%)

Accuracy(%)

0.6 0.6
0.6
0.5 0.5
0.4
0.4 0.4

0.3 0.3
0.2
0.2 0.2
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Dimension Dimension Dimension

Fig. 7. Performance of DLF and KPCA + LA with different feature dimensions on Task 2.
S. Wang et al. / Applied Soft Computing 37 (2015) 125–141 139

Table 7
Resolutions used were 16 × 16 pixels on Task 3.

Kernel parameter Algorithms Wins

KPCA plus SVM DLF plus SVM

Acc Positive F1 Negative F1 Acc Positive F1 Negative F1

 = 0.1 0.4500 0.4762 0.4211 0.7333 0.6522 0.7838 22.22%:66.66%


 = 0.5 0.8167 0.7843 0.8406 0.8500 0.8696 0.8235
 = 0.9 0.9167 0.9180 0.9153 0.9167 0.9123 0.9206
 =1 0.9500 0.9500 0.9500 0.9667 0.9655 0.9677
 =5 0.7667 0.7742 0.7586 0.9000 0.8966 0.9032
 =9 0.8500 0.8525 0.8475 0.8833 0.8814 0.8852
 = 10 0.9167 0.9206 0.9123 0.8833 0.8814 0.8852
 = 50 0.9167 0.9153 0.9180 0.8833 0.8814 0.8852
 = 100 0.9000 0.8966 0.9032 0.9167 0.9153 0.9180

Kernel parameter Algorithms Wins

KPCA plus KNN DLF plus KNN

Acc Positive F1 Negative F1 Acc Positive F1 Negative F1

 = 0.1 0.5000 0.4000 0.5714 0.9000 0.9063 0.8929 0%:66.66%


 = 0.5 0.8500 0.8525 0.8475 0.9333 0.9355 0.9310
 = 0.9 0.9000 0.8889 0.9091 0.9667 0.9655 0.9677
 =1 0.9000 0.8889 0.9091 0.9833 0.9831 0.9836
 =5 0.9667 0.9677 0.9655 0.9833 0.9836 0.9831
 =9 0.9667 0.9677 0.9655 0.9667 0.9677 0.9655
 = 10 0.9667 0.9677 0.9655 0.9667 0.9677 0.9655
 = 50 0.9667 0.9677 0.9655 0.9667 0.9677 0.9655
 = 100 0.9667 0.9677 0.9655 1 1 1

Kernel parameter Algorithms

KPCA plus Naive Bayes DLF plus Naive Bayes

Acc Positive F1 Negative F1 Acc Positive F1 Negative F1

 = 0.1 0.5167 0.6667 0.1212 0.6833 0.7595 0.5366 0%:88.88%


 = 0.5 0.9000 0.9091 0.8889 0.9500 0.9492 0.9508
 = 0.9 0.8833 0.8923 0.8727 1 1 1
 =1 0.8833 0.8923 0.8727 0.9667 0.9677 0.9655
 =5 0.8333 0.8485 0.8148 0.8500 0.8475 0.8525
 =9 0.8667 0.8710 0.8621 0.8833 0.8852 0.8814
 = 10 0.8667 0.8710 0.8621 0.8833 0.8852 0.8814
 = 50 0.8667 0.8710 0.8621 0.8667 0.8710 0.8621
 = 100 0.8667 0.8710 0.8621 0.9000 0.8966 0.9032

ingenious learning technique can make richer information con- the paper, we only choose the kernel width ı = 0.1 and the original
centrated in the data with a fixed dimension. It is a main reason image size is 16 × 16 pixels in this experiment, the detailed results
that the proposed algorithm is highly effective and outperforms are illustrated in Figs. 6–8.
the classical KPCA + LA method for different image classification It can be seen from Figs. 6–8 that the classification performance
tasks. of the proposed DLF-based image classification algorithm is better
than KPCA + LA method by extracting different feature dimensions.
4.2.2. Fix the kernel width for KPCA The experimental results confirm that with the fixed kernel width
In this subsection, the classification performance of DLF is com- the ingenious learning technique can make richer information con-
pared with KPCA + LA by extracting different feature dimensions centrated in the data with different dimensions. It is a main reason
on the following set {2, 4, 6, 8, 10, 12, 14, 16}. To save the space of that the proposed DLF method is highly effective and outperforms

Task 3
SVM KNN Naive Bayes

KPCA(δ =0.1) KPCA(δ =0.1) KPCA(δ=0.1)


1
1 DLF(δ =0.1) DLF(δ=0.1) DLF(δ=0.1)
1
Accuracy(%)
Accuracy(%)

Accuracy(%)

0.8 0.8
0.8
0.6 0.6
0.6
0.4 0.4
0.4
0.2 0.2
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Dimension Dimension Dimension

Fig. 8. Performance of DLF and KPCA + LA with different feature dimensions on Task 3.
140 S. Wang et al. / Applied Soft Computing 37 (2015) 125–141

Table 8
Resolutions used were 32 × 32 pixels on Task 3.

Kernel parameter Algorithms Wins

KPCA plus SVM DLF plus SVM

Acc Positive F1 Negative F1 Acc Positive F1 Negative F1

 = 0.1 0.4000 0.4000 0.4000 0.6833 0.7164 0.6415 11.11%:88.88%


 = 0.5 0.7000 0.7692 0.5714 0.8833 0.8679 0.8955
 = 0.9 0.8167 0.8406 0.7843 0.9000 0.8889 0.9091
 =1 0.8333 0.8571 0.8000 0.9167 0.9091 0.9231
 =5 0.8667 0.8750 0.8571 0.9000 0.8966 0.9032
 =9 0.8000 0.8125 0.7857 0.8833 0.8814 0.8852
 = 10 0.7833 0.7636 0.8000 0.8833 0.8814 0.8852
 = 50 0.9000 0.9000 0.9000 0.8833 0.8814 0.8852
 = 100 0.8833 0.8814 0.8852 0.9000 0.8966 0.9032

Kernel parameter Algorithms Wins

KPCA plus KNN DLF plus KNN

Acc Positive F1 Negative F1 Acc Positive F1 Negative F1

 = 0.1 0.4000 0.4000 0.4000 0.7000 0.6786 0.7188 0%:100%


 = 0.5 0.8333 0.8214 0.8438 0.9667 0.9677 0.9655
 = 0.9 0.8167 0.8136 0.8197 0.9667 0.9677 0.9655
 =1 0.7833 0.8000 0.7636 0.9333 0.9333 0.9333
 =5 0.9667 0.9677 0.9655 0.9833 0.9836 0.9831
 =9 0.9667 0.9677 0.9655 1 1 1
 = 10 0.9500 0.9524 0.9474 0.9833 0.9836 0.9831
 = 50 0.9333 0.9375 0.9286 0.9500 0.9524 0.9474
 = 100 0.9333 0.9375 0.9286 0.9667 0.9677 0.9655

Kernel parameter Algorithms Wins

KPCA plus Naive Bayes DLF plus Naive Bayes

Acc Positive F1 Negative F1 Acc Positive F1 Negative F1

 = 0.1 0.4500 0.6207 0 0.5000 0.6250 0.2500 11.11%:66.66%


 = 0.5 0.8000 0.7500 0.8333 0.9167 0.9123 0.9206
 = 0.9 0.9500 0.9492 0.9508 0.9833 0.9836 0.9831
 =1 0.9667 0.9655 0.9677 0.9667 0.9677 0.9655
 =5 0.8833 0.8852 0.8814 0.9333 0.9310 0.9355
 =9 0.8500 0.8657 0.8302 0.8667 0.8710 0.8621
 = 10 0.8333 0.8485 0.8148 0.8667 0.8710 0.8621
 = 50 0.8667 0.8710 0.8621 0.8667 0.8710 0.8621
 = 100 0.8667 0.8710 0.8621 0.8500 0.8475 0.8525

the classical KPCA + LA method with different feature dimensions widths in our experiments, and we should attempt to determine
for different image classification tasks. their appropriate values by using certain optimization method as
pointed out in the fourth step of DLF in near future; (2) our prelimi-
nary experimental results demonstrate the DLF-based learning of a
5. Conclusions and future works
deep FKNN network is time-consuming, how to speed up its learn-
ing is a interesting topic; (3) as shown in Fig. 4, the proposed deep
In this paper, the architecture of feedforward kernel neural
FKNN network can provide a wide range of feature representations
networks (FKNNs) is proposed to cover a considerably large family
at varying levels of abstraction, how to improve current DLF ver-
of existing feedforward neural networks. The first central result
sion such that its learning can synthesize feature representations
of this study is that with the fixed architecture of a FKNN net-
at varying levels of abstraction is another interesting topic; (4) as
work, a KPCA is implicitly performed and hence its learning may be
shown in Fig. 4, another interesting topic is how to improve cur-
hidden-layer-tuning-free, accordingly a generalized LLM is devel-
rent DLF version such that it can share the common transformed
oped for training the network on datasets. It is also revealed that
data space for multiple tasks; (5) although the proposed deep FKNN
rigorous Mercer kernel condition is not required in FKNNs. The sec-
networks and the deep learning framework DLF have been jus-
ond central role of this study is that a deep FKNN architecture is
tified by our preliminary results about image classification, their
developed with the explicit execution of multi-layer KPCAs, and
effectiveness should be further witnessed by exploiting their more
its deep learning framework DLF, as an alternative deep learn-
applications and doing comparative study with the existing deep
ing, has strong theoretical guarantee. While many authors claimed
learning methods.
that deep learning can be performed by data transformation plus
a learning algorithm in the last layer of a deep structure, it has
never been demonstrated why this property holds. Our experimen- Acknowledgements
tal results indicated that the proposed FKNN’s deep architecture
and its DLF based learning algorithm indeed lead to enhanced per- This work was supported in part by the Hong Kong Polytech-
formance in image classification. nic University under Grant G-UA3W, and by the National Natural
Future works may be mainly focused on the proposed deep Science Foundation of China under Grants 61272210, 2015NSFC,
FKNN network, including: (1) instead of several given kernel and by the Natural Science Foundation of Jiangsu Province under
S. Wang et al. / Applied Soft Computing 37 (2015) 125–141 141

Grant BK2011003, BK2011417, JiangSu 333 Expert Engineering [24] J. Yang, D. Zhang, A.F. Frangi, J. Yang, Two-dimensional PCA: a new approach
under Grant BRA2011142, the Fundamental Research Funds for the to appearance-based face representation and recognition, IEEE Trans. Pattern
Anal. Mach. Intell. 26 (1) (2004) 131–137.
Central Universities under Grant JUSRP111A38 and 2013 Postgrad- [25] X.M. Wang, F.L. Chung, S.T. Wang, On minimum class locality preserving vari-
uate Student’s Creative Research Fund of Jiangsu Province. ance support vector machine, Pattern Recognit. 3 (2010) 2753–2762.
[26] X.M. Wang, Research on Dimension-Reduction and Classification Techniques
in Intelligent Computation (Ph.D. thesis), JiangNan University, 2010.
References [27] V.N. Vapnik, The Nature of Statistical Learning Theory, Springer-Verlag, New
York, 1995.
[1] K. Hornik, Approximation capabilities of feedforward networks, Neural Netw. [28] S.T. Wang, F.L. Chung, J. Wu, J. Wang, Least learning machine and its experimen-
(1991) 251–257. tal studies on regression capability, Appl. Soft Comput. 21 (2014) 677–684.
[2] M. Leshno, V.Y. Lin, A. Pinkus, S. Schocken, Multilayer feedforward networks [29] I.W.H. Tsang, J.T.Y. Kwok, J.M. Zurada, Generalized core vector machines, IEEE
with a nonpolynomial activation function can approximate any function, Neu- Trans. Neural Netw. 17 (5) (2006) 1126–1140.
ral Netw. 6 (1993) 861–867. [30] S.T. Wang, J. Wang, F.L. Chung, Kernel density estimation, kernel methods, and
[3] K.L. Du, M.N.S. Swamy, Neural Networks in a Softcomputing Framework, fast learning in large data sets, IEEE Trans. Cybern. 44 (1) (2014) 1–20.
Springer-Verlag, 2006. [31] K. Hajebi, Y. Abbasi-Yadkori, H. Shahbazi, H. Zhang, Fast approximate
[4] G.B. Huang, X. Ding, H. Zhou, Optimization method based extreme learning nearest-neighbor search with k-nearest neighbor graph, in: Twenty-Second
machine for classification, Neurocomputing 74 (1–3) (2010) 155–163. International Joint Conference on Artificial Intelligence, 2011, pp. 1312–1317.
[5] G.B. Huang, L. Chen, C.-K. Siew, Universal approximation using incremental [32] Z.H. Deng, F.L. Chung, S.T. Wang, FRSDE: fast reduced set density estimator
constructive feedforward networks with random hidden nodes, IEEE Trans. using minimal enclosing ball approximation, Pattern Recognit. 41 (4) (2008)
Neural Netw. 17 (4) (2006) 879–892. 1363–1372.
[6] G.B. Huang, Q.Y. Zhu, C.-K. Siew, Extreme learning machine: theory and appli- [33] P.J. Qian, F.L. Chung, S.T. Wang, Z.H. Deng, Fast graph-based relaxed clustering
cations, Neurocomputing 70 (2006) 489–501. for large data sets using minimal enclosing ball, IEEE Trans. Syst. Man Cybern.
[7] H.J. Rong, G.B. Huang, P. Saratchandran, N. Sundararajan, On-line sequential Part B 42 (3) (2012) 672–687.
fuzzy extreme learning machine for function approximation and classification [34] W.J. Hu, K.F.L. Chung, S.T. Wang, The maximum vector-angular margin classifier
problems, IEEE Trans. Syst. Man Cybern. Part B 39 (4) (2009) 1067–1072. and its fast training on large datasets using a core vector machine, Neural Netw.
[8] G. Feng, G.-B. Huang, Q. Lin, R. Gay, Error minimized extreme learning machine 27 (2012) 60–73.
with growth of hidden nodes and incremental learning, IEEE Trans. Neural [35] B. Mitchell, J. Sheppard, Deep structure learning: beyond connectionist
Netw. 20 (8) (2009) 1352–1357. approaches, in: Proc. 11th International Conference on Machine Learning and
[9] Q.Y. Zhu, A.K. Qin, P.N. Suganthan, G.-B. Huang, Evolutionary extreme learning Applications, 2012, pp. 162–167.
machine, Pattern Recognit. 38 (1) (2005) 1759–1763. [36] Y. Cho, L. Saul, Kernel methods for deep learning, in: Y. Bengio, D. Schuur-
[10] M. Yoan, A. Sorjamaa, P. Bas, O. Simula, C. Jutten, A. Lendasse, OP-ELM: opti- mans, C. Williams, J. Lafferty, A. Culotta (Eds.), Advances in Neural Information
mally pruned extreme learning machine, IEEE Trans. Neural Netw. 21 (1) (2010) Processing Systems 22 (NIPS’09), 2010, pp. 342–350.
158–162. [37] G.E. Hinton, R.R. Salakhutdinov, Reducing the dimensionality of data with neu-
[11] G. Huang, H. Zhou, X. Ding, R. Zhang, Extreme learning machine for regression ral networks, Science 313 (2006) 504–507.
and multiclass classification, IEEE Trans. Syst. Man Cybern. Part B 42 (2) (2012) [38] Y. Bengio, Learning deep architectures for AI, Found. Trends Mach. Learn. 2 (1)
513–529. (2009) 1–127 (Also published as a book. Now Publishers, 2009).
[12] H.T. Huynh, Y. Won, Extreme learning machine with fuzzy activation function, [39] Y. Bengio, P. Lamblin, D. Popovici, H. Larochelle, Greedy layer-wise training of
in: Proc. NCM’2009, 2009, pp. 303–307. deep belief networks, in: Advances in Neural Information Processing Systems
[13] H.X. Tian, Z.Z. Mao, An ensemble ELM based on modified AdaBoost.RT algorithm 19 (NIPS’06), 2007, pp. 153–160.
for predicting the temperature of molten steel in ladle furnace, IEEE Trans. [40] D. Erhan, Y. Bengio, A. Courville, P. Manzagol, P. Vincent, Why does unsuper-
Autom. Sci. Eng. 7 (1) (2008) 73–80. vised pre-training help deep learning? J. Mach. Learn. Res. 11 (2010) 625–660.
[14] S. Wang, F.-L. Chung, J. Wang, J. Wu, A fast learning method for feedforward [41] W. Liao, A. Pizurica, W. Philips, Y. Pi, A fast iterative kernel PCA feature extractor
neural networks, Neurocomputing 149 (2015) 295–307. for hyperspectral images, in: Proc. 2010 IEEE 17th International Conference on
[15] J. Wu, S.T. Wang, F.L. Chung, Positive and negative fuzzy rule systems, extreme Image Processing, 2010, pp. 26–29.
learning machine and image classification, J. Mach. Learn. Cybern. 2 (4) (2011) [42] F. Wang, B. Zhao, C.S. Zhang, Linear time maximum margin clustering, IEEE
261–271. Trans. Neural Netw. 21 (2) (2010) 319–332.
[16] S.T. Wang, K.F.L. Chung, Z.H. Deng, D.W. Hu, Robust fuzzy clustering neural [43] A. Oliva, A. Torralba, Modeling the shape of the scene: a holistic representation
network based on epsilon-insensitive loss function, Appl. Soft Comput. 7 (2) of the spatial envelope, Int. J. Comput. Vision 42 (3) (2001) 145–175.
(2007) 577–584. [44] R. Duda, P. Hart, D. Stork, Pattern Classification, Wiley, New York, 2000.
[17] S.T. Wang, D. Fu, M. Xu, D.W. Hu, Advanced fuzzy cellular neural network: [45] C. Cortes, V. Vapnik, Support vector networks, Mach. Learn. 20 (3) (1995)
application to CT liver images, Artif. Intell. Med. 39 (1) (2007) 66–77. 273–297.
[18] D. Achlioptas, F. McSherry, B. Schölkopf, Sampling techniques for kernel meth- [46] D. Grossman, P. Domingos, Learning Bayesian network classifiers by maximiz-
ods, in: NIPS 2001, 2001, pp. 335–342. ing conditional likelihood, in: Proc. of the Twenty-first International Conference
[19] B. Schölkopf, A. Smola, Learning with Kernels, MIT Press, Cambridge, MA, 2002. on Machine Learning, ACM, 2004, p. 46.
[20] J. Friedman, Multivariate adaptive regression splines (with discussion), Ann. [47] G. Li, K. Chang, S.C.H. Hoi, Multi-view semi-supervised learning with consensus,
Stat. 19 (1) (1991) 1–141. IEEE Trans. Knowl. Data Eng. 24 (11) (2012) 2040–2051.
[21] S. Mitaim, B. Kosko, The shape of fuzzy sets in adaptive function approximation, [48] Y. Liu, J. Chen, Correntropy kernel learning for nonlinear system identification
IEEE Trans. Fuzzy Syst. 9 (2001) 637–656. with outliers, Ind. Eng. Chem. Res. 53 (13) (2014) 5248–5260.
[22] J. Shawe-Taylor, N. Cristianini, Kernel Methods for Pattern Analysis, University [49] Y. Liu, Z. Gao, P. Li, H. Wang, Just-in-time kernel learning with adaptive param-
Press, Cambridge, 2004. eter selection for soft sensor modeling of batch processes, Ind. Eng. Chem. Res.
[23] C. Zhang, F. Nie, S. Xiang, A general kernelization framework for learning algo- 51 (11) (2012) 4313–4327.
rithms based on kernel PCA, Neurocomputing 73 (4–6) (2010) 959–967.

You might also like