2015 Elsevier Feedforward Kernel Neural Networks Generalized Least Learning Machine and Its Deep Learning With Application to Image Classification
2015 Elsevier Feedforward Kernel Neural Networks Generalized Least Learning Machine and Its Deep Learning With Application to Image Classification
a r t i c l e i n f o a b s t r a c t
Article history: In this paper, the architecture of feedforward kernel neural networks (FKNN) is proposed, which can
Received 5 February 2015 include a considerably large family of existing feedforward neural networks and hence can meet most
Received in revised form 12 June 2015 practical requirements. Different from the common understanding of learning, it is revealed that when
Accepted 29 July 2015
the number of the hidden nodes of every hidden layer and the type of the adopted kernel based activa-
Available online 20 August 2015
tion functions are pre-fixed, a special kernel principal component analysis (KPCA) is always implicitly
executed, which can result in the fact that all the hidden layers of such networks need not be tuned and
Keywords:
their parameters can be randomly assigned and even may be independent of the training data. There-
Feedforward kernel neural networks
Least learning machine
fore, the least learning machine (LLM) is extended into its generalized version in the sense of adopting
Kernel principal component analysis much more error functions rather than mean squared error (MSE) function only. As an additional merit,
(KPCA) it is also revealed that rigorous Mercer kernel condition is not required in FKNN networks. When the
Hidden-layer-tuning-free learning proposed architecture of FKNN networks is constructed in a layer-by-layer way, i.e., the number of the
Deep architecture and learning hidden nodes of every hidden layer may be determined only in terms of the extracted principal com-
ponents after the explicit execution of a KPCA, we can develop FKNN’s deep architecture such that its
deep learning framework (DLF) has strong theoretical guarantee. Our experimental results about image
classification manifest that the proposed FKNN’s deep architecture and its DLF based learning indeed
enhance the classification performance.
© 2015 Elsevier B.V. All rights reserved.
http://dx.doi.org/10.1016/j.asoc.2015.07.040
1568-4946/© 2015 Elsevier B.V. All rights reserved.
126 S. Wang et al. / Applied Soft Computing 37 (2015) 125–141
approximators with probability one. Therefore, the least learn- a deep architecture of FKNN networks with its deep learning
ing machine (LLM) [28], as a generalized version of the extreme framework DLF. Since this novel deep learning framework is
learning machine (ELM) [4–13], can be further extended into its built in terms of KLA = multi-layer KPCAs + LA, which makes the
generalized version in the sense of adopting much more error func- deep learning have rigorous theoretical guarantee the first time.
tions instead of only the MSE function. (2) When the proposed We show that training this deep architecture can be finished
architecture of FKNN networks is constructed in a layer-by-layer within O(N) time complexity.
way, i.e., the number of the hidden nodes of every hidden layer (4) When less abstract features are required, a lower level of the
may be determined after the explicit execution of a KPCA, we can hierarchy in the proposed deep architecture can provide us a
develop FKNN’s deep architecture such that its deep learning has range of feature representations at varying levels of abstraction.
strong theoretical guarantee with the enhanced performance for Another advantage is that a lower level of the hierarchy can be
image classification. The more detailed contributions of this work used as the shared transformed data space for different tasks.
can be summarized as follows: Let us keep in mind that multi-layer feedforward neural networks
with BP-like learning cannot have this advantage.
(5) The effectiveness of the proposed deep FKNN’s architecture and
(1) Given the proposed architecture of FKNN networks with the its DLF based learning algorithm is confirmed by our experi-
pre-fixed number of the hidden nodes of every hidden layer ments on image classification.
and the pre-fixed type of the adopted kernel based activation
functions, we reveal that for any hidden layer with its randomly
The rest of this paper is organized as follows. In Section 2, we
assigned kernel parameter vectors, a special KPCA is implicitly
define the proposed FKNN network and its architecture. We prove
performed and hence all the hidden layers may be tuning free.
that training such a FKNN network may be hidden-layer-tuning-
With this special KPCA, we can justify an appropriate number
free and then develop the proposed generalized LLM. In Section 3,
of the hidden nodes in the hidden layer in terms of the rank
we propose a deep FKNN architecture and its deep learning frame-
of the covariance matrix of the kernelized transformed dataset.
work DLF. We also investigate the DLF-based image classification
This is a novel method of estimating an appropriate number
technique. In Section 4, we report our experimental results about
of the hidden nodes in a hidden layer, which is also used to
DLF-based image classification and confirm its superiority over the
justify whether the pre-fixed number of the hidden node is
classical technique in which a KPCA plus a learning algorithm is
appropriate or not.
adopted. Section 5 concludes the paper, and future works are also
(2) We reveal that the proposed feedforward neural networks
given in this section.
behave like kernel methods, and hence their learning can
be separated into two independent parts: implicitly executed
KPCA plus a learning algorithm LA on the transformed data. 2. FKNNs and generalized LLM
However, rigorous Mercer kernel condition is not required. The-
oretical results in kernel methods can help us answer why ELM, In this study, we consider the following feedforward kernel neu-
LLM and its generalized version here outperforms BP-like learn- ral networks (FKNNs hereafter), as shown in Fig. 1. A FKNN network
ing algorithms for feedforward neural networks. Unlike ELM contains the input layer in which the input x = (x1 , x2 , . . ., xd )T ∈ Rd ,
and LLM which are based only on the MSE error function, the L hidden layers in which each hidden node in each layer therein
generalized LLM here can adopt various error functions. takes the same infinitely differential kernel function with differ-
(3) When FKNN networks are constructed in a layer-by-layer way, ent parameters as its activation functions, and the output layer in
i.e., the number of the hidden nodes of every hidden layer is only which the output y of FKNN network can be expressed as the linear
determined after the explicit execution of a KPCA, we develop combination of m activation functions in the last hidden layer by
Hidden layer
What is an appropriate
number of hidden nodes
Input layer with the chosen type of
the activation functions at
every hidden layer?
x1 x2 xd
T
y Proof. Let us observe the covariance matrix C = 1
X̄X̄ . Its every
N
N
element cij = x̄ki x̄kj (i, j = 1, 2, . . ., N), where x̄ki , x̄kj denote
output layer
k=1
the ith and jth components of xk − x̄, respectively. For the matrix
α1 α2 αm D = N1 GGT for the single hidden layer of a FKNN network, its every
1 2 m hidden layer
N
element Dij = g(xk − x̄, i )g(xk − x̄, j ) (i, j = 1, 2, . . ., N). There-
k=1
fore, this theorem holds true if we simply take the kernel feature
input layer
mapping function ϕ(x) = (g(x, 1 ), g(x, 2 ), . . ., g(x, m )).
Table 1
Representative feedforward neural networks.
Types of feedforward neural Sigmodial feedforward neural networks, Radial basis function networks, Wavelet feedforward neural networks,
networks Principal component analysis network, Competitive learning network, Clustering neural network, Evolving neural
networks, SOM networks, Fuzzy neural networks, see [3,15–17], Positive and negative fuzzy rule systems, see [15]
−xT x
i
Activation functions 1) Most frequently used sigmodial functions such as 1
, 1−exp T ;
−xT x+b −x x
with kernels [3,18,19] 1+e i 1+exp i
2 2
2) Decaying RBF functions such as Gaussian function e−||x−xi || / ;
√ 2 2
3) Mexican Hat wavelet function (2/ 3)−1/4 (1 − ||x − xi ||2 / 2 )e−||x−xi || / ;
4) B-spline basis functions such as [20]
√
2 2
5) Morlet wavelet function (2/ 3)e−||x−xi || / cos 5 ||x − xi ||2 / 2 ;
6) Kernel based fuzzy basis functions which use the product inference with fuzzy membership functions for every
dimension j (j = 1, 2, . . ., d), including [21]
6.1) Clipped-parabola (Quadratic) set function with the center mj and the same width d:
2
x − mj x − mj
1− if <1
aj (x) = d d
0 else
√
6.2) Gaussian set function. The Gaussian set function depends on the mean mj and the same standard deviation d/ 2.
2
x−mj
aj (x) = exp − d
N
1 2
Error functions (see [3]) 1) MSE, i.e. e = N
(yi − ȳi ) , where yi is the output of a FKNN network, ȳi is the actual output of the training sample.
i=1
N
1+yi 1−yi
1
2) e = 2
(1 + yi ) ln 1+ȳi
+ (1 − yi ) ln 1−ȳi
i=1
N
i=1
N 1
(yi − ȳi )
2
|yi − ȳi | ≤ ˇ
7) Huber’s function: e = ei , ei = 2
1
ˇ|yi − ȳi | − ˇ2 |yi − ȳi | > ˇ
i=1 2
N 1
(yi − ȳi )
2
|yi − ȳi | ≤ ˇ
8) Talwar’s function: e = ei , ei = 2
1 2
ˇ |yi − ȳi | > ˇ
i=1 2
⎧
9) Hampel’s tanh estimator:
⎪
⎪
1
(y − ȳ )
2
|yi − ȳi | ≤ ˇ1
N ⎨2 i i
1 2 2c1 1 + expc2 (ˇ2 −|yi −ȳi |)
e= ei , ei = ˇ − ln − c1 (|yi − ȳi | − ˇ1 ) ˇ1 < |yi − ȳi | ≤ ˇ2
2 1
⎪
⎪
c2 1 + expc2 (ˇ2 −ˇ1 )
i=1 ⎩ ˇ12 − ln
1 2c 1 2
− c1 (ˇ2 − ˇ1 ) |yi − ȳi | > ˇ2
2 c2 1 + expc2 (ˇ2 −ˇ1 )
N
1
10) e = N
min |||xp − ck |2 for competitive neural networks, e.g., clustering neural networks.
1≤k≤K
p=1
11) KPCA network’s MSE function: E |||ei |2 , ei = ||xt − x̂i (t)|| or 1T E[(ei )] = E[||h(ei )||2 ]. h(ei ) is a real function.
Nval
Nval e2
1 1
12) Correntropy-based error (CE) criterion function [48]: CESval (ei ) = Nval
K (ei ) = Nval
exp − i
, where
2 2
i=1 i=1
ei = (yi − ȳi ).
Learning algorithm BP and its variants, genetic learning methods, Hebbian-learning methods and so on, see [3]
S. Wang et al. / Applied Soft Computing 37 (2015) 125–141 129
this is the best theoretical result about the convergence speed of (1) Since all the parameters in the kernel activation functions
current learning algorithms for feedforward neural networks with among all the hidden layers can be randomly assigned and
infinitely differential kernel functions for the application scenarios all the patterns in the kernel activation functions among all
in which N is huge but N < m3 . Let us keep in mind that because the the hidden layers can be randomly selected, and without
kernel parameter vectors in the hidden layer is randomly assigned, any iterative calculation input data in every hidden layer are
the number m of the hidden nodes is generally a comparatively big transformed into the next hidden layer only once in a for-
value (e.g., 1000 in [11], and N being larger than m3 (e.g., 106 ) may ward layer-by-layer way, a multi-layer FKNN network has the
seldom appear in application scenarios (imagine we have 106 train- advantages of both easy implementation and very fast learning
ing patterns!), therefore, the above alternative LLM in Eq. (4) has capability, compared with BP-like learning algorithms where
considerable applicability in practical scenarios. When N is larger parameters in the network need to be iteratively adjusted in
than m3 , we can take LLM or ELM with the same strategy in [11] a backward gradient-descent way such that BP-like learning
(i.e., Eq. (38) in [11]). algorithms generally converge very slowly and even sometimes
converge to local minima.
Remark 5. As seen in [6,11,28], numerous experimental results (2) Since all the parameters between all the hidden nodes can be
indicate that ELM and LLM outperforms the conventional SVM and randomly assigned, so no any dependence of parameters between
SVR in approximation accuracy. Let us reveal the reason from a new different hidden layers exists!
perspective. As a competitive model of ELM and LLM, an alternative (3) In fact, a multi-layer FKNN network here views the behavior
LLM in Eq. (4) can be readily derived as of a feedforward neural network between the last hidden layer
and the input layer as the successive encoding procedure for the
1
N N m
min − (˛i − ˛∗i )(˛k − ˛∗k )g(xi , j )g(xk , j ) input data in a difficult-to-understand way, due to the natural
2 existence of KPCAs. When we understand the training behav-
i=1 k=1 j=1
ior of FKNN networks from this new perspective as shown in
N
N
Fig. 1, to large extent, this architecture can help us answer why
− yi (˛i − ˛∗i ) + ε (˛i + ˛∗i ) (5)
feedforward neural networks behave like a black box.
i=1 i=1
N
s.t. (˛i − ˛∗i ) = 0, 0 < ˛i , ˛∗i < C Please note, the generalized LLM’s power can be easily justified
i=1
by rich experimental results of LLM, ELM and linear SVM/SVR in
[6,11,13,28,30–34], we do not report the experimental results on
Let us recall the dual of the conventional SVR with the same the generalized LLM in the experimental section, for the sake of the
Vapnik’s ε-insensitive loss function [22,29]: space of the paper.
1
N N
min − (˛i − ˛∗i )(˛k − ˛∗k )k(xi , xk ) 3. Deep FKNN network and its deep learning
2
i=1 k=1
N
N In this section, we will state that a deep FKNN’s architecture can
− yi (˛i − ˛∗i ) + ε (˛i + ˛∗i ) (6) be built in a layer-by-layer way. Although a concrete deep FKNN’s
i=1 i=1 architecture is application-oriented and data dependent, its deep
N learning algorithm can always be characterized by a general deep
s.t. (˛i − ˛∗i ) = 0, 0 < ˛i , ˛∗i < C learning framework DLF. Based on the proposed framework DLF,
i=1 we explore deep FKNN’s application in image classification.
By comparing Eq. (5) with Eq. (6), we can easily find that the
3.1. Deep FKNN network and DLF
alternative LLM in Eq. (4) and hence ELM and LLM in essence employ
m
the special kernel combination g(xi , j )g(xk , j ) rather than In [14,28], based on multi-layer KPCAs, we proposed the
architecture of multi-layer feedforward neural networks and its
j=1
only a single kernel k(xi , xk ) in the conventional SVR. As we know LLM/ELM based fast learning methods. In the last section, we actu-
very well, kernel combinations can generally enhance the perfor- ally extend the conclusion in [11,14,15,28] and point out that more
mance of kernel methods. Therefore, the above observation actually learning algorithms can be adopted in the hidden-layer-tuning-
helps us answer why ELM and LLM experimentally outperform the free mechanism of multi-layer feedforward neural networks. In this
conventional SVM and SVR. section, we will demonstrate that multi-layer feedforward neural
networks with multi-layer KPCAs + LA can be used for deep learn-
Now, let us observe a multi-layer FKNN network. In terms of ing.
Theorem 1, for a multi-layer FKNN network, we can easily see that One may cheer that in terms of the seemingly surprising con-
a KPCA implicitly behaves between the first hidden layer and the clusions in the last section and LLM/ELM theory in [14], even if
input layer, and another KPCA implicitly behaves between the first all the kernel parameter vectors in FKNN networks are tuning-
and second hidden layers. This process repeats until the last hidden free (i.e., randomly assigned) and learning is only required in the
layer in a multi-layer FKNN network. In other words, there implic- output layer of FKNN networks, FKNNs may become universal
itly exist multi-layer KPCAs, which may be viewed as a special KPCA approximators with probability one. However, we should keep
with a successive nonlinear combination of the kernel feature map- in mind that when learning in this way, FKNN networks actually
ping functions between the input layer and the last hidden layer. behave like shallow architectures such as SVMs. When we consider
Thus, according to the above theoretical analysis, we may realize how to assign appropriate kernel parameter vectors by explicitly
a LA between the output layer and the last hidden layer, with ran- executing KPCA in every hidden layer of FKNN networks, FKNN
domly assigned kernel parameter vectors in every hidden layer. networks will behave like deep architectures which can help us
When we understand the behavior of multi-layer FKNN networks represent complex data. Deep architectures learn complex map-
in the above way, we can see the same benefits [14,28] as LLM, pings by transforming their input data through multiple layers of
which is summarized as follows: nonlinear processing. The advantages of deep architectures can be
S. Wang et al. / Applied Soft Computing 37 (2015) 125–141 131
highlighted by the following motivations: the wide range of func- to construct two data sets for two tasks, which may be viewed as
tions which can be parameterized by integrating weakly nonlinear another nonlinear transformation. Finally, we can finish these two
transformations, the appeal of hierarchical distributed representa- tasks, respectively, by identifying a face with the forehead of a mole
tions and the potential for combining unsupervised and supervised in the square face images (a nonlinear transformation for the first
methods. However, at present, there exists the on-going debate task), a face with a high nose in the round face images (a nonlinear
over deep vs shallow architectures. Deep architectures are gener- transformation for the second task). This example can be charac-
ally more difficult to train than shallow ones, since they involve terized by a deep architecture of FKNNs in Fig. 4 where multi-layer
difficulty nonlinear optimizations and many heuristics. The chal- KPCAs + LA is adopted.
lenges of deep learning explain the appeal of SVMs which learn By observing the learning behavior of FKNN networks in Fig. 4,
nonlinear classifiers via the kernel trick. Unlike deep architec- we can see the following virtues:
tures, SVMs are trained by solving simple QP problems. However
SVMs cannot seemingly benefit from the advantages of deep
learning. (1) The number of the layers of a FKNN network can be increased
Here we will show that FKNN networks can help us enjoy both in a layer-by-layer way and a KPCA behaves as a data transfor-
the success of deep learning and the elegance of kernel meth- mation method in each hidden layer. This shares the same idea
ods. Although we share a similar motivation as in previous works of the deep learning in [37–40]: Enough layers can ensure us
[35,36], our study here is very different. Our study here makes two to obtain enough complex nonlinear functions to express com-
main contributions. First, FKNN networks here can be built in a plex data. Except kernel parameter vectors in every layer, no
layer-by-layer way in terms of KLA = multi-layer KPCAs + LA rather any other parameter is required.
than the strategies in [35–40], which makes their deep learning (2) If the number of the layers of a FKNN network is bigger than or
have rigorous theoretical guarantee the first time. Second, with the equal to 2, even if a linear PCA is adopted in every hidden layer,
adopted fast KPCAs in the deep learning, the time complexity of a nonlinear data transformation in the last hidden layer will
training FKNN networks indeed becomes linear with the number always be obtained. When a nonlinear kernel PCA is adopted in
of the training samples. the first hidden layer, more complex nonlinear data transfor-
In fact, multi-layer KPCAs map the input data into the kernel- mation in the last hidden layer will be generated, which may
ized data space in the last hidden layer through kernelized data be more beneficial for representing more complex data. Both
spaces between multiple hidden layers. One may argue why we linear PCAs and KPCAs in hidden layers actually reflect succes-
need multi-layer KPCAs. An example can help us explain the moti- sive abstracts in a deep way for the input data or kernelized
vation of doing so and observe how to build an application-specific transformed data, respectively. In particular, Wang et al. pro-
deep FKNN. Assume we want to accomplish two tasks, i.e., identify posed a KPCA based learning architecture for LLM or ELM based
one male crime with a square face and the forehead of a mole, and FKNN networks [28]. Mitchell et al. [35] realized a deep struc-
the other male crime with a round face and high nose, from a face ture learning by using multi-layer linear PCAs and pointed out
image database. Quite often, one first picks up all the male images that their deep structure learning framework is a fairly broad
from the face image database to obtain a shared data space for these one which can be used to describe the conventional Convolu-
two tasks. We can imagine this procedure as a nonlinear transfor- tional Networks and Deep Belief Networks whose successes in
mation between all male images and all the images. Based on the several applications have been largely sparking recent growing
shared data space, one may then pick up square face images and interest in deep learning in [35]. Chao and Saul implemented
round face images, respectively, from all the picked male images deep learning by using a special multi-layer kernel machines in
Task y1 Task y2
LA 1 LA 2
KPCA KPCA
KPCA KPCA
KPCA
Shared data space
transformed
KPCA
x1 x2 xd
[36]. Therefore, the proposed multi-layer FKNN framework is hidden layer to the input layer. More concretely, we summarize the
general in two senses: (1) it generalizes the work in [35] since detailed DLF framework as follows.
except for linear PCAs more nonlinear KPCAs and more ker-
nel functions can be involved; (2) it generalizes the work in Learning framework DLF for a deep FKNN network
[14] since except for the MSE error function adopted in LLM
Input: The given dataset, e.g., an image database; Number of the
and ELM, more error functions can be considered within this hidden layers, with their respective kernel functions used in
framework. multi-layer KPCAs, of a deep FKNN network;
(3) Although existing deep learning techniques have obtained
Output: The parameters of the output layer determined by a LA, and/or
impressive successes, the resulting systems are so complex the kernel parameter vectors in every hidden layer of a deep
such that they require many heuristics without theoretical FKNN network;
guarantee (i.e., why can they be in theory separated into two Step 1: Preprocess the given dataset form the input dataset for the
independent learning components and what is the maximal deep FKNN network; fix the number of the layers in the FKNN,
number of the hidden nodes in every hidden layer) and they are and the type of a kernel function in each hidden layer of the
very difficulty in understanding which theoretical properties deep FKNN network;
are responsible for the improved performance. However, the Step 2: Carry out a full-rank or low-rank KPCA with the chosen kernel
proposed FKNN architecture has strong theoretical guarantee, function and its initial kernel parameter vector in every hidden
i.e., KLA = multi-layer KPCAs + LA, and rich achievements about layer, in a layer-by-layer way from the input layer to the last
hidden layer. If a full-rank KPCA is taken, the number of the
kernel methods can help us support the suitability of FKNN hidden nodes in a hidden layer is automatically determined by
networks in deep learning. Besides, the proposed FKNN archi- the full-rank KPCA. Otherwise, fix the chosen top-k principal
tecture naturally provides a lower level of the hierarchy which components as the number of the hidden nodes, in terms of
can be used as “the last output layer”, as shown in Fig. 4, thus the extracted principal components by a low-rank KPCA.
resulting in a range of data representations at varying levels of Step 3: Carry out the chosen learning algorithm LA in the output layer
abstraction. on the transformed dataset after multi-layer KPCAs such that
(4) The entire FKNN network behaves like a kernelized learning the parameters in the output layer are determined and the
kernel parameter vectors in the last hidden layer are
algorithm KLA since multi-layer KPCAs can be viewed as a optimized.
special KPCA and KLA = this special KPCA+ LA. The most dis-
Step 4: Determine an appropriate kernel parameter vector in every
tinctive advantage of such a multi-layer FKNN network exists
hidden layer by:
in that we may attempt to choose a fast KPCA algorithm and (1) the grid search in every hidden layer in a layer-by-layer
a fast implementation of LA independently to achieve FKNN’s way from the last hidden layer to the input layer.;
fast deep learning. For example, in order to make the time Or (2) choose certain optimization method with certain
complexity of FKNN’s deep learning become linear with N, we performance criterion (e.g. MSE) and the initial kernel
parameter vector to optimize the kernel parameter vector in
may choose the fast KPCA method in [41], which is proved
every hidden layer in a layer-by-layer way from the last
to have O(kN) time complexity, where k is the number of hidden layer to the input layer.
the extracted components, we may also choose ELM and LLM
in [14,15,28], linear SVM in [13], KNN [31] and support vec-
It should be pointed out that different from the strategies used in
tor clustering in [42] as the fast learning method in the last
the existing typical deep learning techniques [35–40], the mature
layer of the deep FKNN network in terms of the training
KPCA techniques can be exploited to play a crucial role in the pro-
tasks.
posed general deep learning framework DLF. Although both DLF
(5) Since we can carry out KPCAs in a layer-by-layer way, as seen
and the existing deep learning techniques can indeed be used to
in the dotted lines in Fig. 4, a potential advantage is that when
build deep neural networks in a layer-by-layer way, DLF is very
less abstract features are required, a lower level of the hierarchy
different from them in principle. In other words, although exist-
can be used as the “output” layer, readily providing a range
ing deep learning techniques have obtained impressive successes,
of feature representations at varying levels of abstraction. The
the resulting systems are so complex such that they require many
other advantage is that a lower level of the hierarchy can be
heuristics without theoretical guarantee (i.e., why can they be in
used as the shared transformed data space for different tasks.
theory separated into two independent learning components and
Let us keep in mind that multi-layer feedforward neural networks
what is the maximal number of the hidden nodes in every hidden
with BP-like learning cannot have this advantage.
layer) and that it would be very difficult to understand which the-
oretical properties are responsible for the improved performance.
However, since KLA = multi-layer KPCAs + LA, DLF has strong theo-
Obviously, deep FKNN networks are application-oriented and
retical guarantee.
data dependent, However, since they can be constructed in a layer-
by-layer way, we can summarize their deep learning using a general
deep learning framework DLF below. In other words, when ori- 3.2. Deep FKNN network for image classification
ented for a specific application, we may design a concrete FKNN
and realize its deep learning implementation by instantiating and In order to exhibit the very applicability of the proposed deep
even simplifying the proposed framework DLF. For example, we FKNN network and the proposed deep learning framework DLF,
may instantiate LA with the classical SVM in DLF. We may also omit here we consider a concrete application of image data classification,
the step 4 in DLF in terms of the trade-off between the deep learn- though there is no reason the DLF could not be applied to other type
ing performance and the running time. This learning framework of data. We also hope to do so in near future. Now, let us state how
consists of four main steps: (1) application-oriented data prepara- we can apply DLF to image classification.
tion; (2) realizing the data transformation by multi-layer KPCAs, For a given set of images as the training dataset, in order to do
in a layer-by-layer way from the input layer to the last hidden the data preparation well, we use a quad-tree decomposition to
layer; (3) carrying out a learning algorithm LA on the transformed subdivide each image recursively, the level of the quad-tree was a
dataset after multi-layer KPCAs; (4) optimizing the kernel param- set of non-overlapping l × l (In general, we use 4 × 4) pixel patches.
eter vectors in multi-KPCAs, in terms of the obtained result by the We present the set of all l × l patches from all images in the training
learning algorithm in step (3), in a layer-by-layer way from the last dataset to a deep FKNN network as the input data for a deep FKNN
S. Wang et al. / Applied Soft Computing 37 (2015) 125–141 133
network. For simplicity, we adopt Gaussian kernel functions with then we start with a raw data matrix D0 , which is m × n. In case of
the same kernel width in every hidden layer, and the top-k eigen- images, this means each row of the matrix is an image. The level
vectors as a reduced-dimensionality basis. Each patch is projected one divided data matrix, D1 , in this network is generated by recur-
into this new basis, and the reduced-dimensionality patches are sively dividing the vectors in D0 into l × l (4 × 4 = 16) dimensional
n
then joined back together in their original order. The result of this patches, so it will be a m · 16 × 16 matrix. We apply a KPCA to
is that each image had its dimensionality multiplied by k/16 (Fig. 5). D1 , extract the top-k eigenvectors into F1 , and use F1 as a basis
These reduced-dimensionality images formed the next layer up in to project the vectors of D1 onto. Applying the projection to the
the deep FKNN network. n
vectors in D1 results in P1 , a m · 16 × k matrix. Adjacent vectors
This process is repeated, using the newly created layer as the
in P1 are then united (using the inverse of the splitting operator),
data for the divide-KPCA-union process to create the next layer. At n
resulting in D2 , which is m · 16·4 × 4k. And then, we can continue
every layer after the first, the dimensionality is reduced by a fixed
to apply a KPCA to build current hierarchy on D2 . In general, Dt
factor of 4. The process is terminated when the remaining data is k
will be m × n · 16·4 t . When 16·4t = n, the hierarchy (i.e., the deep
too small to split, and the entire image is represented by a single
FKNN’s architecture) is complete, giving the last hidden layer with
k-dimensional feature vector at the last hidden layer of the deep
dimensions of m × k. More concretely, based on the above, below
FKNN network.
we state the DLF-based image classification algorithm in which the
As an illustration, if our original data is a set of m vectors, each
instantiation of step 4 in DLF is not considered.
length n (e.g., for 200 16 × 16 images, m = 200, n = 16 × 16 = 256),
Fig. 5. An example of how the deep FKNN network works on a pair of images from a 32 × 32 image dataset.
134 S. Wang et al. / Applied Soft Computing 37 (2015) 125–141
DLF based image classification algorithm Please note, the DivideQuads function does a quad-tree style
division of an image into four equal-sized sub-images, and the
Input: Given the set D0 of m × n images, the number l of hidden layers
in the FKNN, where l = log4 (n/16), the number k of eigenvectors
UnionQuads function inverts this operation. The function KPCA(Dt ,
to keep k) computes an eigen-decomposition of the kernelized covariance
of Dt and then returns the top-k eigenvectors.
Output: Feature space hierarchy F, projected data hierarchy Dl and
classification accuracy for Dl .
Table 2
Example images for image classification tasks.
Class 2: forest
Class 2: street
S. Wang et al. / Applied Soft Computing 37 (2015) 125–141 135
The experimental studies are organized as follows. In Section data, generated from the original image dataset, as the input data
4.1, the experiment settings and the adopted image datasets are for multi-layer KPCAs. After multi-layer KPCAs, 200 k-dimensional
described, respectively. In Section 4.2, the classification results data will be presented to a LA in the output layer of the deep
obtained by DLF and KPCA + LA for the adopted image datasets are FKNN network. Similarly, the input dataset is generated for both
reported and analyzed. methods on the original 200 32 × 32 image datasets from task 1 to
task 3.
For each experiment, 70% of images are taken as the training set,
4.1. Experiment settings
and the remaining 30% of images are used for testing.
Table 3
Resolutions used were 16 × 16 pixels on Task 1.
9, 10, 50, 100}, instead of the kernel widths obtained by certain The dimensionality of the resultant feature space was the same for
optimization method, such as leave-one-out cross-validation, and both algorithms. The experimental results are illustrated in from
so on [49], in a layer-by-layer way. Let us keep in mind that although Tables 3–8 in which “Wins” represents one technique outperforms
we can fix the Gaussian kernel widths in a layer-by-layer way to the other. In cases where performance is the same, no winner is
scale up the runtime of DLF which is linear with the depth of the listed. From Tables 3–8, i.e., Task 1 to Task 3, we reveal the following
hierarchy of the deep FKNN network, DLF is still time-consuming. observations.
Therefore, we focus our attention on the main goal of this study
and leave these works for future investigation. In addition, we fix
(1) Focused on the index of “wins” in Tables 3–8:
the regularization parameter C = 100 for SVM and the number of
While DLF is better on average for every kernel width, it is
nearest points K = 5 for KNN.
not always better for KPCA + LA. Table 3 shows how often each
Both algorithms are implemented using MATLAB on a computer
algorithm beat the other with each value of kernel width. In
with Intel Core 2 Duo P8600 2.4 GHz CPU and 2 GB RAM.
cases where both algorithms tie, neither is counted as winning.
In order to evaluate the experimental results reasonably, two
We note that multi-layer KPCAs not only win more frequently,
traditional evaluation indices, i.e., Accuracy and F1-measure (or Acc
but also win by a larger margin of accuracy and F1-measure.
and F1 for simplicity, respectively [47]) are adopted, in which the
(2) Compared with KPCA + LA on the best accuracy from task 1 to
accuracy is the average number of correctly classified labels and the
task 3:
F1-measure is the harmonic mean of precisions and recalls.
(i) On Task 1 – coast & forest classification task: As shown
in Table 3 (Task 1 with 16 × 16 pixels), the best accuracies
4.2. Performance analysis of both algorithms of KPCA + LA with different classifiers are 60.00% (SVM),
68.33% (KNN) and 58.33% (Naive Bayes), respectively. Com-
4.2.1. Fix the dimension pared with the KPCA + LA method, the best accuracy of
In this subsection, the datasets from task 1 to task 3 are used our DLF with different classifiers are 71.67% (SVM), 70.00%
to generate two feature spaces. The first uses the classical KPCA (KNN) and 73.33% (Naive Bayes), respectively. The simi-
with Gaussian kernels to generate 16 features, and the second uses lar experimental results can be observed in Table 4 (Task 1
multi-layer KPCAs with Gaussian kernels to generate 16 features. with 32 × 32 pixels), i.e., the best accuracies of KPCA + LA vs
Table 4
Resolutions used were 32 × 32 pixels on Task 1.
Table 5
Resolutions used were 16 × 16 pixels on Task 2.
DLF with different classifiers are 66.67% vs 75.00% (SVM), 73.00% (SVM), 71.67% (KNN) and 78.33% (Naive Bayes),
71.67% vs 75.00% (KNN) and 63.33% vs 78.33% (Naive respectively. The similar experimental results can be
Bayes), respectively. observed in Table 6 (Task 2 with 32 × 32 pixels), i.e., the
(ii) On Task 2 – inside city & tall building classification task: best accuracies of KPCA + LA vs DLF with different classi-
As shown in Table 5 (Task 2 with 16 × 16 pixels), the fiers are 71.67% vs 75.00% (SVM), 70.00% vs 78.33% (KNN)
best accuracies of KPCA + LA with different classifiers are and 76.67% vs 80.00% (Naive Bayes), respectively.
71.67% (SVM), 68.33% (KNN) and 73.33% (Naive Bayes), (iii) On Task 3 – highway & street classification task: As shown
respectively. Compared with the KPCA + LA method, the in Table 7 (Task 3 with 16 × 16 pixels), the best accuracies
best accuracies of our DLF with different classifiers are of KPCA + LA with different classifiers are 91.67% (SVM),
Task 1
SVM KNN Naive Bayes
0.9 0.8
KPCA(δ=0.1) KPCA(δ=0.1) KPCA(δ =0.1)
0.65 DLF(δ =0.1)
0.8 DLF(δ =0.1) 0.7 DLF(δ =0.1)
0.6
Accuracy(%)
Accuracy(%)
Accuracy(%)
0.7
0.6 0.55
0.6
0.5
0.5
0.5
0.45
0.4 0.4 0.4
0.3 0 5 10 15 20 0.35
0 5 10 15 20 Dimension 0 5 10 15 20
Dimension Dimension
Fig. 6. Performance of DLF and KPCA + LA with different feature dimensions on Task 1.
138 S. Wang et al. / Applied Soft Computing 37 (2015) 125–141
Table 6
Resolutions used were 32 × 32 pixels on Task 2.
96.67% (KNN) and 90.00% (Naive Bayes), respectively. Com- In summary, DLF achieves higher accuracies than classical
pared with the KPCA + LA method, the best accuracy of KPCA + LA in most cases with different kernel widths and obtains
our DLF with different classifiers are 96.67% (SVM), 98.33% the best accuracy than classical KPCA + LA in all cases on dif-
(KNN) and 100.00% (Naive Bayes), respectively. The simi- ferent classification tasks. And the experimental results confirm
lar experimental results can be observed in Table 8 (Task 3 that the proposed DLF based image classification algorithm has
with 32 × 32 pixels), i.e., the best accuracies of KPCA + LA vs the ability of mining the deep classification knowledge from the
DLF with different classifiers are 90.00% vs 91.67% (SVM), image dataset. When we fix the kernel widths of multi-layer
96.67% vs 100.00% (KNN) and 96.67% vs 98.33% (Naive KPCAs, the data obtained by DLF will contain more information
Bayes), respectively. than the data obtained by a classical KPCA. In other words, the
Task 2
SVM KNN Naive Bayes
0.8 0.8
KPCA(δ =0.1) KPCA(δ=0.1) KPCA(δ =0.1)
0.8
0.7 DLF(δ=0.1) DLF(δ =0.1) 0.7 DLF(δ =0.1)
Accuracy(%)
Accuracy(%)
Accuracy(%)
0.6 0.6
0.6
0.5 0.5
0.4
0.4 0.4
0.3 0.3
0.2
0.2 0.2
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Dimension Dimension Dimension
Fig. 7. Performance of DLF and KPCA + LA with different feature dimensions on Task 2.
S. Wang et al. / Applied Soft Computing 37 (2015) 125–141 139
Table 7
Resolutions used were 16 × 16 pixels on Task 3.
ingenious learning technique can make richer information con- the paper, we only choose the kernel width ı = 0.1 and the original
centrated in the data with a fixed dimension. It is a main reason image size is 16 × 16 pixels in this experiment, the detailed results
that the proposed algorithm is highly effective and outperforms are illustrated in Figs. 6–8.
the classical KPCA + LA method for different image classification It can be seen from Figs. 6–8 that the classification performance
tasks. of the proposed DLF-based image classification algorithm is better
than KPCA + LA method by extracting different feature dimensions.
4.2.2. Fix the kernel width for KPCA The experimental results confirm that with the fixed kernel width
In this subsection, the classification performance of DLF is com- the ingenious learning technique can make richer information con-
pared with KPCA + LA by extracting different feature dimensions centrated in the data with different dimensions. It is a main reason
on the following set {2, 4, 6, 8, 10, 12, 14, 16}. To save the space of that the proposed DLF method is highly effective and outperforms
Task 3
SVM KNN Naive Bayes
Accuracy(%)
0.8 0.8
0.8
0.6 0.6
0.6
0.4 0.4
0.4
0.2 0.2
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Dimension Dimension Dimension
Fig. 8. Performance of DLF and KPCA + LA with different feature dimensions on Task 3.
140 S. Wang et al. / Applied Soft Computing 37 (2015) 125–141
Table 8
Resolutions used were 32 × 32 pixels on Task 3.
the classical KPCA + LA method with different feature dimensions widths in our experiments, and we should attempt to determine
for different image classification tasks. their appropriate values by using certain optimization method as
pointed out in the fourth step of DLF in near future; (2) our prelimi-
nary experimental results demonstrate the DLF-based learning of a
5. Conclusions and future works
deep FKNN network is time-consuming, how to speed up its learn-
ing is a interesting topic; (3) as shown in Fig. 4, the proposed deep
In this paper, the architecture of feedforward kernel neural
FKNN network can provide a wide range of feature representations
networks (FKNNs) is proposed to cover a considerably large family
at varying levels of abstraction, how to improve current DLF ver-
of existing feedforward neural networks. The first central result
sion such that its learning can synthesize feature representations
of this study is that with the fixed architecture of a FKNN net-
at varying levels of abstraction is another interesting topic; (4) as
work, a KPCA is implicitly performed and hence its learning may be
shown in Fig. 4, another interesting topic is how to improve cur-
hidden-layer-tuning-free, accordingly a generalized LLM is devel-
rent DLF version such that it can share the common transformed
oped for training the network on datasets. It is also revealed that
data space for multiple tasks; (5) although the proposed deep FKNN
rigorous Mercer kernel condition is not required in FKNNs. The sec-
networks and the deep learning framework DLF have been jus-
ond central role of this study is that a deep FKNN architecture is
tified by our preliminary results about image classification, their
developed with the explicit execution of multi-layer KPCAs, and
effectiveness should be further witnessed by exploiting their more
its deep learning framework DLF, as an alternative deep learn-
applications and doing comparative study with the existing deep
ing, has strong theoretical guarantee. While many authors claimed
learning methods.
that deep learning can be performed by data transformation plus
a learning algorithm in the last layer of a deep structure, it has
never been demonstrated why this property holds. Our experimen- Acknowledgements
tal results indicated that the proposed FKNN’s deep architecture
and its DLF based learning algorithm indeed lead to enhanced per- This work was supported in part by the Hong Kong Polytech-
formance in image classification. nic University under Grant G-UA3W, and by the National Natural
Future works may be mainly focused on the proposed deep Science Foundation of China under Grants 61272210, 2015NSFC,
FKNN network, including: (1) instead of several given kernel and by the Natural Science Foundation of Jiangsu Province under
S. Wang et al. / Applied Soft Computing 37 (2015) 125–141 141
Grant BK2011003, BK2011417, JiangSu 333 Expert Engineering [24] J. Yang, D. Zhang, A.F. Frangi, J. Yang, Two-dimensional PCA: a new approach
under Grant BRA2011142, the Fundamental Research Funds for the to appearance-based face representation and recognition, IEEE Trans. Pattern
Anal. Mach. Intell. 26 (1) (2004) 131–137.
Central Universities under Grant JUSRP111A38 and 2013 Postgrad- [25] X.M. Wang, F.L. Chung, S.T. Wang, On minimum class locality preserving vari-
uate Student’s Creative Research Fund of Jiangsu Province. ance support vector machine, Pattern Recognit. 3 (2010) 2753–2762.
[26] X.M. Wang, Research on Dimension-Reduction and Classification Techniques
in Intelligent Computation (Ph.D. thesis), JiangNan University, 2010.
References [27] V.N. Vapnik, The Nature of Statistical Learning Theory, Springer-Verlag, New
York, 1995.
[1] K. Hornik, Approximation capabilities of feedforward networks, Neural Netw. [28] S.T. Wang, F.L. Chung, J. Wu, J. Wang, Least learning machine and its experimen-
(1991) 251–257. tal studies on regression capability, Appl. Soft Comput. 21 (2014) 677–684.
[2] M. Leshno, V.Y. Lin, A. Pinkus, S. Schocken, Multilayer feedforward networks [29] I.W.H. Tsang, J.T.Y. Kwok, J.M. Zurada, Generalized core vector machines, IEEE
with a nonpolynomial activation function can approximate any function, Neu- Trans. Neural Netw. 17 (5) (2006) 1126–1140.
ral Netw. 6 (1993) 861–867. [30] S.T. Wang, J. Wang, F.L. Chung, Kernel density estimation, kernel methods, and
[3] K.L. Du, M.N.S. Swamy, Neural Networks in a Softcomputing Framework, fast learning in large data sets, IEEE Trans. Cybern. 44 (1) (2014) 1–20.
Springer-Verlag, 2006. [31] K. Hajebi, Y. Abbasi-Yadkori, H. Shahbazi, H. Zhang, Fast approximate
[4] G.B. Huang, X. Ding, H. Zhou, Optimization method based extreme learning nearest-neighbor search with k-nearest neighbor graph, in: Twenty-Second
machine for classification, Neurocomputing 74 (1–3) (2010) 155–163. International Joint Conference on Artificial Intelligence, 2011, pp. 1312–1317.
[5] G.B. Huang, L. Chen, C.-K. Siew, Universal approximation using incremental [32] Z.H. Deng, F.L. Chung, S.T. Wang, FRSDE: fast reduced set density estimator
constructive feedforward networks with random hidden nodes, IEEE Trans. using minimal enclosing ball approximation, Pattern Recognit. 41 (4) (2008)
Neural Netw. 17 (4) (2006) 879–892. 1363–1372.
[6] G.B. Huang, Q.Y. Zhu, C.-K. Siew, Extreme learning machine: theory and appli- [33] P.J. Qian, F.L. Chung, S.T. Wang, Z.H. Deng, Fast graph-based relaxed clustering
cations, Neurocomputing 70 (2006) 489–501. for large data sets using minimal enclosing ball, IEEE Trans. Syst. Man Cybern.
[7] H.J. Rong, G.B. Huang, P. Saratchandran, N. Sundararajan, On-line sequential Part B 42 (3) (2012) 672–687.
fuzzy extreme learning machine for function approximation and classification [34] W.J. Hu, K.F.L. Chung, S.T. Wang, The maximum vector-angular margin classifier
problems, IEEE Trans. Syst. Man Cybern. Part B 39 (4) (2009) 1067–1072. and its fast training on large datasets using a core vector machine, Neural Netw.
[8] G. Feng, G.-B. Huang, Q. Lin, R. Gay, Error minimized extreme learning machine 27 (2012) 60–73.
with growth of hidden nodes and incremental learning, IEEE Trans. Neural [35] B. Mitchell, J. Sheppard, Deep structure learning: beyond connectionist
Netw. 20 (8) (2009) 1352–1357. approaches, in: Proc. 11th International Conference on Machine Learning and
[9] Q.Y. Zhu, A.K. Qin, P.N. Suganthan, G.-B. Huang, Evolutionary extreme learning Applications, 2012, pp. 162–167.
machine, Pattern Recognit. 38 (1) (2005) 1759–1763. [36] Y. Cho, L. Saul, Kernel methods for deep learning, in: Y. Bengio, D. Schuur-
[10] M. Yoan, A. Sorjamaa, P. Bas, O. Simula, C. Jutten, A. Lendasse, OP-ELM: opti- mans, C. Williams, J. Lafferty, A. Culotta (Eds.), Advances in Neural Information
mally pruned extreme learning machine, IEEE Trans. Neural Netw. 21 (1) (2010) Processing Systems 22 (NIPS’09), 2010, pp. 342–350.
158–162. [37] G.E. Hinton, R.R. Salakhutdinov, Reducing the dimensionality of data with neu-
[11] G. Huang, H. Zhou, X. Ding, R. Zhang, Extreme learning machine for regression ral networks, Science 313 (2006) 504–507.
and multiclass classification, IEEE Trans. Syst. Man Cybern. Part B 42 (2) (2012) [38] Y. Bengio, Learning deep architectures for AI, Found. Trends Mach. Learn. 2 (1)
513–529. (2009) 1–127 (Also published as a book. Now Publishers, 2009).
[12] H.T. Huynh, Y. Won, Extreme learning machine with fuzzy activation function, [39] Y. Bengio, P. Lamblin, D. Popovici, H. Larochelle, Greedy layer-wise training of
in: Proc. NCM’2009, 2009, pp. 303–307. deep belief networks, in: Advances in Neural Information Processing Systems
[13] H.X. Tian, Z.Z. Mao, An ensemble ELM based on modified AdaBoost.RT algorithm 19 (NIPS’06), 2007, pp. 153–160.
for predicting the temperature of molten steel in ladle furnace, IEEE Trans. [40] D. Erhan, Y. Bengio, A. Courville, P. Manzagol, P. Vincent, Why does unsuper-
Autom. Sci. Eng. 7 (1) (2008) 73–80. vised pre-training help deep learning? J. Mach. Learn. Res. 11 (2010) 625–660.
[14] S. Wang, F.-L. Chung, J. Wang, J. Wu, A fast learning method for feedforward [41] W. Liao, A. Pizurica, W. Philips, Y. Pi, A fast iterative kernel PCA feature extractor
neural networks, Neurocomputing 149 (2015) 295–307. for hyperspectral images, in: Proc. 2010 IEEE 17th International Conference on
[15] J. Wu, S.T. Wang, F.L. Chung, Positive and negative fuzzy rule systems, extreme Image Processing, 2010, pp. 26–29.
learning machine and image classification, J. Mach. Learn. Cybern. 2 (4) (2011) [42] F. Wang, B. Zhao, C.S. Zhang, Linear time maximum margin clustering, IEEE
261–271. Trans. Neural Netw. 21 (2) (2010) 319–332.
[16] S.T. Wang, K.F.L. Chung, Z.H. Deng, D.W. Hu, Robust fuzzy clustering neural [43] A. Oliva, A. Torralba, Modeling the shape of the scene: a holistic representation
network based on epsilon-insensitive loss function, Appl. Soft Comput. 7 (2) of the spatial envelope, Int. J. Comput. Vision 42 (3) (2001) 145–175.
(2007) 577–584. [44] R. Duda, P. Hart, D. Stork, Pattern Classification, Wiley, New York, 2000.
[17] S.T. Wang, D. Fu, M. Xu, D.W. Hu, Advanced fuzzy cellular neural network: [45] C. Cortes, V. Vapnik, Support vector networks, Mach. Learn. 20 (3) (1995)
application to CT liver images, Artif. Intell. Med. 39 (1) (2007) 66–77. 273–297.
[18] D. Achlioptas, F. McSherry, B. Schölkopf, Sampling techniques for kernel meth- [46] D. Grossman, P. Domingos, Learning Bayesian network classifiers by maximiz-
ods, in: NIPS 2001, 2001, pp. 335–342. ing conditional likelihood, in: Proc. of the Twenty-first International Conference
[19] B. Schölkopf, A. Smola, Learning with Kernels, MIT Press, Cambridge, MA, 2002. on Machine Learning, ACM, 2004, p. 46.
[20] J. Friedman, Multivariate adaptive regression splines (with discussion), Ann. [47] G. Li, K. Chang, S.C.H. Hoi, Multi-view semi-supervised learning with consensus,
Stat. 19 (1) (1991) 1–141. IEEE Trans. Knowl. Data Eng. 24 (11) (2012) 2040–2051.
[21] S. Mitaim, B. Kosko, The shape of fuzzy sets in adaptive function approximation, [48] Y. Liu, J. Chen, Correntropy kernel learning for nonlinear system identification
IEEE Trans. Fuzzy Syst. 9 (2001) 637–656. with outliers, Ind. Eng. Chem. Res. 53 (13) (2014) 5248–5260.
[22] J. Shawe-Taylor, N. Cristianini, Kernel Methods for Pattern Analysis, University [49] Y. Liu, Z. Gao, P. Li, H. Wang, Just-in-time kernel learning with adaptive param-
Press, Cambridge, 2004. eter selection for soft sensor modeling of batch processes, Ind. Eng. Chem. Res.
[23] C. Zhang, F. Nie, S. Xiang, A general kernelization framework for learning algo- 51 (11) (2012) 4313–4327.
rithms based on kernel PCA, Neurocomputing 73 (4–6) (2010) 959–967.