Pcanet: A Simple Deep Learning Baseline For Image Classification?
Pcanet: A Simple Deep Learning Baseline For Image Classification?
Pcanet: A Simple Deep Learning Baseline For Image Classification?
Abstract—In this work, we propose a very simple deep learning network for image classification which comprises only the very
basic data processing components: cascaded principal component analysis (PCA), binary hashing, and block-wise histograms. In the
proposed architecture, PCA is employed to learn multistage filter banks. It is followed by simple binary hashing and block histograms
for indexing and pooling. This architecture is thus named as a PCA network (PCANet) and can be designed and learned extremely
easily and efficiently. For comparison and better understanding, we also introduce and study two simple variations to the PCANet,
namely the RandNet and LDANet. They share the same topology of PCANet but their cascaded filters are either selected randomly
arXiv:1404.3606v2 [cs.CV] 28 Aug 2014
or learned from LDA. We have tested these basic networks extensively on many benchmark visual datasets for different tasks, such
as LFW for face verification, MultiPIE, Extended Yale B, AR, FERET datasets for face recognition, as well as MNIST for hand-written
digits recognition. Surprisingly, for all tasks, such a seemingly naive PCANet model is on par with the state of the art features, either
prefixed, highly hand-crafted or carefully learned (by DNNs). Even more surprisingly, it sets new records for many classification tasks
in Extended Yale B, AR, FERET datasets, and MNIST variations. Additional experiments on other public datasets also demonstrate the
potential of the PCANet serving as a simple but highly competitive baseline for texture classification and object recognition.
Index Terms—Convolution Neural Network, Deep Learning, PCA Network, Random Network, LDA Network, Face Recognition,
Handwritten Digit Recognition, Object Classification.
Stage 1 Stage 2
extensive experiments, such drastic simplification does
not seem to undermine performance of the network on
some of the typical datasets.
PCA filter
bank 2
A network closely related to PCANet could be two-
PCA filter
stage oriented PCA (OPCA), which was first proposed
bank 1
for audio processing [11]. Noticeable differences from
PCANet lie in that OPCA does not couple with hashing
and local histogram in the output layer. Given covariance
PCA filter
bank 2
Input Image of noises, OPCA gains additional robustness to noises
and distortions. The baseline PCANet could also incor-
porate the merit of OPCA, likely offering more invari-
Output layer ance to intra-class variability. To this end, we have also
Composed explored a supervised extension of PCANet, we replace
block-wise
histogram the PCA filters with filters that are learned from linear
Binarization & discriminant analysis (LDA), called LDANet. As we will
Binary to
see through extensive experiments, the additional dis-
...
Composed
block-wise Decimal
histogram Conversion criminative information does not seem to improve per-
formance of the network; see Sections 2.3 and 3. Another,
somewhat extreme, variation to PCANet is to replace
Fig. 1. Illustration of how the proposed PCANet extracts the PCA filters with totally random filters (say the filter
features from an image through three simplest processing entries are i.i.d. Gaussian variables), called RandNet.
components: PCA filters, binary hashing, and histogram. In this work, we conducted extensive experiments and
fair comparisons of these types of networks with other
such as face recognition where the intra-class variability existing networks such as ConvNet and ScatNet. We
includes significant illumination change and corruption. hope our experiments and observations will help people
gain better understanding of these different networks.
1.1 Motivations
1.2 Contributions
An initial motivation of our study is trying to re-
solve some apparent discrepancies between ConvNet Although our initial intention of studying the simple
and ScatNet. We want to achieve two simple goals: PCANet architecture is to have a simple baseline for
First, we want to design a simple deep learning network comparing and justifying other more advanced deep
which should be very easy, even trivial, to train and to learning components or architectures, our findings lead
adapt to different data and tasks. Second, such a basic to some pleasant but thought-provoking surprises: The
network could serve as a good baseline for people to very basic PCANet, in fair experimental comparison,
empirically justify the use of more advanced processing is already quite on par with, and often better than,
components or more sophisticated architectures for their the state-of-the-art features (prefixed, hand-crafted, or
deep learning networks. learned from DNNs) for almost all image classification
The solution comes as no surprise: We use the most ba- tasks, including face images, hand-written digits, tex-
sic and easy operations to emulate the processing layers ture images, and object images. More specifically, for
in a typical (convolutional) neural network mentioned face recognition with one gallery image per person, it
above: The data-adapting convolution filter bank in each achieves 99.58% accuracy in Extended Yale B dataset,
stage is chosen to be the most basic PCA filters; the non- and over 95% accuracy for across disguise/illumination
linear layer is set to be the simplest binary quantization subsets in AR dataset. In FERET dataset, it obtains the
(hashing); for the feature pooling layer, we simply use state-of-the-art average accuracy 97.25% and achieves
the block-wise histograms of the binary codes, which is the best accuracy of 95.84% and 94.02% in Dup-1 and
considered as the final output features of the network. Dup-2 subsets, respectively.1 In LFW dataset, it achieves
For ease of reference, we name such a data-processing competitive 86.28% face verification accuracy under “un-
network as a PCA Network (PCANet). As example, Figure supervised setting”. In MNIST datasets, it achieves the
1 illustrates how a two-stage PCANet extracts features state-of-the-art results for subtasks such as basic, back-
from an input image. ground random, and background image. See Section
At least one characteristic of the PCANet model seem 3 for more details. Overwhelming empirical evidences
to challenge common wisdoms in building a deep learn- demonstrate the effectiveness of the proposed PCANet
ing network such as ConvNet [4], [5], [8] and ScatNet in learning robust invariant features for various image
[6], [10]: No nonlinear operations in early stages of the classification tasks.
PCANet until the very last output layer where binary
1. The results were obtained by following FERET standard training
hashing and histograms are conducted to compute the CD, and could be marginally better when the PCANet is trained on
output features. Nevertheless, as we will see through MultiPIE database.
JOURNAL OF LATEX CLASS FILES 3
The method hardly contains any deep or new tech- eigenvectors capture the main variation of all the mean-
niques and our study so far is entirely empirical.2 Never- removed training patches. Of course, similar to DNN or
theless, a thorough report on such a baseline system has ScatNet, we can stack multiple stages of PCA filters to
tremendous value to the deep learning and visual recog- extract higher level features.
nition community, sending both sobering and encouraging
messages: On one hand, for future study, PCANet can 2.1.2 The second stage: PCA
serve as a simple but surprisingly competitive baseline
Almost repeating the same process as the first stage. Let
to empirically justify any advanced designs of multistage
the lth filter output of the first stage be
features or networks. On the other hand, the empirical
success of PCANet (even that of RandNet) confirms .
Iil = Ii ∗ Wl1 , i = 1, 2, ..., N, (4)
again certain remarkable benefits from cascaded feature
learning or extraction architectures. Even more impor- where ∗ denotes 2D convolution, and the boundary
tantly, since PCANet consists of only a (cascaded) linear of Ii is zero-padded before convolving with Wl1 so
map, followed by binary hashing and block histograms, as to make Iil having the same size of Ii . Like the
it is now amenable to mathematical analysis and jus- first stage, we can collect all the overlapping patches
tification of its effectiveness. That could lead to funda- of Iil , subtract patch mean from each patch, and form
mental theoretical insights about general deep networks, Ȳil = [ȳi,l,1 , ȳi,l,2 , ..., ȳi,l,mn ] ∈ Rk1 k2 ×mn , where ȳi,l,j is
which seems in urgent need for deep learning nowadays. the jth mean-removed patch in Iil . We further define
Y l = [Ȳ1l , Ȳ21 , ..., ȲNl ] ∈ Rk1 k2 ×N mn for the matrix col-
2 C ASCADED L INEAR N ETWORKS lecting all mean-removed patches of the lth filter output,
and concatenate Y l for all the filter outputs as
2.1 Structures of the PCA Network (PCANet)
Suppose that we are given N input training images Y = [Y 1 , Y 2 , ..., Y L1 ] ∈ Rk1 k2 ×L1 N mn . (5)
{Ii }N
i=1 of size m × n, and assume that the patch size
(or 2D filter size) is k1 × k2 at all stages. The proposed The PCA filters of the second stage are then obtained as
PCANet model is illustrated in Figure 2, and only the .
W`2 = matk1 ,k2 (q` (Y Y T )) ∈ Rk1 ×k2 , ` = 1, 2, ..., L2 . (6)
PCA filters need to be learned from the input images
{Ii }N
i=1 . In what follows, we describe each component For each input Iil of the second stage, we will have L2
of the block diagram more precisely. outputs, each convolves Iil with W`2 for ` = 1, 2, ..., L2 :
.
2.1.1 The first stage: PCA Oil = {Iil ∗ W`2 }L
`=1 .
2
(7)
Around each pixel, we take a k1 × k2 patch, and we The number of outputs of the second stage is L1 L2 . One
collect all (overlapping) patches of the ith image; i.e., can simply repeat the above process to build more (PCA)
xi,1 , xi,2 , ..., xi,mn ∈ Rk1 k2 where each xi,j denotes the stages if a deeper architecture is found to be beneficial.
jth vectorized patch in Ii . We then subtract patch mean
from each patch and obtain X̄i = [x̄i,1 , x̄i,2 , ..., x̄i,mn ],
where x̄i,j is a mean-removed patch. By constructing 2.1.3 Output stage: hashing and histogram
the same matrix for all input images and putting them For each of the L1 input images Iil for the second
together, we get stage, it has L2 real-valued outputs {Iil ∗ W`2 }L
`=1 from
2
Assuming that the number of filters in layer i is Li , PCA function whose value is one for positive entries and zero
minimizes the reconstruction error within a family of otherwise.
orthonormal filters, i.e., Around each pixel, we view the vector of L2 binary
bits as a decimal number. This converts the L2 outputs
min kX − V V T Xk2F , s.t. V T V = IL1 , (2) in Oil back into a single integer-valued “image”:
V ∈Rk1 k2 ×L1
.
Wl1 = matk1 ,k2 (ql (XX T )) ∈ Rk1 ×k2 , l = 1, 2, ..., L1 , (3) whose every pixel is an integer in the range 0, 2L2 − 1 .
The order and weights of for the L2 outputs is irrelevant
where matk1 ,k2 (v) is a function that maps v ∈ Rk1 k2 as we here treat each integer as a distinct “word.”
to a matrix W ∈ Rk1 ×k2 , and ql (XX T ) denotes the For each of the L1 images Til , l = 1, . . . , L1 , we
lth principal eigenvector of XX T . The leading principal partition it into B blocks. We compute the histogram
(with 2L2 bins) of the decimal values in each block, and
2. We would be surprised if something similar to PCANet or vari-
ations to OPCA [11] have not been suggested or experimented with concatenate all the B histograms into one vector and
before in the vast learning literature. denote as Bhist(Til ). After this encoding process, the
JOURNAL OF LATEX CLASS FILES 4
Second stage
Output layer
First stage
Input layer
…
…
…
Patch-mean removal PCA filters convolution Patch-mean removal PCA filters convolution Binary quantization & Concatenated image and
mapping block-wise histogram
“feature” of the input image Ii is then defined to be The PCANet contains no non-linearity process be-
the set of block-wise histograms; i.e., tween/in stages, running contrary to the common wis-
. L2
dom of building deep learning networks; e.g., the abso-
fi = [Bhist(Ti1 ), . . . , Bhist(TiL1 )]T ∈ R(2 )L1 B . (9) lute rectification layer in ConvNet [5] and the modulus
layer in ScatNet [6], [10]. We have tested the PCANet
The local blocks can be either overlapping or non-
with an absolute rectification layer added right after the
overlapping, depending on applications. Our empiri-
first stage, but we did not observe any improvement
cal experience suggests that non-overlapping blocks are
on the final classification results. The reason could be
suitable for face images, whereas the overlapping blocks
that the use of quantization plus local histogram (in
are appropriate for hand-written digits, textures, and
the output layer) already introduces sufficient invariance
object images. Furthermore, the histogram offers some
and robustness in the final feature.
degree of translation invariance in the extracted features,
as in hand-crafted features (e.g., scale-invariant feature
transform (SIFT) [12] or histogram of oriented gradients
(HOG) [13]), learned features (e.g., bag-of-words (BoW)
model [14]), and average or maximum pooling process The overall process prior to the output layer in
in ConvNet [3]–[5], [8], [9]. PCANet is completely linear. One may wonder what
The model parameters of PCANet include the filter if we merge the two stages into just one that has an
size k1 , k2 , the number of filters in each stage L1 , L2 , the equivalently same number of PCA filters and size of
number of stages, and the block size for local histograms receptive field. To be specific, one may be interested in
in the output layer. PCA filter banks require that k1 k2 ≥ how the single-stage PCANet with L1 L2 filters of size
L1 , L2 . In our experiments in Section 3, we always set (k1 + k2 − 1) × (k1 + k2 − 1) could perform, in comparison
L1 = L2 = 8 inspired from the common setting of to the two-stage PCANet we described in Section 2.1. We
Gabor filters [15] with 8 orientations, although some have experimented with such settings on faces and hand-
fine-tuned L1 , L2 could lead to marginal performance written digits and observed that the two-stage PCANet
improvement. Moreover, we have noticed empirically outperforms this single-stage alternative in most cases;
that two-stage PCANet is in general sufficient to achieve see the last rows of Tables 2, 9, and 10. In comparison
good performance and a deeper architecture does not to the filters learned by the single-stage alternative, the
necessarily lead to further improvement. Also, larger resulting two-stage PCA filters essentially has a low-
block size for local histograms provides more translation rank factorization, possibly having lower chance of over-
invariance in the extracted feature fi . fitting the dataset. As for why we need the deep struc-
ture, from a computational perspective, the single-stage
2.1.4 Comparison with ConvNet and ScatNet alternative requires learning filters with L1 L2 (k1 +k2 −1)2
variables, whereas the two-stage PCANet only learns
Clearly, PCANet shares some similarities with ConvNet filters with totally L1 k12 +L2 k22 variables. Another benefit
[5]. The patch-mean removal in PCANet is reminiscent of of the two-stage PCANet is the larger receptive field as
local contrast normalization in ConvNet.3 This operation it contains more holistic observations of the objects in
moves all the patches to be centered around the origin of images and learning invariance from it can essentially
the vector space, so that the learned PCA filters can bet- capture more semantic information. Our comparative
ter catch major variations in the data. In addition, PCA experiments validates that hierarchical architectures with
can be viewed as the simplest class of auto-encoders, large receptive fields and multiple stacked stages are
which minimizes reconstruction error. more efficient in learning semantically related represen-
3. We have tested the PCANet without patch-mean removal and the tations, which coincides with what have been observed
performance degrades significantly. in [7].
JOURNAL OF LATEX CLASS FILES 5
2.2 Computational Complexity training images are classified into C classes {Ii }i∈Sc ,
The components for constructing the PCANet are ex- c = 1, 2, ..., C where Sc is the set of indices of images in
tremely basic and computationally efficient. To see how class c, and the mean-removed patches associated with
light the computational complexity of PCANet would each image of distinct classes X̄i ∈ Rk1 k2 ×mn , i ∈ Sc
be, let us take the two-stage PCANet as an example. In (in the same spirit of X̄i in (1)) are given. We can first
each stage of PCANet, forming the patch-mean-removed compute the class mean Γc and the intra-class variability
matrix X costs k1 k2 + k1 k2 mn flops; the inner product Σc for all the patches as follows,
XX T has complexity of 2(k1 k2 )2 mn flops; and the com- X
plexity of eigen-decomposition is O((k1 k2 )3 ). The PCA Γc = X̄i /|Sc |, (10)
filter convolution takes Li k1 k2 mn flops for stage i. In i∈Sc
X
the output layer, the conversion of L2 binary bits to a Σc = (X̄i − Γc )(X̄i − Γc )T /|Sc |. (11)
decimal number costs 2L2 mn, and the naive histogram i∈Sc
operation is of complexity O(mnBL2 log 2). Assuming
mn max(k1 , k2 , L1 , L2 , B), the overall complexity of Each column of Γc denotes the mean of patches around
PCANet is easy to be verified as each pixel in the class c, and Σc is the sum of all the
patch-wise sample covariances in the class c. Likewise,
O(mnk1 k2 (L1 + L2 ) + mn(k1 k2 )2 ). the inter-class variability of the patches is defined as
The above computational complexity applies to training
C
phase and testing phase of PCANet, as the extra compu- X
tational burden in training phase from testing phase is Φ= (Γc − Γ)(Γc − Γ)T /C, (12)
c=1
the eigen-decomposition, whose complexity is ignorable
when mn max(k1 , k2 , L1 , L2 , B). where Γ is the mean of class means. The idea of LDA
In comparison to ConvNet, the SGD for filter learn- is to maximize the ratio of the inter-class variability
ing is also a simple gradient-based optimization solver, to sum of the intra-class variability within a family of
but the overall training time is still much longer than orthonormal filters; i.e.,
PCANet. For example, training PCANet on around
100,000 images of 80×60 pixel dimension took only half a Tr(V T ΦV )
hour, but CNN-2 took 6 hours, excluding the fine-tuning max PC , s.t. V T V = IL1 , (13)
V ∈Rk1 k2 ×L1 Tr(V T ( c=1 Σc )V )
process; see Section 3.1.1.D for details.
where Tr(·) is the trace operator. The solution is known
2.3 Two Variations: RandNet and LDANet PC
as the L1 principal eigenvectors of Φ̃ = ( c=1 Σc )† Φ,
The PCANet is an extremely simple network, requiring where the superscript † denotes the pseudo-inverse. The
PC
only minimum learning of the filters from the training pseudo inverse is to deal with the case when c=1 Σc is
data. One could immediately think of two possible vari- not of full rank, though there might be another way of
ations of the PCANet towards two opposite directions: handling this with better numeric stability [16]. The LDA
1) We could further eliminate the necessity of training filters are thus expressed as Wl1 = matk1 ,k2 (ql (Φ̃)) ∈
data and replace the PCA filters at each layer with Rk1 ×k2 , l = 1, 2, ..., L1 . A deeper network can be built
random filters of the same size. Be more specific, by repeating the same process as above. .
for random filters, i.e., the elements of Wl1 and Wl2 ,
are generated following standard Gaussian distri-
bution. We call such a network Random Network, 3 E XPERIMENTS
or RandNet as a shorthand. It is natural to won-
der how much degradation such a randomly cho- We now evaluate the performance of the proposed
sen network would perform in comparison with PCANet and the two simple variations (RandNet and
PCANet. LDANet) in various tasks, including face recognition,
2) If the task of the learned network is for classifica- face verification, hand-written digits recognition, texture
tion, we could further enhance the supervision of discrimination, and object recognition in this section.
the learned filters by incorporating the information
of class labels in the training data and learn the
filters based on the idea of multi-class linear dis-
3.1 Face Recognition on Many Datasets
criminant analysis (LDA). We called so composed
network LDA Network, or LDANet for ease of We first focus on the problem of face recognition with
reference. Again we are interested in how much one gallery image per person. We use part of MultiPIE
the enhanced supervision would help improve the dataset to learn PCA filters in PCANet, and then apply
performance of the network. such trained PCANet to extract features of new subjects
To be more clear, we here describe with more details in MultiPIE dataset, Extended Yale B, AR, and FERET
how to construct the LDANet. Suppose that the N datasets for face recognition.
JOURNAL OF LATEX CLASS FILES 6
PCANet−1 (BS = [8 6]) PCANet−2 (BS = [8 6]) PCANet−1 (BS = [12 9]) PCANet−2 (BS = [12 9])
y−translation
−6 −6 −6 −6 60
0 0 0 0
40
6 −6 6 6
20
12 −12 12 12
−12 −6 0 6 12 −12 −6 0 6 12 −12 −6 0 6 12 −12 −6 0 6 12
x−translation x−translation x−translation x−translation
(a)
60 60 60 60
PCANet−1 (BS = [8 6])
40 PCANet−2 (BS = [8 6]) 40 40 40
PCANet−1 (BS = [12 9])
20 PCANet−2 (BS = [12 9]) 20 20 20
0 0 0 0
−10 −5 0 5 10 −10 −5 0 5 10 −50 0 50 0.8 0.9 1 1.1 1.2
x−translation y−translation Angle of rotation (degree) Scale
Fig. 5. Recognition rate of PCANet on MultiPIE cross-illumination test set, for different PCANet block size and
deformation to the test image. Two block sizes [8 6] and [12 9] for histogram aggregation are tested. (a) Simultaneous
translation in x and y directions. (b) Translation in x direction. (c) Translation in y direction. (d) In-plane rotation. (e)
Scale variation.
TABLE 4
Recognition rates (%) on AR dataset.
TABLE 2
Comparison of face recognition rates (%) of various methods on MultiPIE test sets. The filter size k1 = k2 = 5 are set
in RandNet, PCANet, and LDANet unless specified otherwise.
differs from ours largely. These two works require some TABLE 7
outside database to train the ConvNet and the face Error rates (%) of PCANet-2 on basic dataset for varying
images have to be more precisely aligned; e.g., [34] block overlap ratios (BORs).
uses 3-dimensional model for face alignment and [35]
extracts multi-scale features based on detected landmark BOR 0.1 0.2 0.3 0.4 0.5 0.6 0.7
positions. On the contrary, we only trained PCANet RandNet-2 1.31 1.35 1.23 1.34 1.18 1.14 1.24
based on LFW-a [32], an aligned version of LFW images PCANet-2 1.12 1.12 1.07 1.06 1.06 1.02 1.05
LDANet-2 1.14 1.14 1.11 1.05 1.05 1.05 1.06
using the commercial alignment system of face.com.
TABLE 6
Comparison of verification rates (%) on LFW under 3.3.3 Comparison with state of the arts
unsupervised setting.
We compare RandNet, PCANet, and LDANet with Con-
Methods Accuracy vNet [5], 2-stage ScatNet (ScatNet-2) [6], and other ex-
POEM [26] 82.70±0.59 isting methods. In ScatNet, the number of scales and the
High-dim. LBP [36] 84.08 number of orientations are set to 3 and 8, respectively.
High-dim. LE [36] 84.58
SFRD [37] 84.81
Regarding the parameters of PCANet, we set the filter
I-LQP [27] 86.20±0.46 size k1 = k2 = 7, the number of PCA filters L1 = L2 = 8;
OCLBP [33] 86.66±0.30 the block size is tuned by a cross-validation for MNIST,
PCANet-1 81.18 ± 1.99 and the validation sets for MNIST variations9 . The over-
PCANet-1 (sqrt) 82.55 ± 1.48
PCANet-2 85.20 ± 1.46 lapping region between blocks is half of the block size.
PCANet-2 (sqrt) 86.28 ± 1.14 Unless otherwise specified, we use linear SVM classifier
for ScatNet and RandNet, PCANet and LDANet for the
9 classification tasks.
3.3 Digit Recognition on MNIST Datasets The testing error rates of the various methods on
MNIST are shown in Table 9. For fair comparison, we
We now move forward to test the proposed PCANet, do not include the results of methods using augmented
along with RandNet and LDANet, on MNIST [4] and training samples with distortions or other information,
MNIST variations [38], a widely-used benchmark for for that the best known result is 0.23% [39]. We see
testing hierarchical representations. There are 9 classi- that RandNet-2, PCANet-2, and LDANet-2 are compa-
fication tasks in total, as listed in Table 8. All the images rable with the state-of-the-art methods on this standard
are of size 28 × 28. In the following, we use MNIST MNIST task. However, as MNIST has many training
basic as the dataset to investigate the influence of the data, all methods perform very well and very close –
number of filters or different block overlap ratios for the difference is not so statistically meaningful.
RandNet, PCANet and LDANet, and then compare with Accordingly, we also report results of different meth-
other state-of-the-art methods on all the MNIST datasets. ods on MNIST variations in Table 10. To the best of our
knowledge, the PCANet-2 achieves the state-of-the-art
3.3.1 Impact of the number of filters results for four out of the eight remaining tasks: basic,
We vary the number of filters in the first stage L1 from bg-img, bg-img-rot, and convex. Especially for bg-img, the
2 to 12 for one-stage networks. Regarding two-stage error rate reduces from 12.25% [40] to 10.95%.
networks, we set L2 = 8 and change L1 from 4 to 24. Table 10 also shows the result of PCANet-1 with L1 L2
The filter size of the networks is k1 = k2 = 7, block filters of size (k1 + k2 − 1) × (k1 + k2 − 1). The PCANet-1
size is 7×7, and the overlapping region between blocks with such a parameter setting is to mimic the reported
is half of the block size. The results are shown in Figure PCANet-2 in a single-stage structure. PCANet-2 still
8. The results are consistent with that for MultiPIE face outperforms this PCANet-1 alternative.
database in Figure 3; PCANet outperforms RandNet and Furthermore, we also draw the learned PCANet filters
LDANet for almost all the cases. in Figure 9 and Figure 10. An intriguing pattern is
observed in the filters of rect and rect-img datasets. For
3.3.2 Impact of the block overlap ratio rect, we can see both horizontal and vertical stripes,
The number of filters is fixed to L1 = L2 = 8, and the for these patterns attempt to capture the edges of the
filter size is again k1 = k2 = 7 and block size is 7×7. rectangles. When there is some image background in
We only vary the block overlap ratio (BOR) from 0.1 to rect-img, several filters become low-pass, in order to
0.7. Table 7 tabulates the results of RandNet-2, PCANet- secure the responses from background images.
2, and LDANet-2. Clearly, PCANet-2 and LDANet-2
achieve their minimum error rates for BOR equal to 0.5
9. Using either cross-validation or validation set, the optimal block
and 0.6, respectively, and PCANet-2 performs the best size is obtained as 7×7 for MNIST, basic, rec-img, 4×4 for rot, bg-img,
for all conditions. bg-rnd, bg-img-rot, 14×14 for rec, and 28×28 for convex.
JOURNAL OF LATEX CLASS FILES 11
TABLE 8
Details of the 9 classification tasks on MNIST and MNIST variations.
Data Sets Description Num. of classes Train-Valid-Test
MNIST Standard MNIST 10 60000-0-10000
basic Smaller subset of MNIST 10 10000-2000-50000
rot MNIST with rotation 10 10000-2000-50000
bg-rand MNIST with noise background 10 10000-2000-50000
bg-img MNIST with image background 10 10000-2000-50000
bg-img-rot MNIST with rotation and image background 10 10000-2000-50000
rect Discriminate between tall and wide rectangles 2 1000-200-50000
rect-img Dataset rect with image background 2 10000-2000-50000
convex Discriminate between convex and concave shape 2 6000-2000-50000
TABLE 10
Comparison of testing error rates (%) of the various methods on MNIST variations.
10 1.4
TABLE 9
RandNet−1
PCANet−1
RandNet−2 (L2 = 8)
Comparison of error rates (%) of the methods on MNIST,
Error rate (%)
basic rot
bg-rand bg-img
bg-img-rot rect
rect-img convex
Fig. 10. The PCANet filters learned on various MNIST datasets. For each dataset, the top row shows the filters of the
first stage; the bottom row shows the filters of the second stage.
TABLE 12
Comparison of accuracy (%) of the methods on CIFAR10
with no data augmentation.
Methods Accuracy
Tiled CNN [54] 73.10
Improved LCC [55] 74.50
KDES-A [56] 76.00
K-means (Triangle, 4000 features) [57] 79.60
Cuda-convnet2 [58] 82.00
Stochastic pooling ConvNet [44] 84.87
CNN + Spearmint [59] 85.02
Conv. Maxout + Dropout [3] 88.32
NIN [60] 89.59
PCANet-2 77.14
PCANet-2 (combined) 78.67
larger scale datasets or problems. [25] J. Lu, Y.-P. Tan, and G. Wang, “Discriminative multi-manifold
Regardless, extensive experiments given in this paper analysis for face recognition from a single training sample per
person,” IEEE TPAMI, vol. 35, no. 1, pp. 39–51, 2013.
sufficiently conclude two facts: 1) the PCANet is a [26] N.-S. Vu and A. Caplier, “Enhanced patterns of oriented edge
very simple deep learning network, effectively extracting magnitudes for face recognition and image matching,” IEEE TIP,
useful information for classification of faces, digits, and vol. 21, no. 3, pp. 1352–1368, 2012.
[27] S. Hussain, T. Napoleon, and F. Jurie, “Face recognition using
texture images; 2) the PCANet can be a valuable baseline local quantized patterns,” in BMVC, 2012.
for studying advanced deep learning architectures for [28] S. Xie, S. Shan, X. Chen, and J. Chen, “Fusing local patterns
large-scale image classification tasks. of gabor magnitude and phase for face recognition,” IEEE TIP,
vol. 19, no. 5, pp. 1349–1361, 2010.
[29] N.-S. Vu, “Exploring patterns of gradient orientations and mag-
R EFERENCES nitudes for face recognition,” IEEE Trans. Information Forensics and
Security, vol. 8, no. 2, pp. 295–304, 2013.
[1] G. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm [30] Z. Chai, Z. Sun, H. Mndez-Vzquez, R. He, and T. Tan, “Gabor
for deep belief nets,” Neural Computation, vol. 18, no. 7, pp. 1527– ordinal measures for face recognition,” IEEE Trans. Information
1554, 2006. Forensics and Security, vol. 9, no. 1, pp. 14–26, Jan. 2014.
[2] Y. Bengio, A. Courville, and P. Vincent, “Representation learning:
[31] G. Huang, M. Ramesh, T. Berg, and E. Learned-Miller, “Labeled
a review and new perspectives,” IEEE TPAMI, vol. 35, no. 8, pp.
faces in the wild: a database for studying face recognition in
1798–1828, 2013.
unconstrained environments,” in Technical Report 07-49, University
[3] I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and
of Massachusetts, Amherst, 2007.
Y. Bengio, “Maxout networks,” in ICML, 2013.
[32] L. Wolf, T. Hassner, and Y. Taigman, “Effective face recognition
[4] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based
by combining multiple descriptors and learned background statis-
learning applied to document recognition,” Proceedings of the IEEE,
tics,” IEEE TPAMI, vol. 33, no. 10, 2011.
vol. 86, no. 11, pp. 2278–2324, 1998.
[5] K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. LeCun, “What is [33] O. Barkan, J. Weill, L. Wolf, and H. Aronowitz, “Fast high di-
the best multi-stage architecture for object recognition,” in ICCV, mensional vector multiplication face recognition,” in IEEE ICCV,
2009. 2013.
[6] J. Bruna and S. Mallat, “Invariant scattering convolution net- [34] Y. Taigman, M. Yang, M. A. Ranzato, and L. Wolf, “Deepface:
works,” IEEE TPAMI, vol. 35, no. 8, pp. 1872–1886, 2013. Closing the gap to human-level performance in face verification,”
[7] H. Lee, R. Grosse, R. Rananth, and A. Ng, “Convolutional deep in IEEE CVPR, 2014.
belief networks for scalable unsupervised learnig of hierachical [35] H. Fan, Z. Cao, Y. Jiang, and Q. Yin, “Learning deep face repre-
representation,” in ICML, 2009. sentation,” in arXiv: 1403.2802v1, 2014.
[8] A. Krizhevsky, I. Sutskever, and G. Hinton, “Imagenet classifica- [36] D. Chen, X. Cao, F. Wen, and J. Sun, “Blessing of dimensionality:
tion with deep convolutional neural network,” in NIPS, 2012. high-dimentional feature and its efficient compression for face
[9] K. Kavukcuoglu, P. Sermanet, Y. Boureau, K. Gregor, M. Mathieu, verification,” in CVPR, 2013.
and Y. LeCun, “Learning convolutional feature hierarchies for [37] Z. Cui, W. Li, D. Xu, S. Shan, and X. Chen, “Fusing robust
visual recognition,” in NIPS, 2010. face region descriptors via multiple metric learning for face
[10] L. Sifre and S. Mallat, “Rotation, scaling and deformation invari- recognition in the wild,” in CVPR, 2013.
ant scattering for texture discrimination,” in CVPR, 2013. [38] H. Larochelle, D. Erhan, A. Courville, J. Bergstra, and Y. Bengio,
[11] C. J. C. Burges, J. C. Platt, and S. Jana, “Distortion discriminant “An empirical evaluation of deep architectures on problems with
analysis for audio fingerprinting,” IEEE TSAP, vol. 11, no. 3, pp. many factors of variation,” in ICML, 2007.
165–174, 2003. [39] D. Ciresan, U. Meier, and J. Schmidhuber, “Multi-column deep
[12] D. G. Lowe, “Distinctive image features from scale-invariant neural networks for image classification,” in CVPR, 2012.
keypoints,” International Journal of Computer Vision, vol. 60, no. 2, [40] K. Sohn, G. Zhou, C. Lee, and H. Lee, “Learning and selecting
pp. 91–110. features jointly with point-wise gated Boltzmann machine,” in
[13] N. Dalal and B. Triggs, “Histograms of oriented gradients for ICML, 2013.
human detection,” in IEEE CVPR, 200. [41] K. Yu, Y. Lin, and J. Lafferty, “Learning image representations
[14] L. Fei-Fei and P. Perona, “A Bayesian hierarchical model for from the pixel level via hierarchical sparse coding,” in CVPR,
learning natural scene categories,” in IEEE CVPR, 2005. 2011.
[15] C. Liu and H. Wechsler, “Gabor feature based classification using [42] S. Belongie, J. Malik, and J. Puzicha, “Shape matching and object
the enhanced fisher linear discriminant model for face recogni- recognition using shape contexts,” IEEE TPAMI, vol. 24, no. 4, pp.
tion,” IEEE TIP, vol. 11, no. 4, pp. 467–476, 2002. 509–522, 2002.
[16] H. Yu and J. Yang, “A direct LDA algorithm for high-dimensional [43] D. Keysers, T. Deselaers, C. Gollan, and H. Ney, “Deformation
data— with application to face recognition,” Pattern Recognition, models for image recognition,” IEEE TPAMI, vol. 29, no. 8, pp.
vol. 34, no. 10, pp. 2067–2069, 2001. 1422–1435, 2007.
[17] R. Gross, I. Matthews, and S. Baker, “Multi-pie,” in IEEE Confer-
[44] M. D. Zeiler and R. Fergus, “Stochastic pooling for regularization
ence on Automatic Face and Gesture Recognition, 2008.
of deep convolutional neural networks,” in ICLR, 2013.
[18] T. Ahonen, A. Hadid, and M. Pietikainen, “Face description
[45] S. Rifai, P. Vincent, X. Muller, X. Glorot, and Y. Bengio, “Contrac-
with local binary patterns: application to face recognition,” IEEE
tive auto-encoders: explicit invariance during feature extraction,”
TPAMI, vol. 28, no. 12, pp. 2037–2041, 2006.
in ICML, 2011.
[19] Y. Jia, “Caffe: An open source convolutional architecture for fast
feature embedding,” http://caffe.berkeleyvision.org/, 2013. [46] K. Sohn and H. Lee, “Learning invariant representations with
[20] A. S. Georghiades, P. N. Belhumeur, and D. J. Kriegman, “From local transformations,” in ICML, 2012.
few to many: Illumination cone models for face recognition under [47] M. Varma and A. Zisserman, “A statistical approach to material
variable lighting and pose,” IEEE TPAMI, vol. 23, no. 6, pp. 643– classification using image patch examplars,” IEEE TPAMI, vol. 31,
660, June 2001. no. 11, pp. 2032–2047, 2009.
[21] X. Tan and B. Triggs, “Enhanced local texture feature sets for face [48] E. Hayman, B. Caputo, M. Fritz, and J. O. Eklundh, “On the
recognition under difficult lighting conditions,” IEEE TIP, vol. 19, significance of real-world conditions for material classification,”
no. 6, pp. 1635–1650, 2010. in ECCV, 2004.
[22] A. Martinez and R. Benavente, CVC Technical Report 24, 1998. [49] M. Crosier and L. Griffin, “Using basic image features for texture
[23] W. Deng, J. Hu, and J. Guo, “Extended SRC: Undersampled face classification,” IJCV, pp. 447–460, 2010.
recognition via intraclass variant dictionary,” IEEE TPAMI, vol. 34, [50] R. E. Broadhust, “Statistical estimation of histogram variation for
no. 9, pp. 1864–1870, Sept. 2012. texture classification,” in Proc. Workshop on Texture Analysis and
[24] P. J. Phillips, H. Wechsler, J. Huang, and P. J. Rauss, “The Synthesis, 2006.
FERET database and evaluaion procedure for face-recognition [51] K. Grauman and T. Darrell, “The pyramid match kernel: Discrim-
algorithms,” Image Vision Comput., vol. 16, no. 5, pp. 295–306, 1998. inative classification with sets of image features,” in ICCV, 2005.
JOURNAL OF LATEX CLASS FILES 15
[52] S. Lazebnik, C. Scmid, and J. Ponce, “Beyond bags of features: spa- Jiwen Lu is currently a research scientist at the
tial pyramid matching for recognizing natural scene categories,” Advanced Digital Sciences Center (ADSC), Sin-
in CVPR, 2006. gapore. His research interests include computer
[53] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling vision, pattern recognition, machine learning,
in deep convolutional networks for visual recognition,” in ECCV, and biometrics. He has authored/co-authored
2014. more than 70 scientific papers in peer-reviewed
[54] Q. V. Le, J. Ngiam, Z. Chen, D. Chia, P. W. Koh, and A. Y. Ng, journals and conferences including some top
“Tiled convolutional neural networks,” in NIPS, 2010. venues such as the TPAMI, TIP, CVPR and
[55] K. Yu and T. Zhang, “Improved local coordinate coding using ICCV. He is a member of the IEEE.
local tangents,” in ICML, 2010.
[56] L. Bo, X. Ren, and D. Fox, “Kernel descriptors for visual recogni-
tion,” in NIPS, 2010.
[57] A. Coates, H. Lee, and A. Y. Ng, “An analysis of single-layer
networks in unsupervised feature learning,” in NIPS Workshop,
2010.
[58] A. Krizhevsky, “cuda-convnet,” http://code.google.com/p/
cuda-convnet/, July 18, 2014.
[59] J. Snoek, H. Larochelle, and R. P. Adams, “Practical bayesian
optimization of machine learning algorithms,” in NIPS, 2012.
[60] M. Lin, Q. Chen, and S. Yan, “Network in network,” in
arXiv:1312.4400v3, 2014.
Kui Jia received the B.Eng. degree in marine en- Yi Ma (F’13) is a professor of the School of
gineering from Northwestern Polytechnical Uni- Information Science and Technology of Shang-
versity, China, in 2001, the M.Eng. degree in haiTech University. He received his Bachelors’
electrical and computer engineering from Na- degree in Automation and Applied Mathematics
tional University of Singapore in 2003, and the from Tsinghua University, China in 1995. He
Ph.D. degree in computer science from Queen received his M.S. degree in EECS in 1997, M.A.
Mary, University of London, London, U.K., in degree in Mathematics in 2000, and his PhD de-
2007. He is currently a Research Scientist at gree in EECS in 2000 all from UC Berkeley. From
Advanced Digital Sciences Center. His research 2000 to 2011, he was an associate professor of
interests are in computer vision, machine learn- the ECE Department of the University of Illinois
ing, and image processing. at Urbana-Champaign, where he now holds an
adjunct position. From 2009 to early 2014, he was a principal researcher
and manager of the visual computing group of Microsoft Research Asia.
His main research areas are in computer vision and high-dimensional
data analysis. Yi Ma was the recipient of the David Marr Best Paper Prize
from ICCV 1999 and Honorable Mention for the Longuet-Higgins Best
Paper Award from ECCV 2004. He received the CAREER Award from
the National Science Foundation in 2004 and the Young Investigator
Shenghua Gao received the B.E. degree from Program Award from the Office of Naval Research in 2005. He has been
the University of Science and Technology of an associate editor for IJCV, SIIMS, IEEE Trans. PAMI and Information
China in 2008, and received the Ph.D. degree Theory.
from the Nanyang Technological University in
2013. He is currently a postdoctoral fellow in Ad-
vanced Digital Sciences Center, Singapore. He
was awarded the Microsoft Research Fellowship
in 2010. His research interests include computer
vision and machine learning.