Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Pcanet: A Simple Deep Learning Baseline For Image Classification?

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

JOURNAL OF LATEX CLASS FILES 1

PCANet: A Simple Deep Learning Baseline for


Image Classification?
Tsung-Han Chan, Kui Jia, Shenghua Gao, Jiwen Lu, Zinan Zeng, and Yi Ma

Abstract—In this work, we propose a very simple deep learning network for image classification which comprises only the very
basic data processing components: cascaded principal component analysis (PCA), binary hashing, and block-wise histograms. In the
proposed architecture, PCA is employed to learn multistage filter banks. It is followed by simple binary hashing and block histograms
for indexing and pooling. This architecture is thus named as a PCA network (PCANet) and can be designed and learned extremely
easily and efficiently. For comparison and better understanding, we also introduce and study two simple variations to the PCANet,
namely the RandNet and LDANet. They share the same topology of PCANet but their cascaded filters are either selected randomly
arXiv:1404.3606v2 [cs.CV] 28 Aug 2014

or learned from LDA. We have tested these basic networks extensively on many benchmark visual datasets for different tasks, such
as LFW for face verification, MultiPIE, Extended Yale B, AR, FERET datasets for face recognition, as well as MNIST for hand-written
digits recognition. Surprisingly, for all tasks, such a seemingly naive PCANet model is on par with the state of the art features, either
prefixed, highly hand-crafted or carefully learned (by DNNs). Even more surprisingly, it sets new records for many classification tasks
in Extended Yale B, AR, FERET datasets, and MNIST variations. Additional experiments on other public datasets also demonstrate the
potential of the PCANet serving as a simple but highly competitive baseline for texture classification and object recognition.

Index Terms—Convolution Neural Network, Deep Learning, PCA Network, Random Network, LDA Network, Face Recognition,
Handwritten Digit Recognition, Object Classification.

1 I NTRODUCTION key ingredient for success of deep learning in image


classification is the use of convolutional architectures [3]–
Image classification based on visual content is a very
[10]. A convolutional deep neural network (ConvNet)
challenging task, largely because there is usually large
architecture [3]–[5], [8], [9] consists of multiple train-
amount of intra-class variability, arising from different
able stages stacked on top of each other, followed by
lightings, misalignment, non-rigid deformations, occlu-
a supervised classifier. Each stage generally comprises
sion and corruptions. Numerous efforts have been made
of “three layers” – a convolutional filter bank layer, a
to counter the intra-class variability by manually design-
nonlinear processing layer, and a feature pooling layer.
ing low-level features for classification tasks at hand.
To learn a filter bank in each stage of ConvNet, a variety
Representative examples are Gabor features and local
of techniques has been proposed, such as restricted
binary patterns (LBP) for texture and face classification,
Boltzmann machines (RBM) [7] and regularized auto-
and SIFT and HOG features for object recognition. While
encoders or their variations; see [2] for a review and
the low-level features can be hand-crafted with great suc-
references therein. In general, such a network is typically
cess for some specific data and tasks, designing effective
learned by stochastic gradient descent (SGD) method.
features for new data and tasks usually requires new do-
However, learning a network useful for classification
main knowledge since most hand-crafted features cannot
critically depends on expertise of parameter tuning and
be simply adopted to new conditions [1], [2].
some ad hoc tricks.
Learning features from the data of interest is con-
While many variations of deep convolutional net-
sidered as a plausible way to remedy the limitation of
works have been proposed for different vision tasks and
hand-crafted features. An example of such methods is
their success is usually justified empirically, arguably
learning through deep neural networks (DNNs), which
the first instance that has led to clear mathematical
draws significant attention recently [1]. The idea of deep
justification is the wavelet scattering networks (ScatNet)
learning is to discover multiple levels of representation,
[6], [10]. The only difference there is that the convolu-
with the hope that higher-level features represent more
tional filters in ScatNet are prefixed – they are simply
abstract semantics of the data. Such abstract represen-
wavelet operators, hence no learning is needed at all.
tations learned from a deep network are expected to
Somewhat surprisingly, such a pre-fixed filter bank, once
provide more invariance to intra-class variability. One
utilized in a similar multistage architecture of ConvNet
or DNNs, has demonstrated superior performance over
• Tsung-Han Chan, Kui Jia, Shenghua Gao, Jiwen Lu, and Zinan Zeng are
with Advanced Digital Sciences Center (ADSC), Singapore.
ConvNet and DNNs in several challenging vision tasks
• Yi Ma is with the School of Information Science and Technology of such as handwritten digit and texture recognition [6],
ShanghaiTech University, and with the ECE Department of the University [10]. However, as we will see in this paper, such a
of Illinois at Urbana-Champaign
prefixed architecture does not generalize so well to tasks
JOURNAL OF LATEX CLASS FILES 2

Stage 1 Stage 2
extensive experiments, such drastic simplification does
not seem to undermine performance of the network on
some of the typical datasets.

PCA filter
bank 2
A network closely related to PCANet could be two-

PCA filter
stage oriented PCA (OPCA), which was first proposed

bank 1
for audio processing [11]. Noticeable differences from
PCANet lie in that OPCA does not couple with hashing
and local histogram in the output layer. Given covariance

PCA filter
bank 2
Input Image of noises, OPCA gains additional robustness to noises
and distortions. The baseline PCANet could also incor-
porate the merit of OPCA, likely offering more invari-
Output layer ance to intra-class variability. To this end, we have also
Composed explored a supervised extension of PCANet, we replace
block-wise
histogram the PCA filters with filters that are learned from linear
Binarization & discriminant analysis (LDA), called LDANet. As we will
Binary to
see through extensive experiments, the additional dis-
...

Composed
block-wise Decimal
histogram Conversion criminative information does not seem to improve per-
formance of the network; see Sections 2.3 and 3. Another,
somewhat extreme, variation to PCANet is to replace
Fig. 1. Illustration of how the proposed PCANet extracts the PCA filters with totally random filters (say the filter
features from an image through three simplest processing entries are i.i.d. Gaussian variables), called RandNet.
components: PCA filters, binary hashing, and histogram. In this work, we conducted extensive experiments and
fair comparisons of these types of networks with other
such as face recognition where the intra-class variability existing networks such as ConvNet and ScatNet. We
includes significant illumination change and corruption. hope our experiments and observations will help people
gain better understanding of these different networks.

1.1 Motivations
1.2 Contributions
An initial motivation of our study is trying to re-
solve some apparent discrepancies between ConvNet Although our initial intention of studying the simple
and ScatNet. We want to achieve two simple goals: PCANet architecture is to have a simple baseline for
First, we want to design a simple deep learning network comparing and justifying other more advanced deep
which should be very easy, even trivial, to train and to learning components or architectures, our findings lead
adapt to different data and tasks. Second, such a basic to some pleasant but thought-provoking surprises: The
network could serve as a good baseline for people to very basic PCANet, in fair experimental comparison,
empirically justify the use of more advanced processing is already quite on par with, and often better than,
components or more sophisticated architectures for their the state-of-the-art features (prefixed, hand-crafted, or
deep learning networks. learned from DNNs) for almost all image classification
The solution comes as no surprise: We use the most ba- tasks, including face images, hand-written digits, tex-
sic and easy operations to emulate the processing layers ture images, and object images. More specifically, for
in a typical (convolutional) neural network mentioned face recognition with one gallery image per person, it
above: The data-adapting convolution filter bank in each achieves 99.58% accuracy in Extended Yale B dataset,
stage is chosen to be the most basic PCA filters; the non- and over 95% accuracy for across disguise/illumination
linear layer is set to be the simplest binary quantization subsets in AR dataset. In FERET dataset, it obtains the
(hashing); for the feature pooling layer, we simply use state-of-the-art average accuracy 97.25% and achieves
the block-wise histograms of the binary codes, which is the best accuracy of 95.84% and 94.02% in Dup-1 and
considered as the final output features of the network. Dup-2 subsets, respectively.1 In LFW dataset, it achieves
For ease of reference, we name such a data-processing competitive 86.28% face verification accuracy under “un-
network as a PCA Network (PCANet). As example, Figure supervised setting”. In MNIST datasets, it achieves the
1 illustrates how a two-stage PCANet extracts features state-of-the-art results for subtasks such as basic, back-
from an input image. ground random, and background image. See Section
At least one characteristic of the PCANet model seem 3 for more details. Overwhelming empirical evidences
to challenge common wisdoms in building a deep learn- demonstrate the effectiveness of the proposed PCANet
ing network such as ConvNet [4], [5], [8] and ScatNet in learning robust invariant features for various image
[6], [10]: No nonlinear operations in early stages of the classification tasks.
PCANet until the very last output layer where binary
1. The results were obtained by following FERET standard training
hashing and histograms are conducted to compute the CD, and could be marginally better when the PCANet is trained on
output features. Nevertheless, as we will see through MultiPIE database.
JOURNAL OF LATEX CLASS FILES 3

The method hardly contains any deep or new tech- eigenvectors capture the main variation of all the mean-
niques and our study so far is entirely empirical.2 Never- removed training patches. Of course, similar to DNN or
theless, a thorough report on such a baseline system has ScatNet, we can stack multiple stages of PCA filters to
tremendous value to the deep learning and visual recog- extract higher level features.
nition community, sending both sobering and encouraging
messages: On one hand, for future study, PCANet can 2.1.2 The second stage: PCA
serve as a simple but surprisingly competitive baseline
Almost repeating the same process as the first stage. Let
to empirically justify any advanced designs of multistage
the lth filter output of the first stage be
features or networks. On the other hand, the empirical
success of PCANet (even that of RandNet) confirms .
Iil = Ii ∗ Wl1 , i = 1, 2, ..., N, (4)
again certain remarkable benefits from cascaded feature
learning or extraction architectures. Even more impor- where ∗ denotes 2D convolution, and the boundary
tantly, since PCANet consists of only a (cascaded) linear of Ii is zero-padded before convolving with Wl1 so
map, followed by binary hashing and block histograms, as to make Iil having the same size of Ii . Like the
it is now amenable to mathematical analysis and jus- first stage, we can collect all the overlapping patches
tification of its effectiveness. That could lead to funda- of Iil , subtract patch mean from each patch, and form
mental theoretical insights about general deep networks, Ȳil = [ȳi,l,1 , ȳi,l,2 , ..., ȳi,l,mn ] ∈ Rk1 k2 ×mn , where ȳi,l,j is
which seems in urgent need for deep learning nowadays. the jth mean-removed patch in Iil . We further define
Y l = [Ȳ1l , Ȳ21 , ..., ȲNl ] ∈ Rk1 k2 ×N mn for the matrix col-
2 C ASCADED L INEAR N ETWORKS lecting all mean-removed patches of the lth filter output,
and concatenate Y l for all the filter outputs as
2.1 Structures of the PCA Network (PCANet)
Suppose that we are given N input training images Y = [Y 1 , Y 2 , ..., Y L1 ] ∈ Rk1 k2 ×L1 N mn . (5)
{Ii }N
i=1 of size m × n, and assume that the patch size
(or 2D filter size) is k1 × k2 at all stages. The proposed The PCA filters of the second stage are then obtained as
PCANet model is illustrated in Figure 2, and only the .
W`2 = matk1 ,k2 (q` (Y Y T )) ∈ Rk1 ×k2 , ` = 1, 2, ..., L2 . (6)
PCA filters need to be learned from the input images
{Ii }N
i=1 . In what follows, we describe each component For each input Iil of the second stage, we will have L2
of the block diagram more precisely. outputs, each convolves Iil with W`2 for ` = 1, 2, ..., L2 :
.
2.1.1 The first stage: PCA Oil = {Iil ∗ W`2 }L
`=1 .
2
(7)
Around each pixel, we take a k1 × k2 patch, and we The number of outputs of the second stage is L1 L2 . One
collect all (overlapping) patches of the ith image; i.e., can simply repeat the above process to build more (PCA)
xi,1 , xi,2 , ..., xi,mn ∈ Rk1 k2 where each xi,j denotes the stages if a deeper architecture is found to be beneficial.
jth vectorized patch in Ii . We then subtract patch mean
from each patch and obtain X̄i = [x̄i,1 , x̄i,2 , ..., x̄i,mn ],
where x̄i,j is a mean-removed patch. By constructing 2.1.3 Output stage: hashing and histogram
the same matrix for all input images and putting them For each of the L1 input images Iil for the second
together, we get stage, it has L2 real-valued outputs {Iil ∗ W`2 }L
`=1 from
2

the second stage. We binarize these outputs and get


X = [X̄1 , X̄2 , ..., X̄N ] ∈ Rk1 k2 ×N mn . (1) {H(Iil ∗ W`2 )}L
`=1 , where H(·) is a Heaviside step (like)
2

Assuming that the number of filters in layer i is Li , PCA function whose value is one for positive entries and zero
minimizes the reconstruction error within a family of otherwise.
orthonormal filters, i.e., Around each pixel, we view the vector of L2 binary
bits as a decimal number. This converts the L2 outputs
min kX − V V T Xk2F , s.t. V T V = IL1 , (2) in Oil back into a single integer-valued “image”:
V ∈Rk1 k2 ×L1

where IL1 is identity matrix of size L1 ×L1 . The solution L2


. X `−1
is known as the L1 principal eigenvectors of XX T . The Til = 2 H(Iil ∗ W`2 ), (8)
PCA filters are therefore expressed as `=1

.
 
Wl1 = matk1 ,k2 (ql (XX T )) ∈ Rk1 ×k2 , l = 1, 2, ..., L1 , (3) whose every pixel is an integer in the range 0, 2L2 − 1 .
The order and weights of for the L2 outputs is irrelevant
where matk1 ,k2 (v) is a function that maps v ∈ Rk1 k2 as we here treat each integer as a distinct “word.”
to a matrix W ∈ Rk1 ×k2 , and ql (XX T ) denotes the For each of the L1 images Til , l = 1, . . . , L1 , we
lth principal eigenvector of XX T . The leading principal partition it into B blocks. We compute the histogram
(with 2L2 bins) of the decimal values in each block, and
2. We would be surprised if something similar to PCANet or vari-
ations to OPCA [11] have not been suggested or experimented with concatenate all the B histograms into one vector and
before in the vast learning literature. denote as Bhist(Til ). After this encoding process, the
JOURNAL OF LATEX CLASS FILES 4

Second stage

Output layer
First stage
Input layer




Patch-mean removal PCA filters convolution Patch-mean removal PCA filters convolution Binary quantization & Concatenated image and
mapping block-wise histogram

Fig. 2. The detailed block diagram of the proposed (two-stage) PCANet.

“feature” of the input image Ii is then defined to be The PCANet contains no non-linearity process be-
the set of block-wise histograms; i.e., tween/in stages, running contrary to the common wis-
. L2
dom of building deep learning networks; e.g., the abso-
fi = [Bhist(Ti1 ), . . . , Bhist(TiL1 )]T ∈ R(2 )L1 B . (9) lute rectification layer in ConvNet [5] and the modulus
layer in ScatNet [6], [10]. We have tested the PCANet
The local blocks can be either overlapping or non-
with an absolute rectification layer added right after the
overlapping, depending on applications. Our empiri-
first stage, but we did not observe any improvement
cal experience suggests that non-overlapping blocks are
on the final classification results. The reason could be
suitable for face images, whereas the overlapping blocks
that the use of quantization plus local histogram (in
are appropriate for hand-written digits, textures, and
the output layer) already introduces sufficient invariance
object images. Furthermore, the histogram offers some
and robustness in the final feature.
degree of translation invariance in the extracted features,
as in hand-crafted features (e.g., scale-invariant feature
transform (SIFT) [12] or histogram of oriented gradients
(HOG) [13]), learned features (e.g., bag-of-words (BoW)
model [14]), and average or maximum pooling process The overall process prior to the output layer in
in ConvNet [3]–[5], [8], [9]. PCANet is completely linear. One may wonder what
The model parameters of PCANet include the filter if we merge the two stages into just one that has an
size k1 , k2 , the number of filters in each stage L1 , L2 , the equivalently same number of PCA filters and size of
number of stages, and the block size for local histograms receptive field. To be specific, one may be interested in
in the output layer. PCA filter banks require that k1 k2 ≥ how the single-stage PCANet with L1 L2 filters of size
L1 , L2 . In our experiments in Section 3, we always set (k1 + k2 − 1) × (k1 + k2 − 1) could perform, in comparison
L1 = L2 = 8 inspired from the common setting of to the two-stage PCANet we described in Section 2.1. We
Gabor filters [15] with 8 orientations, although some have experimented with such settings on faces and hand-
fine-tuned L1 , L2 could lead to marginal performance written digits and observed that the two-stage PCANet
improvement. Moreover, we have noticed empirically outperforms this single-stage alternative in most cases;
that two-stage PCANet is in general sufficient to achieve see the last rows of Tables 2, 9, and 10. In comparison
good performance and a deeper architecture does not to the filters learned by the single-stage alternative, the
necessarily lead to further improvement. Also, larger resulting two-stage PCA filters essentially has a low-
block size for local histograms provides more translation rank factorization, possibly having lower chance of over-
invariance in the extracted feature fi . fitting the dataset. As for why we need the deep struc-
ture, from a computational perspective, the single-stage
2.1.4 Comparison with ConvNet and ScatNet alternative requires learning filters with L1 L2 (k1 +k2 −1)2
variables, whereas the two-stage PCANet only learns
Clearly, PCANet shares some similarities with ConvNet filters with totally L1 k12 +L2 k22 variables. Another benefit
[5]. The patch-mean removal in PCANet is reminiscent of of the two-stage PCANet is the larger receptive field as
local contrast normalization in ConvNet.3 This operation it contains more holistic observations of the objects in
moves all the patches to be centered around the origin of images and learning invariance from it can essentially
the vector space, so that the learned PCA filters can bet- capture more semantic information. Our comparative
ter catch major variations in the data. In addition, PCA experiments validates that hierarchical architectures with
can be viewed as the simplest class of auto-encoders, large receptive fields and multiple stacked stages are
which minimizes reconstruction error. more efficient in learning semantically related represen-
3. We have tested the PCANet without patch-mean removal and the tations, which coincides with what have been observed
performance degrades significantly. in [7].
JOURNAL OF LATEX CLASS FILES 5

2.2 Computational Complexity training images are classified into C classes {Ii }i∈Sc ,
The components for constructing the PCANet are ex- c = 1, 2, ..., C where Sc is the set of indices of images in
tremely basic and computationally efficient. To see how class c, and the mean-removed patches associated with
light the computational complexity of PCANet would each image of distinct classes X̄i ∈ Rk1 k2 ×mn , i ∈ Sc
be, let us take the two-stage PCANet as an example. In (in the same spirit of X̄i in (1)) are given. We can first
each stage of PCANet, forming the patch-mean-removed compute the class mean Γc and the intra-class variability
matrix X costs k1 k2 + k1 k2 mn flops; the inner product Σc for all the patches as follows,
XX T has complexity of 2(k1 k2 )2 mn flops; and the com- X
plexity of eigen-decomposition is O((k1 k2 )3 ). The PCA Γc = X̄i /|Sc |, (10)
filter convolution takes Li k1 k2 mn flops for stage i. In i∈Sc
X
the output layer, the conversion of L2 binary bits to a Σc = (X̄i − Γc )(X̄i − Γc )T /|Sc |. (11)
decimal number costs 2L2 mn, and the naive histogram i∈Sc
operation is of complexity O(mnBL2 log 2). Assuming
mn  max(k1 , k2 , L1 , L2 , B), the overall complexity of Each column of Γc denotes the mean of patches around
PCANet is easy to be verified as each pixel in the class c, and Σc is the sum of all the
patch-wise sample covariances in the class c. Likewise,
O(mnk1 k2 (L1 + L2 ) + mn(k1 k2 )2 ). the inter-class variability of the patches is defined as
The above computational complexity applies to training
C
phase and testing phase of PCANet, as the extra compu- X
tational burden in training phase from testing phase is Φ= (Γc − Γ)(Γc − Γ)T /C, (12)
c=1
the eigen-decomposition, whose complexity is ignorable
when mn  max(k1 , k2 , L1 , L2 , B). where Γ is the mean of class means. The idea of LDA
In comparison to ConvNet, the SGD for filter learn- is to maximize the ratio of the inter-class variability
ing is also a simple gradient-based optimization solver, to sum of the intra-class variability within a family of
but the overall training time is still much longer than orthonormal filters; i.e.,
PCANet. For example, training PCANet on around
100,000 images of 80×60 pixel dimension took only half a Tr(V T ΦV )
hour, but CNN-2 took 6 hours, excluding the fine-tuning max PC , s.t. V T V = IL1 , (13)
V ∈Rk1 k2 ×L1 Tr(V T ( c=1 Σc )V )
process; see Section 3.1.1.D for details.
where Tr(·) is the trace operator. The solution is known
2.3 Two Variations: RandNet and LDANet PC
as the L1 principal eigenvectors of Φ̃ = ( c=1 Σc )† Φ,
The PCANet is an extremely simple network, requiring where the superscript † denotes the pseudo-inverse. The
PC
only minimum learning of the filters from the training pseudo inverse is to deal with the case when c=1 Σc is
data. One could immediately think of two possible vari- not of full rank, though there might be another way of
ations of the PCANet towards two opposite directions: handling this with better numeric stability [16]. The LDA
1) We could further eliminate the necessity of training filters are thus expressed as Wl1 = matk1 ,k2 (ql (Φ̃)) ∈
data and replace the PCA filters at each layer with Rk1 ×k2 , l = 1, 2, ..., L1 . A deeper network can be built
random filters of the same size. Be more specific, by repeating the same process as above. .
for random filters, i.e., the elements of Wl1 and Wl2 ,
are generated following standard Gaussian distri-
bution. We call such a network Random Network, 3 E XPERIMENTS
or RandNet as a shorthand. It is natural to won-
der how much degradation such a randomly cho- We now evaluate the performance of the proposed
sen network would perform in comparison with PCANet and the two simple variations (RandNet and
PCANet. LDANet) in various tasks, including face recognition,
2) If the task of the learned network is for classifica- face verification, hand-written digits recognition, texture
tion, we could further enhance the supervision of discrimination, and object recognition in this section.
the learned filters by incorporating the information
of class labels in the training data and learn the
filters based on the idea of multi-class linear dis-
3.1 Face Recognition on Many Datasets
criminant analysis (LDA). We called so composed
network LDA Network, or LDANet for ease of We first focus on the problem of face recognition with
reference. Again we are interested in how much one gallery image per person. We use part of MultiPIE
the enhanced supervision would help improve the dataset to learn PCA filters in PCANet, and then apply
performance of the network. such trained PCANet to extract features of new subjects
To be more clear, we here describe with more details in MultiPIE dataset, Extended Yale B, AR, and FERET
how to construct the LDANet. Suppose that the N datasets for face recognition.
JOURNAL OF LATEX CLASS FILES 6

3.1.1 Training and Testing on MultiPIE Dataset. 100 100

Recognition rate (%)

Recognition rate (%)


Generic faces training set. MultiPIE dataset [17] con- 90
99
tains 337 subjects across simultaneous variation in pose, 80
RandNet−2 (L = 8)
expression, and illumination. Of these 337 subjects, we RandNet−1 98
2
PCANet−2 (L2 = 8)
70
select the images of 129 subjects that enrolled all the four PCANet−1
LDANet−1 LDANet−2 (L2 = 8)
60 97
sessions. The images of a subject under all illuminations 2 4 6 8 10 12 5 10 15 20
Number of filters in the first stage (L1) Number of filters in the first stage (L1)
and all expressions at pose −30◦ to +30◦ with step size
15◦ , a total of 5 poses, were collected. We manually select (a) (b)
eye corners as the ground truth for registration, and
Fig. 3. Recognition accuracy of PCANet on MultiPIE
down-sample the images to 80×60 pixels. The distance
cross-illumination test set for varying number of filters in
between the two outer eye corners is normalized to be
the first stage. (a) PCANet-1; (b) PCANet-2 with L2 = 8.
50 pixels. This generic faces training set comprises around
100,000 images, and all images are converted to gray
scale.
We use these assembled face images to train the
PCANet and together with data labels to learn LDANet,
and then apply the trained networks to extract fea-
tures of the new subjects in Multi-PIE dataset. As men-
tioned above, 129 subjects enrolling all four sessions
Input face (-6,-6)-translated 10o-rotated 0.9-scaled
are used for PCANet training. The remaining 120 new
subjects in Session 1 are used for gallery training and Fig. 4. Original image and its artificially deformed images.
testing. Frontal view of each subject with neutral ex-
pression and frontal illumination is used in gallery,
and the rest is for testing. We classify all the possible
variations into 7 test sets, namely cross illumination, percent accuracy with translation up to 4 pixels in all
cross expression, cross pose, cross expression-plus-pose, directions, up to 8◦ in-plane rotation, or with scale
cross illumination-plus-expression, cross illumination- varying from 0.9 to 1.075. Moreover, the results suggest
plus-pose, and cross illumination-plus-expression-and- that PCANet-2 with larger block size provides more
pose. The cross-pose test set is specifically collected over robustness against various deformations, but a larger
the poses −30◦ , −15◦ , +15◦ , +30◦ . block side may sacrifice some performance for PCANet-
A. Impact of the number of filters. Before comparing 1.
RandNet, PCANet, and LDANet with existing methods C. Impact of the number of generic faces training samples.
on all the 7 test sets, we first investigate the impact of We also report the recognition accuracy of the PCANet
the number of filters of these networks on the cross- for differen number of the generic faces training images.
illumination test set only. The filter size of the networks Again, we use cross-illumination test set. We randomly
is k1 = k2 = 5 and their non-overlapping blocks is of size select S images from the generic training set to train the
8×6. We vary the number of filters in the first stage L1 PCANet, and varies S from 10 to 50, 000. The parameters
from 2 to 12 for one-stage networks. When considering of PCANet are set to k1 = k2 = 5, L1 = L2 = 8,
two-stage networks, we set L2 = 8 and vary L1 from 4 and block size 8×6. The results are tabulated in Table
to 24. The results are shown in Figure 3. One can see 1. Surprisingly, the accuracy of PCANet is less-sensitive
that PCANet-1 achieves the best results for L1 ≥ 4 and to the number of generic training images. The perfor-
PCANet-2 is the best for all L1 under test. Moreover, the mance of PCANet-1 gradually improves as the number
accuracy of PCANet and LDANet (for both one-stage of generic training samples increases, and PCANet-2
and two-stage networks) increases for larger L1 , and the keeps perfect recognition even when there are only 100
RandNet also has similar performance trend. However, generic training samples.
some performance fluctuation is observed for RandNet
due to the filters’ randomness. D. Comparison with state of the arts. We compare the
B. Impact of the the block size. We next examine the
impact of the block size (for histogram computation) on
robustness of PCANet against image deformations. We TABLE 1
use the cross-illumination test set, and introduce artificial Face recognition rates (%) of PCANet on MultiPIE
deformation to the testing image with a translation, in- cross-illumination test set, with respect to different
plane rotation or scaling; see Figure 4. The parameters amount of generic faces training images (S).
of PCANet are set to k1 = k2 = 5 and L1 = L2 = 8. Two
block sizes 8×6 and 12×9 are considered. Figure 5 shows S 100 500 1,000 5,000 10,000 50,000
the recognition accuracy for each artificial deformation. PCANet-1 98.01 98.44 98.61 98.65 98.70 98.70
It is observed that PCANet-2 achieves more than 90 PCANet-2 100.00 100.00 100.00 100.00 100.00 100.00
JOURNAL OF LATEX CLASS FILES 7

PCANet−1 (BS = [8 6]) PCANet−2 (BS = [8 6]) PCANet−1 (BS = [12 9]) PCANet−2 (BS = [12 9])

−12 −12 −12 −12 80

y−translation
−6 −6 −6 −6 60
0 0 0 0
40
6 −6 6 6
20
12 −12 12 12
−12 −6 0 6 12 −12 −6 0 6 12 −12 −6 0 6 12 −12 −6 0 6 12
x−translation x−translation x−translation x−translation

(a)

100 100 100 100


Recognition rate (%)

Recognition rate (%)

Recognition rate (%)

Recognition rate (%)


80 80 80 80

60 60 60 60
PCANet−1 (BS = [8 6])
40 PCANet−2 (BS = [8 6]) 40 40 40
PCANet−1 (BS = [12 9])
20 PCANet−2 (BS = [12 9]) 20 20 20

0 0 0 0
−10 −5 0 5 10 −10 −5 0 5 10 −50 0 50 0.8 0.9 1 1.1 1.2
x−translation y−translation Angle of rotation (degree) Scale

(b) (c) (d) (e)

Fig. 5. Recognition rate of PCANet on MultiPIE cross-illumination test set, for different PCANet block size and
deformation to the test image. Two block sizes [8 6] and [12 9] for histogram aggregation are tested. (a) Simultaneous
translation in x and y directions. (b) Translation in x direction. (c) Translation in y direction. (d) In-plane rotation. (e)
Scale variation.

channels for the second stage. Each convolution output


is followed by a rectified linear function relu(x) =
max(x, 0) and 2×2 max-pooling. The output layer is a
softmax classifier. After pre-training the CNN-2 on the
generic faces training set, the CNN-2 is also fine-tuned
Fig. 6. The PCANet filters learned on MultiPIE dataset. on the 120 gallery images for 500 epochs.
Top row: the first stage. Bottom row: the second stage. The performance of all methods are given in Table
2. Except for cross-pose test set, the PCANet yields
the best precision. For all test sets, the performance of
RandNet, PCANet, and LDANet with Gabor4 [15], LBP5 RandNet and LDANet is inferior to that of PCANet, and
[18], and two-stage ScatNet (ScatNet-2) [6]. We set the LDANet does not seem to take advantage of discrimina-
parameters of PCANet to the filter size k1 = k2 = 5, tive information. One can also see that whenever there
the number of filters L1 = L2 = 8, and 8×6 block size, is illumination variation, the performance of LBP drops
and the learned PCANet filters are shown in Figure 6. significantly. The PCANet overcomes this drawback and
The number of scales and the number of orientations offers comparable performance to LBP for cross-pose
in ScatNet-2 are set to 3 and 8, respectively. We use and cross-expression variations. As a final note, ScatNet
the nearest neighbor (NN) classifier with the chi-square and CNN seem not performing well.6 This is the case
distance for RandNet, PCANet, LDANet and LBP, or for all face-related experiments below, and therefore
with the cosine distance for Gabor and ScatNet. The NN ScatNet and CNN are not included for comparison in
classifier with different distance measure is to secure the these experiments. We also do not include RandNet and
best performances of respective features. LDANet in the following face-related experiments, as
We also compare with CNN. Since we could not find they did not show performance superior over PCANet.
any work that successfully applies CNN to the same face The last row of Table 2 shows the result of PCANet-
recognition tasks, we use Caffe framework [19] to pre- 1 with L1 L2 filters of size (k1 + k2 − 1) × (k1 + k2 − 1).
train a two-stage CNN (CNN-2) on the generic faces The PCANet-1 with such a parameter setting is to mimic
training set. The CNN-2 is fully-supervised network the reported PCANet-2 in a single-stage network, as
with filter size 5×5; 20 channels for the first stage and 50 both have the same number of PCA filters and size
of receptive field. PCANet-2 outperforms the PCANet-1
4. Each face is convolved with a family of Gabor kernels with 5 alternative, showing the advantages of deeper networks.
scales and 8 orientations. Each filter response is down-sampled by a
3 × 3 uniform lattice, and normalized to zero mean and unit variance.
Another issue worth mentioning is the efficiency of
5. Each face is divided into several blocks, each of size the same the PCANet. Training PCANet-2 on the generic faces
as PCANet. The histogram of 59 uniform binary patterns is then com-
puted, where the patterns are generated by thresholding 8 neighboring 6. The performance of CNN could be further promoted if the model
pixels in a circle of radius 2 using the central pixel value. parameters are more fine-tuned.
JOURNAL OF LATEX CLASS FILES 8

TABLE 4
Recognition rates (%) on AR dataset.

Test sets Illum. Exps. Disguise Disguise + Illum.


LBP [18] 93.83 81.33 91.25 79.63
P-LBP [21] 97.50 80.33 93.00 88.58
PCANet-1 98.00 85.67 95.75 92.75
0% occlusion 40% occlusion 80% occlusion PCANet-2 99.50 85.00 97.00 95.00

Fig. 7. Illustration of varying level of an occluded test face


image. 3.1.3 Testing on AR Dataset.
We further evaluate the ability of the MultiPIE-learned
TABLE 3 PCANet to cope with real possibly malicious occlusions
Recognition rates (%) on Extended Yale B dataset. using AR dataset [22]. AR dataset consists of over 4,000
frontal images for 126 subjects. These images include
Percent occluded 0% 20% 40% 60% 80% different facial expressions, illumination conditions and
LBP [18] 75.76 65.66 54.92 43.22 18.06 disguises. In the experiment, we chose a subset of the
P-LBP [21] 96.13 91.84 84.13 70.96 41.29
PCANet-1 97.77 96.34 93.81 84.60 54.38 data consisting of 50 male subjects and 50 female sub-
PCANet-2 99.58 99.16 96.30 86.49 51.73 jects. The images are cropped with dimension 165×120
and converted to gray scale. For each subject, we select
the face with frontal illumination and neural expression
training set (i.e., around 100,000 face images of 80×60 in the gallery training, and the rest are all for testing.
pixel dimension) took only half a hour, but CNN-2 took The size of non-overlapping blocks in the PCANet is set
6 hours, excluding the fine-tuning process. to 8×6. We also compare with LBP [18] and P-LBP [21].
We use the NN classifier with the chi-square distance
3.1.2 Testing on Extended Yale B Dataset. measure.
The results are given in Table 4. For test set of
We then apply the PCANet with the PCA filters learned illumination variations, the recognition by PCANet is
from MultiPIE to Extended Yale B dataset [20]. Extended again almost perfect, and for cross-disguise related test
Yale B dataset consists of 2414 frontal-face images of sets, the accuracy is more than 95%. The results are
38 individuals. The cropped and normalized 192×168 consistent with that on MultiPIE and Extended Yale
face images were captured under various laboratory- B datasets: PCANet is insensitive to illumination and
controlled lighting conditions. For each subject, we select robust to occlusions. To the best of our knowledge, no
frontal illumination as the gallery images, and the rest single feature with a simple classifier can achieve such
for testing. To challenge ourselves, in the test images, performances, even if in extended representation-based
we also simulate various levels of contiguous occlusion, classification (ESRC) [23]!
from 0 percent to 80 percent, by replacing a randomly
located square block of each test image with an unre- 3.1.4 Testing on FERET Dataset.
lated image; see Figure 7 for example. The size of non-
We finally apply the MultiPIE-learned PCANet to the
overlapping blocks in the PCANet is set to 8×8. We
popular FERET dataset [24], which is a standard dataset
compare with LBP [18] and LBP of the test images being
used for facial recognition system evaluation. FERET
processed by illumination normalization, P-LBP [21].
contains images of 1,196 different individuals with up
We use the NN classifier with the chi-square distance
to 5 images of each individual captured under different
measure.
lighting conditions, with non-neural expressions and
The experimental results are given in Table 3. One can
over the period of three years. The complete dataset
see that the PCANet outperforms the P-LBP for different
is partitioned into disjoint sets: gallery and probe. The
levels of occlusion. It is also observed that the PCANet is
probe set is further subdivided into four categories: Fb
not only illumination-insensitive, but also robust against
with different expression changes; Fc with different light-
block occlusion. Under such a single sample per per-
ing conditions; Dup-I taken within the period of three to
son setting and various difficult lighting conditions, the
four months; Dup-II taken at least one and a half year
PCANet surprisingly achieves almost perfect recognition
apart. We use the gray-scale images, cropped to image
99.58%, and still sustains 86.49% accuracy when 60%
size of 150×90 pixels. The size of non-overlapping blocks
pixels of every test image are occluded! The reason could
in the PCANet is set to 15×15. To compare fairly with
be that each PCA filter can be seen as a detector with
prior methods, the dimension of the PCANet features are
the maximum response for patches from a face. In other
reduced to 1000 by a whitening PCA (WPCA),7 where
words, the contribution from occluded patches would
the projection matrix is learned from the features of
somehow be ignored after PCA filtering and are not
passed onto the output layer of the PCANet, thereby 7. The PCA projection directions are weighted by the inverse of their
yielding striking robustness to occlusion. corresponding square-root energies, respectively.
JOURNAL OF LATEX CLASS FILES 9

TABLE 2
Comparison of face recognition rates (%) of various methods on MultiPIE test sets. The filter size k1 = k2 = 5 are set
in RandNet, PCANet, and LDANet unless specified otherwise.

Test Sets Illum. Exps. Pose Exps.+Pose Illum.+Exps. Illum.+Pose Illum.+Exps.+Pose


Gabor [15] 68.75 94.17 84.17 64.70 38.09 39.76 25.92
LBP [18] 79.77 98.33 95.63 86.88 53.77 50.72 40.55
ScatNet-2 [6] 20.88 66.67 71.46 54.37 14.51 15.00 14.47
CNN-2 [8] 46.71 75.00 73.54 57.50 23.38 25.05 18.74
RandNet-1 80.88 98.33 87.50 75.62 46.57 42.80 31.85
RandNet-2 97.64 97.50 83.13 75.21 63.87 53.50 42.47
PCANet-1 98.70 99.17 94.17 87.71 72.40 65.76 53.80
PCANet-2 100.00 99.17 93.33 87.29 87.89 75.29 66.49
LDANet-1 99.95 98.33 92.08 82.71 77.89 68.55 57.97
LDANet-2 96.02 99.17 93.33 83.96 65.78 60.14 46.72
PCANet-1 (k1 = 9) 100 99.17 89.58 81.46 75.74 67.59 56.95

TABLE 5 3.2 Face Verification on LFW Dataset


Recognition rates (%) on FERET dataset. Besides tests with laboratory face datasets, we also
evaluate the PCANet on the LFW dataset [31] for un-
Probe sets Fb Fc Dup-I Dup-II Avg.
LBP [18] 93.00 51.00 61.00 50.00 63.75
constrained face verification. LFW contains 13,233 face
DMMA [25] 98.10 98.50 81.60 83.20 89.60 images of 5,749 different individuals, collected from the
P-LBP [21] 98.00 98.00 90.00 85.00 92.75 web with large variations in pose, expression, illumi-
POEM [26] 99.60 99.50 88.80 85.00 93.20
G-LQP [27] 99.90 100 93.20 91.00 96.03
nation, clothing, hairstyles, etc. We consider “unsuper-
LGBP-LGXP [28] 99.00 99.00 94.00 93.00 96.25 vised setting”, which is the best choice for evaluating
sPOEM+POD [29] 99.70 100 94.90 94.00 97.15 the learned features, for it does not depend on metric
GOM [30] 99.90 100 95.70 93.10 97.18
PCANet-1 (Trn. CD) 99.33 99.48 88.92 84.19 92.98
learning and discriminative model learning. The aligned
PCANet-2 (Trn. CD) 99.67 99.48 95.84 94.02 97.25 version of the faces, namely LFW-a, provided by Wolf
PCANet-1 99.50 98.97 89.89 86.75 93.78 et al. [32] is used, and the face images were cropped
PCANet-2 99.58 100 95.43 94.02 97.26
into 150 × 80 pixel dimensions. We follow the standard
evaluation protocal, which splits the View 2 dataset into
10 subsets with each subset containing 300 intra-class
gallery samples. The NN classifier with cosine distance pairs and 300 inter-class pairs. We perform 10-fold cross
is used. Moreover, in addition to PCANet trained from validation using the 10 subsets of pairs in View 2. In
MultiPIE database, we also train PCANet on the FERET PCANet, the filter size, the number of filters, and the
generic training set, consisting of 1,002 images of 429 (non-overlapping) block size are set to k1 = k2 = 7,
people listed in the FERET standard training CD. L1 = L2 = 8, and 15×13, respectively. The performances
The results of the PCANet and other state-of-the- are measured by averaging the 10-fold cross validation.
art methods are listed in Table 5. Surprisingly, both We project the PCANet features onto 400 and 3,200
simple MultiPIE-learned PCANet-2 and FERET-learned dimensions using WPCA for PCANet-1 and PCANet-2,
PCANet-2 (with Trn. CD in a parentheses) achieve the respectively, and use NN classifier with cosine distance.
state-of-the-art accuracies 97.25% and 97.26% on average, Table 6 tabulates the results.8 Note that PCANet fol-
respectively. As the variations in MultiPIE database are lowed by sqrt in a parentheses represents the PCANet
much richer than the standard FERET training set, it is feature taking square-root operation. One can see that
nature to see that the MultiPIE-learned PCANet slightly the square-root PCANet outperforms PCANet, and
outperforms FERET-learned PCANet. More importantly, this performance boost from square-root operation has
PCANet-2 breaks the records in Dup-I and Dup-II. also been observed in other features for this dataset
Conclusive remarks on face recognition. A prominent [33]. Moreover, the square-root PCANet-2 that achieves
message drawn from the above experiments in sections 86.28% accuracy is quite competitive to the current
3.1.1, 3.1.2, 3.1.3, and 3.1.4 is that training PCANet from state-of-the-art methods. This shows that the proposed
a face dataset can be very effective to capture the abstract PCANet is also effective in learning invariant features
representation of new subjects or new datasets. After for face images captured in less controlled conditions.
the PCANet is trained, extracting PCANet-2 feature for In preparation of this paper, we are aware of two
one test face only takes 0.3 second in Matlab. We can concurrent works [34], [35] that employ ConvNet for
anticipate that the performance of PCANet could be LFW face verification. While both works achieve very
further improved and moved toward practical use if impressive results on LFW, their experimental setting
the PCANet is trained upon a wide and deep dataset
8. For fair comparison, we only report results of single descriptor.
that collect sufficiently many inter-class and intra-class The best known LFW result under unsupervised setting is 88.57% [33],
variations. which is inferred from four different descriptors.
JOURNAL OF LATEX CLASS FILES 10

differs from ours largely. These two works require some TABLE 7
outside database to train the ConvNet and the face Error rates (%) of PCANet-2 on basic dataset for varying
images have to be more precisely aligned; e.g., [34] block overlap ratios (BORs).
uses 3-dimensional model for face alignment and [35]
extracts multi-scale features based on detected landmark BOR 0.1 0.2 0.3 0.4 0.5 0.6 0.7
positions. On the contrary, we only trained PCANet RandNet-2 1.31 1.35 1.23 1.34 1.18 1.14 1.24
based on LFW-a [32], an aligned version of LFW images PCANet-2 1.12 1.12 1.07 1.06 1.06 1.02 1.05
LDANet-2 1.14 1.14 1.11 1.05 1.05 1.05 1.06
using the commercial alignment system of face.com.

TABLE 6
Comparison of verification rates (%) on LFW under 3.3.3 Comparison with state of the arts
unsupervised setting.
We compare RandNet, PCANet, and LDANet with Con-
Methods Accuracy vNet [5], 2-stage ScatNet (ScatNet-2) [6], and other ex-
POEM [26] 82.70±0.59 isting methods. In ScatNet, the number of scales and the
High-dim. LBP [36] 84.08 number of orientations are set to 3 and 8, respectively.
High-dim. LE [36] 84.58
SFRD [37] 84.81
Regarding the parameters of PCANet, we set the filter
I-LQP [27] 86.20±0.46 size k1 = k2 = 7, the number of PCA filters L1 = L2 = 8;
OCLBP [33] 86.66±0.30 the block size is tuned by a cross-validation for MNIST,
PCANet-1 81.18 ± 1.99 and the validation sets for MNIST variations9 . The over-
PCANet-1 (sqrt) 82.55 ± 1.48
PCANet-2 85.20 ± 1.46 lapping region between blocks is half of the block size.
PCANet-2 (sqrt) 86.28 ± 1.14 Unless otherwise specified, we use linear SVM classifier
for ScatNet and RandNet, PCANet and LDANet for the
9 classification tasks.
3.3 Digit Recognition on MNIST Datasets The testing error rates of the various methods on
MNIST are shown in Table 9. For fair comparison, we
We now move forward to test the proposed PCANet, do not include the results of methods using augmented
along with RandNet and LDANet, on MNIST [4] and training samples with distortions or other information,
MNIST variations [38], a widely-used benchmark for for that the best known result is 0.23% [39]. We see
testing hierarchical representations. There are 9 classi- that RandNet-2, PCANet-2, and LDANet-2 are compa-
fication tasks in total, as listed in Table 8. All the images rable with the state-of-the-art methods on this standard
are of size 28 × 28. In the following, we use MNIST MNIST task. However, as MNIST has many training
basic as the dataset to investigate the influence of the data, all methods perform very well and very close –
number of filters or different block overlap ratios for the difference is not so statistically meaningful.
RandNet, PCANet and LDANet, and then compare with Accordingly, we also report results of different meth-
other state-of-the-art methods on all the MNIST datasets. ods on MNIST variations in Table 10. To the best of our
knowledge, the PCANet-2 achieves the state-of-the-art
3.3.1 Impact of the number of filters results for four out of the eight remaining tasks: basic,
We vary the number of filters in the first stage L1 from bg-img, bg-img-rot, and convex. Especially for bg-img, the
2 to 12 for one-stage networks. Regarding two-stage error rate reduces from 12.25% [40] to 10.95%.
networks, we set L2 = 8 and change L1 from 4 to 24. Table 10 also shows the result of PCANet-1 with L1 L2
The filter size of the networks is k1 = k2 = 7, block filters of size (k1 + k2 − 1) × (k1 + k2 − 1). The PCANet-1
size is 7×7, and the overlapping region between blocks with such a parameter setting is to mimic the reported
is half of the block size. The results are shown in Figure PCANet-2 in a single-stage structure. PCANet-2 still
8. The results are consistent with that for MultiPIE face outperforms this PCANet-1 alternative.
database in Figure 3; PCANet outperforms RandNet and Furthermore, we also draw the learned PCANet filters
LDANet for almost all the cases. in Figure 9 and Figure 10. An intriguing pattern is
observed in the filters of rect and rect-img datasets. For
3.3.2 Impact of the block overlap ratio rect, we can see both horizontal and vertical stripes,
The number of filters is fixed to L1 = L2 = 8, and the for these patterns attempt to capture the edges of the
filter size is again k1 = k2 = 7 and block size is 7×7. rectangles. When there is some image background in
We only vary the block overlap ratio (BOR) from 0.1 to rect-img, several filters become low-pass, in order to
0.7. Table 7 tabulates the results of RandNet-2, PCANet- secure the responses from background images.
2, and LDANet-2. Clearly, PCANet-2 and LDANet-2
achieve their minimum error rates for BOR equal to 0.5
9. Using either cross-validation or validation set, the optimal block
and 0.6, respectively, and PCANet-2 performs the best size is obtained as 7×7 for MNIST, basic, rec-img, 4×4 for rot, bg-img,
for all conditions. bg-rnd, bg-img-rot, 14×14 for rec, and 28×28 for convex.
JOURNAL OF LATEX CLASS FILES 11

TABLE 8
Details of the 9 classification tasks on MNIST and MNIST variations.
Data Sets Description Num. of classes Train-Valid-Test
MNIST Standard MNIST 10 60000-0-10000
basic Smaller subset of MNIST 10 10000-2000-50000
rot MNIST with rotation 10 10000-2000-50000
bg-rand MNIST with noise background 10 10000-2000-50000
bg-img MNIST with image background 10 10000-2000-50000
bg-img-rot MNIST with rotation and image background 10 10000-2000-50000
rect Discriminate between tall and wide rectangles 2 1000-200-50000
rect-img Dataset rect with image background 2 10000-2000-50000
convex Discriminate between convex and concave shape 2 6000-2000-50000

TABLE 10
Comparison of testing error rates (%) of the various methods on MNIST variations.

Methods basic rot bg-rand bg-img bg-img-rot rect rect-img convex


CAE-2 [45] 2.48 9.66 10.90 15.50 45.23 1.21 21.54 -
TIRBM [46] - 4.20 - - 35.50 - - -
PGBM + DN-1 [40] - - 6.08 12.25 36.76 - - -
ScatNet-2 [6] 1.27 7.48 12.30 18.40 50.48 0.01 8.02 6.50
RandNet-1 1.86 14.25 18.81 15.97 51.82 0.21 15.94 6.78
RandNet-2 1.25 8.47 13.47 11.65 43.69 0.09 17.00 5.45
PCANet-1 1.44 10.55 6.77 11.11 42.03 0.15 25.55 5.93
PCANet-2 1.06 7.37 6.19 10.95 35.48 0.24 14.08 4.36
LDANet-1 1.61 11.40 7.16 13.03 43.86 0.15 23.63 6.89
LDANet-2 1.05 7.52 6.81 12.42 38.54 0.14 16.20 7.22
PCANet-1 (k1 = 13) 1.21 8.30 6.88 11.97 39.06 0.03 13.94 6.75

10 1.4
TABLE 9
RandNet−1
PCANet−1
RandNet−2 (L2 = 8)
Comparison of error rates (%) of the methods on MNIST,
Error rate (%)

Error rate (%)

1.3 PCANet−2 (L2 = 8)


LDANet−1
LDANet−2 (L2 = 8) excluding methods that augment the training data. The
5 1.2
filter size k1 = k2 = 7 are set in RandNet, PCANet, and
1.1 LDANet unless specified otherwise.
0 1
2 4 6 8 10 12 5 10 15 20
Number of filters in the first stage (L1) Number of filters in the first stage (L1) Methods MNIST
HSC [41] 0.77
(a) (b) K-NN-SCM [42] 0.63
K-NN-IDM [43] 0.54
Fig. 8. Error rate of PCANet on MNIST basic test set for CDBN [7] 0.82
varying number of filters in the first stage. (a) PCANet-1; ConvNet [5] 0.53
Stochastic pooling ConvNet [44] 0.47
(b) PCANet-2 with L2 = 8. Conv. Maxout + Dropout [3] 0.45
ScatNet-2 (SVMrbf ) [6] 0.43
RandNet-1 1.32
RandNet-2 0.63
PCANet-1 0.94
PCANet-2 0.66
LDANet-1 0.98
LDANet-2 0.62
PCANet-1 (k1 = 13) 0.62
Fig. 9. The PCANet filters learned on MNIST dataset. Top
row: the first stage. Bottom row: the second stage. The
filter size k1 = k2 = 7 are set in RandNet, PCANet, and
LDANet unless specified otherwise. degrees is selected, thereby yielding 92 images in each
class. A central 200 × 200 region is cropped from each of
the selected images. The dataset is randomly split into
3.4 Texture Classification on CUReT Dataset a training and a testing set, with 46 training images for
The CUReT texture dataset contains 61 classes of image each class, as in [47]. The PCANet is trained with filter
textures. Each texture class has images of the same size k1 = k2 = 5, the number of filters L1 = L2 = 8,
material with different pose and illumination conditions. and block size 50×50. We use linear SVM classifier. The
Other than the above variations, specularities, shad- testing error rates averaged over 10 different random
owing and surface normal variations also make this splits are shown in Table 11. We see that the PCANet-
classification challenging. In this experiment, a subset of 1 outperforms ScatNet-1, but the improvement from
the dataset with azimuthal viewing angle less than 60 PCANet-1 to PCANet-2 is not as large as that of ScatNet.
JOURNAL OF LATEX CLASS FILES 12

basic rot

bg-rand bg-img

bg-img-rot rect

rect-img convex

Fig. 10. The PCANet filters learned on various MNIST datasets. For each dataset, the top row shows the filters of the
first stage; the bottom row shows the filters of the second stage.

in object position and object scale within each class, but


also in colors and textures of these objects.
The motivation here is to explore the limitation of
such a simple PCANet on a relatively complex database,
in comparison to the databases of faces, digits, and
Fig. 11. The PCANet filters learned on CUReT database. textures we have experimented with, which could some-
Top row: the first stage. Bottom row: the second stage. how be roughly aligned or prepared. To begin with, we
extend PCA filter learning so as to accommodate the
TABLE 11 RGB images in object databases. In the same spirit of
Comparison of error rates (%) on CUReT. constructing the data matrix X in (1), we gather the
same individual matrix for RGB channels of the images,
Methods Error rates
denoted by Xr , Xg , Xb ∈ Rk1 k2 ×N mn , respectively. Fol-
Textons [48] 1.50
BIF [49] 1.40 lowing the key steps in Section 2.1.1, the multichannel
Histogram [50] 1.00 PCA filters can be easily verified as
ScatNet-1 (PCA) [6] 0.50
.
ScatNet-2 (PCA) [6] 0.20 Wlr,g,b = matk1 ,k2 ,3 (ql (X
fXfT )) ∈ Rk1 ×k2 ×3 , (14)
RandNet-1 0.61
RandNet-2 0.46 f = [X T , X T , X T ]T and matk ,k ,3 (v) is a func-
PCANet-1 0.45 where X r g b 1 2
PCANet-2 0.39 tion that maps v ∈ R3k1 k2 to a tensor W ∈ Rk1 ×k2 ×3 .
LDANet-1 0.69 An example of the learned multichannel PCA filters is
LDANet-2 0.54
demonstrated in Figure 12. In addition to the modifi-
cation above, we also connect spatial pyramid pooling
Note that ScatNet-2 followed by a PCA-based classifier (SPP) [51]–[53] to the output layer of PCANet, with
gives the best result [6]. the aim of extracting information invariant to large
poses and complex backgrounds, usually seen in object
databases. The SPP essentially helps object recognition,
3.5 Object Recognition on CIFAR10 but finds no significant improvement in the previous
We finally evaluate the performance of PCANet on experiments on faces, digits and textures.
CIFAR10 database for object recognition. CIFAR10 is a We use linear SVM classifier in the experiments. In
set of natural RGB images of 32×32 pixels. It contains the first experiment, we train PCANet on CIFAR10 with
10 classes with 50000 training samples and 10000 test filter size k1 = k2 = 5, the number of filters L1 =
samples. Images in CIFAR10 vary significantly not only 40, L2 = 8, and block size equal to 8 × 8. Also, we
JOURNAL OF LATEX CLASS FILES 13

TABLE 12
Comparison of accuracy (%) of the methods on CIFAR10
with no data augmentation.

Methods Accuracy
Tiled CNN [54] 73.10
Improved LCC [55] 74.50
KDES-A [56] 76.00
K-means (Triangle, 4000 features) [57] 79.60
Cuda-convnet2 [58] 82.00
Stochastic pooling ConvNet [44] 84.87
CNN + Spearmint [59] 85.02
Conv. Maxout + Dropout [3] 88.32
NIN [60] 89.59
PCANet-2 77.14
PCANet-2 (combined) 78.67

PCANet comprises only a cascaded linear map, followed


Fig. 12. The PCANet filters learned on Cifar10 database. by a nonlinear output stage. Such a simplicity offers
Top: the first stage. Bottom: the second stage. an alternative and yet refreshing perspective to con-
volutional deep learning networks, and could further
facilitate mathematical analysis and justification of its
set the overlapping region between blocks to half of effectiveness.
the block size, and connected SPP to the output layer A couple of simple extensions of PCANet; that is,
of PCANet; i.e., the maximum response in each bin RandNet and LDANet, have been introduced and tested
of block histograms is pooled in a pyramid of 4×4, together with PCANet on many image classification
2×2, and 1×1 subregions. This yields the 21 pooled tasks, including face, hand-written digit, texture, and
histogram feature of dimension L1 2L2 . The dimension object. Extensive experimental results have consistently
of each pooled feature is reduced to 1280 by PCA. shown that the PCANet outperforms RandNet and
In the second experiment, we concatenate PCANet LDANet, and is generally on par with ScatNet and
features learned with different filter size k1 = k2 = 3 and variations of ConvNet. Furthermore, the performance
k1 = k2 = 5. All the processes and model parameters are of PCANet is closely comparable and often better than
fixed identical to the single descriptor mentioned in last highly engineered hand-crafted features (such as LBP
paragraph, except L1 = 12 and L1 = 28 set for filter size and LQP). In tasks such as face recognition, PCANet also
equal to 3 and 5, respectively. This is to ensure that the demonstrates remarkable robustness to corruption and
combined features are of the same dimension with the ability to transfer to new datasets.
single descriptor, for fairness. The experiments also convey that as long as the
The results are shown in Table 12. PCANet-2 achieves images in databases are somehow well prepared; i.e.,
accuracy 77.14% and gains 1.5% improvement when images are roughly aligned and do not exhibit diverse
combining two features learned with different filter scales or poses, PCANet is able to eliminate the image
sizes (marked with combined in a parenthesis). While variability and gives reasonably competitive accuracy.
PCANet-2 has around 11% accuracy degradation in com- In challenging image databases such as Pascal and
parison to state-of-the-art method (with no data aug- ImageNet, PCANet might not be sufficient to handle
mentation), the performance of the fully unsupervised the variability, given its extremely simple structure and
and extremely simple PCANet-2 shown here is still unsupervised learning method. An intriguing research
encouraging. direction will then be how to construct a more compli-
cated (say more sophisticated filters possibly with dis-
criminative learning) or deeper (more number of stages)
4 C ONCLUSION PCANet that could accommodate the aforementioned
In this paper, we have proposed arguably the simplest issues. Some preprocessing of pose alignment and scale
unsupervised convolutional deep learning network— normalization might be needed for good performance
PCANet. The network processes input images by cas- guarantee. The current bottleneck that keeps PCANet
caded PCA, binary hashing, and block histograms. Like from growing deeper (e.g., more than two stages) is that
the most ConvNet models, the network parameters such the dimension of the resulted feature would increase ex-
as the number of layers, the filter size, and the number ponentially with the number of stages. This fortunately
of filters have to be given to PCANet. Once the pa- seems able to be fixed by replacing the 2-dimensional
rameters are fixed, training PCANet is extremely simple convolution filters with tensor-like filters as in (14), and
and efficient, for the filter learning in PCANet does not it will be our future study. Furthermore, we will also
involve regularized parameters and does not require leave as future work to augment PCANet with a simple,
numerical optimization solver. Moreover, building the scalable baseline classifier, readily applicable to much
JOURNAL OF LATEX CLASS FILES 14

larger scale datasets or problems. [25] J. Lu, Y.-P. Tan, and G. Wang, “Discriminative multi-manifold
Regardless, extensive experiments given in this paper analysis for face recognition from a single training sample per
person,” IEEE TPAMI, vol. 35, no. 1, pp. 39–51, 2013.
sufficiently conclude two facts: 1) the PCANet is a [26] N.-S. Vu and A. Caplier, “Enhanced patterns of oriented edge
very simple deep learning network, effectively extracting magnitudes for face recognition and image matching,” IEEE TIP,
useful information for classification of faces, digits, and vol. 21, no. 3, pp. 1352–1368, 2012.
[27] S. Hussain, T. Napoleon, and F. Jurie, “Face recognition using
texture images; 2) the PCANet can be a valuable baseline local quantized patterns,” in BMVC, 2012.
for studying advanced deep learning architectures for [28] S. Xie, S. Shan, X. Chen, and J. Chen, “Fusing local patterns
large-scale image classification tasks. of gabor magnitude and phase for face recognition,” IEEE TIP,
vol. 19, no. 5, pp. 1349–1361, 2010.
[29] N.-S. Vu, “Exploring patterns of gradient orientations and mag-
R EFERENCES nitudes for face recognition,” IEEE Trans. Information Forensics and
Security, vol. 8, no. 2, pp. 295–304, 2013.
[1] G. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm [30] Z. Chai, Z. Sun, H. Mndez-Vzquez, R. He, and T. Tan, “Gabor
for deep belief nets,” Neural Computation, vol. 18, no. 7, pp. 1527– ordinal measures for face recognition,” IEEE Trans. Information
1554, 2006. Forensics and Security, vol. 9, no. 1, pp. 14–26, Jan. 2014.
[2] Y. Bengio, A. Courville, and P. Vincent, “Representation learning:
[31] G. Huang, M. Ramesh, T. Berg, and E. Learned-Miller, “Labeled
a review and new perspectives,” IEEE TPAMI, vol. 35, no. 8, pp.
faces in the wild: a database for studying face recognition in
1798–1828, 2013.
unconstrained environments,” in Technical Report 07-49, University
[3] I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and
of Massachusetts, Amherst, 2007.
Y. Bengio, “Maxout networks,” in ICML, 2013.
[32] L. Wolf, T. Hassner, and Y. Taigman, “Effective face recognition
[4] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based
by combining multiple descriptors and learned background statis-
learning applied to document recognition,” Proceedings of the IEEE,
tics,” IEEE TPAMI, vol. 33, no. 10, 2011.
vol. 86, no. 11, pp. 2278–2324, 1998.
[5] K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. LeCun, “What is [33] O. Barkan, J. Weill, L. Wolf, and H. Aronowitz, “Fast high di-
the best multi-stage architecture for object recognition,” in ICCV, mensional vector multiplication face recognition,” in IEEE ICCV,
2009. 2013.
[6] J. Bruna and S. Mallat, “Invariant scattering convolution net- [34] Y. Taigman, M. Yang, M. A. Ranzato, and L. Wolf, “Deepface:
works,” IEEE TPAMI, vol. 35, no. 8, pp. 1872–1886, 2013. Closing the gap to human-level performance in face verification,”
[7] H. Lee, R. Grosse, R. Rananth, and A. Ng, “Convolutional deep in IEEE CVPR, 2014.
belief networks for scalable unsupervised learnig of hierachical [35] H. Fan, Z. Cao, Y. Jiang, and Q. Yin, “Learning deep face repre-
representation,” in ICML, 2009. sentation,” in arXiv: 1403.2802v1, 2014.
[8] A. Krizhevsky, I. Sutskever, and G. Hinton, “Imagenet classifica- [36] D. Chen, X. Cao, F. Wen, and J. Sun, “Blessing of dimensionality:
tion with deep convolutional neural network,” in NIPS, 2012. high-dimentional feature and its efficient compression for face
[9] K. Kavukcuoglu, P. Sermanet, Y. Boureau, K. Gregor, M. Mathieu, verification,” in CVPR, 2013.
and Y. LeCun, “Learning convolutional feature hierarchies for [37] Z. Cui, W. Li, D. Xu, S. Shan, and X. Chen, “Fusing robust
visual recognition,” in NIPS, 2010. face region descriptors via multiple metric learning for face
[10] L. Sifre and S. Mallat, “Rotation, scaling and deformation invari- recognition in the wild,” in CVPR, 2013.
ant scattering for texture discrimination,” in CVPR, 2013. [38] H. Larochelle, D. Erhan, A. Courville, J. Bergstra, and Y. Bengio,
[11] C. J. C. Burges, J. C. Platt, and S. Jana, “Distortion discriminant “An empirical evaluation of deep architectures on problems with
analysis for audio fingerprinting,” IEEE TSAP, vol. 11, no. 3, pp. many factors of variation,” in ICML, 2007.
165–174, 2003. [39] D. Ciresan, U. Meier, and J. Schmidhuber, “Multi-column deep
[12] D. G. Lowe, “Distinctive image features from scale-invariant neural networks for image classification,” in CVPR, 2012.
keypoints,” International Journal of Computer Vision, vol. 60, no. 2, [40] K. Sohn, G. Zhou, C. Lee, and H. Lee, “Learning and selecting
pp. 91–110. features jointly with point-wise gated Boltzmann machine,” in
[13] N. Dalal and B. Triggs, “Histograms of oriented gradients for ICML, 2013.
human detection,” in IEEE CVPR, 200. [41] K. Yu, Y. Lin, and J. Lafferty, “Learning image representations
[14] L. Fei-Fei and P. Perona, “A Bayesian hierarchical model for from the pixel level via hierarchical sparse coding,” in CVPR,
learning natural scene categories,” in IEEE CVPR, 2005. 2011.
[15] C. Liu and H. Wechsler, “Gabor feature based classification using [42] S. Belongie, J. Malik, and J. Puzicha, “Shape matching and object
the enhanced fisher linear discriminant model for face recogni- recognition using shape contexts,” IEEE TPAMI, vol. 24, no. 4, pp.
tion,” IEEE TIP, vol. 11, no. 4, pp. 467–476, 2002. 509–522, 2002.
[16] H. Yu and J. Yang, “A direct LDA algorithm for high-dimensional [43] D. Keysers, T. Deselaers, C. Gollan, and H. Ney, “Deformation
data— with application to face recognition,” Pattern Recognition, models for image recognition,” IEEE TPAMI, vol. 29, no. 8, pp.
vol. 34, no. 10, pp. 2067–2069, 2001. 1422–1435, 2007.
[17] R. Gross, I. Matthews, and S. Baker, “Multi-pie,” in IEEE Confer-
[44] M. D. Zeiler and R. Fergus, “Stochastic pooling for regularization
ence on Automatic Face and Gesture Recognition, 2008.
of deep convolutional neural networks,” in ICLR, 2013.
[18] T. Ahonen, A. Hadid, and M. Pietikainen, “Face description
[45] S. Rifai, P. Vincent, X. Muller, X. Glorot, and Y. Bengio, “Contrac-
with local binary patterns: application to face recognition,” IEEE
tive auto-encoders: explicit invariance during feature extraction,”
TPAMI, vol. 28, no. 12, pp. 2037–2041, 2006.
in ICML, 2011.
[19] Y. Jia, “Caffe: An open source convolutional architecture for fast
feature embedding,” http://caffe.berkeleyvision.org/, 2013. [46] K. Sohn and H. Lee, “Learning invariant representations with
[20] A. S. Georghiades, P. N. Belhumeur, and D. J. Kriegman, “From local transformations,” in ICML, 2012.
few to many: Illumination cone models for face recognition under [47] M. Varma and A. Zisserman, “A statistical approach to material
variable lighting and pose,” IEEE TPAMI, vol. 23, no. 6, pp. 643– classification using image patch examplars,” IEEE TPAMI, vol. 31,
660, June 2001. no. 11, pp. 2032–2047, 2009.
[21] X. Tan and B. Triggs, “Enhanced local texture feature sets for face [48] E. Hayman, B. Caputo, M. Fritz, and J. O. Eklundh, “On the
recognition under difficult lighting conditions,” IEEE TIP, vol. 19, significance of real-world conditions for material classification,”
no. 6, pp. 1635–1650, 2010. in ECCV, 2004.
[22] A. Martinez and R. Benavente, CVC Technical Report 24, 1998. [49] M. Crosier and L. Griffin, “Using basic image features for texture
[23] W. Deng, J. Hu, and J. Guo, “Extended SRC: Undersampled face classification,” IJCV, pp. 447–460, 2010.
recognition via intraclass variant dictionary,” IEEE TPAMI, vol. 34, [50] R. E. Broadhust, “Statistical estimation of histogram variation for
no. 9, pp. 1864–1870, Sept. 2012. texture classification,” in Proc. Workshop on Texture Analysis and
[24] P. J. Phillips, H. Wechsler, J. Huang, and P. J. Rauss, “The Synthesis, 2006.
FERET database and evaluaion procedure for face-recognition [51] K. Grauman and T. Darrell, “The pyramid match kernel: Discrim-
algorithms,” Image Vision Comput., vol. 16, no. 5, pp. 295–306, 1998. inative classification with sets of image features,” in ICCV, 2005.
JOURNAL OF LATEX CLASS FILES 15

[52] S. Lazebnik, C. Scmid, and J. Ponce, “Beyond bags of features: spa- Jiwen Lu is currently a research scientist at the
tial pyramid matching for recognizing natural scene categories,” Advanced Digital Sciences Center (ADSC), Sin-
in CVPR, 2006. gapore. His research interests include computer
[53] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling vision, pattern recognition, machine learning,
in deep convolutional networks for visual recognition,” in ECCV, and biometrics. He has authored/co-authored
2014. more than 70 scientific papers in peer-reviewed
[54] Q. V. Le, J. Ngiam, Z. Chen, D. Chia, P. W. Koh, and A. Y. Ng, journals and conferences including some top
“Tiled convolutional neural networks,” in NIPS, 2010. venues such as the TPAMI, TIP, CVPR and
[55] K. Yu and T. Zhang, “Improved local coordinate coding using ICCV. He is a member of the IEEE.
local tangents,” in ICML, 2010.
[56] L. Bo, X. Ren, and D. Fox, “Kernel descriptors for visual recogni-
tion,” in NIPS, 2010.
[57] A. Coates, H. Lee, and A. Y. Ng, “An analysis of single-layer
networks in unsupervised feature learning,” in NIPS Workshop,
2010.
[58] A. Krizhevsky, “cuda-convnet,” http://code.google.com/p/
cuda-convnet/, July 18, 2014.
[59] J. Snoek, H. Larochelle, and R. P. Adams, “Practical bayesian
optimization of machine learning algorithms,” in NIPS, 2012.
[60] M. Lin, Q. Chen, and S. Yan, “Network in network,” in
arXiv:1312.4400v3, 2014.

Zinan Zeng received the master degree and


B.E. degree with First Honour from the School
of Computer Engineering, Nanyang Technolog-
ical University, Singapore. He is now a senior
software engineer in Advanced Digital Sciences
Tsung-Han Chan received the B.S. degree from Center, Singapore. His research interests in-
the Department of Electrical Engineering, Yuan clude statistical learning, optimization with appli-
Ze University, Taiwan, in 2004 and the Ph.D. de- cation in computer vision.
gree from the Institute of Communications Engi-
neering, National Tsing Hua University, Taiwan,
in 2009. He is currently working as a Project
Lead R&D Engineer with Sunplus Technology
Co., Hsinchu, Taiwan. His research interests are
in image processing and convex optimization,
with a recent emphasis on computer vision and
hyperspectral remote sensing.

Kui Jia received the B.Eng. degree in marine en- Yi Ma (F’13) is a professor of the School of
gineering from Northwestern Polytechnical Uni- Information Science and Technology of Shang-
versity, China, in 2001, the M.Eng. degree in haiTech University. He received his Bachelors’
electrical and computer engineering from Na- degree in Automation and Applied Mathematics
tional University of Singapore in 2003, and the from Tsinghua University, China in 1995. He
Ph.D. degree in computer science from Queen received his M.S. degree in EECS in 1997, M.A.
Mary, University of London, London, U.K., in degree in Mathematics in 2000, and his PhD de-
2007. He is currently a Research Scientist at gree in EECS in 2000 all from UC Berkeley. From
Advanced Digital Sciences Center. His research 2000 to 2011, he was an associate professor of
interests are in computer vision, machine learn- the ECE Department of the University of Illinois
ing, and image processing. at Urbana-Champaign, where he now holds an
adjunct position. From 2009 to early 2014, he was a principal researcher
and manager of the visual computing group of Microsoft Research Asia.
His main research areas are in computer vision and high-dimensional
data analysis. Yi Ma was the recipient of the David Marr Best Paper Prize
from ICCV 1999 and Honorable Mention for the Longuet-Higgins Best
Paper Award from ECCV 2004. He received the CAREER Award from
the National Science Foundation in 2004 and the Young Investigator
Shenghua Gao received the B.E. degree from Program Award from the Office of Naval Research in 2005. He has been
the University of Science and Technology of an associate editor for IJCV, SIIMS, IEEE Trans. PAMI and Information
China in 2008, and received the Ph.D. degree Theory.
from the Nanyang Technological University in
2013. He is currently a postdoctoral fellow in Ad-
vanced Digital Sciences Center, Singapore. He
was awarded the Microsoft Research Fellowship
in 2010. His research interests include computer
vision and machine learning.

You might also like