ENSEMBLE METHODS FOR
HANDWRITTEN DIGIT RECOGNITION
zyx
zyxw
zyxwvutsr
zyxwvuts
zyxwv
zyx
L. K. Hansen’, C. Liisberg2, and P. Salamon3
Abstract. Neural network ensembles are applied to handwritten digit
recognition. The invidual networks of the ensemble are combinations
of sparse Look-Up Tables with random receptive fields. It is shown
that the consensus of a group of networks outperform the best invidual of the ensemble and further we show that it is possible to
estimate the ensemble performance as well as the learning curve, on
a medium size database. In addition we present preliminary analysis of experiments on a large data base and show that state o f t h e
ad performance can be obtained using the ensemble approach by
optimizing the receptive fields.
INTRODUCTION
Recognition of handwritten digits is a serious, current candidate for a “real
world” benchmark problem to assess pattern recognition methods: For a recent
review see [4]. It has been the object of a recent state of the art application of
neural networks [5].
Neural network ensembles were introduced recently as a means for improving
network training ‘and performance. The consensus of a neural network ensemble may outperform individual networks [I] and ensembles can be used to
implement oracle functions [2]. Furthermore the consensus may be used for realization of fault tolerant neural network architectures [3]. Within the present
system for recognition of handwritten digits, we find that the ensemble consensus outperform the best individual of the ensemble by 20 - 25%. However, due
to correlation among errors made by the participating networks, the marginal
benefit obtained by increasing the ensemble is low once the ensemble size is
2 15. Our findings are in line with the results obtained in [l]. We illustrate
the theoretical tools for predicting the performance of the ensemble consensus,
and we demonstrate the use of the ensemble as an oracle in its capacity of
predicting the learning curve, ie. the number of test errors as a function of the
number of training examples. This and other ensemble oracle functions were
introduced by Salamon e2 al. [2]. Real world applications face the problem of
‘CO.Y?IE:C‘I, Electronics Institute B349, The Technical University of Denmark, DK-2800
Lyngby Denmark, lars@eiffel.ei.dth.dk
’ C O Y ~ E C ‘ I , Dept. of Optics and Fluid Dynamics, Ria National Laboratory, DK-4000
Roskilde, liisberg@risoe.dk
3Dept. of Mathematical Sciences, San Diego State University, San Diego CA 92182 USA,
salamon@math.sdsu.edu
@7803-0557-4/92$03.~0 1992
333
Authorized licensed use limited to: Danmarks Tekniske Informationscenter. Downloaded on July 06,2010 at 09:47:05 UTC from IEEE Xplore. Restrictions apply.
zyxwvutsrqpo
zyxwvutsrq
zyxwvutsr
Error comlatioo matrix for ensemble of 11 networks
zyx
zyxwvutsr
Network number
‘
8
-4,
Io 11
Network number
Figure 1: Correlation of test errors among members of an ensemble of 11
networks trained on a database of 3471 handwritten digits. The test set of 3500
examples was written by an independent group of 140 people. Note that the
level in the diagonal is the average performance. If the errors were independent,
the level outside the diagonal would be the square of the level in the diagonal.
noise (eg. misclassifications or sloppy handwritings) and an important problem relates to the estimation of such noise levels. In this context we introduce,
and use the ensemble to estimate, the modelling deficiency. This quantity is
defined as the residual test error using the model and it is estimated as the
frequency of consensus errors in an infinite ensemble. The modelling deficiency
is determined by the model design as well as by the inherent noise level. As
a specific result we present evidence that there is, on the average, only one
dominant alternative (mis-classification) for each digit.
The present study is based on a pattern recognition strategy designed by Liisberg et al. [7]. The individual network device is a collection of Look-Up Tables
(LUT’s) with sparse random receptive fields. The randomness of the individual networks tends to differentiate their generalization (test) errors, creating
an ideal setting for ensembles. By combining such LUT’s, one may obtain improved performance by using the consensus of the ensemble. Figure 1illustrates
the level of correlation between networks for the particular application. The
heights of the columns are the fractions of test examples with coincident errors
in an ensemble of 11 networks.
In the next section we discuss the LUT-approach and section three reviews
the basic notions of ensemble theory. In section four we discuss experiments
on a 7000-digit database, while section five contains some concluding remarks
as well as an outline of future work on extended databases. In particular we
present a preliminary analysis of experiments on the NET data base [8], we
show that by using optimized receptive fields it is possible to obtain stale of
zyxwv
334
Authorized licensed use limited to: Danmarks Tekniske Informationscenter. Downloaded on July 06,2010 at 09:47:05 UTC from IEEE Xplore. Restrictions apply.
zyxwvutsrqp
zyxwvutsrqpon
zyxwvutsr
zyxwvut
zyxwvuts
zyxwvutsrqp
zyxw
the a d performance.
SPARSE RANDOM LOOK-UP TABLES
The network system is a feed-forward net with one “hidden layer”. The units
of the hidden layer are equipped with sparse random receptive fields. Each
hidden unit receives binary input from nR input units. The activity pattern in
the receptive field is interpreted as the bit-pattern of an address in a Look-Up
Table. Each LUT has 2nR rows of M entries, with M being the number of
output categories.
Training Phase. In the training phase an example, consisting of a binary
image and the corresponding classification, is loaded on the network by incrementing the activity of the entry corresponding to the given particular address
and the given output category. After a single pass through the training set
the entries will have different activities reflecting the correlations among bitpatterns/addresses and classes.
The size of the receptive fields nR determines together with M the number of
entries in the LUT’s. To ensure proper generalization we fix n~ by the heuristic:
nR = Llogz(m) - 1J where m is the number of training examples, this leaves
two examples pr. row on the average. A network consists of NLUT Look-Up
Tables. NLUT is in turn determined by the constraint that all pixels should
contribute: NLUT= [ n p i z e l a / n Rwhere
]
npizels is the number of pixels in the
visual field. The assignment of the receptive fields is a random permutation of
the pixel indices, ensuring that all pixels participate in the classification.
Application of the Trained Network. In the application phase an example,
ie. a binary image with unknown classification, produces a set specific bitpattern in the receptive fields of the LUT’s. Each pattern is translated into
the proper address and the classification is given by the class of the particular
address having the highest activity. The output of the LUT is fed to the outputlayer where the activity of the given class is incremented by one. As a result
of the LUT processing, each digit obtains a score. The network output is the
digit obtaining the highest LUT activity.
ENSEMBLE METHODS
The consensus decision of an ensemble of pattern recognition devices may be
more reliable than that of the individuals if two basic qualitative criteria are
met: 1) The individual networks are performing reasonably well. 2) The errors
of the different networks are t o some degree independent. The necessity of the
first criterion is obvious; there is no such thing as a free lunch. Provided that
the second criterion also holds, the consensus decision is usually superior to the
best individual even when there is a large range in individual performance [l].
335
Authorized licensed use limited to: Danmarks Tekniske Informationscenter. Downloaded on July 06,2010 at 09:47:05 UTC from IEEE Xplore. Restrictions apply.
zyxwvu
zyxwvu
zyxwvu
zyxwvuts
zyxwvuts
Estimating Ensemble Performance. Correlation of errors among networks
may be present for two reasons. There may be induced correlation in the way
the networks err by the method used to create the networks. Secondly we
have to face the fact that some examples of the pattern recognition problem
at hand, are more difficult than others. The distribution of pattern difficulty
is naturally discussed in terms of the ensemble. The dificulty 0 of an example
is defined as the fraction in a large ensemble of trained networks that misclassifies it. This induces a dificulty distribution p ( 0 ) that gives the probability
of seeing an example with difficultyl?in a random choice from the set of possible
patterns which are candidates for classification. Using p(O), we may estimate
the performance of the plurality decision (where the option with the largest
number of votes wins) using the tools proposed in [l].
zyxwv
In this presentation we consider plurality decisions among M = 10 categories
and we will estimate the consensus performance taking the difficulty distribution into account. The prediction is based on the observation that once we
have formulated the problem in the di'cufty
representation, the errors can be
treated as if they were independent. We first compute the performance of an
ensemble of N devices, on the slice of example space that has difficulty 0:
with:
zyxwv
zyxw
(2)
and with the proviso that binomial coefficients with negative entries are zero.
P:;E,jty(0) is the probability that any one of the M - 1 equally likely alternatives gets more votes than the correct alternative [l]. The formulas are best
used with M as an adjustable parameter: the eflective degree of confusion,
M e n . The value of me^ is typically less than the actual number of output
categories for the problem. Note that for M = 2, H ( N , 2 , n l ) = 0 for all
nl > N/2 and thus the second term on the RHS of equation (2) vanishes.
To get the average performance we integrate the above expression using the
difficulty distribution:
336
Authorized licensed use limited to: Danmarks Tekniske Informationscenter. Downloaded on July 06,2010 at 09:47:05 UTC from IEEE Xplore. Restrictions apply.
zyxwvu
zyxwvut
zyx
zyxw
zyxwvut
zyxw
zyxw
zyxw
Prediction of the Learning Curve. Applications tend to run short of examples. The necessary number of training examples is therefore a most important
issue. A central result, obtained by Schwartz et al. [9], predicts the test-error
as a function of the training set size. Schwartz et al. consider the distribution
of generalization proficiencies g (test performances) of the ensemble of possible
networks specified by (say) a given architecture. Assuming the distribution of
generalization abilities, pmo(g),to be given at some specific training set size mol
the predicted distribution of networks compatible with a larger set of m mo
1
examples is approximately given by: pm+mo (9)= gmpmo(g)/ So gmpmo(g)dg.
+
This in turn makes it possible to predict the learning curve from the given
measured pmo:
To estimate pmo from a finite sample of networks, we apply a Maximum Entropy
argument. We choose the distribution pmo with the largest entropy consistent
with the values of three me&urements: the mean (g), the variance ( v g ) and the
range of generalizations ([0, gmaz]). While J and vg are readily computed from a
given ensemble, we need to develop an estimate of gmaz. We propose here to use
the extrapolated performance of an infinite ensemble trained on the given training set by taking the limit N + 00 in equation (3). The resulting maximum
entropy distribution is of the form: pmo(g) = 2-l exp (-(g - ~ ) ~ / 2 bwhere
)
the
parameters a, b are determined by the constraint that the mean and variance are
given by the measured values: 8,vp, and where 2-l
z-'(a, b, gmaz) ensures
the proper normalization on the interval [0, gmaz]. The resulting distribution
may then be inserted in (4) to predict the learning curve.
EXPERIMENTAL
The primary body of experiments for this study is based on a digit database
consisting of 6973 handwritten digits written by 280 people, who all had filled
a one page form with preprinted boxes and rectangles for the digits. The digits
were scanned as binary images with a resolution of 200 dots pr. inch, segmented
and scaled to fit within a 16 by 16 grid. Digits whose width was bigger than
their height were scaled so width equalled height, else the proportions were
kept and the resulting digits left-adjusted. There is approximately the same
number of examples of each of the digits (0 - 9) in the database. The database
has been inspected and cleaned for false classifications, segmentation errors and
did. All digits are readable by a human, however for some of the digits the
331
Authorized licensed use limited to: Danmarks Tekniske Informationscenter. Downloaded on July 06,2010 at 09:47:05 UTC from IEEE Xplore. Restrictions apply.
zyxwvutsrqp
zyxwvutsrqponml
zyxwvutsrqpon
I
0 12-
p
g
5
3
s,
Ol-
zyxw
V
008-
006-
s
I
zyxwvut
014-
.---_
_,_i_-----
2
00'-
zyxwvutsrqponm
0020
zyxwvutsrq
human readers needed to inspect the wriiing style contezt of other digits. The
test set and training set were written by two different groups of 140 people each.
Two series of experiments were conducted, in the first series network ensembles
with 5,7,15, and 30 members were trained on a test set of 3471 digits. In the
second experiment a 7-member ensemble was trained on training sets of size:
200, 500, 1000, 1500, 2000, 2500, 3000, 3471. In both cases the test set was
3500 digits. Consensus decisions were implemented using the plurality scheme
where the digit that collects the maximum number of ensemble votes wins. A
vote is forced from each network, i.e., there is no rejection at the individual
member or at the consensus level.
zyxwvut
zyxwvuts
We crossvalidate the networks on the 3500-digit test set. The ensemble average, the best of ensemble and the plurality performance are depicted in fig. 2
as a function of the ensemble size. Our first observation is that the consensus
outperform the best individual of the ensemble by 20 - 25%. The benefit of
increasing the ensemble size beyond 15 members is marginal, however. We use
eqs. (1)-(3) to estimate the consensus performance of larger ensembles based
on measurements from the 7-member ensemble. The difficulty distribution was
recorded and used for prediction. In fig. 3 we show the predictions using various effective degrees of confusion M e n . The conclusion is that the best fit to the
performance of the 7-member ensemble is obtained by Men = 2. This indicates
that, when a network makes an error, ihere i s only one dominant alternative
available f o r each ezample. This conclusion is corroborated by a direct inspection of the distribution of errors. In particular we counted the number of wrong
alternatives that had obtained more votes than the correct classification. This
count estimated the average number of alternatives considered by the networks
to be 1.4.
338
Authorized licensed use limited to: Danmarks Tekniske Informationscenter. Downloaded on July 06,2010 at 09:47:05 UTC from IEEE Xplore. Restrictions apply.
zyx
zyxwvutsrqp
'Tzyxwvutsrqpo
zyxwvutsrqponmlkjihgfedcbaZYXWVUT
zyxwvutsrqpon
zyxwv
Prediction of ensemble performance
0.08
P;;
M=2
a...
M=4
t-
0 021
5
10
15
20
ensemble size
15
30
35
zyxwvut
Figure 3: Theoretical prediction of consensus performance for varying efleclive
degrees of confusion, M . Experimental data (0).
Leamina curves
Training se1 size
Figure 4: Learning curves for the average of the ensemble (full line), the best
of ensemble (dashed line), and for the consensus (dotted line).
339
Authorized licensed use limited to: Danmarks Tekniske Informationscenter. Downloaded on July 06,2010 at 09:47:05 UTC from IEEE Xplore. Restrictions apply.
zyxwvutsrq
zyxwvutsrq
zyxwvutsrq
Training set size
zyxw
zyx
Figure 5: Theoretical prediction of the learning curve based on the data of the
7-member ensemble trained on 500 examples (dash-dotted lines). The lower
prediction in the plot is based on gmoz = 1.0, while the upper curve is based
on the estimate of the modelling deficiency: gmoz = 0.91.
zyxwvutsrqp
zyxwvu
zyxwvuts
The learning curves for the average and for the plurality consensus of the 7member ensemble are presented in fig. 4. We use the ensemble to compute the
mean and variance based on 500 training examples. The estimated maximum
generalization ability of the model, given the noise level, is derived from the
extrapolated performance of an infinite ensemble trained on 500 examples. As
noted above, using Meff = 2, makes the second term on the RHS of equation
N M
(2) vanish. Taking the limit of the resulting Ppl~ra~<y(0)
as N -+
CO
gives
N'Meff
N Me8
Pplurollty(0)
= 1 if 0 > 0.5 and Pp,:rality(0)
= 0 if 6 < 0.5. Using this in
si20,5
-
equation (3) gives the simple expression[l]: Pm,z=
p(6')dB Based on this
estimator we find the modelling deficiency to be: 1 - gmaz P"' = 0.09. In
fig. 5 we show the learning curve and the two predictions based on gmaz = 0.91
and on gmaz = 1.0 respectively. The extrapolated error rate is remarkably close
to the experimental limit of the average performance if we use the estimated
modelling deficiency.
In order to improve the classification performance of the individual networks we
invoke a modified design based on optimized LUT's. In this case the networks
are constructed in an iterative scheme. While constructing the networks we
simulate a leave one out cross validation procedure of a pool of candidate LUT's.
The result of the cross validation test can be computed in a very compact form
using a score table quantifying the network activity for each example [lo].
The successfull candidate, which together with the current network leads to
the optimal validation result, is included in the network, and the procedure is
carried on until the gain is negligible. Furthermore we apply a ensemble reject
criterion simply enforcing that the plurality winner beats the runner u p by a
340
Authorized licensed use limited to: Danmarks Tekniske Informationscenter. Downloaded on July 06,2010 at 09:47:05 UTC from IEEE Xplore. Restrictions apply.
zyxwvutsrqpo
zyxwvutsrqp
zyxwvutsrq
zyxwvutsrqp
Test error Venus reject rate (NISTdata base)
no05
‘.
* .
-.
~
....__
---..._.._
-----_.__.._
I-._
rcjm rate
Figure 6: Experiment on the NIST data base: The error rate of an 18 member
ensemble, versus fraction of rejected patterns.
given margin. We trained ensembles with up to 30 networks of 30 LUT’s each
having 15 unit receptive fields, on digits of the NIST database [8]. To facilitate
training we divided the training set among the ensemble members using 7000
examples pr. member. In fig. 6. we present the error rate versus fraction of
rejected patterns of an 18 member ensemble on a test set of 10482 examples.
While it is impossible to compare directly with other reported results since
these are based on different data bases, it is evident that the above results are
at level with state of the art results as reported by [5, 61.
CONCLUSION
In this work we have discussed the use of ensemble methods for the identification of handwritten digits. The problem is of great practical importance. We
have shown that it is possible to improve performance significantly by introducing moderate-size ensembles; in particular we found a 20 - 25% improvement.
Our ensemble random LUT’s, when trained on a medium size database, reaches
a performance (without rejects) of 94% correct classification on digits written
by a independent group of people. As a comparison, the state of the art system
developed in [5] obtained an error rate of 1% with 9% rejects. In preliminary
analysis of ensembles of optimized LUT networks trained on the large NIST
data base we reach an error rate of 0.8% with 9% rejects. The notion of modelling deficiency has been introduced and an estimator for it has been proposed.
Using the estimated modelling deficiency we are able to predict the learning
curve. We have presented arguments that the networks tend to confuse pairs of
classifications. This, however, is not simply explained by a pair-wise confusion
of digits. Rather some subset of the instances of a given digit is confused with
subsets of the instances of another digit. This merits a more detailed future
zyxwvu
341
Authorized licensed use limited to: Danmarks Tekniske Informationscenter. Downloaded on July 06,2010 at 09:47:05 UTC from IEEE Xplore. Restrictions apply.
study, using the available international digit databases.
zyxwv
ACKNOWLEDGEMENTS
We thank Nils Hoffmann a n d Jan Larsen for helpful1 c o m m e n t s on the manuscript.
This work is s u p p o r t e d b y the Danish N a t u r a l Science and Technical Research
Counsils t h r o u g h the C o m p u t a t i o n a l Neural Network C e n t e r (CONNECT).
References
zyxwvutsrq
zyxwvu
zyxwvu
zyxwvu
[l] L.K Hansen and P. Salamon: Neural Network Ensembles, I E E E Transactions
on Pattern Analysis and Machine Intelligence, 12,993-1001 (1990).
[2] P. Salamon, L. K. Hansen, B. E. Felts III., and C. Svarer: The Ensemble Oracle,
AMSE Conference on Neural Networks. San Diego 1991.
[3] L. K. Hansen and P. Salamon: Self-Repair in Neural Network Ensembles,
AMSE Conference on Neural Networks. San Diego 1991.
[4] V.K. Govindan and
A.P. Shivaprasad: Character recognition - a review, Pattern
Recognition 23 (1990).
[5] Y. Le Cun, B. Boser, J.S. Denker, D. Henderson, R.E. Howard, W. Hubbard,
and L.D. Jakel: Handwritten Digit Recognition with a Back-Propagation Network, In Advances in Neural Information Processing Systems I1 (Denver 1989)
ed. D.S.Touretzsky, 396-404. San Mateo: Morgan Kaufman. (1990)
[6] Y. Lee: Handwritten Digit Recognition Using K Nearest-Neighbor, Radial- Basis
Function,and Backpropagation Neural Networks Neural Computation 3, 440449, (1991)
zyxwvut
[7] C. Liisberg: Low-priced and robust expert systems are possible using neural
networks and minimal entropy coding, To appear in: Expert systems with a p
plications. 1991 Pergamon Press.
[8] National Institute of Standards and Technology: N I S T Special Data Base
3, Handwritten Segmented Characters of Binary Images, HWSC Rel. 4-1.1
(1992).
[9] D.B. Schwartz, V.K. Salalam, S.A. Solla, and J.S. Denker: Exhaustive Learning,
Neural Computation 2, 371-382 (1990).
[lo] Thomas Martini J ~ r g e n s e nand Christian Liisberg: Private Communication.
342
Authorized licensed use limited to: Danmarks Tekniske Informationscenter. Downloaded on July 06,2010 at 09:47:05 UTC from IEEE Xplore. Restrictions apply.