Achieving Compatible Numeral Handwriting Recognition Rate by A Simple Activation Function

International Journal of Computational Intelligence Research. ISSN 0973-1873 Vol.2, No. 1 (2006), pp.
1-9 Research India Publications http://www.ijcir.info
Achieving Compatible Numeral Handwriting Recognition Rate by a Simple Activation Function

Daniel Brderle1, Khamron Sunat2, , Sirapat Chiewchanwattana3, Chidchanok Lursinsap4 and Suchada Siripant4
1
Kirchhoff Institute for Physics, Ruprecht-Karls University INF 227, 69120 Heidelberg, Germany bruederle@kip.uni-heidelberg.de
Department of Computer Engineering, Mahanakorn University, Cheum-Sampan Road, Nong-Chok, Bangkok, Thailand 10530 khamron_sunat@yahoo.com
3
Department of Computer Science, KhonKaen University, Khon Kaen, Thailand 40002 sunkra@kku.ac.th
AVIC Research Center, Department of Mathematics, Chulalongkorn University Phayathai Road, Patumwan, Bangkok 10330, Thailand lchidcha@chula.ac.th, ssuchada@chula.ac.th
Abstract: Most of the supervised neural networks for numeral handwriting recognition employ the sigmoidal activation function to generate the outputs. Although this function performs rather well, its computational time as well as its hardware realization is costly and complicated. Here, we introduce a simple activation function in forms of a recursive piecewise polynomial function as an activation function. The accuracy of recognition can be adjusted according to the parameters of the function. In addition a new risk function measuring the discrepancy between the correct and estimated classification of the network is also presented to improve the performance. The proposed activation function and the risk function can achieve the same accuracy compatible with that from the sigmoidal function when tested with the benchmark data set. Keywords: activation function, handwriting recognition, risk function, supervised neural network.
measuring the success by a risk function. The most popular activation function employed in this work is the sigmoidal or logistic function [3] since it can be adjusted to imitate a threshold function and it is also differentiable. However, this activation function faces the problem of costly computational time and space even though the recognition rate is high. In this paper, we introduce a new simple activation function in forms of a piecewise recursive polynomial function not only to reduce the computational time but also achieve the same compatible recognition rate. This activation function is adapted from the activation function previously suggested by Sunat [6]. The testing data set of size 6000 samples is obtained from CENPARMI database at Concordia University in Canada. The rest of the paper is organized as follows. Section 2 explains how the features are extracted from a given image. Section 3 introduces our proposed activation function and risk function. Section 4 summarizes the comparison results. Section 5 concludes the paper.
I. Introduction II. Basic Setup

Numeral handwriting recognition [2] is still one of the most studied problems in neural pattern recognition area. The recognition is achieved by training a supervised neural network with an appropriate activation function and
The features of a given numeral handwriting image are extracted by adopting the multi-resolution representation concept introduced by Liu et al. [2]. The concept is summarized as follows. In [2], Liu et al. proposed a single
Corresponding Author
2 layer network for solving the CENPARMI classification task. Each numeral image is captured by a multi-resolution representation, i.e. any numeral image is presented to the network in three different resolutions, namely 8 8, 4 4 and 2 2 pixels. This approach facilitates the processing of the data on both pixel and region levels for the network, which can preserve both local and global features on the same scale. It is also an important remedy for some shortcoming of the locally expanded high order inputs introduced further below. Before the original numeral image is processed into the three resolution sub-images, it is normalized to a 32 32 pixel array. Then, its stroke-width variations in the normalized image are eliminated by Hilditch's algorithm [2], [4]. Two different methods for extracting a low resolution image from the original one are suggested and tested. In this work, only the more successful one of both will be used, namely the Gaussian pyramids representation. This method achieves resolution reduction by low-pass filtering the given image. Starting from the original image (resolution r 1 ) pixels (CENPARMI: n = 32 each lower r = 2 j containing resolution 2 n j 2 n j pixels is computed by convolution of the original f 0 (x, y ) with an with 2 n 2 n impulse response function h(x, y ) = x2 + y2 exp 2 2 2 x 2 x 1 .
Khamron Sunat et. al
Figure 1 illustrates the pixel arrays of different resolutions, the connection bars between neighboring pixels representing the second order products.
(1)
Figure 1. 2D pixel patterns with resolutions 8 8 , 4 4 and 2 2 . The bars represent the local second order expansions described in the text. Every dot and every bar denote an input value to the network. The picture is taken from [2]. Every dot and every bar in the figure denotes an input to the neural network. The total number of inputs becomes 8 8 + 4 4 + 2 2 (pixel arrays of differently resolved image) +7 7 4 + 2 7 (second order inputs for 8 8 image) +3 3 4 + 2 3 (second order inputs for 4 4 image) +6 (second order inputs for 2 2 image) = 342. Let j be the activation of a neuron j, i.e. the sum of all input values xi running into this neuron and weighted with an individual synaptic strength w ij ,
Hence, including a reasonable sampling position within the old image for every pixel in the new image, every single layer j of the Gaussian pyramid with the original image at the bottom ( j = 0 ) and a single pixel at the top ( j = n ) is computed by f j (x, y ) = f 0 (x , y ) h 2 x + 2
j j
j = w ij xi .
i
(4)
xj yj j 1
x ,2 y + 2
j
j 1
y .
(2)
Since the CENPARMI handwriting recognition task is a 10-class problem, the network presented in [2] consists of N class = 10 output neurons with a so-called logistic activation function (AF), i.e. the neuron output y j is determined by its activation as follows: yj
lg t
Liu et al. empirically found the value of x to be close to an optimum when it is set to 2 3 . Another method proposed by Liu et al. and adopted for this work is the usage of locally expanded high order inputs. This means that not only the pixel gray-scale values themselves are presented to the network input, but also the pair wise products of the values of all neighboring pixels in each single sub-image are considered. For keeping input values at about the same magnitude, the product of two pixel values is replaced by its square root f h (x1 , y1 , x 2 , y 2 ) = f (x1 , y1 ) f (x 2 , y 2 ). (3)
( j ) =
1 1+ e
j
, j =1
! N class .
(5)
The network serves as a classifier, i.e. for every input pattern presented to the input, there should always be exactly one output neuron active, namely the one associated with the correct class, while all other output neurons should exhibit their minimum value. This desired ideal classification function can be written as class : X S (6)
Activation Function with X being the set of input patterns (for this study the preprocessed CENPARMI database) and S = s ( j ) s ( j ) 0,0 0,1, 0, ,0 {0,1}N class . j 1 m j AF defined in (5), the learning rule (11) concretizes to
MSE w ij = s j y j y j xi .
(12)
!
!
(7)
Inserting the logistic AF (see (5)) and its derivative is y lgt = y lgt 1 y lgt , the gradient descent rule used in [2], [3] emerges:
MSE w ij ,lgt = s j y j y 1 y j xi . j
Let w be the vector of all synaptic weights within the network, i.e. the vector of all free parameters w ij in the
proposed setup. Furthermore, let s(x ) S be the desired network output vector for an applied input pattern x X . Then w can be trained in a supervised way, namely, by reducing the mean square error function (MSE) [3]
) (
(13)
E MSE (w ) =
1 N class yj sj , 2 j =1
2 (8)
So far, all setup suggestions by Liu et al. have been followed for this study. A control experiment exactly following their methods led to the same results as theirs, see Sec.4 for details.
III. New Approach

Computing an ANN output mainly means computing the output of each involved neuron as a function of its activation. Inspired by, among others, [5] and [7], the author of [6] proposed a specific method to model sigmoid-like AF with a piecewise defined polynomial. The basic polynomial formulation of a function close to the hyperbolic tangent is
1, 32 p 1 1 + 1 + 2 ( p +1) , g ( , p ) = 32 p 1 ( p +1) + 1 1 2 , + 1,
with the widely used error back-propagation (BP) algorithm. Minimizing an arbitrary error function with BP can easily be derived from the following equation E (w ) E (w ) j = w ij (t ) j w ij = E (w ) y j y j j (9)
= j xi .
The so-called generalized delta term j encapsulates
( (
) )
2 p +1 ,
2 p +1 < < 0, 0 < 2 p +1 , 2 ( p +1) .
(14)
j=
E (w ) y j . y ( j ) j
(10)
Equation (9) leads to the well known gradient descent rule for weight updating
w ij = j xi ,
As shown in [6], this piecewise polynomial with p being zero or a positive integer can be implemented in a fast recursive way for software simulations, leading to the term p-recursive piecewise polynomial (p-RPP) for the new AF. It is also proven that g ( , p ) is differentiable in , and a fast pseudo code algorithm providing both g and g is given. Figure 2 shows the graph of the second-order 0-RPP (solid line) compared with the hyperbolic tangent (dashdotted line). Even for this lowest order both curves fit quite well, as the difference plot (dash-dashed line) exhibits. Hence, the main advantage of replacing a real sigmoid AF by the p-RPP is the computation efficiency. In [6] the speedup is documented with some systematic experiments. The results show a computation time decrease for 3-RPP of 30% to 64% compared to a sigmoid AF, (5), and a decrease of 40% up to more than 80% compared with hyperbolic tangent function, all results being strongly compiler-and precisiondependent. Of course the speed-up is even larger for p-RPP functions of lower orders, e.g. 65% less computation time
(11)
including some small, dimensionless learning rate . This weight change normally is applied iteratively, i.e. for every available input pattern a single weight update is performed, usually multiple times in a round-robin manner, until the network output satisfies some predefined criterion (typically an error value to be under-run). The classification decision for a non-ideal classification network is based on a simple maximum search over all N class output neurons. The one with the largest output value is associated with the guessed class. For the MSE error function defined in (8) and the logistic
4 being the worst value for 0-RPPs compared with hyperbolic tangent.
1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 4 3 2 1 0 1 2 3 4
Khamron Sunat et. al neurons into saturation for many input patterns due to large activation values. This will probably happen with large initial weights. In the following section the results of the handwriting recognition experiment as proposed in [2] are reproduced. The same experiment is repeated using p-RPP neurons without further improvements. A strong dependency of training success from the weight initialization can be shown. The loss of training performance for high initial weights is obvious and can be assigned primarily to the zero-gradient ranges and the problems resulting from it as described above. A method supporting and developing a network-wide moderate neuron activation will be introduced now. If learning means reducing the MSE error function as defined in (8), the intention - and most times the result - is a network exhibiting one large output value for signaling a class recognition and small output values for all other neurons. The tendency during training with MSE is to manipulate the synaptic weights such, that for all input patterns this largesmall gap is optimized. Thus, BP training with MSE often pushes some synapses to extreme weights in order to reduce error contribution from a subset of input patterns and their according targets. This may lead to extreme activation values for some other patterns. As explained above, such a development is problematic for the AF with stereotypic output ranges like p-RPP. In [1] an error functional is proposed that can be transformed into an error function for the handwriting recognition problem, involving all output neurons at the same time for a single weight update. The main idea is to add a defuzzification layer right behind the output of the neural network. This output layer has as many units as the network has output neurons or classes, and shall perform a mapping from any vector y of all possible network outputs Y to the desired target vector s S , with S according to the definition in (7), c :Y S (15)
Figure 2. The graph of 0-RPP function (solid) shown as hyperbolic tangent (dash-dot), and the difference between both (dash-dash). The picture is taken from [6]. A major problem for the training of an ANN with a gradient descent method like BP algorithms is AF exhibiting ranges of constant values, since no gradient means no weight change. If input patterns generate an activation value in these regions, the synaptic weights will not change anymore, i.e. training can run into saturation without reaching a global minimum. Typical AF likes the logistic function (see (5)) or the hyperbolic tangent does not have regions with a gradient being totally zero. Therefore, they may run into phases of very slow learning, but never will completely get stuck. Another problem of constant output for larger ranges of activation arises especially for classification tasks. The classification guess of a network normally is decided by just associating the neuron having the largest output with the predicted class. We will call this the classify-by-max (CBM) criterion. But for the p-RPP function, which outputs its maximum (minimum) value for all activation values
2 p +1 2 p+1 , the probability of two neurons exhibiting exactly the same output value for a specific input pattern becomes significantly large. If, for example, two 0RPP neurons are activated to the values 1 = 2.01 and 2 = 13.35 , both will output y1 = y 2 = 1 . Hence, the zerogradient viz. stereotypic-output ranges are a problematic feature of p-RPP neurons. This has to be countervailed against with extra efforts. A promising idea is to find a mechanism that keeps the activation of p-RPP neurons within the bounds for nonlinearity 2 p+1,2 p+1 , while network training is performed.

Vector y is computed by the activation function having x and w as its variables. Hence, an ideal defuzzification layer satisfies c(y (x, w )) = class(x ) for all input patterns x X . Obviously, an error function based on such crisp outputs, associating input patterns with elements of S, is not differentiable and therefore not suited to be used within a BP-like network training. The authors of [1] approximated the fuzzy-to-crisp mapping c by the differentiable function y ( x, w ) c (y (x, w ) ) = y (x, w ) q
*
(16)
with u and q being values larger than 1 and with the convention that (v )u = v1, v 2 ,
Furthermore, a situation that surely should be avoided is a pre-training synapse weight initialization that already pushes
u !, vm )u := (v1u , v2u ,!, vm ).
Activation Function An easy-to-follow derivation in their publication shows that lim c * (y (x, w )) = c(y (x, w )). = (17)
N class k =1
(c k s k )u
u ,q
q 1 u y k 1 Kr y k y j . ij u q yq yq
Using this function c * , we can define a new error function J (w ) := c * (y (x, w )) class(x ) .
2
0, i j Kr Note that ij = is the Kronecker delta and has 1, i = j nothing to do with ADERF . With this new weight updating j rule, we hope to alleviate the problems emerging from the appliance of the p-RPP AF. The parameters u and q still have to be selected sensibly. See the following section for an experimental parameter adjustment.
(18)
This error function can be used on a subset of all possible input data, which may be assumed to exhibit the same distribution like the total data. Then, the sum over all elements of this training set is an empirical estimation of the general misclassification rate of the network. Therefore, the authors of [1] call the accumulation of the new error function over a training set approximate differentiable empirical risk functional (ADERF). The term functional is based on the definition of a risk functional (RF): Within the statistical learning theory it denotes the expected value of an arbitrary loss function L measuring the discrepancy between the correct and the estimated classification of a system. We will keep to that acronym for convenience, E ADERF J , but still will call E ADERF a function, since we apply weight changes iteratively during training (and do not calculate a loss function estimate based upon the whole training set). The great advantage of the ADERF error function for the p-RPP AF arises from the following fact: Assuming a nearly ideal defuzzification layer, the ADERF error can be pushed close to zero without the need of pulling the network output to rails, i.e. close to maximum viz. minimum values. After training with this error function is stopped and the defuzzification layer is removed, the network output will classify well in terms of the CBM criterion, but the winner in most cases will not output close to its maximum value. Roughly spoken, we abandon the goal of a network output close to the vectors defined by S, which is unreachable in a perfect way for most problems, anyway, but it is satisfied with a correct CBM classification at all. With this relinquishment, weight configurations emerge from training that result in activation values more probably located in the non-linear ranges of p-RPP and, therefore, tend to avoid stereotypic network outputs. For the ADERF error function, the general weight updating rule (11) can be given the specific form
ADERF wij = ADERF y j xi j
IV. Experimental Results

All experiments presented in this section are approaches to solve the classification task based on the CENPARMI handwriting recognition database, consisting of 6000 images of handwritten numerals. In order to gain comparable results, this set usually is divided into a training set Ttr (first 4000 samples) and a test set Tte (remaining 2000 samples). This division has been kept throughout the work. For all experiments, the network architecture was a single layer neural network consisting of ten neurons, with multiresolution locally expanded second order input, altogether 342 float input signals. Furthermore, a bias input providing a value of 1 was fed into each neuron, hence the total number of trainable synaptic weights became N syn = 3430 . Every experiment was performed
N exp times with different
(19)
with
ADERF = j
N class
* E ADERF c k * y j c k k =1
(20)
random synapse weight initializations in order to get statistical information about reliability, mean and standard deviation of training success. The weight values were selected randomly from an interval [ Wmax ,Wmax ] with a uniform distribution. If not mentioned explicitly, all results are the recognition rates of the network after 1000 epochs. The recognition rate G of correctly classified patterns from Tte after training with only Ttr , i.e. the generalization capability of the trained network, is considered the measure for training success. For each epoch, every pattern of the training data set was applied to the network input and a weight change was performed according to (11). The learning rate was set to a fixed value of 0.05. All data presented here have been gathered with a full-custom C++ implementation of the setup. In a first step, the experiment proposed in [2], i.e. BP training based on the MSE error function and neurons computing the logistic AF, was reproduced for control reasons. The weights were initialized with Wmax = 1 , in accordance with the published data, the number of runs for statistics was N exp = 20 . The achieved recognition rate of G = (97.10 0.15)% fits the published result of 97.05% (unfortunately given without error estimation) very well.
6 In a second step the new p-RPP AF was introduced to all neurons, and the modified experiment was repeated, again with N exp = 20 . Due to the existence of zero-gradient ranges we expected a worse result compared to the logistic AF, possibly depending on the initialization of synaptic weights. As will be seen below, some runs still gained recognition rates similar to the training with the logistic AF, but others indeed obviously got stuck and converged at much lower rates. For the logistic AF, the maximum deviation from the best result G * over a series of identical (except random initialization) experiments was below 1% of G * . Thus, we propose the percentage R of training runs which achieved at least 99% of the maximum value over all runs to be a coarse measure for training success reliability, R= N 99 , with N 99 = # of run i: Gi 0.99G * . N exp (21)
AF Wmax 0.25 0.50 0.75 1.00 1.25 1.50
2-RPP G = 97.23 0.06 R = 100 97.24 0.12 100 95.8 3.4 85 94.8 5.2 80 92.0 4.9 45 86.6 9.3 30
3-RPP 97.22 0.09 100 97.22 0.08 100 97.23 0.11 100 96.7 2.1 95 94.4 4.4 70 92.9 4.7 55
Table 1 summarizes the results of the first p-RPP experiment, namely the training of p-RPP and logistic AF network (standard setup, N exp = 20 ) with different values for the weight initialization parameter, ranging from Wmax = 0.25 to Wmax = 1.5 . For the logistic AF, the reliability of training success is found to be independent of Wmax , at least for values Wmax 1.25. In contrast, for the p-RPP AF the data exhibits a strong dependency on Wmax , just as expected due to the zero-gradient regions of the piecewise polynomial.
Table 1. Recognition rates G (upper values, in %) on the test set after training with different values of the weigh initialization parameter Wmax , given with the measured success reliability R (lower values, in %). Each column lists the results for another AF. The errors for G are standard deviations. The errors for R are difficult to estimate, but a very coarse statistical estimation following N = N leads to R 20% .
Initial weight sweep 100 90
AF Wmax 0.25 0.50 0.75 1.00 1.25 1.50
Logistic G = 97.09 0.05 R = 100 97.10 0.09 100 97.09 0.1 100 97.10 0.15 100 97.09 0.10 100 96.62 2.12 95
0-RPP 95.4 3.9 80 90.3 5.3 35 88.0 8.9 40 84.0 8.2 15 72.2 12.1 25 72.0 14.8 10
1-RPP 97.21 0.08 100 97.22 0.09 100 94.4 4.4 70 89.4 8.2 40 88.2 7.1 25 84.8 9.9 15
80 70
% (percent)
60 50 40 30 20 10 0.2 recognition rate (uncleaned) recognition rate (cleaned) success reliability (R) 0.4 0.6 0.8 1 Maximum initial weight 1.2 1.4 1.6
Figure 3. Solid line: Recognition rates G of a 1-RPP network after training with different weight initializations ( Wmax Max initial weight). Dash-dotted line: Measured success reliability R for the same experiments, i.e. the ratio of experiment runs that do not differ more than 1% from the best run. Dashed line: Mean and standard deviation over only those runs lying within reliability range. Figure 3 illustrates the break of training success reliability for weight initializations selected too large. It is a plot of the
Activation Function Wmax -sweep found in Table 1 for the 1-RPP AF. The solid line represents the mean recognition rate G, the error bars denoting the standard deviation over all runs. Since for higher values of Wmax training success obviously decreases, the experimentally found reliability measure R, i.e. the ratio of experiment runs, that do not differ more than 1% from the best run, is also included (dash-dotted line). Obviously, for large values of Wmax the solid line contains dirty data, i.e. runs that got stuck due to the insufficiency of the p-RPP AF. In that case the standard deviation is not a good error measure for the uncleaned data. The dashed line gives the mean and standard deviation over only those runs lying within reliability range and is therefore called cleaned. Together with the reliability plot it provides much better information about characteristics of a training setup. Another insight from these results is that selecting initial weights small enough already makes the p-RPP AF a fast and well performing alternative to continuously defined AF. This can be seen, for example, from the recognition rate for 1-RPP with Wmax = 0.25 . The number even exceeds the best value for the logistic AF. Generally, all results for experiment series with R = 100% provided similar or even better recognition rates compared to the logistic AF runs. But how to select the initial weight range for an arbitrary classification problem has not yet been answered. It is not practical to perform each experiment multiple times with different weight initializations and hope to find a setup providing reliably successful training. Reducing or eliminating the break of performance for large starting weights is fundamental for making p-RPP AF a useful alternative to continuously defined sigmoidal functions. The ADERF error function with its inherent feature to optimize classification without pushing weights too far might be the appropriate tool. Hence, the MSE error was replaced by the ADERF error function in a third step, while both the p-RPP and the logistic AF were used in different experiment series. A question not yet answered is how to select the ADERF parameters u and q (see (16)). A priori, we consider them to be integers larger than 0, because the necessary exponent computing can be performed in a fast recursive way. The larger both values are set, the more exactly the defuzzification layer will provide a crisp classification. But computation time will grow, either, since more multiplications have to be performed. Computational costs during training are not considered the topic of optimization for this work, we focus on speeding up classification after training, i.e. when the network is exposed to arbitrary patterns (including not trained ones). Anyway, choosing the step-size of u and q smaller than 1 was decided to be not necessary. Furthermore, the differentiability of ADERF as one of its important properties arises from the approximated character of ADERF. Hence, training might be supported better, if u and q are selected low. This leads to two opposing demands, namely large vs. small values for the ADERF parameters.
7 Thus, a systematic parameter sweep was performed, repeating the network training with p-RPP AF and the ADERF error function and applying different integer pairs (u, q ) . For every parameter pair,
N exp = 40 experiments
with
300
epochs
each
were
recorded. Each of the 40 runs had an individual random weight initialization determined by Wmax = 1 , since this value was problematic for all MSE approaches with p-RPP AF shown so far. Table 2 summarizes the experimentally found reliability values R gained from these runs.
(u, q ) (1,5) (2,5) (3,5) (4,5) (5,5) (2,1) (2,2) (2,3) (2,4) (2,10)
AF
0-RPP 5 25 55 28 15 28 75 55 43 40
1-RPP 5 78 68 25 25 28 80 88 75 65
2-RPP 18 95 90 65 43 22 78 95 97 90
3-RPP 18 100 88 47 43 25 90 95 100 95
Table 2. Experimentally measured reliability R of training success (in %) for p-RPP networks with the ADERF error function, depending on different choices for the ADERF parameters u and q. Again, error estimation can only be guessed by R N exp N exp = 12% .
The experiments show that for a high weight initialization, Some (u, q ) configurations can not remedy the phenomenon of recognition rate convergence far below the values normally reached with the logistic AF and MSE, but that for others the success reliability can be increased significantly. Exemplarily for the 2-RPP AF setup, Figure 4 plots the uncleaned version of G (solid line) and the reliability R (dashdotted) plus the cleaned G (dashed line) of the sweeps of the ADERF parameters u and q. Considering all results, selecting u = 2 and q p + 2 is proven to be a reasonable choice when training networks with the ADERF error function and p-RPP neurons with p 3 . The dependency qoptimal( p ) probably arises from the insufficiency of the low-order sigmoid approximation and the possibility to compensate it by a softer gradient of c * . For all p-RPP AFs, the new ADERF error function with well chosen parameters increases the robustness of training against the weight initialization problem and, inherently, avoids pushing the activation of p-RPP neurons out of their
8 non-linear regions. Hence, one can fully benefit from the advantage of the p-RPP AF, namely its fast computability, without loss of classification performance. Furthermore, when summarizing the recognition results for MSE and ADERF ( u = 2, q = 6 ), applied to two AFs considered so far plus a new one (we call it absolute sigmoid), y abssig ( ) =
(b) ADERF u swept
, 1+
Figure 4. Recognition results G (solid line), success reliability R (dash-dotted) and the cleaned version of G (dashed) for training a 2-RPP network with the ADERF error function, varying ADERF parameters u and q. Upper: Parameter q is swept from 1 to 5 for a fixed value u = 2 . Lower: Parameter u is swept from 1 to 5 for a fixed value q = 5 .
(22)
V. CONCLUSION
we find that ADERF training on networks with the new, more sophisticated function (22) even beats the result achieved with MSE (see Table 3). For the other two functions the decrease of recognition rate arising from the use of ADERF is marginal. The absolute sigmoid AF trained with ADERF and the 3-RPP AF trained with both MSE and ADERF provide results which are better than the data presented in [2] and thus are also better than all other published results reviewed in that paper.
G and R for 2RPP AF, sweeping ADERF q 100 90 80 70
60 50 40 30 20 recognition rate (uncleaned) recognition rate (cleaned) success reliability (R) 1 1.5 2 2.5 3 3.5 ADERF parameter q 4 4.5 5
(a) ADERF q swept

G and R for 2RPP AF, sweeping ADERF u 100 90 80 70
The task of pattern recognition with feed-forward networks of formal neurons was introduced to be the general topic of this work. A specific classification benchmark problem, namely the CENPARMI handwritten numeral database, served as a standard test for all methods developed within this work. The already published multi-resolution locally expanded high-order neural network architecture was explained and successfully developed further by changing the activation function of the neurons from continuously defined sigmoidal curves to piecewise defined, sigmoid-like polynomials. In this context, success is meant in terms of speed and correct classification rates in generalization mode, i.e. during normal operation after training, when the network is exposed to patterns not trained. The insufficiency of the piecewise defined polynomial activation function was proven to be compensable by using a new error function for training. This error function gains its advantage for being used in combination with the polynomial activation function from its inherent mechanism keeping neuron activations of the trained network low, thus avoiding the neurons to run into activation regions of constant output. For the CENPARMI dataset, the combination of piecewise polynomials and this new error function was shown to provide faster and better classification results than the work referred to as a starting point.
% (percent)
AF Err Funct MSE
logistic 97.09 0.05 100 96.82 0.09 100
3-RPP 97.22 0.09 100 97.18 0.07 100
Abs sigmoid 96.69 0.06 100 97.20 0.06 100
60 50 40 30 20 10 recognition rate (uncleaned) recognition rate (cleaned) success reliability (R) 1 1.5 2 2.5 3 3.5 ADERF parameter u 4 4.5 5
ADERF (2,6)
G G R G G R
% (percent)
Table 3. Recognition results G and measured training success reliability R for logistic, 3-RPP and absolute sigmoid AFs (in %). Each of them was trained with MSE and ADERF. For both error functions, a weight initialization of W max = 0.25 was chosen in order to maximize training
Activation Function reliability for p-RPP. The parameters for the ADERF runs have been set to u = 2 and q = 2.
9
Daniel Brderle was born in Offenburg, Germany, in 1978. He studied physics with focus on computer science in Heidelberg, Germany. He gained his diploma in physics at the Kirchhoff Institute in Heidelberg in 2004. He now works as a Ph.D. student in the "Electronic Vision(s)" group at the Kirchhoff Institute. His major field of interest is the application of hardware neural networks and liquid computing.
References
[1] G. Castellano, A. M. Fanelli, C. Mencar. An Empirical Risk Functional to Improve Learning in a Neuro-Fuzzy Classifier, IEEE transactions on Systems, Man, and Cybernetics, 34, pp. 725731, February 2004. [2] C.-L. Liu, J. H. Kim, R.-W. Dai. Multiresolution Locally Expanded HONN for Handwritten Numeral Recognition, Pattern Recognition Letters, 18(10), pp.10191025, 1997. [3] S. Haykin. Neural Networks: A Comprehensive Foundation 2nd Ed., Prentice Hall Inc. New Jersey, 1999. [4] C.J. Hilditch. Linear Skeletons from Square Cupboards, Machine Intelligence, 4, pp. 403420, 1969. [5] HK. Kwan. Simple Sigmoid-Like Activation Function Suitable for Digital Hardware Implementation, IEE Electronics Letters, 28(15), pp. 13791380, 1992. [6] K. Sunat. Principles of Convergent Rate and Generalization Enhancement for Feedforward SigmoidLike Network. PhD thesis, Chulalongkorn University Bangkok, 2003. [7] M. Zhang, S. Vassiliadis, and JG. Delgado- Frias. Sigmoid Generators for Neural Computing Using Piecewise Approximation, IEEE Trans Computers, 45(2), pp. 10451049, 1996.
Khamron Sunat was born in Trad, Thailand, in 1965. He graduated in chemical engineering in Chulalongkorn university, Thailand, in 1989. He recieved his M.Sc. in computational science in 1998 and Ph.D. in computer science from Chulalongkorn university in 2004. He now works as a lecturer in the department of computer engineering at Mahanakorn university of technology, Thailand and joined in research group in Advanced Virtual and Intelligent Computing (AVIC) Research Center, Chulalongkorn University. His research interests are in neural networks, soft computing, fuzzy system and pattern recognition. Sirapat Chiewchanwattana graduated in statistics from Khon Kaen university, Thailand. She received her M.Sc. in computer science from the National Institute of Development Administration, Thailand. She is now a Ph.D. student at Chulalongkorn university, Thailand. She works as a lecturer at Khon Kaen university and joined in research group in Advanced Virtual and Intelligent Computing (AVIC) Research Center, Chulalongkorn University. Her research interests are in neural networks, soft computing, and pattern recognition. Chidchanok Lursinsap was born in Bangkok, Thailand in 1956. He graduated in computer engineering from Chulalongkorn University, Thailand. He received MS and PhD in computer science from University of Illinois at Urbana-Champaign, USA. He is a professor in computer science at Department of Mathematics and head of AVIC Center at Chulalongkorn University. His research interests are in neural computing, bioinformatics, and plant simulation.
Suchada Siripant graduated in mathematics from Elon College, USA in 1972. She received MA in mathematics from University of North Carolina, USA in 1974. She is an associate professor in computer science at Department of Mathematics. Her research interests are in computer graphic, visualization and plant modeling.
Author Biographies

Achieving Compatible Numeral Handwriting Recognition Rate by A Simple Activation Function

Uploaded by

Copyright:

Available Formats

Achieving Compatible Numeral Handwriting Recognition Rate by A Simple Activation Function

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Achieving Compatible Numeral Handwriting Recognition Rate by A Simple Activation Function

Uploaded by

Copyright:

Available Formats

International Journal of Computational Intelligence Research. ISSN 0973-1873 Vol.2, No. 1 (2006), pp.

1-9 Research India Publications http://www.ijcir.info

Achieving Compatible Numeral Handwriting Recognition Rate by a Simple Activation Function

I. Introduction II. Basic Setup

Khamron Sunat et. al

III. New Approach

u !, vm )u := (v1u , v2u ,!, vm ).

IV. Experimental Results

N exp times with different

Khamron Sunat et. al

AF Wmax 0.25 0.50 0.75 1.00 1.25 1.50

Initial weight sweep 100 90

AF Wmax 0.25 0.50 0.75 1.00 1.25 1.50

3-RPP 18 100 88 47 43 25 90 95 100 95

Khamron Sunat et. al

(a) ADERF q swept

AF Err Funct MSE

logistic 97.09 0.05 100 96.82 0.09 100

3-RPP 97.22 0.09 100 97.18 0.07 100

Abs sigmoid 96.69 0.06 100 97.20 0.06 100

You might also like