Enhancing Alexnet For Arabic Handwritten Words Recognition Using Incremental Dropout
Enhancing Alexnet For Arabic Handwritten Words Recognition Using Incremental Dropout
Enhancing Alexnet For Arabic Handwritten Words Recognition Using Incremental Dropout
duanpf@whut.edu.cn
Abstract— Currently, the growth of mobile technologies, lead feature vectors suited for training and recognition. Large
to a necessity to develop handwritten recognition applications. varieties of techniques/classifiers have been employed for the
While the recognition of handwritten Latin and Chinese has been analytic approach. El Hajj et al. in [3] proposed an offline
extensively investigated using various techniques, so little works Arabic handwritten recognition system based on the
have been done on Arabic handwritten recognition, and none of combination of three HMM classifiers. The baseline-
the existing techniques is accurate enough for practical independent and baseline-dependent features were extracted
application. Over the past few years, deeper convolutional neural using a sliding window in three directions for the purpose of
networks (CNNs) have widely been employed for improving writing inclination, overlapping ascenders and descenders and
handwritten recognition performance. In this paper, we enhance
shifted position of some diacritical marks. Then, an HMM
the popular AlexNet for Arabic Handwritten Words Recognition
(HWR). By adopting a dropout regularization, we prevented our
classifier was put forward for in each direction to classify the
system against overfitting problem and reduced the error extracted features. The results showed that this combination
recognition rate. We also investigated ReLU and tanH activation gave better results than a single HMM. The obtained results
functions performance in the fully connected layers. Through given by the proposed model was higher than 90% accuracy.
several settings of experiments using the benchmarking IFN/ENIT Jayech et al. in [4] developed a dynamic hierarchical Bayesian
Database, we achieved a new state-of-the-art classification network for Arabic handwriting recognition. After the
accuracy of 92.13% and 92.55%. Lastly, we compared our best preprocessing step, an explicit segmentation based on the
result to those of previous state-of-the-art. smoothed vertical histogram projection. Then, a set of statistical
features was extracted for each character using the invariant
Keywords—Arabic handwritten; AlexNet; Overfitting; moment, such as Hu and Zernike. After that, the dynamic
Dropout hierarchical Bayesian network used to recognize the Arabic
words. Based on the nature of Arabic writing, Parvez et al. [5]
proposed an off-line Arabic handwritten text recognition
I. INTRODUCTION AND STATE OF THE ART system using structural techniques. A text line segmented into
Automatic handwriting words recognition is among the most words and sub/words and dots were extracted leaving the Parts
important axes in the NLP (Natural language processing). of Arabic words (PAW). The Arabic characters were modeled
Applications like postal address and zip code recognition, by fuzzy polygons. Fuzzy polygons matching algorithm were
passport validation, check processing demonstrated the need for later applied to recognize the Arabic word. Khémiri et al. [6]
recognition of handwritten characters. During the last few proposed a system based on Probabilistic Graphical Models
decades, numerous research results have been reported on (PGM). The system is divided into three stages, preprocessing,
handwriting recognition. Although there are promising results feature extraction and word classification. Preprocessing
for recognition Latin, Chinese and Japanese script, accuracies on includes baseline estimation. Structural features (ascendants,
recognition handwriting Arabic scripts fall behind. This is due descendants, loops and diacritic points) and statistical features
to unlimited variation in human handwriting, the large variety of at the pixel level (pixel density distributions and local pixel
Arabic character shapes, the presence of ligature between configurations) are then extracted from word images. Words are
characters and overlapping of the components. The different recognized using a variety of PGMs.
approaches of handwritten word recognition (HWR) fall into
either the on-line or off-line category. In on-line HWR, the Although being successful, the performance of such
computer recognizes the words as they are written. Off-line approaches has always been substantially dependent on the
recognition is performed after the writing is completed. We here selection of right representing features, which is a difficult task
focus on Off-line HWR, which has traditionally been tackled by for cursive writing. In a holistic approach, the entire word is
following two main approaches: (i) Analytic approach and (ii) recognized without prior decomposition of the word [7], [8]. In
holistic approach. The analytic approach [1], [2] a word is this case, the feature vectors are extracted from the word as a
decomposed into a set of smaller components (e.g., characters, whole. Convolutional Neural Networks (CNNs) are the current
graphemes, allographs) and then features are extracted for each state-of-the-art model architecture for handwritten recognition
component. Finally, the word is transformed into sequential tasks. CNNs apply a sequence of filters to the raw image data to
664
A. Deep Neural Networks softmax activation in the output. The fully connected layers are
A DNN is one of the most advanced machine learning simply implementing the dot product between the input and
techniques, it consists of a succession of convolutional and weight vectors, where each neuron in layer l is connected to all
max-pooling layers, and each layer receives connections from outputs of neurons in layer l − 1. Moreover, it uses the dropout
its preceding layer. The most popular image classification regularization method to reduce overfitting in the fully
structure of DNN is constructed by three primary processing connected layers and applies Rectified Linear Units (ReLUs) for
layers: Convolutional Layer, Pooling Layer and Fully the activation of those and the convolutional layers function to
Connected Layer (or classification layer). DNN units are speed up the learning process.
described below: C1
Convolutional Layer: M1
C2 M2 FC1 FC2
Let ∈ ℝ × represents the ith map in the lth layer, jth
kernel filter in the lth layer connected to the ith map in the (l-1)th M3 FC3
layer denoted ∈ ℝ × and index maps set Mj ={i|ith in the 5 C3 C4 C4
(l-1) Layer map connected to jth map in the lth layer}. The
th
5
5
convolution operation can be given by equation (1). 5 3 3 3
= (∑ ∗ + ) (1) 3 3 3
∈
665
AlexNet from scratch and called it Net1, then Net2, Net3 and A. Training Method:
Net4 were build using the baseline Net1. Each of the four Suppose we have T categories and the training data for the
networks tests the input image and produces the same number each category are denoted as (xi, yi), where i = {1,..., N }, with xi
of classes. All the networks share the same configuration of the ∈ ℝ and yi ∈ ℝ Being the feature vector and label apart.
number of layers, the number of filters and the dropout after The objective of training is to iteratively minimize the following
fully connected layers. The four networks can answer the cross-entropy loss function:
following two questions: where to add the dropout and to use
which value? Which activation function can improve the ⎡ ⎤
performance? ReLU or tanH? The networks architectures are 1
( )=− ⎢ 1{ = } = ⎥
depicted in Fig. 3 and the description of each network are as ⎢ ⎥
follow: ⎣ ⎦
Net1 (Fully connected dropout): this net represents standard
AlexNet. In Net1 dropout value set to 0.5. The dropout applied
immediately after fully connected layers. where θ is model parameters, ∑ is a factor of
normalization, 1(.) is an indicating function.
Net2 (Convolution dropout): The dropout value set to 0.5 The loss function of J(T) can be minimized by using one of
after convolution layer 1 and convolution layer 2. optimization algorithms such as Stochastic Gradient Descent
Net3 (Max-pooling dropout): The dropout value set to 0.5 SGD, adam or adadelta, during the training process.
after max-pool layer 1, max-pool layer 2 and max-pool layer 3. To put our results in comparison with another state of the art
methods, we have used the offline IFN/ENIT benchmark
Net4 (Convolution and max-pooling dropout): In this net, the database. IFN/ENIT database contains 32492 binary images of
dropout probability is increased with respect to the depth of the Arabic handwritten words written by more than 400 writers.
network. The dropout rate increased by 0.1 after each 19724 words for training and 12768 for testing. The words
convolution layer. The dropout rate increased by 0.2 after each represent 937 Tunisian town/village names. Before training, we
max-pool layer. The purpose of doing so is that the bottom layers shuffle our training data; we normalize the offline words image
of a deep network are usually harder to train compared with the to a size of 227 x 227.
top layers [24].
The experiments conducted on an open source library called
keras [25]. As mentioned before, AlexNet composed of 5
IV. EXPERIMENTAL RESULTS
convolution layers and three max-pooling layers. A max pooling
To examine the performance of the proposed models, we layer follows the layer 1, layer 2 and layer 5. The receptive field
have conducted many experiments. The network's performance of each convolutional layer is 3×3. Three Fully-Connected (FC)
has been measured in term of Classification Accuracy (CA). layers follow the fifth convolutional layer: the first two has 512
The objective of the experiment part was twofold. First, it was channels each. Since the original AlexNet was trained on 1000
a comparison between the state-of-the-art. Second, we wanted classes, its last fully connected layer produces 1000 outputs. We
to check whether dropout was better after convolution or after replace this layer with a new fully-connected layer that has as
max pooling layer and in which values. many outputs as the number of classes (937 for the IFN/ENIT
Net 1:
D D
C1 M1 C2
C M2 C
C3 C4 C5 M3 Fc1 0.5 Fc2
F 0
0.5
D D D D
Net 2: C1 0.5 M1 C2 0.5 M2 C3 C4 C5 M3 Fc1 0.5 Fc2 00.5
D D D D D
Net 3: C C3 C4 C5 M3 0.5 Fc1
C1 M1 0.5 C2 M2 0.5 C F 0.5 Fc2 0.5
D D D D D D D D D D
Net 4: C1 0.0 M1 0.1 C
C2 0.1 M2 0.3 C3 0.2 C
C C4 0.3 C5 0.4 M
M3 0.5 F
Fc1 0.5 Fc2 0.5
0
Convolution layer Max pool layer Dropout layer Fully connected layer
666
database). The final layer is the soft-max layer. As we used the training reached 100% after 86 epochs. Net2 achieved equal to
latest version of Keras, we replace the Local Normalization 85.96% when using tanH and 85.32% using ReLU. Net2
Layer with Batch Normalization layer. This layer is useful and performed the worst among the four nets. Net3 achieved
versatile. The training is carried out by using Stochastic Gradient accuracy equal to 89.55%, but when replacing its activation
Descent SGD which is a highly common optimization algorithm function with ReLU, the accuracy grows to 90.43%. The best
utilized in deep networks to update the weights. The batch size performer between our nets is Net4. This net achieved the
was set to 64 and the momentum to 0.9. The training was highest classification accuracies 92.13% using ReLU and
regularized with the weight of five · 10−4 and dropout 92.55% using tanH activation function. These two results have
regularization for the first two fully connected layers (dropout outperformed the state of the art results. The results depicted in
ratio set to 0.5). The learning rate was set to 10−2.The type of Table 1 in term of Classification accuracy (CA). The overall
non-linearity used for the convolution layers is Rectified Linear performances of the four networks using ReLU are depicted in
Unit (ReLU). AlexNet was trained for 200 epochs. The whole Fig. 5 (a), and using tanH are depicted in Fig. 5 (b).
training procedure for a single network took at most 3 hours on
a desktop PC with an Intel i7 3770 processor, a NVidia GTX780 C. The impact of dropout
graphics card and 16 gigabytes of on-board RAM.
Dropout consists of setting to zero the output of each hidden
neuron with probability p. If the neurons in CNN are dropped
B. The impact of tanH and ReLU activation functions out, they do not contribute to the forward pass and do not
ReLU (as shown in Fig.4) is an effective activation function participate in backpropagation. In this paragraph, we provide
for use in neural network. ReLU function is given by: additional insight into the performance of the proposed method.
In Fig. 5(a), (b), we show the accuracy performance during
f(x) = max(0,x) training and testing of the model.
The blue and the orange ascending curves correspond to the
Unlike fully connected layer, convolution layer extracts classification accuracy values for the training and the testing. As
feature. Feature extraction requires sparsity in the input feature shown in Fig. 5, Net1 (dropout in fully connected layers), the
maps, and it should set to 0 as many features as possible. Unlike training accuracy is increased very fast, and after 26 epochs, it
ReLU, the sparsity does not come into effect with other reaches 100%. However, the test accuracy of Net1 is not
activation functions as they can generate small values instead promising due to the overfitting on training data (see Table 1).
of zeros. The sparsity in the features helps in speeding up the Net2 performance is the poorer comparative to other networks.
computation process by removing the undesired features. In This is maybe due to higher drop probability (dropout equal to
AlexNet, the fully connected layers used tanH (shown in Fig.4) 0.5) is applied to the first and the second convolution layers.
as an activation function. tanH function is given by: With the help of dropout, Net3 performs better than Net1, and
f(x) = tanh(x) Net2. Dropout after each max-pool layer significantly improves
the accuracy to 89.55% and 90.43%. Net4 outperforms Net1,
The focus of fully connected layers is to generate new Net2 and Net3 achieving the higher accuracies.
features rather than extracting features as convolution layers do. The very high accuracies 92.13% and 92.55% prove the
Moreover, as fully connected layers are close to the output effectiveness of incremental dropout in improving the
layer, so it is less affected by the vanishing gradient problem. generalization performance of the deep neural network.
However, when dropout is applied to every convolution and
max-pool layers in deep CNN, training process can be slow
ReLU TanH since activation signals are dropped exponentially as dropout is
10 1.5
repeatedly applied [19]. Note that in the testing process, dropout
is no longer used for all nets. The results of using dropout in the
8 1
four networks are depicted in table 1.
6 0.5
667
(a) Net 1 Net 2 Net 3 Net 4
Accuracy
Accuracy
Accuracy
80% 80% 80% 80%
60% 60% 60% 60%
40% 40% 40% 40%
20% 20% 20% 20%
0% 0% 0% 0%
Accuracy
Accuracy
Accuracy
80% 80% 80% 80%
60% 60% 60% 60%
40% 40% 40% 40%
20% 20% 20% 20%
0% 0% 0% 0%
Fig. 5 The networks performance for 200 epochs (a)using ReLU activation function (b) using TanH activation function
D. Comparison with the state of the art Net2, Net3, and Net4. Dropout regularization helps to reach a
good accuracy. We show incremental improvements of the
To show the performance of the proposed enhanced word recognition comparable to approaches used Deep Belief
AlexNet, we compare the performances of different methods Network (DBN) or Recurrent Neural Network (RNN). We have
using the IFN/ENIT database. The results are given in
achieved promising performance which is superior to the state-
Table 2. Elleuch et al [11] achieved 83.7 % recognition
of-the-art systems. The classification accuracy was 92.87%
accuracy (16.3% error rate). Graves et al [10] achieved 91.4 %
recognition accuracy (8.6% error rate). The result of Graves was using ReLU activation function and 92.55% using the
the highest in literature. It can be seen that Net4 achieved the activation function. These two results are the new state-of-the-
higher accuracies on IFN/ENIT database. AlexNet with art records using the deep convolutional neural network to our
Incremental Dropout + ReLU achieved 92.13% classification best knowledge. Using incremental dropout, the classification
accuracy (7.87% error rate). And AlexNet with Incremental accuracy further improved consistently and significantly. As
Dropout + TanH outperformed the state-of-the-art with 92.55% future works, we plan to discover the performance of other deep
recognition accuracy (error rate of 7.45%). networks like VGGNet, GoogLeNet and ResNet on IFN/ENIT
database.
TABLE 2. COMPARISON WITH THE STATE- OF- THE- ART
ACKNOWLEDGMENT
Author Model CA%
This research was supported in part by Science &
Net4:AlexNet with Technology Pillar Program of Hubei Province under Grant
Present work 92.55% (#2014BAA146), Nature Science Foundation of Hubei Province
Incremental Dropout+TanH
Net4:AlexNet with
under Grant (#2015CFA059), Science and Technology Open
Present work 92.13% Cooperation Pro-gram of Henan Province under Grant
Incremental Dropout+ReLU
(#152106000048).
MDLSM
Graves et al[10] 91.4%
REFERENCES
CDBN
Elleuch et al[11] 83.7%
[1] Kim K K, Jin H K, Yun K C, et al. Legal Amount Recognition Based on
the Segmentation Hypotheses for Bank Check Processing[C]//
International Conference on Document Analysis and Recognition, 2001.
V. CONCLUSION Proceedings. IEEE Xplore,2001:964-967.
In this paper, we enhance the famous network AlexNet for the [2] Vinciarelli A. A survey on off-line Cursive Word Recognition [J]. Pattern
Recognition, 2002,35(7):1433-1446.
purpose of Arabic HWR, which demonstrated the efficiency,
[3] Alhajj M R, Likformansulem L, Mokbel C. Combining slanted-frame
applied on IFN/ENIT databases. In order to protect the network classifiers for improved HMM-based Arabic handwriting recognition.[J].
against overfitting, dropout is applied in different positions in IEEE Transactions on Pattern Analysis & Machine Intelligence, 2009,
31(7):1165.
668
[4] Jayech K, Trimech N, Mahjoub M A, et al. Dynamic hierarchical Information Processing Systems. Curran Associates Inc. 2012:1097-
Bayesian network for Arabic handwritten word recognition[C]// Fourth 1105.
International Conference on Information and Communication [14] Szegedy C, Liu W, Jia Y, et al. Going deeper with convolutions[C]//
Technology and Accessibility. IEEE, 2014:1-6. Computer Vision and Pattern Recognition. IEEE, 2015:1-9.
[5] Parvez M T, Mahmoud S A. Arabic handwriting recognition using [15] Simonyan K, Zisserman A. Very Deep Convolutional Networks for
structural and syntactic pattern attributes[J]. Pattern Recognition, 2013, Large-Scale Image Recognition [J]. Computer Science, 2014.
46(1):141-154.
[16] Hinton G E, Srivastava N, Krizhevsky A, et al. Improving neural networks
[6] Khemiri A, Kacem A, Belaid A. Towards Arabic Handwritten Word by preventing co-adaptation of feature detectors[J]. Computer Science,
Recognition via Probabilistic Graphical Models[C]// International 2012, 3(4):págs. 212-223.
Conference on Frontiers in Handwriting Recognition. IEEE, 2014:678-
[17] Wan L, Zeiler M, Zhang S, et al. Regularization of neural networks using
683.
dropconnect[C]// International Conference on Machine Learning.
[7] Madhvanath S, Govindaraju V. The role of holistic paradigms in 2013:1058-1066.
handwritten word recognition[J]. IEEE Transactions on Pattern Analysis
[18] Aburas A A, Gumah M E. Arabic handwriting recognition: Challenges
& Machine Intelligence, 2001, 23(2):149-164.
and solutions[C]// International Symposium on Information Technology.
[8] Ruiz-Pinales J, Jaime-Rivas R, Castro-Bleda M J. Holistic cursive word IEEE, 2008:1-6.
recognition based on perceptual features[J]. Pattern Recognition Letters,
[19] Srihari S N, Ball G. An Assessment of Arabic Handwriting Recognition
2007, 28(13):1600-1609.
Technology [M]// Guide to OCR for Arabic Scripts. Springer London,
[9] Wu C, Fan W, He Y, et al. Handwritten Character Recognition by 2012:3-34.
Alternately Trained Relaxation Convolutional Neural Network[C]//
[20] Lemley, J.; Bazrafkan, S.; Corcoran, P., Smart Augmentation-Learning
International Conference on Frontiers in Handwriting Recognition. IEEE,
an Optimal Data Augmentation Strategy. IEEE Access 2017.
2014:291-296.
[21] Krizhevsky, A.; Sutskever, I.; Hinton, G. E. In Imagenet classification
[10] Graves A. Offline Arabic Handwriting Recognition with
with deep convolutional neural networks, Advances in neural information
Multidimensional Recurrent Neural Networks[J]. Advances in Neural
processing systems, 2012; pp 1097-1105.
Information Processing Systems, 2012:545-552.
[22] Fraser-Thomas, J.; Côté, J.; Deakin, J., Understanding dropout and
[11] Elleuch M, Tagougui N, Kherallah M. Deep Learning for Feature
prolonged engagement in adolescent competitive sport. Psychology of
Extraction of Arabic Handwritten Script[M]// Computer Analysis of
sport and exercise 2008, 9 (5), 645-662.
Images and Patterns. Springer International Publishing, 2015:371-382.
[23] Wu, H.; Gu, X., Towards dropout training for convolutional neural
[12] Deng J, Dong W, Socher R, et al. ImageNet: A large-scale hierarchical
networks. Neural Networks 2015, 71, 1-10.
image database[C]// Computer Vision and Pattern Recognition, 2009.
CVPR 2009. IEEE Conference on. IEEE, 2009:248-255. [24] Zhang X Y, Bengio Y, Liu C L. Online and Offline Handwritten Chinese
Character Recognition: A Comprehensive Study and New Benchmark[J].
Pattern Recognition, 2016, 61(61):348-360.
[13] Krizhevsky A, Sutskever I, Hinton G E. ImageNet classification with deep
[25] Chollet F. , “Keras,” https://github.com/fchollet/keras, 2015.
convolutional neural networks[C]// International Conference on Neural
669