Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

2018 IEEE 14th International Colloquium on Signal Processing & its Applications (CSPA 2018), 9 -10 March 2018,

Penang, Malaysia

Improved Optical Character Recognition


with Deep Neural Network
Tan Chiang Wei U. U. Sheikh Ab Al-Hadi Ab Rahman
Intel Microelectronics (M) Sdn. Bhd. Faculty of Electrical Engineering Faculty of Electrical Engineering
Pulau Pinang, Malaysia Universiti Teknologi Malaysia Universiti Teknologi Malaysia
chiang.wei.tan@intel.com Johor Bahru, Malaysia Johor Bahru, Malaysia
usman@fke.utm.my hadi@fke.utm.my

Abstract— Optical Character Recognition (OCR) plays an caused by damaged print cartridges. Unfortunately, these
important role in the retrieval of information from pixel-based training samples are yet to be found in existing solutions. In
images to searchable and machine-editable text formats. In old or order to recognize poor quality English characters, an improved
poorly printed documents, printed characters are typically broken OCR with sufficient training data is needed.
and blurred, making character recognition potentially far more
complex. In this work, deep neural network using Inception V3 is In transfer learning, training samples can be used to pre-train
used to train and perform OCR. The Inception V3 network is a network in the source domain, and these well-trained learning
trained with 53,342 noisy character images, which were collected characteristics can be delivered and benefit the training process
from receipts and newspapers. Our experimental results show that in the target domain of the second network. In recent years,
the proposed deep neural network achieved significantly better traditional methods in the field of OCR research have been
recognition accuracy on poor quality text images and resulted in almost replaced with deep learning methods such as
an overall 21.5% reduction in error rate compared to existing Convolutional Neural Networks (CNN). Oquab et al. proposed
OCRs. that using CNN to learn image representations on a large
annotation dataset can adequately transfer this information to
Index Terms—OCR (Optical Character Recognition), Deep other visual recognition tasks with a limited amount of training
Learning, Transfer Learning data [2]. Yejun Tang et al. proposed to add an adaptation layer
in CNN using transfer learning, which achieves performance
I. INTRODUCTION improvement in historical Chinese character recognition tasks
Text character recognition commonly deals with the [3]. Inspired by these works, we propose to apply a deep neural
recognition of optically processed characters and is also known network with transfer learning for broken English character
as optical character recognition (OCR). The basic idea of OCR recognition.
is to convert any hand written or printed text into data files that
can be edited and read by machine. With OCR, any article or II. METHODOLOGY
book can be scanned directly and the image can then be easily
converted to text using a computer. The OCR system has two A. OCR Design Model
major advantages, which are the ability to increase productivity The adopted methodology in this paper is inspired from
by reducing staff involvement and the ability to store text YeJun Tang OCR system [3]. Although the research work done
efficiently. Generally, the areas where OCR can be applied are by YeJun Tang was focused on Chinese text characters, but the
postal departments, banks, publication industry, government transfer learning concept can be applied for English text
agencies, education, finance and health care [1]. The universal characters with the same purpose to reduce training time and
OCR system consists of three main steps which are image improve recognition accuracy. A pre-trained model is used and
acquisition and preprocessing, feature extraction and applied together with transfer learning in this paper to enhance
classification [1]. Image preprocessing phase cleans up and the recognition results and speed up the training process. The
enhances the image by noise removal, correction, binarization, proposed OCR system has been designed with the help of
dilation, color adjustment and text segmentation. Feature various modules as shown in Fig. 1.
extraction is to extract and capture information from the
acquired text image to be used for classification. In the
classification phase, the portion of the segmented text in the
document image is mapped to the equivalent textual
representation.
There are several existing OCR solutions which are
commonly used in machine learning and pattern recognition.
However, there is still a challenging problem for recognizing
broken or faded English characters. The performance of OCR
directly depends on the quality of input image or document, thus
making character recognition in scene images is potentially far
more complicated. In addition, poor quality English characters Fig. 1. The proposed OCR model
are typically obtained from old printed documents, and some are

978-1-5386-0389-5/18/$31.00 ©2018 IEEE 245


2018 IEEE 14th International Colloquium on Signal Processing & its Applications (CSPA 2018), 9 -10 March 2018, Penang, Malaysia

Fig. 2. Transfer-learning based on Inception V3 Model


Fig. 3. Segmentation results using different kernel settings for bounding
rectangle algorithm
The process is split into 4 major blocks (input acquisition and
pre-processing, training, testing and validation). The first block
is input acquisition and pre-processing of receipts and old III. NETWORK MODEL SELECTION AND IMPLEMENTATION
newspapers. The prepared documents are scanned into images
and these images will be processed and segmented using A. Model Selection
Maximally Stable Extremal Regions algorithm. After
segmentation, all the segmented text characters become raw The network model consists of two networks, where the first
input that requires labeling. After labeling, the next step is to network is for feature extraction and the second network is for
train using a deep neural network. In this work, the deep neural classification. Three network models are trained and tested with
network that is used is a pre-trained Inception V3 model with text characters and the accuracy is shown in Table I.
transfer learning and is shown in Fig. 2. When all the labeled
images have been processed through the Inception V3 model TABLE I
and the resulting transfer-values (weight values for classifier FIRST NETWORK MODEL SELECTION
layers) saved to a cache file, then these transfer-values can be Accuracy Training
used as the input to another neural network. The second neural Network Model Layers Details
(%) Time
network is trained using the classes from the labeled images, so 10-layer
5 conv/pool,
the network learns how to classify images based on the transfer- 4 fully connected, 77.0 36 hours
CNN
values from the Inception V3 model. In this way, the Inception 1 softmax
2 conv/pool,
V3 model is used to extract useful information from the images 5-layer
2 fully connected, 80.0
12
and another neural network is then used for the actual CNN hours
1 softmax
classification. Inception 120 conv/pool,
1 hour
V3 2 fully connected, 90.6
After 50,000 optimization iterations, the network is trained Model 1 softmax
13 min
and ready for testing. Test dataset is given to the trained network
for optical character recognition. Any fine-tuning in the
The first network model is a deep neural network with 10
validation stage requires a rerun of the network training and
layers of CNN and was initially used for flower recognition [4].
testing.
In this experiment, this network model is trained with a
characters dataset (characters 0-9) and the accuracy obtained
B. Image Processing and Segmentation was 77%. The accuracy result is acceptable but the training time
Image processing and segmentation is essential in extracting is too long for just 10 classes. The second network is a network
segments of symbols and text characters. The image processing model for CIFAR-10 dataset which is made up of 5 layers of
strategy has four basic steps including grayscale conversion, CNN [5]. This network model is trained using CIFAR-10 dataset
binarization, dilation and segmentation. The process starts by that consists of 60,000 color images with size of 32x32 for 10
converting image into grayscale image and then convert it to classes (animals and transports). This dataset actually has 6,000
binary. Binarization helps to enhance the symbols and text images per class and there are 50,000 training images and 10,000
characters and removes disjoint pixels. Characters are then test images. The accuracy obtained was 80%. The training
enhanced by applying image dilation. process took around 12 hours and the result is acceptable for a
network that just needs one time training. The last network is a
The next step is to perform character segmentation, which is
pre-trained network model called Inception V3 model with
to localize image blobs by applying a contour algorithm on the
transfer learning. In this experiment, this network model is
dilated image. Bounding Rectangle algorithm is used to bound
trained with 94 classes of clean characters dataset (0-9, A-Z, a-z
each contour into the smallest possible rectangle. Fig. 3 shows
and symbols). Results show that using more layers can actually
the different segmentation results obtained for different kernel
achieve higher accuracy of 90.6% compared to the other two
settings. When the kernel (x, y) size is set higher, then several
network models. Besides that, the training time is just 1 hour 13
characters are localized in the same rectangle box which is
mins. Using a pre-trained model with transfer learning has the
undesirable. A kernel of (3, 1) is found suitable in this work.
benefit of fast training time with high accuracy.
Segmentation will generate text character images as output and
these images are not yet labeled.

246
2018 IEEE 14th International Colloquium on Signal Processing & its Applications (CSPA 2018), 9 -10 March 2018, Penang, Malaysia

B. Network Parameters 3) Batch Size: Instead of testing the number of iterations, in


The purpose of the following experiments is aimed at this experiment, the batch size is evaluated. Batch size defines
selecting best network parameters of the second network for the number of image samples that are propagated through the
classification. network. Batch size does influence training time and accuracy
but there is no general rule for determining the optimal batch
1) Number of Fully Connected Layers: This experiment
size. We experimented with 3 different batch size; 32, 64, and
was done to analyze on how the accuracy and training time are
128. Table IV shows that increasing the batch size will increase
affected by the number and size of the fully connected layers in
the training time. Besides that, batch size of 64 obtained the
the second network. When training a deep convolutional neural
highest accuracy and acceptable time. In order to avoid
network, the first layer will train itself to recognize very basic
overfitting (such as when batch size is 128), batch
things like edges, the next layer will train itself to recognize
normalization or dropout is applied. Table V shows that by
collections of edges such as shapes, and subsequent layers will
using dropout equals 0.2 overcome overfitting (dropout 20% of
learn more details. Table 2 shows that larger layer size results
batch size).
in longer training time. The highest accuracy obtained for a
single fully connected layer is 64.5%, obtained not from the TABLE IV
largest layer size of 4096 but was obtained from a layer size of BATCH SIZE VS ACCURACY
1024. Besides that, the highest accuracy obtained from two Batch Size Time (mins) Accuracy (%)
fully connected layers (1st: 4096, 2nd: 2048), with an accuracy 32 70.32 62.8
of 66.6% but with a very long training time. Adding layers will 64 73.35 73.1
also increase the training time, thus 3 fully connected layers is 128 82.57 64.1
not considered.
TABLE V
DROPOUT TEST ON BATCH SIZE 128
2) Training Iterations: The next experiment was conducted
Batc Batch Size after
to investigate on how the accuracy and time usage are affected h
Dropout
dropout
Time Accuracy
Rate (mins) (%)
by the number of training iterations. Table III illustrates that the Size (approx)
accuracy increases at the beginning from 1,000 iterations until 128 0.8 25.6 81.23 69.0
128 0.5 64.0 81.02 68.8
50,000 iterations. The accuracy converges after 50,000
128 0.2 102.4 83.40 70.6
iterations and no improvement is noted.

TABLE II
4) Activation Function: The purpose of activation function
EXPERIMENT ON THE NUMBER OF FULLY CONNECTED LAYERS is to convert an input signal of a node in a neural network to an
Number Layer Size Accuracy Time
output signal [6]. Without activation function, neural networks
of Layers (1st, 2nd, 3rd) (%) (mins) would not be able to learn and model complicated information
256 63.0 11.72 such as videos, audio, speech and images [7]. There are
512 62.9 14.20 different types of activation function that can be applied such
1 1024 64.5 20.30
2048 64.1 31.15 as Relu (Rectified Linear Unit), sigmoid and tanh. We have
4096 64.4 54.02 conducted accuracy measurement using ReLu, sigmoid and
1024, 256 61.1 21.60 tanh. Experimental results show that ReLu has the highest
2048, 256 61.9 35.23 accuracy (73.1%) while sigmoid and tanh achieved 67.5% and
2 4096, 1024 65.8 75.61
1024, 4096 65.6 44.28
68.8% accuracy respectively. ReLu has no gradient vanishing
4096, 2048 66.6 105.05 problem as ReLu’s gradient is always constant = 1. The only
4096, 2048, 1024 61.4 105.73 disadvantage of ReLu is that it can cause overfitting, but this
4096, 1024, 256 61.6 75.32 can be solved by using dropout. The final network
3 1024, 4096, 1024 60.9 63.40
4096, 1024, 1024 64.4 81.88
implementation is as follows. We use the Inception V3 network
4096, 4096, 1024 61.0 158.2 model, with two fully connected layers (1st: 4096, 2nd: 2048).
We train the final network with 50,000 iterations, a batch size
of 128 with a dropout factor of 0.2 and ReLu is used as the
TABLE III
ITERATION VS ACCURACY AND TIME USAGE
activation function for the proposed OCR system.
Iteration No Accuracy % Time (mins) IV. RESULTS
1000 40.40 1.50
5000 63.80 7.48 A. Benchmark with Existing OCR
10000 63.80 18.32
50000 65.50 73.50 The proposed OCR is benchmarked against existing OCR,
100000 63.20 146.13 the a9t9 (also called as OCR space). The a9t9 OCR was released
150000 65.10 221.57 in 2017 and the API is available online [8]. a9t9 OCR supports
250000 65.70 470.62 image dimensions of 40 by 40 pixels up to 2600 by 2600 pixels.
Since a9t9 OCR does not provide any standard testing data in

247
2018 IEEE 14th International Colloquium on Signal Processing & its Applications (CSPA 2018), 9 -10 March 2018, Penang, Malaysia

public, thus the collected testing-sets from real-world samples


are used for benchmarking.
We use four real-world samples (see Fig. 4 (a-d)), containing
657 text characters to benchmark both OCR systems. Table VI
indicates that the proposed OCR achieves better performance
compared to a9t9 OCR when it comes to poor quality test Fig. 5. Eight different noise patterns used in noise pattern analysis.
samples.
100
95
90
85

Accuracy (%)
80
75
70
65
60
55

(a) 50
White Black Printer Split-
Ink Pepper Edge Corner
line line head up/divisi
leakage noise distortion missing
distortion distortion damage on
Accuracy 59.7 62.9 67.7 67.7 72.6 77.4 83.9 90.3

Fig. 6. OCR classification accuracy for different simulated noise patterns.

B. Text Character’s Noise Pattern Analysis


The aim of this experiment is to analyze the impact of
(b) different noise patterns on the detectable rate of text characters.
Fig. 5 shows the hand-crafted test data. The hand-crafted data is
specially designed with 8 different noise patterns for each class,
and all these hand-crafted data will be used in this experiment.
A total of 135 images were used.
Fig. 6 shows that characters with “corner missing” and
“split-up” noise patterns are easily classified compared to other
types of noise. Notable, the proposed OCR performed poorly
(c) when subjected to characters distorted with “white line”, “ink
leakage” and “pepper noise”. The recognition accuracy is
greatly affected by the damaged area in the image. In order to
improve the classification rate, the deep neural network must be
trained with sample images of similar kind of noise patterns.

V. CONCLUSION

(d) In this work, we proposed an OCR system for the


Fig. 4. (a-d) Test samples used for benchmarking. recognition of printed text in poor quality images. This was
achieved by building a transfer learning based OCR using a pre-
trained deep neural network (Inception V3). The propsed deep
TABLE VI neural network based OCR was trained and tested using real
BENCHMARK RESULTS BETWEEN A9T9 AND THE PROPOSED OCR world samples and standard English Text Character dataset.
From the experiment results, the system achieved significantly
a9t9 OCR Proposed OCR better recognition accuracy at an average of 78% for poor
Test Total quality text images and resulted in overall 21.5% reduction in
Recognized

Recognized

error rate as compared to a9t9 OCR. Furthermore, the OCR also


Accuracy

Accuracy

Imag Character
(%)

(%)

e s In Image maintained a 90.6% accuracy on good quality image test


dataset. Further experiments conducted showed that the
proposed OCR can perform well with noisy character images.
(a) 587 254 43.3 409 69.0 The results also concluded that training process for a network
(b) 27 18 66.7 19 70.4
(c) 23 6 26.1 19 82.6 must not only include good quality training data but also poor
(d) 20 18 90.0 19 95.0 quality training data to improve the learning of the network.

248
2018 IEEE 14th International Colloquium on Signal Processing & its Applications (CSPA 2018), 9 -10 March 2018, Penang, Malaysia

ACKNOWLEDGMENTS [3] Y. Tang, L. Peng, Q. Xu, Y. Wang, and A. Furuhata,


The authors thank the Ministry of Education Malaysia and “CNN Based Transfer Learning for Historical Chinese
Universiti Teknologi Malaysia (UTM) for their support under Character Recognition,” in Proceedings of the 12th IAPR
grant number Q.J130000.2623.14J48. Workshop on Document Analysis Systems (DAS), 2016,
pp. 25–29.
[4] Cho Yee Phy, “Tensorflow input image by tfrecord,” 2017.
[Online]. Available: https://github.com/yeephycho/
REFERENCES
tensorflow_input_image_by_tfrecord. [Accessed: 4-Mar-
[1] R. Anil, K. Manjusha, S. S. Kumar, and K. P. Soman, 2018].
“Convolutional Neural Networks for the Recognition of [5] L. Hvass, “Tensorflow tutorial 06 CIFAR-10,” 2016.
Malayalam Characters,” in Proceedings of the 3rd [Online]. Available: https://www.youtube.com/ watch?v =
International Conference on Frontiers of Intelligent 3BXfw_1_TF4. [Accessed: 4- Mar- 2018].
Computing: Theory and Applications (FICTA) 2014, vol. [6] R. Prajit, Z. Barret, and V. L. Quoc, “Searching for
328, Springer International Publishing, 2015, pp. 493–500. Activation Function,” arXiv, vol. 2, Oct. 2017.
[2] M. Oquab, L. Bottou, I. Laptev, and J. Sivic, “Learning and [7] H. Chung, S. J. Lee, and J. G. Park, “Deep neural network
Transferring Mid-level Image Representations Using using trainable activation functions,” in Neural Networks
Convolutional Neural Networks,” in Proceedings of the (IJCNN), 2016 International Joint Conference on, 2016,
2014 Computer Vision and Pattern Recognition (CVPR), pp. 348–352.
2014, pp. 1717–1724. [8] OCR SPACE, “Free OCR API and Online OCR.” 2018.
[Online]. Available: https://ocr.space/. [Accessed: 4- Mar-
2018].

249

You might also like