Deep Learning For Amharic Text-ImageRecognition
Deep Learning For Amharic Text-ImageRecognition
Deep Learning For Amharic Text-ImageRecognition
Thesis
approved by
the Department of Computer Science
Technische Universität Kaiserslautern
for the award of the Doctoral Degree
Doctor of Engineering (Dr.-Ing.)
to
D 386
Abstract
Optical Character Recognition (OCR) is one of the central problems in pattern recognition. Its
applications have played a great role in the digitization of document images collected from het-
erogeneous sources. Many of the well-known scripts have OCR systems with sufficiently high
performance that enables OCR applications in industrial/commercial settings. However, OCR sys-
tems yield very-good results only on a narrow domain and very-specific use cases. Thus, it is still
a challenging task, and there are other exotic languages with indigenous scripts, such as Amharic,
for which no well-developed OCR systems exist.
As many as 100 million people speak Amharic, and it is an official working language of Ethiopia.
Amharic script contains about 317 different alphabets derived from 34 consonants with small changes.
The change involves shortening or elongating one of its main legs or adding small diacritics to the
right, left, top, or bottom of the consonant character. Such modifications lead the characters to have
similar shapes and make the recognition task complex, but this is particularly interesting for charac-
ter recognition research. So far, Amharic script recognition models are developed based on classical
machine learning techniques, and they are very limited in addressing the issues for Amharic OCR.
The motivation of this thesis is, therefore, to explore and tailor contemporary deep learning tech-
niques for the OCR of Amharic.
This thesis addresses the challenges in Amharic OCR through two main contributions. The first
contribution is an algorithmic contribution in which we investigate deep learning approaches that
suit the demand for Amharic OCR. The second is a technical contribution that comprises several
works towards the OCR model development; thus, it introduces a new Amharic database consisting
of collections of images annotated at a character and text-line level. It also presents a novel CNN-
based framework designed by leveraging the grapheme of characters in Fidel-Gebeta (where Fidel-
Gebeta consists of the full set of Amharic characters in matrix structure) and achieves 94.97%
overall character recognition accuracy.
In addition to character level methods, text-line level methods are also investigated and devel-
oped based on sequence-to-sequence learning. These models avoid several of the pre-processing
stages used in prior works by eliminating the need to segment individual characters. In this design,
we use a stack of CNNs, before the Bi-LSTM layers and train from end-to-end. This model out-
performs the LSTM-CTC based network, on average, by a CER of 3.75% with the ADOCR test
set. Motivated by the success of attention, in addressing the problems’ of long sequences in Neural
Machine Translation (NMT), we proposed a novel attention-based methodology by blending the
attention mechanism into CTC objective function. This model performs far better than the existing
techniques with a CER of 1.04% and 0.93% on printed and synthetic text-line images respectively.
Finally, this thesis provides details on various tasks that have been performed for the development
of Amharic OCR. As per our empirical analysis, the majority of the errors are due to poor annotation
of the dataset. As future work, the methods proposed in this thesis should be further investigated
and extended to deal with handwritten and historical Amharic documents.
Acknowledgments
First, I would like to thank Prof. Dr. Didier Stricker for allowing me to conduct my Ph.D. research
at the Augmented Vision lab, DFKI-Kaiserslautern. Without his ongoing support, supervision,
feedback, and the freedom to investigate different approaches, this thesis would not have been
possible. Further, I would like to thank Prof. Dr. Marcus Liwicki for becoming my cosupervisor, his
deep insights, invaluable feedback, fruitful discussions, and many great ideas are the cornerstones
of this research work. I would like to thank Dr. Million Meshesha and Dr. Gebeyehu Belay, my
mentors in the Ph.D. program for their guidance and feedback on my research.
I would like to thank Tewodros, who led to the first conceptual ideas for this thesis. Not only the
long technical discussions but also your continuous motivation and support have always been my
asset. Thank you for taking the time for the many fruitful discussions that not only challenged me
but also helped me to develop many of the ideas behind this thesis and teaching me the importance
of hard work. Then, Fatemeh, you and Teddy were always there to facilitate and made things
easy during my conference registration wherever and whenever it is. Special thanks to Reza, who
initially helped me a lot in understanding various ideas regarding my topic and letting me know
and use the OCRopus tool. Many thanks also to Kina for her valuable feedback and proofreading
towards the end of this thesis. I would like to acknowledge David, who is always there to maintain
and provide a solution immediately when my computing infrastructure faces any problem. I would
like to acknowledge Dr. Jason for his support during the thesis write-up. Whenever I drop an email
to him, he always replied to the point and immediately. Last but not least, special thanks to Leivy,
who helped me a lot from the very beginning of my arrival at DFKI until I was officially accepted
as a Ph.D. student at TU-Kaiserslautern. Further, I would like to thank Keonna, Nicole, and Eric
for helping me with various administrative issues.
I would like to thank Abrham Gebru, for spending countless hours together. Especially, every
random weekend trip, we had to Frankfurt to buy Injera. Even though it was my first winter semester
in Germany, he made me feel like it’s summer. My best wishes to him. I would like to thank Dr.
Tsegaye, Dr. Dereje, and Dr. Mulugeta, for all the times we spent in KL. I would especially like to
thank Dere, for being there for me during the pandemic, he is a very sociable person and we had a
good friendship and spent lots of time together and he was a very great chef of our kitchen.
It is a pleasure to express my gratitude wholeheartedly to my wife, Masti, for all her love, pa-
tience, and having my back in stressful times. She has also waited and stayed with me till midnight,
and she is always there to encourage me when I work the whole night. My daughter, Nissan, and
my son, Zephaniah, are not only my kids, but you are also my friends, who understand me when I
work and bring out the best in me when I got bored. I am also very thankful to my brother, Workye
Hailu, for caring in all my ways. My mother, Enatalem, for encouraging and supporting me in all
my studies starting from early schools. I am also very grateful to my father, Hailu Belay, and the
rest of all my families and friends who have a significant contribution and help me in one way or
another during my research work.
Birhanu Hailu
Contents
Abstract ix
Abbreviations xvi
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Statement of the Problem . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Research Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . 9
1.6 List of Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
vii
viii Contents
Appendices 111
A.1 Attention Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
A.1.1 Generic Model of Attention Mechanism . . . . . . . . . . . . . 114
A.1.2 Types of Attention and Seq2seq Model . . . . . . . . . . . . . 115
A.1.3 Basic Parameters of Attention-based Networks . . . . . . . . . 122
Bibliography 124
List of Figures
xi
xii List of Figures
xiii
Abbreviations
xv
Chapter 1
Introduction
1
2 Introduction
Layout-analysis
Preprocessing Column detection
Image / non-image
Noise removal detection Segmentation
Skew detection Title, keyword and
Binarization line detection Line, word and/or
character segmentation
Feature-extraction
Hand-crafted
Automatic feature
Document Image
Models
Recognition
Post processing
Dictionary correction
Model Language model
Alignment correction
corrected text
Document Texts
Figure 1.1: A generic Document Image Analysis (DIA) flow diagram . This
generic DIA flow diagram can be broken down into four major
steps: The first step is image preprocessing which helps to en-
hance the quality of the image and it involves broad range of
imaging functions such as image rotation, binarization and de-
skewing. The second document analysis step defines the areas for
text recognition and delivers information about layout and for-
matting elements of each individual page as well as structure of
the document as a whole. The actual texts are predicted at recog-
nition step. Then the OCR errors are corrected and the model is
updated at the post-processing stage.
• Preprocessing stage: In this stage the quality of the images are en-
hanced employing different image preprocessing techniques and the
data of interest are located.
• Classification stage: Here, the extracted feature vectors from the pre-
vious stage are processed to recognize character, word, or text-lines.
The applications of OCR have been widely used and implemented for the
digitization of various documents, written in both Latin and Non-Latin scripts,
ranging from historical to modern documents [Mes08, BHL+ 19a, CPT+ 19,
IFD+ 19]. Researchers achieved a better recognition accuracy and even most
of these scripts, now, have workable and commercial off-the-shelf OCR ap-
plication including the functionality of the ground truth generator from the
exiting printed text so as to train the second model [Bre08]. However, OCR
gives a better recognition accuracy only on a narrow domain and very specific
use cases. Besides, other multiple indigenous scripts are underrepresented in
the area of Natural Language Processing (NLP) and Document Image Analy-
sis (DIA). Hence, researchers are motivated towards the development of either
a generic multi-script OCR model or an OCR model for specific scripts.
Dated back to the 12th century, most of the historical and literary docu-
ments in Ethiopia are written and documented using Amharic script [Mey06].
In parallel, there are some documents written in Ge’ez script which share the
same writing system and characters as Amharic script. Since then, multiple
documents containing various contents are stored in different places such as
Ethiopian Orthodox Tewahdo Churches, museums, monasteries, public and
academic libraries in the form of correspondence letters, magazines, newspa-
pers, pamphlets, and books [Hag15]. These documents contain information
1
S. the on-line database Mazgaba seelat (Ruba Kusa); on historical carpets of Rubakusa,
s. Gervers 2004:292-93: Grant ES. Ethio-SPaRe Cultural Heritage of Christian Ethiopia:
Salvation, Preservation and Research.
4 Introduction
Figure 1.2: Sample Amharic and Ge’ez document collections. The top im-
age of this figure is Ge’ez document 1 while the others are Amharic
documents. Each document image depicts variations of fonts,
styles, and layouts of Amharic script that are limiting the im-
plementation of a robust OCR system. The characters of Ge’ez
are a subset of Amharic characters; thus, from the OCR perspec-
tive, a recognition model of Amharic script can be directly applied
and/or extended to Ge’ez script recognition.
played a crucial role in shaping national strategy and supporting digital in-
novation by the public organization with their implementation of digital ini-
tiatives [CBC19, NAL]. To support librarian, historian, and other experts in
the humanity studies, academic and industrial researchers have been attempt-
ing to develop OCR model to digitize textual-resources and documents for
various scripts. However, the nature of the script, such as type of fonts, repre-
sentation of encoding, quality of printing and support from operating systems
affect document standardization, causing the languages to have diverse fea-
tures in nature [Mes08]. Consequently, these issues add to the complexity of
the design and implementation of document image recognition systems to a
great extent for many scripts including Amharic script.
1.1 Motivation
The main motivation for this thesis is the research gap in the OCR develop-
ment for Amharic script. Amharic follows a unique writing system that has its
specific linguistic composition and indigenous characters. Amharic language
has a rich collection of documents that are collected from various sources over
centuries that should be digitized for further research and knowledge dissem-
ination. Consequently, there is a need to develop an OCR model for Amharic
script. However, despite the need for developing the OCR model, there is
a very limited research attempt for developing a robust and automatic OCR
system for Amharic script.
So far, the research attempts made for Amharic OCR development are not
well organized. These attempts are also implemented based on classical ma-
chine learning algorithms which require lots of effort for prepossessing and
feature extraction. The type and number of characters considered in the train-
ing set are not enough to represent the occurrence of characters even in com-
mon Amharic documents. In addition, the performance of the OCR model in
general and specifically the OCR models developed for Amharic script recog-
nition are affected by the nature and varieties of document images. The un-
availability of organized datasets for training is also one of the limiting factors
in Amharic OCR development. Hence, a large annotated real-world dataset is
required to train OCR systems and improve their performance. However, it is
a very challenging and time-consuming task to manually annotate real-world
documents. Therefore, a method is needed, which can automatically generate
ground truth for character or text-line images that can represent real-world
artifacts of Amharic documents. This motivated the author to introduce a par-
6 Introduction
ticular and simple method for the preparation of printed texts and automatic
ground-truth generation of synthetic Amharic text-line images.
The present growth of document digitization and changes in the world of the
linguistic landscape demands an immediate solution for enabling informa-
tion access for everyone. This requires research in the area of natural lan-
guage processing and document image analysis. In parallel, the demands of
language technology are increasing for resource-limited languages, to break
language barriers. Amharic is one of these resource-limited languages with
about 100 million speakers around the world. It has an indigenous script
with its own alphabets. Amharic is rich in historical documents collected
over centuries. However, from an OCR perspective, there is only limited re-
search effort for Amharic scripts. The scripting nature of the language and
lack of data availability poses many challenges in document image analysis
and recognition tasks.
For decades, OCR was the only way to turn printed documents into data
that could be processed by computers, and it remains the tool of choice for
converting paper documents into editable data that can be processed further.
Ultimately, OCR will continue to be a valuable tool for filling in gaps of hu-
manity and the digital world. Since the truly paperless business doesn’t yet
exist, data extraction is still a useful tool and that can augment the develop-
ment of OCR technology to minimize the scope for errors.
Deep learning-based approaches have improved over the last few years, re-
viving an interest in the OCR problem, where neural networks can be used
to combine the tasks of localizing texts in an image along with understand-
ing what the texts are. Using CNN architectures, Attention mechanisms and
RNNs have gone a long way in this regard. However, to the best of our knowl-
edge, researches are done on Amharic OCR so far is in an uncoordinated
manner and none of the researchers have taken the advantage of deep learn-
ing techniques such as sequence-to-sequence learning instead segmentation-
based traditional machine learning techniques have been employed nearly in
all of these attempts. In addition, in the literature, attempts to Amharic OCR
neither shown results on a large dataset nor considering all possible charac-
ters used in Amharic script. Therefore, this work aims to explore the problem
of OCR in Amharic script and solve some of these problems by applying
3. Research Objectives 7
Given the increasing need to digitize documents and its impact on accessibil-
ity of indigenous knowledge, this thesis aims to construct a robust OCR model
for Amharic script recognition using contemporary deep learning algorithms.
This thesis also aims to reduce the research gap in Amharic language and
to develop a baseline database for Amharic OCR which can help researchers
for further improvement of the OCR system. To achieve the objective of this
thesis, we pose the following theoretical hypotheses and we are going to test
them through an empirical investigation.
4. Given that the second and third hypotheses are correct, then using the
best of both worlds and blending the capability of Attention mechanism
into CTC objective function boosts the recognition performance of the
OCR model.
1.4 Contributions
In this thesis, we focus on designing an OCR model for Amharic script. Some
of the main conceptual and technical contributions of this thesis are summa-
rized as follows:
8 Introduction
This thesis is organized into eight chapters. The first chapter provides an
overview of the general background of OCR in general and particularly, for
Amharic script recognition. The motivation behind the present work, state-
ment of the problem, research objectives, list of publications from this thesis,
and the major contributions are also briefly described. In chapter two, we
present the trends in OCR research and various methodologies that have been
emerged over decades of OCR research. Both traditional and state-of-the-art
OCR approaches that were and have been applied in DIA and OCR research
are also presented. Specifically, it is noted on sequence-to-sequence learning
techniques such as Attention mechanism and CTC-based recurrent networks
have been utilized, in this thesis, for Amharic scripts.
Chapter four is about the dataset used in this thesis; then it introduces and
describes the details of an Amharic database called ADOCR which contains
both printed and synthetically generated images. The dataset is organized into
two levels; at the character level and text-line level image. Data collection and
synthetic data generation techniques are also described.
Chapters six and seven provide the systems architecture of various sequence-
to-sequence learning models. Chapter 6 focuses on using LSTM as a sequence-
to-sequence mapping together with CTC. CNN’s were also covered and em-
ployed as a feature extractor module by stacking before recurrent networks.
Further, various recurrent network-based approaches are incorporated in the
10 Introduction
learning processes and experimental results are presented to show their appli-
cability and performance in recognition of Amharic text-images.
Attention mechanism has been given more attention in NLP tasks for neu-
ral machine translation. Based on the idea of neural machine translation, an
Attention-based encoder-decoder network is proposed in Chapter 7. In the
first part of this Chapter, an end-to-end Amharic OCR model is proposed us-
ing LSTM networks, without CTC alignment, as encoder and decoder unit.
Attention is embedded between the encoder and decoder LSTMs to allow the
decoder to attend to different parts of the source sequence and to let the model
learn how to generate a context vector for each output time step. This model
gives state-of-the-art results on ADOCR test datasets. An extension of this
work, a blended Attention-CTC network formulated by directly taking the ad-
vantage of Attention mechanism and CTC network, is reported in the second
part of this chapter. The performance of this blended attention-CTC network
is evaluated against the ADOCR test sets and it performs significantly better
than the other sequence-to-sequence models.
Part of the work presented in this thesis has been accepted and/or presented
in peer-reviewed conferences or journals. The list of the publications derived
from this thesis work are provide as follows.
Journals
This chapter provides a highlight of the existing methods and trends in OCR
technology. Technically, OCR comprises numerous procedures and tech-
niques which have been emerged over a century of research. The rapidly
growing and availability of computing power has facilitated the use of so-
phisticated and diversified approaches for OCR development. The following
sections of this chapter have acknowledged multiple approaches of OCR re-
search which can be broadly categorized into two groups. The first category,
the classical OCR approach, is commonly called segmentation-based OCR
which involves the segmentation of a page into paragraphs, then into text-
lines, followed by words, and finally into individual characters. A character
recognizer would recognize the segmented characters one by one. Section 2.1
describes various methods developed for this type of OCR approach. The sec-
ond OCR approach, which is called segmentation-free OCR in literature, and
the modern OCR approach in this thesis. Section 2.2 describes this OCR ap-
proach in detail. Section 2.3 describes the overview of deep neural networks
and section 2.4 presents state-of-the-art and emerging trends of segmentation-
free OCR approach and Seq2Seq tasks in general. Since recognition in seg-
mentation free approach is done at the word or at text-line level, and it avoids
the task of extracting characters from text-lines and minimizes the efforts re-
quired for preprocessing and feature extraction.
13
14 Trends in OCR Research and Development
Segmentation-based OCR techniques have been used for years, during this
period significant works have been done for recognition of different scripts
[PC04, CL96, LV12] and it has been also widely applied for Amharic script
recognition [Mes08, AB08, MJ07, CH03] as well. The smallest individual
components feed for the final recognition stage, in the segmentation-based
OCR technique, are characters. In general, character segmentation is achieved
using either explicit or implicit segmentation techniques, where the former
segmentation techniques are employed in segmentation-based OCR while the
later segmentation technique is the feature of segmentation-free OCR ap-
proach [Cho14, HS95, SG97]. The detailed procedures of this segmentation
technique are described as follows:
መገንባት
መ ዐዐ
Figure 2.1: Character segmentation process and segmentation problems
on sample Amharic text. Character Œ is over segmented into
a a (i.e a single Amharic character pronounced as me is con-
sidered as two other characters pronounced as 0 a).
[UH16].
− I¯u,v ][T (x − u, y − v) − T̄ ]
P
x,y [I(x, y)
f (x, y) = qP (2.1)
¯ 2 P x,y [T (x − u, y − v) − T̄ ]2
x,y [I(x, y) − I u,v ]
where I is the image, T is the template, T̄ and I¯u,v is the mean of the template
and mean of the region under the template I(x, y) respectively.
This section describes deep neural networks, an artificial neural network with
multiple layers between the inputs and outputs, in detail. The concept of the
deep neural network is originated from Artificial Neural Networks (ANN)
[HS06], which are inspired by the structure and function of the biological
brain. Until the birth of deep learning [HOT06], ANN and other statisti-
cal classifiers such as Support Vector Machines (SVM), K-Nearest Neigh-
bor (KNN) [HM16], and Random Forests (RN)[Pal05] have been adopted
and widely implemented for various applications. Those classical machine
learning techniques, in general, consists of two modules; the feature extractor
and the classifier module. The feature extractor is responsible for the ex-
traction of discriminant features, usually called handcrafted features, from
the given data. Once those features are extracted, they are subsequently fed
into the classifiers. However, after the introduction of deep neural networks
such as Convolutional Neural Networks (CNNs) such stepwise procedures
are put to one unified framework, where both the feature extractor and clas-
sifier modules are trained in an end-to-end fashion using back-propagation
[LFZ+ 15, SJS13].
to the next layer, while pooling layers are responsible to reduce the spatial
size of the input [PZLL20].
In addition to these two basic layers, CNNs may consist of fully connected
layers which take the flattened matrix from previous layers and compute the
class score and they may also consist of various network hyper-parameters
such as stride, padding, and depth (number of filters) which could control
the learning algorithms [BSS17]. Convolutional neural networks has been
successfully applied for various computer vision applications such as, im-
age recognition [TT18], facial recognition [HML+ 18], image segmentation
[YNDT18], and character recognition [BHL+ 19b]. CNNs are the dominant
methods in various image-based tasks and they still continue to attract the
interest of the machine learning community across many fields including re-
searchers from document image analysis and recognition.
Ot = ht Wo + bo (2.3)
where Ot is the output, Wo is the weight at the output layer and bo bias and ht
is the hidden state variable at time-step t which can be expressed as follows,
where t is the time step and π t is the label of path π at t. CTC is independent
of the underlying neural network structure rather it refers to the output and
scoring. The difficulty of training comes from there being many more obser-
vations than the actual labels. For example in a text-line, there can be multiple
time slices of a single character. Since we don’t know the alignment of the
observed sequence with the target labels, we predict a probability distribu-
tion at each time step [Han17] and then the CTC is used to align the output
activation of a neural network with target labels.
The target sequence probability y from input sequence x is the sum of the
probability of all paths by reducing each path to this label sequence using B
and this probability is again obtained using Equation (2.7).
X
p(y/x) = p(π/x), (2.7)
π∈B(y)
22 Trends in OCR Research and Development
Predicted text
አማርኛ የኢትዮጵያ ፡ መደበኛ ፡ ቋንቋ ፡ ነው::
CTC
RNN
RNN
.. Output-layers
Input-layers
.
RNN
RNN
Hidden-layers
Attention has been used in different contexts across multiple disciplines, for
example, psychologists define Attention as the behavioral and cognitive pro-
cess of selectively concentrating on a discrete aspect of information while
ignoring other perceivable information [Jam07]. Human visual attention, Bi-
ologically, allows us to focus on a certain region with high resolution while
perceiving the surrounding in low resolution and then adjust the focal point
or do the inference accordingly [All89]. In Computing, Attention is one of
the components of the network architecture and is in charge of managing and
quantifying the interdependence of sequences [BCB14].
As stated in the literature, the way that the alignment score is calculated and
the position at which the attention mechanism is being introduced in the de-
coder are the major differences in various attention mechanisms. For example,
in Luong’s local attention, Concat alignment score function, the decoder hid-
den state and encoder hidden state will not have their own individual weight
matrix, but a shared one instead, unlike in the attention model proposed by
Bahdanau. Therefore, the decoder hidden state and encoder hidden states are
added together first before being passed through a linear layer, and then, after
being passed through the Linear layer, a tanh activation function will be ap-
plied to the output before being multiplied by a weight matrix to produce the
alignment score. A generic flow of Attention-based sequence learning, which
is designed based on the work of Bahdanau [BCB14], is shown in Figure 2.3.
y0 y1 y2 yTy
..
.
Softmax Softmax Softmax Softmax
..
.
s1 s2 sTy-1
Decoder-RNN Decoder-RNN Decoder-RNN
..
.
Concat Concat Concat
..
.
s0
context1 context2 contextTy
..
.
h1 h2 hTx
h0 Encoder-RNN Encoder-RNN Encoder-RNN
..
.
X1 X2 XTx
..
Input-layers
.
Figure 2.3: Illustration of Attention-based sequence learning. In attention-
based sequence learning paradigm, the two RNNs (Encoder and
Decoders) are employed. An encoder processes the input se-
quence (x1 , ..., xTx ) with length Tx and compresses the informa-
tion into a context vector (contexti ) of a fixed length. This repre-
sentation is expected to be a good summary of the meaning of the
whole input sequence while the decoder is initialized with the con-
text vector to emit the transformed output sequence ( y1 , ..., yTy )
of length Ty .
are customizable for each output element. Since the context vector has access
to the entire input sequences, we don’t need to worry about forgetting. The
alignment between the source and the target is learned and controlled by the
context vector.
X X X
w1 w2 ... wTx
SoftMax
S0
...
h1 h2 hTx
Figure 2.4: A typical attention model. This design computes the output c
from the initial decoder hidden state s0 and the part of the given
sequence hi , where the decoder hidden states at t − 1 are first
concatenated with each encoder hidden states and passes through
tanh activation. The tanh layer can be replaced by any other net-
work that is capable to produce an output form s0 and hi . Here,
wi and ai denotes attention weights and alignment scores respec-
tively.
→
−
a bidirectional RNN with a forward hidden states h i and a backward hidden
←−
states h i . With the motivation to include both the preceding and following
words in the annotation of one word, the encoder state hi is the concatenation
→
− ← −
of the two hidden states and represented as, hi =[ h i , h i ] for i = 1, .., T .
The decoder network has hidden state sj = f (sj-1 , y j-1 , cj ) for the output
sequence at position j, j = 1, ..., t, where the context vector cj = Ti=1
P x
sji hi ,
is a sum of hidden states of the input sequence, weighted by alignment scores,
aji = Align(y j , xi ), which shows how well two sequences y j and xi are aligned.
The alignment score aji , as described in the work of Bahdanau [BCB14], is
computed by a feed-forward network with tanh activation function which
is jointly trained with other parts of the model. Then the alignment score is
given as, score(sj , hi )= vaT tanh(Wa [sj ; hi ]), where both v a and W a are weight
matrices to be learned in the alignment model. A typical attention layer ar-
chitecture that computes a context vector for the first decoding time step is
5. Chapter Summary 27
depicted in Figure 2.4. Even though the Attention mechanism is designed and
has been used to solve problems in NMT, nowadays, it is also widely applied
in various fields such as image captioning [LTD+ 17], visual question-answer
[AHB+ 18], and OCR [ZDD18, PV17, CV18] and promising experimental
results are reported.
This chapter provides an overview of the script, Amharic, which we are go-
ing to deal with in this thesis. Amharic is an exotic script that has its own
indigenous alphabet and a rich collection of documents. Understanding the
language and its writing system are indispensable steps in Natural Language
Processing (NLP) tasks in general. OCR application heavily depends on the
writing system of the language. This chapter first presents the demographic
distribution of Amharic speakers and the genetic structure of Amharic script.
Furthermore, the writing system and orthographic identity of Amharic char-
acters with their corresponding unique features, in the script, have been pre-
sented. This chapter also provides a brief description of related works, on
Amharic script, which has been done prior and/or in parallel with this thesis.
Finally, the summary of the chapter is presented.
In Ethiopia, there are more than 86 languages, that are divided into Semitic,
Cushitic, Omotic, and Nilo-Saharan groups [EF20], with up to 200 differ-
ent dialects spoken [MJ05, Mes08]. The languages are written with either
Ge’ez/Amharic and/or Latin scripts. Of the 86 languages in Ethiopia, 41 are
living languages at the institutional level, 14 are developing, 18 are vigor-
ous, 8 are in danger of extinction, and 5 are near extinction [EF20]. Of these
languages, Amharic is an official working language of the Federal Demo-
cratic Republic of Ethiopia and is spoken by more than 50 million people
as their mother language and over 100 million as a second language in the
29
30 Amharic Script and Writing System
The script of Amharic language is originated from the Ge’ez script, an an-
cient Ethiopian script, around 300 A.D. Amharic script consists of the full set
of Ge’ez characters and includes its own additional characters. Nowadays,
Amharic characters are used to write various languages in Ethiopia includ-
ing Amharic, Guragegna, Awi, Argoba, and Tigrigna. Figure 3.2 depicts the
3
https://www.adams.africa/works/ethiopia/
2. Orthographic Identity of Amharic Characters 31
Afro-Asiatic
East-Semitic West-Semitic
Central-Semitic Ethiopian-Semitic
North South
4
Figure 3.2: Family of Amharic language.
been retired to a reserved use primarily for calendar dates and sometimes for
paging. As stated, in literature, Amharic writing system does not have a sym-
bol for zero, negative, decimal point, and mathematical operators. Therefore,
the Arabic numerals are used everywhere else while Latin mathematical op-
erators are used for arithmetic computation.
• The direction of writing and reading the characters is from left to right
and top to bottom, where characters are written in a disconnected man-
ner and proportional spacing.
34 Amharic Script and Writing System
b c
Figure 3.4: Other symbols of Amharic Fidel. (a) Labialized characters. (b)
Numerals and (c) Punctuation marks. In most moder Amharic
documents, Amharic numerals are replaced by Latin numbers
while some of the punctuation marks such as section mark and
paragraph separators are not used any more.
• Vowel characters are derived from the consonants, with a small modi-
fication of their shapes. For example, as shown in Figure 3.3, the 1st
column consists of the basic symbol of Amharic characters with no ex-
plicit vowel indicator which is usually called consonants. In the 2nd
and 3rd columns the corresponding consonants are modified by adding
a hyphen (-) like diacritics in half-way down the right leg and the base
of the right leg of the character respectively. The 4th column has a short
3. Challenges for the OCR of Amharic Script 35
left leg while the 5th column has a loop/circle on the right leg. The
6th and 7th columns are less structured and lack consistency, but some
regular modification has been made on either the left or right leg of the
base characters. Otherwise, some other modifications such as loops or
small diacritics could also be added to the top of the characters.
• Sentences are separated with Ethiopic four dot/full-stop ( ~ ) while the
paragraphs are separated with a full line of horizontal space or can be
started with indentation.
• Unlike the Latin characters, there is no capital and lower case distinc-
tion among Amharic characters.
• There is a noticeable variance in width and height among characters,
some of them such as character ˜, ˜ and ¥ are longer than
character Œ, P and  vertically, while the later characters are
longer than the former characters horizontally.
In addition, the line of printed Amharic script lies at the same level, having
no ascent and descent features which always have a white line between two
consecutive lines of characters. Sample Amharic scripts, including some of
the features stated above, are illustrated in Figure 3.5.
Even though there is a great demand for Amharic OCR, research in this field
is still behind that of the well-known scripts, such as Latin, and even it is the
untouched one. Amharic script is much more complex as compared to Latin
scripts. The character shapes in Amharic script are far too many and vary
widely with different non-standard fonts. Since Amharic is an indigenous
language with its own unique characters and writing system, understanding
the characteristics and distinctive features of this script is challenging and
also interesting because it has an enormous use for the development of optical
character recognition system for Amharic script. Some of the generic and
script/language-specific challenges are outlined below:
• Visual similarity among characters: This is due to shape similarity
between consonant characters and also across vowels which are derived
from the base characters with small modifications. Since characters in
Amharic script are derived from 34 basic characters, there is a high
5
https://www.bbc.com/amharic/news-51780956
36 Amharic Script and Writing System
የጤና:ተቋማት:እንደሚሉት:በሽታውን:ለመከላከል:እጅን:በመደበኝነት:በደንብ:መታጠብ:እጅግ:ጠቃሚ:ነው። እስካሁን:
በሽታው:እንዴት:እንደሚተላለፍ:በእርግጠኝነት:የሚታወቅ:ነገር:የለም። ነገር:ግን:የኮሮና:አይነት:ቫይረሶች:በበሽታው:የተያዙ
ሰዎች:በሚያስሉበትና:በሚያስነጥሱበት:ወቅት:በሚያወጧቸው:ጥቃቅን:ጠብታዎች:አማካይነት:ነው። ስለዚህም:ስናስልና:
ስናስነጥስ:አፍንጫና:አፋችንን:በሶፍት:ወይም:በክርናችን:እንሸፍን፤ ፊታችንን:እጃችንን:ሳንታጠብ:አንንካ። በተጨማሪም:
ከታመሙ:ሰዎች:ጋር:የቀረበ:ንክኪ:ከማድረግ:መቆጠብ:በሽታውን:ለመከላከል:ያግዛል። የህክምና:ባለሙያዎች:እንደሚሉት:
የፊት:መሸፈኛ:ጭንብሎች:የሚፈለገውን:ያህል በሽታውን:ለመከላከል:አያስችሉም። በበሽታው:ተይዣለሁ:
ብለው: የሚያስቡ:ሰዎች:አስፈላጊውን:ምክርና:የህክምና:ድጋፍ:ለማግኘት:ለሚመለከታቸው:የጤና:ተቋማት:ማሳወቅ
አለባቸው። በበሽታው:ተያዞ:ይሆናል:ተብሎ:ከሚታሰብ:ሰው:ጋር:የቀረበ:ንክኪ:ከነበረዎት፤ እራስዎን:ለተወሰነ:ጊዜ:ከሌሎች:
ሰዎች:አግልለው:እንዲቆ:ሊነገርዎት:ይችላል። ስለዚህ:በበሽታው:ከተያዙ:ሰዎች:ጋር:የሚኖርን:የቀረበ:ንክኪ:ማስወገድ:
ይኖርብዎታል። በሽታው:ከተከሰተባቸውና:ከሌሎች:አገራት:የተመለሱ:ከሆነ:ምናልባትም:ለተወሰነ:ጊዜ:እራስዎን:አግልለው
መቆየት:ሊያስፈልግዎት:ይችላል።
(a)
የጤና: ተቋማት: እንደሚሉት: በሽታውን: ለመከላከል: እጅን: የጤና ተቋማት እንደሚሉት በሽታውን ለመከላከል እጅን
በመደበኝነት: በደንብ: መታጠብ: እጅግ: ጠቃሚ: በመደበኝነት በደንብ መታጠብ እጅግ ጠቃሚ ነው። እስካሁን
ነው። እስካሁን: በሽታው: እንዴት: እንደሚተላለፍ: በሽታው እንዴት እንደሚተላለፍ በእርግጠኝነት የሚታወቅ
በእርግጠኝነት: የሚታወቅ: ነገር: የለም። ነገር: ግን: የኮሮና: ነገር የለም። ነገር ግን የኮሮና አይነት ቫይረሶች በበሽታው
አይነት: ቫይረሶች: በበሽታው: የተያዙ: ሰዎች: የተያዙ ሰዎች በሚያስሉበትና በሚያስነጥሱበት ወቅት
በሚያስሉበትና: በሚያስነጥሱበት: ወቅት: በሚያወጧቸው: በሚያወጧቸው ጥቃቅን ጠብታዎች አማካይነት
ጥቃቅን: ጠብታዎች: አማካይነት: ነው። ስለዚህም: ስናስልና ነው። ስለዚህም ስናስልና ስናስነጥስ አፍንጫና አፋችንን
ስናስነጥስ: አፍንጫና: አፋችንን: በሶፍት: ወይም: በክርናችን: በሶፍት ወይም በክርናችን እንሸፍን፤ ፊታችንን እጃችንን
እንሸፍን፤ ፊታችንን: እጃችንን: ሳንታጠብ: አንንካ። ሳንታጠብ አንንካ። በተጨማሪም ከታመሙ ሰዎች ጋር
በተጨማሪም: ከታመሙ: ሰዎች: ጋር: የቀረበ: ንክኪ: የቀረበ ንክኪ ከማድረግ መቆጠብ በሽታውን ለመከላከል
ከማድረግ: መቆጠብ: በሽታውን: ለመከላከል: ያግዛል። ያግዛል።
(b) (c)
Figure 3.5: Sample Amharic scripts. (a) Word are separated with two dots.
(b) Words are separated by two dots followed by blank space. (c)
Words separated with blank space. The texts are obtained from
BBC Amharic news5 .
• Printing quality and font variations: Printed documents are also writ-
ten using various fonts, styles, and sizes with different quality of print-
ing materials such as ink, printing paper, and even the printing device it
self. As a result of these features, the shape and appearance of char-
acters vary substantially. For printing in a specific Amharic script,
there are different fonts (for example, Power Geez, Visual Geez, Nyala,
and Agafari) with several stylistic variants such as Normal, Bold, and
Italic. In addition, different font sizes including 10, 12, 14, 16 pixel
[Mes08]. Since these fonts, styles, and sizes produce texts that greatly
vary in their appearances, it is a challenging task for developing charac-
ter recognition systems. However, these challenges are independent of
the scripted nature rather they are incurred due to either font variation
or the printing quality.
Even though numerous attempts have been made, to overcome the above-
stated general and language-specific challenges of Amharic script, only few
attempts are made to develop an OCR model for Amharic character recogni-
tion. The OCR models developed by researchers so far are based on classical
machine learning techniques and limited Amharic characters were considered
for training as a result, the OCR model yields poor character recognition per-
formance and even no organized and publicly available OCR dataset. There-
fore, new paradigms should be researched and even the accuracy of the ex-
isting algorithm should be verified and optimized. The next section presents
efforts that have been made for developing Amharic OCR.
38 Amharic Script and Writing System
Following the 1997 first attempt made by Worku [Ale97], different statis-
tical machine learning techniques have been explored by many researchers
for Amharic script recognition. In 1999, Dereje [Tef99] has conducted re-
search so as to recognize a typewritten Amharic text with the aim of improv-
ing Amharic OCR and recommended as the algorithms adopted for Amharic
OCR should not be very sensitive to the features of the writing styles of char-
acters. A year later, Million [Mil00] adopted a recognition algorithm that can
handle the different typefaces of Amharic characters with the aim to investi-
gate and extract the attributes of Amharic characters so as to generalize the
previous research works on Amharic OCR.
In this chapter, we have discussed the overview of Amharic script. The origin
of Amharic script, character sets with its formation, the generic structure, and
its relationship with other Semitic languages are presented. In addition, we
highlight the difficulties and challenges related to Amharic script that have
bearings on the OCR development, especially, the existence of visually sim-
ilar symbols and the use of a number of distinct fonts in printed documents
that are designed in an unstructured manner.
We also present a short survey of related works made for the indigenous lan-
guages with a particular emphasis on Amharic script which is currently used
as the official working language of Ethiopia with its own indigenous alpha-
bets and writing system called Fidel. Amharic script has 34 base characters
each of which occurs in seven orders, which represent syllable combinations
consisting of a consonant and following vowel. The total number of unique
characters, used in Amharic script, is about 317, including 238 basic charac-
ters and other symbols representing labialization, numerals, and punctuation
marks.
Developing the OCR model for an indigenous, but the resource-limited lan-
guage is challenging and new algorithms are required to address these chal-
lenges. Unlike most modern Latin scripts enjoying a collection of resources
including a database and benchmark results of OCR, the scarcity of such in-
need resources makes the OCR development very challenging for Amharic
script. In the next chapter, a baseline database prepared for Amharic OCR
development is described in detail.
Chapter 4
Amharic OCR and Baseline Datasets
This chapter gives a brief explanation about a baseline database developed for
Amharic OCR and techniques employed for Amharic database creation. First,
it gives highlights on the effect of dataset unavailability on Amharic charac-
ter recognition tasks. Next, it introduces related works regarding datasets in
general and in particular for Amharic script. Finally, it gives a summary of all
works and contributions regarding Amharic OCR database.
4.1 Introduction
41
42 Amharic OCR and Baseline Datasets
The primary issue in the field of OCR in general and in particular for Amharic
OCR is a lack of annotated dataset. Therefore, this section describes various
approaches followed to develop the database and the detail of Amharic OCR
database called ADOCR [BHL+ 19a]. The ADOCR database contains char-
acter and text-line-level images. Both datasets are available freely and can
44 Amharic OCR and Baseline Datasets
be also obtained from the author, for research purposes. Sample character
images from this dataset are illustrated in Figure 4.1. The following section
presents the details of the ADOCR database.
Despite their numerous attempts to do so, researchers have not been able to
prepare Amharic datasets that are large enough to train deep learning models.
In addition, none of them are made the datasets publicly available. Therefore,
to advance the OCR research for Amharic script, a new dataset of Amharic
characters is proposed.
This dataset is organized into two groups. The first group, Type-I, consists
80,000 Amharic character images with the corresponding Ground-Truth (GT)
texts. The character images and their corresponding GT, in this group, are in
the format of png and txt file respectively. Since it has no explicitly separated
training and test dataset, researchers are free and should split both the training
and test sets accordingly.
However, the second group of the character level dataset contains 72,194
sample Amharic character images as a training and 7800 sample images as
a test set. In this dataset, Type-II, both the character images and the corre-
sponding GT are stored in numpy file format. Moreover, the images are Grey-
level with a size of 32 by 32 pixels and generated synthetically using, OCRo-
pus [Bre08], an open-source OCR framework. Except about 2006 distorted
Amharic character images, which are removed from Type-I, both Type-I and
3. Database for Amharic OCR 45
Figure 4.2: Fidel Gebeta. Basic Amharic characters are arranged in row-
column-wise (33 × 7) order; read from left to right: The first col-
umn gives the consonant characters while the next six columns are
derived by combining the consonant with vowel sounds.
Type-II datasets contain the same character images. In this dataset, we only
considered 231 basic Amharic characters which are arranged in a matrix-like
structure, called Fidel Gebeta, and illustrated in Figure 4.2.
These text-line images have two parts. The first part is synthetically generated
Amharic text-line images while the second part is a printed Amharic text-line
images.
Following the popularity of using artificial data in computer vision and pat-
tern recognition, the shortage of training datasets is, now, partially solved. A
similar approach is employed, in this thesis, to address the problem of limited
labeled Amharic script dataset. During synthetic data generation, to make the
synthetic data closely similar to the scanned documents that represent real-
world data (such as degradation, font, and skew), we use different parameters
from the OCRopus framework such as threshold with values between 0.3 and
0.5, jitters with the value of 0.5 and so on.
As shown in Figure 4.3, the utf-8-encoded text line and ttf -type font-files
are used as the inputs of OCRopus-linegen engine and generate the text-
line image with corresponding GT texts. Since the OCRopus-lingen engine
doesn’t generate the character-level image directly, we prepare a text file that
3. Database for Amharic OCR 47
consists of a character per line first and then feed this text file to the engine
for character-level image generation. Furthermore, there are some necessary
changes regarding the OCRopus framework code so as to adapt for documents
written in Amharic script.
During printed text-line image dataset preparation the contents of the doc-
ument are an important factor; thus the collection of contents should have to
48 Amharic OCR and Baseline Datasets
OCRopus-nlbin
OCRopus-gpageseg
Scanned page
represent the real facts in Amharic script. In this regard, an electronic copy
of Amharic documents like letters, newspapers, books, and magazines which
are written with 12 point pixel including cultural, economical, political, and
religious contents was collected. Once the documents are collected and or-
ganized, the printing process is done using a laser jet printer with the usual
setting used in Amharic document printing process.
The next step, after printing, is image acquisition which can be done using
different technologies like cameras, scanners, and other mobile-based tech-
nologies. In most document digitization, the scanner-based acquisition is con-
sidered as a standard approach and abundantly used for scanning books, mag-
azines, and other textual resources. The scanned contents of printed Amharic
texts are acquired by scanning 721 pages of printed Amharic documents using
a flatbed scanner with 300dpi. Sample scanned page of Amharic document is
shown in Figure 4.4.
duces the time and effort required during traditional text-line image annota-
tion which is usually done by transcribing each text-line manually by human
operators or language experts.
From the total set of text-line images, 40,929 text-lines are printed text-
lines written with Power Ge’ez font, 197,484 and 98,924 text-line images are
synthetically generated text-lines with Power Ge’ez and Visual Ge’ez fonts re-
spectively. In the GT texts there are 280 unique Amharic characters and punc-
tuation marks which are depicted in Figure 4.6 and all characters that exist in
the test dataset also exist in the training set. The text-lines in the ground-truth
of the ADOCR dataset have a minimum sequence length of 3 and a maximum
length of 32 characters. This dataset totally consists of 851,711 words that
are composed of 4,916,663 characters, where 269,419 characters are from the
test datasets and 4,647,244 characters are from the training dataset.
Since the text-line images have different lengths and the implementation
deep learning framework, Keras, used in this thesis requires a limited se-
50 Amharic OCR and Baseline Datasets
a b c
Figure 4.8: Sample Amharic text-line images from ADOCR database. All
images in this figure are normalized to a size of 48 by 128 pix-
els. (a) Printed text-line images written with the Power Ge’ez font
type. (b) Synthetically generated text-line images with the Visual
Ge’ez font type. (c) Synthetically generated text-line images with
the Power Ge’ez font type.
quence length to optimize GPU parallelization, all text-line images are resized
to a width of 128. During text-line image length normalization, first, we cre-
ate a white target image of size 48 × 128. Then we copy each text-line image
into the white target image, as shown in Figure 4.7. Now all the text-line im-
ages are resized to an equal and fixed length which is easy for batch training
where the batch size is greater than one.
The synthetic text-line images, in this database, are generated and stored at
a different level of degradation. The text-line images and their GT text files
in the training set are in the order of printed, synthetically generated images
with Power Ge’ez and Visual Ge’ez fonts respectively. All images and their
corresponding GT text are in the format of numpy file. Images in a test-set are
organized as a separate numpy file with their corresponding GT texts each.
Sample printed and synthetically generated text-line images taken from the
ADOCR database are shown in Figure 4.8. The test dataset contains a total of
18,631 randomly selected text-line images where 2907 of them are scanned
printed, 9245 and 6479 of them are synthetically generated text-line images
with Power Geez and Visual Geez fonts types respectively.
In this section, we introduce the first attempt for Amharic character image
recognition based on deep learning benchmarks. A standard Convolutional
Neural Network (CNN) has been employed for the recognition of basic Amharic
characters. The proposed method is trained with synthetically generated Amharic
character images so as to replace the shortage of training data and the perfor-
mance of the model evaluated using both synthetically generated and printed
Amharic characters. The detailed implementation procedure and related works
are provided in the following section.
53
54 Recognition of Amharic Characters
5.1.1 Introduction
Amharic is an old Semitic language that has been and is widely spoken in
Ethiopia. It is widely used in the country and serves as the official work-
ing language of the state. Most historic and literary documents are written
and documented in this language. Despite the importance of the language,
scanned image documentation and retrieval in Amharic script is still a tedious
process. Technologies that can automate the process of changing this scanned
image into a full and editable document will be of enormous use in many sec-
tors of Amharic speaker nation’s economy. However, unlike the well-known
Latin scripts such as English and Fraktur and Non-Latin script such as Ara-
bic scripts, there is no workable and effective off-the-shelf OCR software for
Amharic script recognition. Over the last few decades, a limited amount of
research efforts have been done for the recognition of Amharic scripts, where
almost all these attempts are based on classical machine learning techniques.
OCR has been and is widely used for many scripts as a method of digi-
tizing printed texts which can be electronically edited, searched, stored more
compactly, displayed on-line, also used to facilitate the human-to-machine
and machine-to-machine communication such as machine translation, text-
to-speech, key data, and text mining [YBJL12, Mes08, GDS10]. Previous
research works on Amharic character recognition follows a long procedure
which starts by segmenting input image into lines, characters and then fol-
lowed by feature extraction and selection. The selected features are used as
an input for a classifier which causes recognition lattice [YBJL12, Tef99] and
also requires more time and techniques during preprocessing and feature ex-
traction.
For character recognition, various methods have been proposed and high char-
acter recognition performances are reported for the OCR of various scripts
such as English [BCFX14], Fraktur [BUHAAS13], Arabic [GDS10], De-
vanagari [BCFX14] , Malayalam [AMKS15] and Chinese[HZMJ15]. In the
last two decades, many researchers [Tef99, Ass02, MJ07, Hag15] are also
attempted to address the problems in Amharic script recognition.
tion algorithms that are not very sensitive to the features of the writing styles
of characters helps to enhance the recognition rate of Amharic OCR. Million
[MJ07], conducted empirical research to investigate and extract the attributes
of Amharic characters written in different fonts and then generalize previously
adopted recognition algorithms. Using different test cases, 49.38%, 26.04%,
and 15.75% recognition accuracy were registered for WashRa, Agafari, and
Visual Geez fonts respectively. In addition, Yaregal [Ass02] noted that font
variation is also a problem in designing an OCR system for printed Ethiopic
documents.
Shatnawi and Abdallah [SA15] tried to model a real distortion in Arabic
script using real handwritten character examples to recognize characters. The
distortion models were employed to synthesize handwritten examples that are
more realistic and achieved 73.4% recognition accuracy. Haifeng [ZHZ17],
proposed a deep learning method based on the convolutional neural networks
to recognize scanned characters from real life with the characteristics of illu-
mination variance, cluttered backgrounds, geometry distortion, and reported
promising results.
In summary, the above studies noted that the writing style of characters that
includes font variation, level of degradation, and size are important factors for
OCR. Preprocessing, feature extraction and classification algorithm selection
is also the major step for constructing an effective recognition model. Even
though, many researchers have been attempted to recognize a small set of
Amharic character images using hand-crafted features which is usually not
robust, to the best of our knowledge, no attempts have been made for Amharic
character recognition using deep learning-based techniques. Therefore, there
is still room for improvement of Amharic character image recognition using
state-of-the-art techniques that can reduce preprocessing steps and computa-
tional resources.
ing the usual printer setting, in Amharic document, and then scanned using
Kyocera TASKalfa 5501i flatbed scanner at a resolution of 300 DPI. Prepro-
cessing steps including binarization, and character image segmentation has
been done using the same tool, the OCRopus framework. The OCRopus OCR
framework and the detail procedure of dataset generation is described in chap-
ter 4 section 4.3.1 and 4.3.2.
For training, we consider Amharic character images that have an input size
of 32 × 32 pixel. To optimize the proposed network model, training is carried
out by using the stochastic gradient descent method with the mini-batches
size of 150 and momentum value of 0.9. A dropout rate of 0.25 was used to
regularize the network parameters. The output of each convolutional block is
fed to the next network block until the probability of the last layer is calculated
where the activation function is Relu. The stochastic gradient descent method
optimizes and finds the parameters of connected network layers that minimize
the prediction loss. The final output of the model is determined by a Soft-max
function, which tells that the probability of the classes being true is computed
using Equation (5.1).
ezj
f (zj ) = PK , for j = 1, ..., k (5.1)
k=1 ezk
Where z is a vector of the inputs to the output layer and k is the number of
outputs.
58 Recognition of Amharic Characters
Pooling
ሀ
Pooling
ለ
ሃ.
..
Convolutional-layers FC-layers ፖ
To the best of our knowledge, there is no publicly available dataset for Amharic
script recognition and all researchers have been done experiments by prepar-
ing their own private datasets. Therefore, it may not be directly relevant to
compare the current Amharic character recognition model with other works
which are performed under different experimental conditions and datasets.
However, the comparison between these works can be used as an indication
of research progress for Amharic character image recognition. The result of
the current Amharic character recognition model and the results of previous
classical machine learning-based research works are shown in Table 5.1.
Table 5.1: Character level recognition accuracy (out of 100%) and perfor-
mance comparison with related research works.
5.1.7 Conclusion
empirical results, confusion among visually similar character across rows and
columns were the major challenges. Therefore, we need to develop an alterna-
tive method to handle such confusion among these visually similar Amharic
characters. The next section, Section 5.2, presents an alternative OCR ap-
proach which is designed by leveraging the grapheme of basic Amharic char-
acters on Fidel-Gebeta.
In section 5.1, we noted that the characters shape similarly across rows,
and having large classes affect the performance of the OCR model developed
for Amharic character recognition and results from the OCR being complex.
However, there is a point to be considered so as to overcome class size com-
plexity and character similarity across rows of the Fidel-Gebeta. Based on
the structure of Fidel-Gebeta, dividing the number of large classes into two
smaller subclasses (row-wise and column-wise classes).
To the best of our knowledge, no attempts have been made to take this ad-
vantage and recognize a character based on the row-column order of Amharic
character in Fidel-Gebeta which has motivated us to work on it. Therefore, in
this section, we propose a method called Factored Convolutional Neural Net-
work (FCNN) for Amharic OCR. FCNN has two classifiers that share a com-
mon feature space before they fork-out at their task-specific layer, where the
two task-specific layers are responsible to detect the row and column compo-
62 Recognition of Amharic Characters
Extensive research efforts have been made to develop an OCR model for vari-
ous scripts and most of these scripts, now, have robust OCR applications while
other low-resource scripts, like Amharic, are still remained relatively unex-
plored [BHS18]. This section presents various character-level OCR works
that have been done, using either classical machine learning or deep learning-
based techniques, for different scripts including Amharic.
The first work was done on Amharic OCR by Worku in 1997 and he adopted
a tree classification scheme built by using topological features of a char-
acter [Ale97]. It was only able to recognize an Amharic character written
with W ashera font and 12 point type. Then other attempts have been made
ranging from typewritten [Tef99], machine-printed [MJ07], Amharic braille
document image recognition [HA17], Amharic document image recognition
and retrieval [Mes08], numeric recognition [RRB18] to handwritten [AB09].
However, all these attempts are based on classical machine learning tech-
niques and trained with a limited number of private datasets. In addition, sev-
eral other works have been made based on CNN-based methods together with
the classical approaches. Maitra et al [MBP15] proposed a CNN as a train-
able feature extractor and SVM classifier for numeric character recognition
in multiple scripts. A similar study, Elleuch et al [ETK16] employed a CNN
as a feature extractor and SVM as recognizer to classify Arabic handwritten
script using the HACDB database.
Even though attempts have done for Amharic OCR so far are based on
traditional machine learning techniques, in section 5.1 of this thesis, a CNN-
based architecture for Amharic character image recognition is presented. Be-
sides, all of these methods mentioned above consider each character as a sin-
gle class [MJ07, Tef99, BHS18] but not consider the row and column order
location of the character in Fidel-Gebeta. The problems of these approaches
are; treated structurally similar characters in a different class and using a large
number of classes which leads the network to be complex. It is also diffi-
cult to obtain a good performance when considering each structurally similar
character as a class and without explicitly using the shared information, in
2. Amharic Character Grapheme-based OCR 63
There are very few databases used in various works on Amharic OCR reported
in the literature. As reported in [Tef99], the authors considered the most fre-
quently used Amharic character with a small number of samples, which are
about 5172 core Amharic characters with 16 sample images per class. Later
work done by million et al [Mes08] uses 76800 character images with differ-
ent fonts type, sizes with 231 classes. Other researchers’ work on Amharic
OCR [Ale97, AB09, RRB18] reported as they used their own private database,
but none of them made their database publicly available for research purpose.
In this section, we use the same Amharic character image database em-
ployed in section 5.1 that contains 80000 sample Amharic character images.
In this database, there are many deformed and meaningless character images.
Therefore, for our experiment, we removed about 2006 distorted character
images from the database. We also labeled each character image in a row-
column order following the arrangements of Fidel-Gebeta.
This section describes the Multi-Task Learning (MTL) paradigm, one of the
central training paradigms in machine learning, that can learn and solve mul-
tiple and related tasks at the same time while exploiting commonalities and
differences across all the tasks [DCBC15, DHK13]. Learning multiple re-
lated tasks simultaneously is inspired by the biological nature of learning in
humans, where we often apply the knowledge we have already acquired by
learning-related tasks from the environment so as to learn new tasks in the
same and/or different environment. MTL has been applied in many domains.
Zhang et al [ZLLT14] propose a convolutional neural network to find facial
landmarks and then recognize face attributes. Kisilev et al [KSBH16] present
a multi-task CNN approach for detection and semantic description of lesions
in diagnostic images. Yang et al [YZX+ 18] introduce a multi-task learning al-
gorithm for cross-domain image captioning. The common part in all these re-
64 Recognition of Amharic Characters
searchers’ work is to use the same parameters for the bottom layers of the deep
neural network while task-specific parameters at the top layers. As explained
by Rgyriou et al [AEP07], learning multiple related tasks simultaneously has
been often shown significantly better performance relative to learning each
task independently. Compared to training tasks separately, MTL improves
learning efficiency and prediction accuracy for the task-specific models. MTL
provides various services such as implicit data augmentation, attention focus,
and eavesdropping [AM90].
MTL could follow different training strategies with the objective of im-
proving the performance of all the tasks [AEP07, ZY17]. Joint training is one
of the most commonly used training strategies, where the model uses hard pa-
rameter sharing and tried to optimize a single joint loss L which is computed
from each task using,
XN
L= li (5.2)
i=1
where li is the loss of a task Ti and N is the total number of tasks involved
in the training procedure. The character recognition model proposed in this
section is designed by taking the advantages and learning strategies of MTL.
2. Amharic Character Grapheme-based OCR 65
0
1
Row-wise order
2
.
Max-pooling
FC1
(33x1)
Max-pooling
.
.
32
Column-wise order
1
2
Convolutional-layers 3
FC2
(7x1)
.
FC-layers .
.
6
Figure 5.4: Factored CNN architecture:. This method has two classifiers
which share a common feature space at the lower layer and then
fork-out at their last layers. FC1 and FC2, at the last layer, repre-
sent the fully connected layers of row and column detector having
33 and 7 neurons that corresponding to the number of classes for
each task respectively.
The details of the proposed system architecture designed, based on MTL, for
Amharic character image recognition are described in the next section.
ljoint = lr + lc (5.3)
Where lr and lc denotes different losses of row and column detector which
66 Recognition of Amharic Characters
Where C is the number of class, ti is ground truth for each class and si is a
score of each class calculated by Equation (5.1).
Figure 5.5: A typical diagram that shows the prediction of FCNN model.
For an input character image, the row component detector rec-
ognizes zero and the column component detector recognizes three
then the model output becomes the character (ƒ) which is located
at row zero and column three of the Fidel-Gebeta
were used for training and testing. The proposed FCNN architecture contains
two multi-class classification problems (the row and column detector) and the
convolutional layers have 32 filters of size 3 × 3. Every two blocks of convo-
lutional layers are followed by a 2 × 2 max-pooling. The two fully connected
layers have 512 neurons each while the last forked layers have a size corre-
sponding to the number of unique classes in each specific task, in our case
33 for row detector and 7 for column/vowel component detector. Considering
the size of the dataset, we train our network using the different batch sizes and
the best test result is recorded with the batch size of 256 and Adam optimizer
running for 15 epochs. Based on the results recorded during experimentation,
96.61 % of the row and 95.96% of the column components of the characters
are correctly detected by the row detector and column detector respectively.
An overall character recognition accuracy of 94.97% was recorded and the
training converges faster since we reduce the number of classes from 231 to
40 classes (33 row-wise and 7 column-wise classes). As can be observed in
Table 5.2, we achieved better recognition performance compared with works
done on Amharic OCR using 231 classes. The value ’character’ in the third
column of Table 5.2 refers to the recognition made with 231 classes while
68 Recognition of Amharic Characters
5.2.6 Conclusions
In the classical OCR approach, different handcrafted features have been ex-
tracted from segmented components of a given script, and then pattern clas-
sification algorithms were employed for recognition. However, contempo-
rary deep learning techniques can learn the distinguishing features automat-
ically; thereby eliminating the manual feature extraction steps. Sequence
learning is based on the sequence-to-sequence matching principle where a
neural network is trained such that both the input-output pairs are comprised
of variable length sequences. This part of the thesis overviews state-of-the-art
segmentation-free OCR techniques and contains two sections. Each section
provides a detailed analysis of the corresponding OCR techniques and reports
their recognition performance on Amharic text-image recognition.
Further this chapter is organized as follows: Section 6.1 describes the use
of LSTM-CTC-based networks for Amharic OCR. Section 6.2 presents the
details of a hybrid model based on the CNN and LSTM networks for Amharic
text-line image recognition. Section 6.3 concludes the chapter with a short
summary.
This section provides necessary details about the LSTM-based OCR method-
ology that has been used for Amharic text-image recognition and detail anal-
71
72 Amharic Script Recognition using CTC Sequence Modeling
ious open-source OCR engines was evaluated on 19th century Fraktur scripts
and the evaluation shows as the open-source OCR engine can outperform
the commercial OCR application [RSWP18]. However, the OCR system for
many scripts especially those which are indigenous to the African continent
(such as Amharic) remains under-researched and none of the researchers take
the advantage of deep learning techniques such as LSTM, used for many lan-
guages, for developing Amharic OCR.
6.1.2 Database
Printed Synthetic
Font type Power Geez Power Geez Visual Geez
Number of samples 40,929 197,484 98,924
No. of test-samples 2,907 9,245 6,479
No. of train-samples 38,022 188,239 92,445
No. of unique chars. 280 261 210
The CTC objective function can automatically learn the alignment between
the input and output sequence [GFGS06], where the CTC loss function is used
to train the network. During training, CTC requires only an input sequence
and the corresponding transcription. The detail formulation of the CTC cost
function explained in chapter 2, section 2.4.1.
In all experiments described below, we use the same network architecture and
the performance of the proposed model is calculated in terms of character er-
ror rate which is computed by counting the characters inserted, substituted,
and deleted in each sequence and then divided by the total number of charac-
ters in the ground-truth (see Equation (6.2)).
1. Amharic Text-image Recognition with LSTM-CTC Networks 75
6.1.4 Experiments
the ADOCR. The first experiment is testing our model with synthetic Amharic
text-line images generated with Power Geez and Visual Geez fonts. The sec-
ond experiment is used to test how the model, that trained with a synthetic
and part of printed text-line images, works on printed text-line images writ-
ten with Power Geez font type. Sample Amharic text-line image taken from
ADOCR database is illustrated in Figure 6.2.
We randomly split the training and test-set. We use a total of 318,706 train-
ing and 18,631 samples for testing. For validation, we use 7% of randomly se-
lected images from the training dataset. The detail information of the database
used for our experiment is shown in Table 6.1. The training and validation loss
of the proposed model recorded for 50 epochs is depicted in Figure 6.3. The
network was trained for 50 epochs with a batch size of 200 and then a charac-
ter error rate of 4.24% on synthetically generated text-line images with Visual
Geez font, 8.54% on printed text-lines with Power Geez font, and 2.28% on
synthetic text-line images generated with Power Geez font are recorded. The
results recorded during experimentation are summarized in Table 6.2. The
performance of the proposed model is measured using Character Error Rate
(CER) formulated as follows:
!
1 X
CER(P, T ) = D(n, m) × 100 (6.2)
q n∈P,m∈T
where q is the total number of target labels, P and T are the predicted and
ground-truth labels, and D(n, m) is the edit distance between sequence n and
m. Edit distance is the difference between the predicted and ground-truth texts
1. Amharic Text-image Recognition with LSTM-CTC Networks 77
which can be occurred due to the deletion, substitution and insertion of char-
acters during recognition. The results show that using a synthetic dataset is the
For example, as illustrated on Figure 6.4a, on the top part of the figure, the
2. Deep CNN-RNN Hybrid Networks for Amharic OCR 79
6.1.6 Conclusions
6.2.1 Database
Since there is no other publicly available dataset for Amharic script recog-
nition, to train and evaluate the performance of our OCR model, we use the
same database, employed in section 6.1. In the original dataset, ADOCR, all
text-line images are Greyscale and the sizes of each text-line were 48 by 128
pixels. Considering similar works done in the area, and to reduce computa-
tional costs during training, we resized the images into sizes of 32 by 128
pixels. Sample text-line images that are normalized to a size of 32 by 128
pixel and used for training and testing the model proposed in this section are
illustrated in Figure 6.5.
2. Deep CNN-RNN Hybrid Networks for Amharic OCR 81
a b c
Figure 6.5: Sample Amharic text-line images from ADOCR database. All
images in this figure are normalized to a size of 32 by 128 pix-
els: (a) Printed text-line images written with the Power Ge’ez font
type. (b) Synthetically generated text-line images with the Visual
Ge’ez font type. (c) Synthetically generated text-line images with
the Power Ge’ez font type.
Since CNNs and RNNs are the most suited for image-based problems [HKR15]
and sequence recognition [GVK19] respectively, we proposed a hybrid net-
work framework which is illustrated in Figure 6.6. In this framework, we
employed three modules; the feature extractor, the sequence learner, and the
transcriber module. All three of these modules are integrated into a single
framework and trained in an end-to-end fashion.
f*T
CNN-layers
Reshape
..
.
LSTM LSTM LSTM
..
.
h1 h2 hT
Output-text
Table 6.3: Convolutional network layers of the proposed model and their cor-
responding parameter values for an input image size 32 × 128 × 1.
middle layer of our framework which predicts the sequential output per time-
step. This module consists of two bidirectional LSTM layers, with the soft-
max function on top, each of which has 128 hidden layer sizes and a dropout
rate of 0.25. The sequential output of each time-step from the LSTM layers
is fed into a softmax layer to get a probability distribution over the n + 1
possible characters. Finally, transcription of the equivalent characters is done
using the CTC layer. The details of the network parameters and configuration
of the proposed model are depicted in Table 6.3 (configuration of CNN layers)
and Table 6.4 (configuration of LSTM layers).
During training, the input text-line image passes through the convolutional
layers, in which several filters extract features from the input images. After
passing some convolutional layers in sequence, we apply to reshape operation
on the output and obtained the sequence of 63 vectors of 512 elements. Then
we feed these 63 vectors to the LSTM network and get its output which also
the vectors of 512 elements. The output of the LSTM is fed into a fully
connected network with the soft-max function which has n + 1 nodes with
a vector of 281 elements. This vector contains the probability distribution of
observing character symbols at each time step. Each symbol corresponds to
a label or each unique character in the ground truth and one blank character
which is used to take care of the continuous occurrence of the same characters.
We employed a checkpoint strategy that can save the model weight to the same
file each time an improvement is observed in validation loss.
84 Amharic Script Recognition using CTC Sequence Modeling
Table 6.4: The recurrent network layers of the proposed model with their
corresponding parameter values. The input size of the Long-
Short-Term Memory (LSTM) is a squeezed output of the convolu-
tional layers, which is depicted in Table 6.3.
the training dataset, randomly selected, as proposed in section 6.1. The net-
work was trained for 10 epochs with a batch size of 200. During the testing of
the proposed model, character error rates of 1.05% and 3.73% were recorded
on the two tests of the ADOCR database which were generated synthetically
using the Visual Geez and Power Geez fonts, respectively. The model was
also tested with the third test dataset, which consists of printed text-line im-
ages that were written with the Power Geez fonts, and a character error rate of
1.59% was obtained. All empirical results recorded during experimentation
are summarized in Table 6.5.
This section provides a detailed analysis and description of the dataset first,
and then the results obtained during experimentation are also presented. We
performed repetitive evaluations of our method using the ADOCR benchmark
dataset (see chapter 4, Section 4.3.2), and we also tried to compare with the
state-of-the-art methods on both printed and synthetically generated datasets.
Some of the printed text-line images are not properly aligned with the ground
truth due to the occurrence of extra blank spaces between words and/or the
merging together of more words during printing. In addition, with the syn-
thetic text-line images, a character at the beginning and/or at the end of the
word is missed, which results in misalignment with the ground-truth. To im-
prove the recognition performance, it is important to annotate the data man-
ually or to use better data annotation tools. Of several factors, samples with
wrongly annotated Amharic text-line images and characters are the major fac-
tors causing recognition errors. In general, the recognition errors may occur
due to misspelled characters, spurious symbols, or lost characters, and sample
annotation errors per text-line image are illustrated in Figure 6.7.
The proposed model works better on the synthetic test datasets generated
with the Visual Geez font type, compared to the character error rate observed
on the Power Geez font type and the printed test data. The generalization of
this model, especially on the synthetic dataset generated with the Power Geez
86 Amharic Script Recognition using CTC Sequence Modeling
a b
Figure 6.7: Sample wrongly annotated images and GT from the test
dataset. (a) Synthetic text-line image; the word marked with
yellowrectangle is a sample mislabeled word where the first
character (µ) in the GT, marked with a redcircle, is missed in
the input image but it exists in GT. (b) Printed text-line image;
a punctuation mark called a full stop/period, bounded with a
purplerectangle in the input image, is incorrectly labeled as two
other punctuation marks called word separators, indicated by the
green and red text colors in the GT.
font type, is not as good as the Visual Geez one. This happens mainly because
of the significantly larger number of text-line images in the test set and the
nature of the training samples (i.e., the text-line images and the ground truth
are not properly annotated due to the existence of deformed characters and
missing characters in the beginning and/or end of the text-line images during
data generation but not in the ground truth). In addition, text-line images
generated with the Power Geez font are relatively blurred, resulting in poor
recognition accuracy.
As depicted in Figure 6.8 and Table 6.6, the performance of the proposed
model is improved. Compared to an LSTM-based Amharic OCR, the pro-
posed model achieved better recognition performance with a smaller number
of epochs. However, the proposed model took a long time for training. This is
due to the nature of end-to-end learning approaches [Gla17] that incorporate
multiple and diverse network layers in a single unified framework. There-
fore, the proposed model can be assessed and training time may be further
improved by following some other concepts like decomposition [SSSS17] or
the divide-and-conquer approach.
The comparisons among the proposed approach and others’ attempts done
on the ADOCR Database [BHL+ 19a] are presented in Table 6.6, and the per-
formance of the proposed model shows better recognition results on all the
2. Deep CNN-RNN Hybrid Networks for Amharic OCR 87
a b
Figure 6.8: Learning loss comparison. (a) The training and validation losses
of LSTM-CTC model recorded for 50 epochs. (b) The training and
validation loss of the proposed model recorded for 10 epochs.
6.2.6 Conclusions
RNNs trained with CTC objective function have been employed for text-
image recognition and presented in the previous chapter in detail. It is also
a state-of-the-art approach and widely applied for sequence learning tasks to
date. Recently, another sequence learning technique has been emerged, from
the field of NMT, and has often been shown to improve the performance over
the existing approaches. Therefore, this chapter presents the other technical
contributions of this thesis. These contributions are organized in two subsec-
tions and presented as follows. Section 7.1 describes the Attention mecha-
nism and its implementation detail for Amharic text-image recognition; while
section 7.2 shows the advantage of blending attention mechanism in to CTC
objective function for sequence transcription in general and in particular for
Amharic text-image recognition. Finally, it provides summarized findings
and results achieved from the two text-line image recognition approaches pro-
posed for Amharic script.
89
90 Attention-based Sequence Learning for Amharic OCR
script which imparts the valuable spatial and structural information of text-
line images. In this section, we aim to push the limits of such techniques
using Attention-based text-line image recognition which is totally based on
neural networks. Unlike the CTC, Attention-based models explicitly use the
history of the target sequence without any conditional independence assump-
tions. The following section gives a detailed overview and training procedures
of the proposed Attention-based encoder-decoder OCR model.
7.1.1 Introduction
The standard OCR tasks have been investigated based on CNNs and RNNs
[BHL+ 19a] by utilizing the CTC [GLB+ 08] objective function. However,
CTC-based architectures are subject to inherent limitations such as condi-
tional independence assumption, strict monotonic input-output alignments,
and an output sequence length that is bound by the subsampled input length
while the Attention-based sequence-to-sequence model is more flexible, suit
the temporal nature of the text and are able to focus on the most relevant
features of the input by incorporating Attention mechanisms [BCB14].
Unlike the previous sections that use the LSTM-CTC based networks, the
method proposed in this section uses the concept of Attention mechanism
with the following major contributions:
The exiting OCR models can utilize either traditional or holistic techniques.
Methods belong to the first category were mainly applied before the introduc-
tion of deep learning, and follows step-wise routines. In contrast, the holistic
approach integrated the feature extraction and sequential translation steps in
a unified framework that are trained from end-to-end. The details are given in
chapter 2. Therefore, in this section, we only review the existing state-of-the-
art techniques that are applied for sequence-to-sequence learning tasks.
one token at a time. The decoder uses the Attention mechanism to gather
context information and search for relevant parts of the encoded features.
Bahdanua [BCB14] and Luong [LPM15] proposed Attention-based encoder-
decoder model for machine language translation. Other works, like Doetsch
et al. [DZN16], apply an Attention neural network to the the output of a
BLSTM network operating on frames extracted from a text line image with a
sliding window and Bluche [Blu16] propose a similar technique for end-to-
end handwritten paragraph recognition.
Finally, our system is based on Bahdanua [BCB14] Attention network ar-
chitecture the one proposed for neural machine translation. The main differ-
ence is that we apply the Attention for text-line image recognition where the
decoder network outputs character by character given the decoder history and
the expected input from the Attention mechanism. Further, CNN layers are
integrated with the encoder network as feature descriptors and unlike Bah-
danau’s networks that use GRU, both the encoder and decoder networks are
LSTM networks. In addition, the hidden state of decoder LSTM is randomly
initialized with a Keras default weight initializer, Xavier uniform initializer,
instead of the final state of the encoder LSTM network. In the case of Bah-
danua’s Attention training was done using the teacher forcing technique while
our proposed model is trained by re-injecting the previous decoder’s predic-
tions into the current decoder’s input.
text-image recognizer proposed by Belay [BHM+ 20] (see section 6.2). Then,
Bidirectional-LSTM layers, which read the sequence of convolutional fea-
tures to encode temporal context between them, are employed. The Bidirectional-
LSTM processes the sequence in opposite directions to encode both forward
and backward dependencies and capture the natural relationship of texts.
exp(aj,i )
wj,i = PT x (7.2)
k=1 exp(ajk )
where aj,i is the alignment score of concatenated annotation between sj−1 and
94 Attention-based Sequence Learning for Amharic OCR
y0 y1 y2 yTy-1 yTy
..
.
Softmax
Softmax Softmax Softmax
RepeatVector
Decoder
s1 s2 sTy-1
LSTM LSTM LSTM
..
.
Concat Concat Concat
..
.
s0
RepeatVector
RepeatVector
..
.
Attention layer
RepeatVector
..
.
FC2 FC2 ... FC2 FC2 FC2 ... FC2 FC2 FC2 ... FC2
FC1 FC1 ... FC1 FC1 FC1 ... FC1 FC1 FC1 ... FC1
..
.
concat concat concat concat concat concat concat concat concat
..
.
h1 h2 hTx
Encoder
X1 X2 XTx
..
.
CNN-layers
Amharic text-image
The function g and f in Equation (7.3) are feed-forward neural networks, with
tanh activation function, that are stacked consecutively. The intuition of f and
g is to let the model learn the alignment weights together with the translation
while training the whole model layers.
5.21% on ADOCR test datasets that are synthetically generated with Visual
Geez and Power Geez fonts respectively. We also carried out other experi-
ments on a printed Amharic text-line images dataset from the ADOCR test
dataset and achieved a promising result with a CER of 2.54%. Sample text-
line images that are wrongly recognized during evaluation of our Amharic
OCR model are depicted in Figure 7.2. Characters marked by colored-boxes
are wrong predictions (it can be deletion, substitution or insertion errors). For
example, characters marked by blue-boxes, such as b and r from the first
and second text-line image respectively are sample deleted characters. Other
characters, such as n and °, from second and fourth predicted texts that
are marked with green-boxes are insertion errors, while character × in the
2. Blended Attention-CTC for Amharic Text-image Recognition 97
third predicted text which is marked by red-box is one of the substitution error
recorded during experimentation.
The other type of error that usually affects the recognition performance
of our model is the missing of characters either on the ground-truth or on
the text-line image itself. For example, the character m, marked by violet-
box, in the third text-line image’s ground-truth is one of the character missed
during the text-line image generation. Even though all visible characters in the
text-line image are predicted correctly, since the CER is computed between
the ground-truth texts and the predicted texts, such types of errors are still
considered deletion errors. Besides wrongly predicted text-line images, the
fifth text-line image in Figure 7.2, is a sample correctly predicted text-line
image from ADOCR test sets.
7.1.5 Conclusions
In chapter 6 of this thesis, CTC-based networks have been proposed for recog-
nition of Amharic text-line image (see section 6.1 and 6.2). In addition, in sec-
tion 7.1 of this chapter, Attention-based encoder-decoder networks have been
98 Attention-based Sequence Learning for Amharic OCR
also introduced for Amharic OCR. In all attempts, the recognition perfor-
mance of the proposed approaches is evaluated with the ADOCR test database
and promising experimental results are reported. In this section, we present
a novel Attention-based approach which is designed by blending the concept
of Attention mechanism directly within the CTC objective function. The pro-
posed model consists of an encoder module, Attention module, and transcrip-
tion module in a unified framework. The new approach and implementation
details are presented as follows. Section 7.2.1 talks about related works. The
proposed blended Attention-CTC model is presented in section 7.2.2, and
section 7.2.3 presents the detail of datasets. In the last two subsections, ex-
perimental results and conclusions are presented respectively.
Tx
X
t
Cvec = sti hi enc (7.5)
i=1
exp(ai )
si = PT x (7.6)
k=1 exp(ak )
where ai is the attention score of henc
i at each time step t and it can be com-
puted using Equation (7.7).
Softmax-2
t enc
Cvec = ∑ st h
i
i=1
softmax-1
a1 a2
... aT
x
FC FC ... FC
...
1 2 Tx
h0
Encoder-LSTM Encoder-LSTM Encoder-LSTM h0
x
1
x
2 ... T
x
CNN-layers
Input text-image
function that lets the model learn the alignment weights together with the
translation while training the whole model layers.
7.2.3 Database
To train and evaluate the performance of the proposed model, we use the
ADOCR database introduced by [BHL+ 19a]. The original Amharic OCR
database was composed of 337,337 Amharic text-line images, each with mul-
tiple word instances collected from different sources. Sample Amharic text-
line images from ADOCR database and the detail descriptions are given in
chapter 4, section 4.3.2.
7.2.4 Experiments
Our model is trained with the ADOCR database. Similar to section 6.2, im-
ages are scaled to 32 by 128 pixels so as to minimize computations. Since
there is no explicitly stated validation data, in the ADOCR dataset, the same
selection mechanism is applied and randomly selected 7% of the training sam-
ples as validation samples. A blended Attention-CTC network is formulated
by directly taking the advantage of Attention mechanism and CTC network
as an integrated framework and trained in an end-to-end fashion.
During training this model, we use two bi-directional LSTM, each with 128
hidden units and a dropout rate of 0.25, on top of seven convolutional layers
that are stacked serially with ReLU activation function as an encoder. The de-
coder LSTM, presented in section 7.1, is removed and then the CTC objective
function that is blended with the Attention mechanism is in place. The con-
volutional layers of the encoder module are composed of seven convolutional
layers which are already trained in section 6.2 and the same weights are used
in this section via transfer learning.
102 Attention-based Sequence Learning for Amharic OCR
a c
b d
Figure 7.4: The training & validation losses of model trained with dif-
ferent network settings. (a). CTC loss with an LSTM-CTC
model. (b) CTC loss with a CNN-LSTM-CTC model. (c) CE loss
of Attention-based encoder-decoder model. (d) CTC loss of the
proposed blended Attention-CTC model
We use a batch size of 128 with Adam optimizer and trained for 15 epochs.
Once the probability of labels is obtained from the trained model, we use
best path decoding [Gra08] to generate a character (C i ) that has the maxi-
mum score at each time step t (See Equation (6.1)). The learning loss of the
proposed model and other models trained using different network settings are
depicted in Figure 7.4.
7.2.5 Results
Figure 7.5: Sample text-line image with the corresponding predicted and
ground-truth texts. Characters, in the predicted text, marked
with red rectangle are wrongly predicted by both Attention based
Encoder-decoder (AED) and Blended Attention-CTC Networks
(BACN).
the same test dataset. Sample Amharic text-image from ADOCR test dataset
with the corresponding GT texts and predictions are depicted in Figure 7.5.
7.2.6 Conclusions
This chapter presents the conclusions of the research work described in the
thesis. The contributions and objectives of the thesis, outlined in chapter 1,
are reviewed and their achievements are addressed. Factors that limit the effi-
ciency of the present work are also presented. The following section presents
the summary of the most relevant conclusions drawn from this thesis and
based on the findings of this thesis work, recommendations are made for fu-
ture research on Amharic script recognition.
8.1 Conclusions
105
106 Conclusions and Recommendations for Future Work
scripts in general, only a few efforts are reported to date, which is not enough
to address the complexities. Amharic, whose speakers exceeds 100 million
people around the globe, is one of the most underrepresented scripts in the
field of NLP. Such resource-limited but unique script presents various chal-
lenges towards the development of an efficient OCR system. Some of the
challenges are the use of a large number of characters in the writing system
and the existence of a large set of visually similar characters. In addition, there
are no standards for font type, size, and style which results in the script being
more complex for recognition. There are numerous documents, in Amharic
script, that is collected for centuries. For further analysis of these documents,
digitization is critical. This signifies the importance of the research in this
thesis.
used in Amharic script. Moreover, none of these datasets are publicly avail-
able. Hence, we have explored state-of-the-art OCR tools and techniques
ranging from dataset preparation to model development, variants of artificial
neural networks ranging from convolutional neural networks to recurrent net-
works, sequence-to-sequence learning, and alignment mechanisms ranging
from CTC to Attention mechanism. Based on the results from empirical ob-
servations and conceptual frameworks from the theoretical foundation, sev-
eral conclusions can be drawn in this thesis work.
• The shortage of training dataset can be solved with artificial data, how-
ever, it should be generated carefully to reflect scanned real-world doc-
uments as close as possible. Experiments performed, in this thesis, by
training different deep-learning-based OCR models on synthetic data
and testing them on real scanned Amharic documents justifies this claim.
This thesis presents the procedures for synthetic data generation and
real scanned data preparation. Further, the dataset used for Amharic
script recognition, in this thesis work, is now made publicly available
at http://www.dfki.uni-kl.de/˜belay/ for free.
• This thesis presented novel CNN-based Amharic character recognition
techniques. The approach employed a Factored-CNN network that is
designed based on the grapheme of basic Amharic characters called
Fidel-Gebeta and achieved character recognition accuracies that are sig-
nificantly better than the performance of existing Amharic characters
recognition techniques.
• Stacking CNNs, before the LSTM network layers, as a feature extrac-
tor and training the hybrid networks using a CTC objective function in
an end-to-end fashion improves the recognition performance than the
LSTM-CTC-based network model.
• Attention embedded RNNs, from neural machine translation tasks, are
adopted and employed for Amharic text-line image recognition. The
recognition efficiency reached near to the performance of CTC-based
sequence alignments.
• The integration of the Attention mechanism and CTC objective func-
tion makes the sequence-to-sequence learning network perform better,
instead of employing each of them separately.
Experimental results show the effectiveness of our approach for the recogni-
tion of Amharic scripts. The majority of the recognition errors observed are
due to improper annotation among text-line images and the ground truth while
108 Conclusions and Recommendations for Future Work
other errors are due to the existence of deformed and/or missing characters in
the beginning and/or end of the text-line images during data generation from
the given texts. In general, the strategies followed, in this thesis, are promising
for the recognition of large digitized Amharic document images. Therefore,
the present approaches could be extended, redesigned or new approaches need
to be proposed for the OCR of historical and handwritten Amharic documents.
Tasks that are not addressed in this thesis and the possibilities to extend the
current work are provided in the next section.
To advance the OCR of Amharic script, in this thesis, various deep learning-
based schemes are proposed. Based on the findings and progression of the
current work, we recommend the following research directions as a possi-
ble future work so as to design an efficient OCR system and digitization of
Amharic language in general.
• In the present work an attempt is made to develop the OCR model and
introduce a database for recognition of Amharic script. It pushes the
technology of Amharic script one step towards digitization. To reduce
the barriers of this language in the digital world, we need appropriate
data that can be used to analyze the linguistics of Amharic. Neural
machine translation and speech recognition need notable attention in
the future since they have not yet been sufficiently investigated.
• The database introduced in this thesis considers only the two commonly
used fonts of Amharic script. Even though a large Amharic database is
created with those fonts, there are other fonts rarely used in Amharic
writing system that is not included in the training set, which may af-
fect the recognition performance of the OCR model. To facilitate OCR
researches and develop versatile OCR for Amharic script, we need to
develop a large database of Amharic texts with different font types and
styles.
• The OCR models proposed in this thesis are based on various state-of-
the-art deep learning-based techniques. The recognition performance
of these models is evaluated using a test dataset from the ADOCR
database. In this regard, we have achieved promising results. Compared
to the over centuries Amharic document collection, which contains im-
ages, tables, unknown fonts, and non-standard scripting, the ADOCR
2. Recommendations for Future Work 109
111
Appendix A
This part of the thesis is an appendix that presents the details of the Atten-
tion mechanism and internal working of Attention techniques. Various ar-
chitectures of encoder-decoder networks that have been used commonly for
sequence-to-sequence tasks, including NMT and OCR are also described in
detail. Besides, it gives the necessary mathematical details of Attention-based
networks and basic parameters of attention mechanism that one needs to tune
when training the attention-based encoder-decoder networks.
The concept of Attention in the neural process has been largely studied
in Neuroscience and Computational Neuroscience first [MZ17, IKN98] and
consequently, there were many attempts to replicate it in Machine Learning
[SVL14, LPM15]. In such fields, a particular aspect of the study is visual
attention; thus many animals focus on specific parts of their visual inputs
to compute adequate responses. This principle has a large impact on neural
computation as we need to select the most pertinent piece of information,
rather than using all available information, a large part of it is irrelevant to
compute the neural response. A neural network is considered to be an effort
to mimic human brain actions in a simplified manner. Attention Mechanism
is also an attempt to implement the same action of selectively concentrating
on a few relevant things while ignoring others in deep neural networks.
A similar idea, focusing on specific parts of the input, has been followed in
deep learning, for speech recognition, neural machine translation, visual ques-
tion and answering, and text-image recognition. In the recent trends of deep
learning Attention mechanisms are one of the most exciting advancements,
and that they are here to stay.
113
114
Attention Model
S0
h1 h2 ... hn
Figure A.1: General setting of Attention model. The initial state s0 is the
beginning of the generated sequence, the hi are the representa-
tions of the parts of the input sequence, and the output is a repre-
sentation of the filtered sequence, with a filter putting the focus of
the interesting part for the sequence currently generated.
As illustrated in Figure A.1 the generic setting of Attention model takes a se-
quence with a length of n arguments h1 , ..., hn (in the preceding examples,
the hi would be the hidden states of the encoder network), and a state s0 . It
return a vector C which is supposed to be the summary of the hi , focusing on
information linked to the state s0 . More formally, it returns weighted arith-
metic mean of the hi , and the weights are chosen according to the relevance
of each hi given the state s0 .
The Attention model depicted in Figure A.1 looks a black-box and it doesn’t
clearly show how the summarized information C is computed; thus the whole
structure of the Attention model need to be discussed in detail and the model
should be redrawn by expanding the black-box design.
where the si are the softmax of the fi projected on a learned direction and
Pn
i=1 si = 1. The output c is the weighted arithmetic mean of all the hi ,
where the weight represents the relevance for each variable according to the
initial hidden state s0 and computed as,
n
X
c= hi si (A.3)
i=1
In seq2seq architecture both the encoder and decoder are recurrent neu-
ral networks, which are usually either LSTM or GRU units [CVMG+ 14,
116
DECODER
(h0,c0) (h1,c1) (h2,c2) (h3,c3) (h4,c4) (h5,c5) (h6,c6) (h7,c7)
The hidden state and cell state (h0 and c0 ) of the encoder are initialized ran-
domly with a smaller value or it can be zero. The dimensions of both states
should be the same as the number of units in the LSTM cell. The encoder will
read the English sentence word by word and store the final internal states, an
intermediate vector, of the LSTM generated after the last time step and since
the output will be generated once the entire sequence is read, therefore out-
puts of the Encoder at each time step have no use and so they are discarded.
The final states of the encode (i.e in this case h8 and c8 ) are the relevant in-
formation of the whole English sentence ”Learning is the process of gaining
new understanding”.
The decoder works in a very similar fashion as the encoder. However, dur-
1. Attention Mechanism 117
ing the training and testing phase, the decoder behaves differently unlike the
encoder which is the same in both cases. The start and end of the sequence,
for the decoder, are marked with < SOS > and < EOS > respectively.
During training the initial states (h0 and c0 ) of the decoder is set to the final
states of the encoder (h8 and c8 ). The decoder is trained to generate the out-
put based on the information gathered by the encoder by taking < SOS >
as the first input to generate the first Amharic word and at last, the decoder
learns to predict the, < EOS >. The input at each time step is actual out-
put which is usually called the teacher forcing technique. However, the loss
is calculated on the predicted outputs from each time step and the errors are
backpropagated through time to update the parameters of the model.
Unlike the training phase which uses the actual output as input at each time
step, the test phase uses the protected output produced at each time step as
input in the next time step. The final states of the decoder are discarded as we
got the output hence it is of no use in both the training and testing stages.
A critical and apparent disadvantage of a fixed-length context vector, cre-
ated in the encoding process of the seq2seq model, is the incapability of re-
membering long sentences. Often it has forgotten the first part once it com-
pletes processing the whole input; thus Attention mechanism was introduced
to solve this issue. Even though the first model of Attention mechanisms is
introduced by Bahdanau [BCB14], nowadays, there are various types of At-
tention mechanisms proposed by many researchers from the deep learning
community. The underlying principles of Attention are the same in these all
types of Attention mechanisms that were proposed so far, the differences lie
mainly in their architectures and computations.
Attention can be general and self-attention, hard and soft-attention, global
and local-attention. General-attention manages and quantifying the inter-
dependence between the input and output elements [BCB14], where self-
attention and interdependency within the input elements and relating different
positions of a single sequence to compute a representation of the same se-
quence [CDL16].
The type of Attention that attends to the entire input state space and uses
all the encoder hidden states is also known as global-attention. In contrast,
local-attention attends the part of the input sequence or patch of an image and
uses only a subset of the encoder hidden states [LPM15, XBK+ 15]. Consid-
ering the neural machine translation tasks, the global-attention has a draw-
back that it has to attend to all words on the source side for each target
word, which is expensive and can potentially render it impractical to trans-
118
Decoder
...
GRU
Initial Decoder
hidden states
concat
context vector
+
X X X
...
SoftMax
Attention layer
score score ... score
Decoder
hidden state
Encoder
... hidden state
...
Encoder
GRU
Initial encoder
hidden states
2. Initialize the decoder hidden state: The initial hidden state of the de-
coder network is a modified vector of the last hidden state of backward
encoder GRU or we can also initialize the decoder hidden states ran-
domly as we did in the encoder network.
5. Computing the Context Vector: The encoder hidden states and their
respective soft-maxed alignment scores are multiplied and then summed
up the results to form the context vector.
6. Decoding the Output: the context vector is concatenated with the pre-
vious decoder output and fed into the Decoder for that time step along
with the previous decoder hidden state to produce a new output. The
process from steps 3 to 6 is repeated for each time step of the decoder
until completing the specified maximum length of the output.
2. Initialize the decoder hidden states: Once the initial hidden states
of the encoder unit are produced then the last hidden states of both
encoder LSTM are passed to the decoder unit as initial hidden states of
the decoder.
4. Compute alignment scores: Using the new decoder hidden state and
the encoder hidden states, alignment scores are computed. In this case,
the Luong Attention uses three types of scoring functions that use the
encoder outputs and the decoder hidden state produced in the previous
step to calculate the alignment scores. The computation for each of the
scoring function is give as follows:
Decoder
LSTM
FC
Initial Decoder
hidden states ...
from encoder
concat
context vector
+
X X X
...
SoftMax
Attention layer
Decoder
hidden state
Encoder
hidden state
...
Encoder
LSTM
Initial encoder
hidden states
...
5. Softmaxing the alignment scores: the alignment scores for each en-
coder hidden state are passed through Softmax function.
6. Computing the context vector: To compute the context vector, the en-
coder hidden states and their corresponding alignment scores are mul-
tiplied and then summed up.
7. Producing the output: The concatenated context vector and the de-
coder hidden state generated in step 3 are passed through a feed-forward
neural network to produce a new final output.
Steps 3 to 8 are repeated themselves till the end of the maximum sequence
length.
The Attention mechanism proposed by Luong is different from the one pre-
sented by Bahdanau with some features. For example, attention calculation
in Bahdanau requires the output of the decoder from the prior time step while
Luong Attention uses the output from the encoder and decoder for the cur-
rent time step only. Besides, Loung uses a reversed input sequence instead of
bidirectional inputs, LSTM instead of GRU elements and unlike Bahdanau’s
Attention, Luong uses the dropout rate.
As done in all neural network-based models, there are various network pa-
rameters that one needs to tune and use so that the Attention-based networks
perform better on a given task. In this case, all parameters of the other module
(the encode and decoder modules) should have selected and tuned as usual,
however, in this section, in addition to tunable parameters used in this module,
the configuration of the Attention module are also described.
Parameters for Encoder and Decoder Networks: The number of hidden
layers, the number of LSTM cells, and even the activation function in an
individual layer are the important parameters that need to be properly selected
and tuned. However, we need also consider the effect of employing more
layers. since the learning capability of a network increases as it learns more
compact representation, but it may also make the training more difficult. We
need also properly select the type of Attention ( either soft or hard-Attention)
1. Attention Mechanism 123
[AB09] Yaregal Assabie and Josef Bigun. Hmm-based handwritten amharic word
recognition with feature concatenation. In 2009 10th International Conference
on Document Analysis and Recognition, pages 961–965. IEEE, 2009. 38, 62, 63
[AB11] Yaregal Assabie and Josef Bigun. Offline handwritten amharic word recogni-
tion. Pattern Recognition Letters, 32(8):1089–1099, 2011. 72
[AHB+ 18] Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson,
Stephen Gould, and Lei Zhang. Bottom-up and top-down attention for image
captioning and visual question answering. In Proceedings of the IEEE con-
ference on computer vision and pattern recognition, pages 6077–6086, 2018.
27
[Ale97] Worku Alemu. The application of ocr techniques to the amharic script. Master’s
thesis, School of Information Science, Addis Ababa University, Addis Ababa,
1997. 32, 38, 62, 63, 72, 73
[All95] May Allam. Segmentation versus segmentation-free for recognizing arabic text.
In Document Recognition II, volume 2422, pages 228–235. International Soci-
ety for Optics and Photonics, 1995. 17
[ALT18] Direselign Addis, Chuan-Ming Liu, and Van-Dai Ta. Printed ethiopic script
recognition by using lstm networks. In 2018 International Conference on Sys-
tem Science and Engineering (ICSSE), pages 1–6. IEEE, 2018. 38, 87, 98
125
126 Bibliography
[AMT+ 18] Solomon Teferra Abate, Michael Melese, Martha Yifiru Tachbelie, Million
Meshesha, Solomon Atinafu, Wondwossen Mulugeta, Yaregal Assabie, Hafte
Abera, Binyam Ephrem Seyoum, Tewodros Abebe, et al. Parallel corpora for
bi-lingual english-ethiopian languages statistical machine translation. In Pro-
ceedings of the 27th International Conference on Computational Linguistics,
pages 3102–3111, 2018. 4
[Aro85] Mark Aronoff. Orthography and linguistic theory: The syntactic basis of ma-
soretic hebrew punctuation. Language, pages 28–72, 1985. 32
[Asn12] Biniam Asnake. Retrieval from real-life amharic document images. Master’s
thesis, School of Information Science, Addis Ababa University, 2012. 31
[AT17] Akm Ashiquzzaman and Abdul Kawsar Tushar. Handwritten arabic numeral
recognition using deep learning neural networks. In 2017 IEEE International
Conference on Imaging, Vision & Pattern Recognition (icIVPR), pages 1–4.
IEEE, 2017. 54, 55
[BCB14] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural ma-
chine translation by jointly learning to align and translate. arXiv preprint
arXiv:1409.0473, 2014. 22, 23, 24, 26, 90, 92, 116, 117, 119
[BCFX14] Jinfeng Bai, Zhineng Chen, Bailan Feng, and Bo Xu. Image character recogni-
tion using deep convolutional neural network learned from different languages.
In 2014 IEEE International Conference on Image Processing (ICIP), pages
2560–2564. IEEE, 2014. 54, 55
[BHL+ 19b] Birhanu Belay, Tewodros Habtegebrial, Marcus Liwicki, Gebeyehu Belay, and
Didier Stricker. Factored convolutional neural network for amharic character
image recognition. In 2019 IEEE International Conference on Image Process-
ing (ICIP), pages 2906–2910. IEEE, 2019. 19, 32, 72, 98
[BHM+ 20] Birhanu Belay, Tewodros Habtegebrial, Million Meshesha, Marcus Liwicki,
Gebeyehu Belay, and Didier Stricker. Amharic ocr: An end-to-end learning.
Applied Sciences, 10(3):1117, 2020. 32, 91, 93, 98
Bibliography 127
[BHS18] Birhanu Hailu Belay, Tewodros Amberbir Habtegebrial, and Didier Stricker.
Amharic character image recognition. In 2018 IEEE 18th International Con-
ference on Communication Technology (ICCT), pages 1179–1182. IEEE, 2018.
62, 67, 72, 98
[Blu16] Théodore Bluche. Joint line segmentation and transcription for end-to-end
handwritten paragraph recognition. In Advances in Neural Information Pro-
cessing Systems, pages 838–846, 2016. 92
[Bre08] Thomas M Breuel. The ocropus open source ocr system. In Document Recog-
nition and Retrieval XV, volume 6815, page 68150F. International Society for
Optics and Photonics, 2008. 3, 44, 55, 56
[BSS17] Erik Bochinski, Tobias Senst, and Thomas Sikora. Hyper-parameter optimiza-
tion for convolutional neural network committees based on evolutionary algo-
rithms. In 2017 IEEE International Conference on Image Processing (ICIP),
pages 3924–3928. IEEE, 2017. 19
[BT14] Henry S Baird and Karl Tombre. The evolution of document image analy-
sis. In Handbook of document image processing and recognition, pages 63–71.
Springer, 2014. 1
[BUHAAS13] Thomas M Breuel, Adnan Ul-Hasan, Mayce Ali Al-Azawi, and Faisal Shafait.
High-performance ocr for printed english and fraktur using lstm networks. In
2013 12th International Conference on Document Analysis and Recognition,
pages 683–687. IEEE, 2013. 54, 55
[CAC+ 04] Luke Cole, David Austin, Lance Cole, et al. Visual object recognition using
template matching. In Australian conference on robotics and automation, 2004.
16
[CCB15] Kyunghyun Cho, Aaron Courville, and Yoshua Bengio. Describing multimedia
content using attention-based encoder-decoder networks. IEEE Transactions
on Multimedia, 17(11):1875–1886, 2015. 90
[CDL16] Jianpeng Cheng, Li Dong, and Mirella Lapata. Long short-term memory-
networks for machine reading. arXiv preprint arXiv:1601.06733, 2016. 117
[CdSBJB+ 06] Paulo Rodrigo Cavalin, Alceu de Souza Britto Jr, Flávio Bortolozzi, Robert
Sabourin, and Luiz E Soares Oliveira. An implicit segmentation-based method
128 Bibliography
[CH03] John Cowell and Fiaz Hussain. Amharic character recognition using a fast
signature based algorithm. In Information Visualization, 2003. IV 2003. Pro-
ceedings. 7th International Conference on, pages 384–389. IEEE, 2003. 14, 72,
98
[CL96] Richard G Casey and Eric Lecolinet. A survey of methods and strategies in
character segmentation. IEEE transactions on pattern analysis and machine
intelligence, 18(7):690–706, 1996. 14
[CPT+ 19] J. Cai, L. Peng, Y. Tang, C. Liu, and P. Li. Th-gan: Generative adversarial net-
work based transfer learning for historical chinese character recognition. In
2019 International Conference on Document Analysis and Recognition (IC-
DAR), pages 178–183. IEEE, Sep. 2019. 3
[CRA13] Amit Choudhary, Rahul Rishi, and Savita Ahlawat. A new character segmen-
tation approach for off-line cursive handwritten words. Procedia Computer
Science, 17:88–95, 2013. 15
[CV18] Arindam Chowdhury and Lovekesh Vig. An efficient end-to-end neural model
for handwritten text recognition. arXiv preprint arXiv:1807.07965, 2018. 27, 98
[CVMG+ 14] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau,
Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase repre-
sentations using rnn encoder-decoder for statistical machine translation. arXiv
preprint arXiv:1406.1078, 2014. 116
[DCBC15] Long Duong, Trevor Cohn, Steven Bird, and Paul Cook. Low resource depen-
dency parsing: Cross-lingual parameter sharing in a neural network parser. In
Proceedings of the 53rd Annual Meeting of the Association for Computational
Linguistics and the 7th International Joint Conference on Natural Language
Processing (Volume 2: Short Papers), pages 845–850, 2015. 63
[Dem11] Fitsum Demissie. Developing optical character recoginition for ethiopic scripts,
2011. 60
[DHK13] Li Deng, Geoffrey Hinton, and Brian Kingsbury. New types of deep neural net-
work learning for speech recognition and related applications: An overview. In
2013 IEEE international conference on acoustics, speech and signal process-
ing, pages 8599–8603. IEEE, 2013. 63
Bibliography 129
[DJH01] O De Jeses and Martin T Hagan. Backpropagation through time for a general
class of recurrent network. In IJCNN’01. International Joint Conference on
Neural Networks. Proceedings (Cat. No. 01CH37222), volume 4, pages 2638–
2643. IEEE, 2001. 19
[DLY+ 19] Amit Das, Jinyu Li, Guoli Ye, Rui Zhao, and Yifan Gong. Advancing acoustic-
to-word ctc model with attention and mixed-units. IEEE/ACM Transactions on
Audio, Speech, and Language Processing, 27(12):1880–1892, 2019. 90
[DT+ 14] David Doermann, Karl Tombre, et al. Handbook of document image processing
and recognition. Springer, 2014. 1
[DZN16] Patrick Doetsch, Albert Zeyer, and Hermann Ney. Bidirectional decoder
networks for attention-based end-to-end offline handwriting recognition. In
2016 15th International Conference on Frontiers in Handwriting Recognition
(ICFHR), pages 361–366. IEEE, 2016. 92
[EF20] Gary F. Simons Eberhard, David M. and Charles D. Fennig. Ethnologue: Lan-
guages of the World. SIL International, Dallas, Texas, United States, twenty-
third edition, 2020. 29
[Eik93] Line Eikvil. Optical character recognition. citeseer. ist. psu. edu/142042. html,
1993. 14, 16
[ETK16] Mohamed Elleuch, Najiba Tagougui, and Monji Kherallah. A novel architec-
ture of cnn based on svm classifier for recognising arabic handwritten script.
International Journal of Intelligent Systems Technologies and Applications,
15(4):323–340, 2016. 62
[GDS10] Debashis Ghosh, Tulika Dube, and Adamane Shivaprasad. Script recogni-
tion—a review. IEEE Transactions on pattern analysis and machine intelli-
gence, 32(12):2142–2161, 2010. 54, 55
[GFGS06] Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhu-
ber. Connectionist temporal classification: labelling unsegmented sequence
data with recurrent neural networks. In Proceedings of the 23rd international
conference on Machine learning, pages 369–376, 2006. 20, 21, 74, 84
[GLB+ 08] Alex Graves, Marcus Liwicki, Horst Bunke, Jürgen Schmidhuber, and Santi-
ago Fernández. Unconstrained on-line handwriting recognition with recurrent
neural networks. In Advances in neural information processing systems, pages
577–584, 2008. 72, 90, 98, 99
130 Bibliography
[GMH13] Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speech recog-
nition with deep recurrent neural networks. In 2013 IEEE international con-
ference on acoustics, speech and signal processing, pages 6645–6649. IEEE,
2013. 19
[Gra08] A Graves. Supervised sequence labelling with recurrent neural networks [ph.
d. dissertation]. Technical University of Munich, Germany, 2008. 102
[GSR15] Shivansh Gaur, Siddhant Sonkar, and Partha Pratim Roy. Generation of syn-
thetic training data for handwritten indic script recognition. In 2015 13th Inter-
national Conference on Document Analysis and Recognition (ICDAR), pages
491–495. IEEE, 2015. 72
[GSTBJ19] Mesay Samuel Gondere, Lars Schmidt-Thieme, Abiot Sinamo Boltena, and
Hadi Samer Jomaa. Handwritten amharic character recognition using a convo-
lutional neural network. arXiv preprint arXiv:1909.12943, 2019. 38, 72
[GVK19] Rajib Ghosh, Chirumavila Vamshi, and Prabhat Kumar. Rnn based online hand-
written word recognition in devanagari and bengali scripts using horizontal zon-
ing. Pattern Recognition, 92:203–218, 2019. 81, 91
[GVZ16] Ankush Gupta, Andrea Vedaldi, and Andrew Zisserman. Synthetic data for
text localisation in natural images. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 2315–2324, 2016. 42
[HA13] Abraham Hailu and Yaregal Assabie. Itemsets-based amharic document cate-
gorization using an extended a priori algorithm. In Language and Technology
Conference, pages 317–326. Springer, 2013. 32
[HA17] Seid Ali Hassen and Yaregal Assabie. Recognition of double sided amharic
braille documents. International Journal of Image, Graphics and Signal Pro-
cessing, 9(4):1, 2017. 62
[Hag15] Tesfahun Hagos. Ocr system for the recognition of ethiopic real life documents.
Master’s thesis, Bahir Dar Institute of Technology, 2015. 3, 55
[Han17] Awni Hannun. Sequence modeling with ctc. Distill, 2(11):e8, 2017. 21
[Her82] H Herbert. The history of ocr, optical character recognition. Manchester Center,
VT: Recognition Technologies Users Association, 1982. 1
[HKR15] Samer Hijazi, Rishi Kumar, and Chris Rowen. Using convolutional neural net-
works for image recognition. Cadence Design Systems Inc.: San Jose, CA,
USA, pages 1–12, 2015. 81
[HM16] Birhanu Hailu and Million Meshesha. Applying image processing for malt-
barley seed identification. In Conference: Ethiopian the 9th ICT Annual Con-
ference, 2016. 18
[HML+ 18] Shizhong Han, Zibo Meng, Zhiyuan Li, James O’Reilly, Jie Cai, Xiaofeng
Wang, and Yan Tong. Optimizing filter size in convolutional neural networks
Bibliography 131
[HOT06] Geoffrey E Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algo-
rithm for deep belief nets. Neural computation, 18(7):1527–1554, 2006. 18
[HZMJ15] Meijun He, Shuye Zhang, Huiyun Mao, and Lianwen Jin. Recognition confi-
dence analysis of handwritten chinese character with cnn. In 2015 13th Inter-
national Conference on Document Analysis and Recognition (ICDAR), pages
61–65. IEEE, 2015. 55
[HZRS16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learn-
ing for image recognition. In Proceedings of the IEEE conference on computer
vision and pattern recognition, pages 770–778, 2016. 18, 72
[IFD+ 19] R. R. Ingle, Y. Fujii, T. Deselaers, J. Baccash, and A. C. Popat. A scalable hand-
written text recognition system. In 2019 International Conference on Document
Analysis and Recognition (ICDAR), pages 17–24. IEEE, Sep. 2019. 3
[IKN98] Laurent Itti, Christof Koch, and Ernst Niebur. A model of saliency-based visual
attention for rapid scene analysis. IEEE Transactions on pattern analysis and
machine intelligence, 20(11):1254–1259, 1998. 113
[Jam07] William James. The principles of psychology, volume 1. Cosimo, Inc., 2007. 22
[KJEY20] Mohamed Ibn Khedher, Houda Jmila, and Mounim A El-Yacoubi. Automatic
processing of historical arabic documents: A comprehensive survey. Pattern
Recognition, 100:107144, 2020. 2, 54
[KSBH16] Pavel Kisilev, Eli Sason, Ella Barkan, and Sharbell Hashoul. Medical image
captioning: learning to describe medical image findings using multi-task-loss
cnn. Deep Learning and Data Labeling for Medical Applications:in proceed-
ing of International Conference on Medical Image Computing and Computer-
Assisted Intervention, 2016. 63
[KSH12] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classifica-
tion with deep convolutional neural networks. In F. Pereira, C. J. C. Burges,
L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Pro-
cessing Systems 25, pages 1097–1105. Curran Associates, Inc., 2012. 18
[LBBH98] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-
based learning applied to document recognition. Proceedings of the IEEE,
86(11):2278–2324, 1998. 18, 43
132 Bibliography
[LFZ+ 15] Tianyi Liu, Shuangsang Fang, Yuehui Zhao, Peng Wang, and Jun Zhang.
Implementation of training convolutional neural networks. arXiv preprint
arXiv:1506.01195, 2015. 18
[LGF+ 07] Marcus Liwicki, Alex Graves, Santiago Fernàndez, Horst Bunke, and Jürgen
Schmidhuber. A novel approach to on-line handwriting recognition based on
bidirectional long short-term memory networks. In Proceedings of the 9th In-
ternational Conference on Document Analysis and Recognition, ICDAR 2007,
2007. 19, 20, 72
[LNNN17] Nam-Tuan Ly, Cuong-Tuan Nguyen, Kha-Cong Nguyen, and Masaki Naka-
gawa. Deep convolutional recurrent network for segmentation-free offline hand-
written japanese text recognition. In 2017 14th IAPR International Conference
on Document Analysis and Recognition (ICDAR), volume 7, pages 5–9. IEEE,
2017. 72, 91, 98
[LO16] Chen-Yu Lee and Simon Osindero. Recursive recurrent nets with attention mod-
eling for ocr in the wild. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pages 2231–2239, 2016. 98
[LTD+ 17] Linghui Li, Sheng Tang, Lixi Deng, Yongdong Zhang, and Qi Tian. Image
caption with global-local attention. In Proceedings of the thirty-first AAAI con-
ference on artificial intelligence, pages 4133–4139, 2017. 27
[LV12] Hong Lee and Brijesh Verma. Binary segmentation algorithm for english cur-
sive handwriting recognition. Pattern Recognition, 45(4):1306–1317, 2012. 14
[MB02] U-V Marti and Horst Bunke. The iam-database: an english sentence database
for offline handwriting recognition. International Journal on Document Analy-
sis and Recognition, 5(1):39–46, 2002. 42
[MBP15] Durjoy Sen Maitra, Ujjwal Bhattacharya, and Swapan K Parui. Cnn based com-
mon approach to handwritten character recognition of multiple scripts. In Doc-
ument Analysis and Recognition (ICDAR), 2015 13th International Conference
on, pages 1021–1025. IEEE, 2015. 62
[Mes08] Million Meshesha. Recognition and retrieval from document image collections.
Ph. D. Dissertation, International Institute of Information Technology (IIIT),
2008. 3, 5, 14, 29, 31, 37, 54, 62, 63
[Mey06] Ronny Meyer. Amharic as lingua franca in ethiopia. Lissan: Journal of African
Languages and Linguistics, 20(1/2):117–132, 2006. 3, 30
[MM18] Getahun Tadesse Mekuria and Getahun Tadesse Mekuria. Amharic text docu-
ment summarization using parser. International Journal of Pure and Applied
Mathematics, 118(24), 2018. 4, 30
[MSLB98] John Makhoul, Richard Schwartz, Christopher Lapre, and Issam Bazzi. A
script-independent methodology for optical character recognition. Pattern
Recognition, 31(9):1285–1294, 1998. 18
[MZ17] Tirin Moore and Marc Zirnsak. Neural mechanisms of selective visual attention.
Annual review of psychology, 68:47–72, 2017. 113
[Nag92] George Nagy. At the frontiers of ocr. Proceedings of the IEEE, 80(7):1093–
1100, 1992. 17
[Nag00] George Nagy. Twenty years of document image analysis in pami. IEEE Trans-
actions on Pattern Analysis & Machine Intelligence, (1):38–62, 2000. 14
[NAL] The ethiopian national archive and liberary agency. http://www.nala.
gov.et/home. Accessed on 10 Feburary 2021. 5
[NWC+ 11] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and An-
drew Y Ng. Reading digits in natural images with unsupervised feature learn-
ing. In NIPS workshop on deep learning and unsupervised feature learning,
page 5, 2011. 43
[Pal05] Mahesh Pal. Random forest classifier for remote sensing classification. Inter-
national journal of remote sensing, 26(1):217–222, 2005. 18
[PC00] U Pal and BB Chaudhuri. Automatic recognition of unconstrained off-line
bangla handwritten numerals. In International Conference on Multimodal In-
terfaces, pages 371–378. Springer, 2000. 31
134 Bibliography
[PC04] U Pal and BB Chaudhuri. Indian script character recognition: a survey. pattern
Recognition, 37(9):1887–1899, 2004. 14
[pre] The ethiopian press agency. will amharic be au’s lingua franca ? https://
www.press.et/english/?p=2654#l. Accessed on 10 November 2019.
30
[PV17] Jason Poulos and Rafael Valle. Character-based handwritten text transcription
with attention networks. arXiv preprint arXiv:1712.04046, 2017. 27, 98
[PZLL20] Xushan Peng, Xiaoming Zhang, Yongping Li, and Bangquan Liu. Research
on image feature extraction and retrieval algorithms based on convolutional
neural network. Journal of Visual Communication and Image Representation,
69:102705, 2020. 19
[RMS09] Amjad Rehman, Dzulkifli Mohamad, and Ghazali Sulong. Implicit vs ex-
plicit based script segmentation and recognition: a performance comparison
on benchmark database. Int. J. Open Problems Compt. Math, 2(3):352–364,
2009. 15
[RRB18] Betselot Yewulu Reta, Dhara Rana, and Gayatri Viral Bhalerao. Amharic hand-
written character recognition using combined features and support vector ma-
chine. In 2018 2nd International Conference on Trends in Electronics and In-
formatics (ICOEI), pages 265–270. IEEE, 2018. 38, 62, 63, 72, 73
[RSWP18] Christian Reul, Uwe Springmann, Christoph Wick, and Frank Puppe. State of
the art optical character recognition of 19th century fraktur scripts using open
source engines. arXiv preprint arXiv:1810.03436, 2018. 73
[RTAS20] António H Ribeiro, Koen Tiels, Luis A Aguirre, and Thomas Schön. Beyond
exploding and vanishing gradients: analysing rnn training using attractors and
smoothness. In International Conference on Artificial Intelligence and Statis-
tics, pages 2370–2380, 2020. 19
[SA15] Maad Shatnawi and Sherief Abdallah. Improving handwritten arabic character
recognition by modeling human handwriting distortions. ACM Transactions on
Asian and Low-Resource Language Information Processing (TALLIP), 15(1):1–
12, 2015. 56
Bibliography 135
[SG97] Zhixin Shi and Venu Govindaraju. Segmentation and recognition of connected
handwritten numeral strings. Pattern Recognition, 30(9):1501–1504, 1997. 14
[SIK+ 09] Fouad Slimane, Rolf Ingold, Slim Kanoun, Adel M Alimi, and Jean Hennebert.
A new arabic printed text image database and evaluation protocols. In 2009
10th International Conference on Document Analysis and Recognition, pages
946–950. IEEE, 2009. 42
[SJS13] P Sibi, S Allwyn Jones, and P Siddarth. Analysis of different activation func-
tions using back propagation neural networks. Journal of theoretical and ap-
plied information technology, 47(3):1264–1268, 2013. 18
[SLJ+ 15] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed,
Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabi-
novich. Going deeper with convolutions. In Proceedings of the IEEE confer-
ence on computer vision and pattern recognition, pages 1–9, 2015. 18
[SS13] Nazly Sabbour and Faisal Shafait. A segmentation-free approach to arabic
and urdu ocr. In Document Recognition and Retrieval XX, volume 8658, page
86580N. International Society for Optics and Photonics, 2013. 42
[SSSS17] Shai Shalev-Shwartz, Ohad Shamir, and Shaked Shammah. Failures of
gradient-based deep learning. In Proceedings of the 34th International Con-
ference on Machine Learning-Volume 70, pages 3067–3075. JMLR. org, 2017.
86
[TT18] Zhiqiang Tong and Gouhei Tanaka. Reservoir computing with untrained con-
volutional neural networks for image recognition. In 2018 24th International
Conference on Pattern Recognition (ICPR), pages 1289–1294. IEEE, 2018. 19
[UH16] Adnan Ul-Hasan. Generic text recognition using long short-term memory net-
works. Ph. D. Dissertation, Technical University of Kaiserslautern, 2016. 17
[UHAR+ 13] Adnan Ul-Hasan, Saad Bin Ahmed, Faisal Rashid, Faisal Shafait, and
Thomas M Breuel. Offline printed urdu nastaleeq script recognition with bidi-
rectional lstm networks. In 2013 12th International Conference on Document
Analysis and Recognition, pages 1061–1065. IEEE, 2013. 73
[VMN+ 16] Andreas Veit, Tomas Matera, Lukas Neumann, Jiri Matas, and Serge Belongie.
Coco-text: Dataset and benchmark for text detection and recognition in natural
images. In arXiv preprint arXiv:1601.07140, 2016. 43
[Wer90] Paul J Werbos. Backpropagation through time: what it does and how to do it.
Proceedings of the IEEE, 78(10):1550–1560, 1990. 19
[WHK+ 17] Shinji Watanabe, Takaaki Hori, Suyoun Kim, John R Hershey, and Tomoki
Hayashi. Hybrid ctc/attention architecture for end-to-end speech recognition.
IEEE Journal of Selected Topics in Signal Processing, 11(8):1240–1253, 2017.
90
[Wio06] Anais Wion. The national archives and library of ethiopia: six years of ethio-
french cooperation (2001-2006). 2006. 4
[WSW+ 17] Qi Wu, Chunhua Shen, Peng Wang, Anthony Dick, and Anton van den Hen-
gel. Image captioning and visual question answering based on attributes and
external knowledge. IEEE transactions on pattern analysis and machine intel-
ligence, 40(6):1367–1381, 2017. 20
[WYCL17] Yi-Chao Wu, Fei Yin, Zhuo Chen, and Cheng-Lin Liu. Handwritten chinese
text recognition using separable multi-dimensional recurrent neural network. In
2017 14th IAPR International Conference on Document Analysis and Recogni-
tion (ICDAR), volume 1, pages 79–84. IEEE, 2017. 72, 98
[XBK+ 15] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan
Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural
image caption generation with visual attention. In International conference on
machine learning, pages 2048–2057, 2015. 23, 24, 117, 118, 119
[YBJL12] Aiquan Yuan, Gang Bai, Lijing Jiao, and Yajie Liu. Offline handwritten en-
glish character recognition based on convolutional neural network. In 2012 10th
IAPR International Workshop on Document Analysis Systems, pages 125–129.
IEEE, 2012. 54, 55, 91
[YNDT18] Rikiya Yamashita, Mizuho Nishio, Richard Kinh Gian Do, and Kaori Togashi.
Convolutional neural networks: an overview and application in radiology. In-
sights into imaging, 9(4):611–629, 2018. 19
Bibliography 137
[YZX+ 18] Min Yang, Wei Zhao, Wei Xu, Yabing Feng, Zhou Zhao, Xiaojun Chen, and Kai
Lei. Multitask learning for cross-domain image captioning. IEEE Transactions
on Multimedia, 2018. 63
[ZCW10] Shusen Zhou, Qingcai Chen, and Xiaolong Wang. Hit-or3c: an opening recog-
nition corpus for chinese characters. In Proceedings of the 9th IAPR Interna-
tional Workshop on Document Analysis Systems, pages 223–230, 2010. 43
[ZDD18] Jianshu Zhang, Jun Du, and Lirong Dai. Track, attend, and parse (tap): An end-
to-end framework for online handwritten mathematical expression recognition.
IEEE Transactions on Multimedia, 21(1):221–233, 2018. 20, 27, 98
[ZHZ17] Haifeng Zhao, Yong Hu, and Jinxia Zhang. Character recognition via a compact
convolutional neural network. In 2017 International conference on digital im-
age computing: techniques and applications (DICTA), pages 1–6. IEEE, 2017.
55, 56
[ZLLT14] Zhanpeng Zhang, Ping Luo, Chen Change Loy, and Xiaoou Tang. Facial land-
mark detection by deep multi-task learning. In European Conference on Com-
puter Vision, pages 94–108. Springer, 2014. 63
[ZY17] Yu Zhang and Qiang Yang. A survey on multi-task learning. arXiv preprint
arXiv:1707.08114, 2017. 64
[ZYDS17] Geoffrey Zweig, Chengzhu Yu, Jasha Droppo, and Andreas Stolcke. Advances
in all-neural speech recognition. In 2017 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), pages 4805–4809. IEEE,
2017. 20
Curriculum Vitae
Work Experience
Education
October 2017 PhD in Computer Science, TU Kaiserslautern, Germany
– Now Thesis: ”Deep Learning for Amharic Text-image Recognition: Algorithm, Dataset
and application”
Journal articles
Conference papers