Handwritten Nepali Character Recognition and Narration System Using Deep CNN

TRIBHUVAN UNIVERSITY
INSTITUTE OF ENGINEERING
PURWANCHAL CAMPUS
HANDWRITTEN NEPALI CHARACTER RECOGNITION AND

NARRATION SYSTEM USING DEEP CNN
By:
Anish Baral (39803)
Diwash Khanal (39812)
Prabin Baskota (39820)
Pradeep Dahal (39821)
A PROJECT WAS SUBMITTED TO THE DEPARTMENT OF ELECTRONICS AND

COMPUTER ENGINEERING IN PARTIAL FULLFILLMENT OF THE
REQUIREMENT FOR THE BACHELOR’S DEGREE IN COMPUTER ENGINEERING
DEPARTMENT OF ELECTRONICS AND COMPUTER ENGINEERING

DHARAN, NEPAL
November, 2019
COPYRIGHT
The author has agreed that the Library, Department of Electronics and Computer
Engineering, Purwanchal Campus, Institute of Engineering may make this report freely
available for inspection. Moreover, the author has agreed that permission for extensive
copying of this project report for scholarly purpose may be granted by the supervisors who
supervised the project work recorded herein or, in their absence, by the Head of the
Department wherein the project report was done. It is understood that the recognition will be
given to the author of this report and to the Department of Electronics and Computer
Engineering, Purwanchal Campus, Institute of Engineering in any use of the material of this
project report. Copying or publication or the other use of this report for financial gain without
approval of to the Department of Electronics and Computer Engineering, Purwanchal
Campus, Institute of Engineering and author's written permission is prohibited.
Request for permission to copy or to make any other use of the material in this report in
whole or in part should be addressed to:
Head
Department of Electronics and Computer Engineering
Purwanchal Campus, Institute of Engineering
Dharan, Nepal
iii
ACKNOWLEDGEMENT
First of all, we are very grateful to the Department of Electronics and Computer Engineering,
IOE Purwanchal Campus for providing us the opportunity to undertake this project by
including it as a part of curriculum in B.E. Computer Engineering.
We are extremely thankful to Er. Bimal Ghimire, our Project Supervisor for providing us with
his invaluable guidance and support throughout every phase of the project development. His
useful suggestions and continuous motivation are sincerely acknowledged.
Also, we express the sincere thanks to our colleagues and respected seniors for their positive
advices and support and to all those who helped us directly or indirectly to complete this
project on time.
-Anish Baral (39803)
-Diwash Khanal (39812)
-Prabin Baskota (39820)
-Pradeep Dahal (39821)
iv
ABSTRACT
This report represents the details on a project entitled “HANDWRITTEN NEPALI
CHARACTER RECOGNITION AND NARRATION SYSTEM USING DEEP CNN” as a
part of the curriculum for the final year project of B.E. in Computer Engineering. The report
discusses various fundamentals and an implementation technique to build the offline system
of handwritten Nepali digit and character recognition and presents the results of the system
being implemented.
Our project mainly focuses on recognizing the digits and characters in an image. So, basically
this project is a way to digitize the information in an image for the purpose of convenient
retrieval and efficient processing of data. Dataset Preparation, Image processing of a certain
level and a Convolutional Neural Network as a classifier are the three main areas, in which our
project is completely relying on. The prime focus of this project is firstly to design a
convolutional neural network with suitable parameters and train it with a dataset of our own.
Finally, predicting the classes in an input image by processing the image into a desired format
and then feeding it into the trained neural network. The project also compares two of the most
widely used optimizers used in image classification ie. Adam and NAdam.
Keywords: Devanagari Handwritten Character Dataset; Image processing; Computer

Vision; Deep learning; Deep Convolutional Neural Network; Optical Character
Recognition; Adam; NAdam
v
TABLE OF CONTENTS
COPYRIGHT ................................................................................................................................. iii
ACKNOWLEDGEMENT .............................................................................................................iv
ABSTRACT ..................................................................................................................................... v
TABLE OF CONTENTS ...............................................................................................................vi
LIST OF FIGURES ..................................................................................................................... viii
LIST OF TABLES ..........................................................................................................................ix
LIST OF ABBREVIATIONS ........................................................................................................ x
1. INTRODUCTION ....................................................................................................................... 1
1.1Background: ........................................................................................................................... 1
1.2 Problem Statement ............................................................................................................... 1
1.3 Objectives: ............................................................................................................................. 2
1.3.1 Project objectives ......................................................................................................... 2
1.3.2 Academic objectives .................................................................................................... 2
1.4 Scope of work: ...................................................................................................................... 2
2. LITERATURE REVIEW ........................................................................................................... 4
2.1 History of Handwriting Recognition ................................................................................. 4

2.2 Different approaches on Handwriting Recognition: ........................................................ 5
2.2.1 Gradient Based Learning Approach .......................................................................... 5
2.2.2 De Campos et. al. (2009) ............................................................................................ 6
2.2.3 Diagonal Based Feature Extraction ........................................................................... 7
2.3 Handwritten Digits and Character Recognition System: ................................................ 7
2.3.1 Image Acquisition ........................................................................................................ 8
2.3.2 Image Preprocessing .................................................................................................... 9
2.3.3 Image Segmentation .................................................................................................. 10
2.3.4 Feature Extraction ...................................................................................................... 10
2.3.5 Classification .............................................................................................................. 11
3. METHODOLOGY .................................................................................................................... 14
3.1 General Description: .......................................................................................................... 14

3.2 System Design: ................................................................................................................... 15
3.3 Dataset Preparation ............................................................................................................ 20
3.3.1 Data Collection ........................................................................................................... 20
3.3.2 Training and Test Data Generation.......................................................................... 20
vi
3.4 Image Processing ................................................................................................................ 21
3.4.1 Image acquisition ....................................................................................................... 21
3.4.2 Image Preprocessing .................................................................................................. 22
3.4.3 Segmentation .............................................................................................................. 24
3.4.4 Feature Extraction ...................................................................................................... 25
3.5 Training Convolution Neural Network ........................................................................... 25
3.5.1 CNN Architecture ...................................................................................................... 25
3.5.2 Training ....................................................................................................................... 28
3.6 Recognition ......................................................................................................................... 31
4. IMPLEMENTATION ............................................................................................................... 32
4.1 Software Development Life Cycle ................................................................................... 32

4.2 Requirement Analysis ........................................................................................................ 33
4.2.1 Functional Requirements .......................................................................................... 33
4.2.2 Non-Functional Requirements ................................................................................. 34
4.3 Feasibility Study ................................................................................................................. 34
4.3.1 Technical Feasibility: ................................................................................................ 34
4.3.2 Economic Feasibility ................................................................................................. 34
4.3.3 Schedule Feasibility ................................................................................................... 35
4.4 Tools and Technologies ..................................................................................................... 35
4.4.1 Python and Text Editor: ............................................................................................ 35
4.4.2 NumPy ......................................................................................................................... 36
4.4.3 OpenCV2 .................................................................................................................... 37
4.4.4 Tensorflow .................................................................................................................. 37
4.4.5 Keras ............................................................................................................................ 38
4.4.6 Flask ............................................................................................................................. 38
4.4.6 PlaySound ................................................................................................................... 38
4.5 GUI Development .............................................................................................................. 38
5. RESULT ..................................................................................................................................... 40
6. OUTPUT ..................................................................................................................................... 46
7. CONCLUSION .......................................................................................................................... 49
8. LIMITATIONS AND FURTHER WORK: ........................................................................... 50
8.1 Limitations .......................................................................................................................... 50

8.2 Future Enhancements ......................................................................................................... 50
9. REFERENCES........................................................................................................................... 51
vii
LIST OF FIGURES
Figure 2.1: LeNet-5 Architecture ............................................................................................................5
Figure 2.2: System Pipeline View ...........................................................................................................8
\
Figure 2.3: Digitization of Image ......................................................................................................... 9
Figure 3.1: Schematic diagram of the total system ......................................................................... 15
Figure 3.2: Use-Case Diagram for Dataset Preparation Module ................................................. 16
Figure 3.3: Use Case For Image Processing Module...................................................................... 16
Figure 3.4: Use-Case Diagram For CNN Training ......................................................................... 17
Figure 3.5: Use-Case Diagram for Recognition Module ............................................................... 17
Figure 3.6: Use Case Diagram for Total System ............................................................................. 18
Figure 3.7: Level-1 DFD for Image Processing ............................................................................... 18
Figure 3.8: Level 1 DFD for training Convolutional Neural Network ...................................... 19
Figure 3.9: Level-0 DFD for recognition module............................................................................ 19
Figure 3.10: Level-0 DFD of Complete System .............................................................................. 20
Figure 3.11: Sample of image loaded for image processing ........................................................ 21
Figure 3.12: Histogram plot of sample image loaded..................................................................... 22
Figure 3.13: Gray scaled sample image data..................................................................................... 23
Figure 3.14: Histogram of Gray scaled sample image ................................................................... 23
Figure 3.15: Histogram plot of Otsu-binarized sample data image ............................................ 24
Figure 3.16: CNN Architecture used ................................................................................................... 25
Figure 3.17: Convolutional Layer ........................................................................................................ 26
Figure 3.18: Pooling layer and Feature mapping ............................................................................. 26
Figure 3.19: Dense and Drop Out Mechanism ................................................................................. 27
Figure 4.1: Incremental Software Development Model................................................................. 32
Figure 4.2: Graphical User Interface using Flask ............................................................................ 39
Figure 5.1: Snapshot of loading data loading before initiating training .................................... 41
Figure 5.2: Snapshot of Training with Adam optimizer ................................................................ 42
Figure 5.3: Snapshot of Training with NAdam optimizer............................................................. 42
Figure 5.4: Model Summary after saving model ............................................................................. 42
Figure 5.5: Plot between model accuracy and epochs with Adam optimizer .......................... 43
Figure5.6: Plot between model loss function and epochs with Adam optimizer ................... 43
viii
Figure 5.7: Plot between model accuracy and epochs with NAdam optimizer ...................... 44
Figure 5.8: Plot between model loss function and epochs with Nadam optimizer ................ 44
Figure 6.1: Image loaded for recognition ........................................................................................... 46
Figure 6.2: Recognized Image Snapshot ............................................................................................ 46
LIST OF TABLES
Table 5.1: System Specification Table ............................................................................. 40
Table 5.2: Training Configuration Table .......................................................................... 41
ix
LIST OF ABBREVIATIONS
AI: Artificial Intelligence
ANN: Artificial Neural Network
CNN: Convolutional Neural Network
CPU: Central Processing Unit
DFD: Data Flow Diagram
GPU: Graphical Processing Unit
GUI: Graphical User Interface
HCR: Handwritten Character Recognition
OCR: Optical Character Recognition
PCR: Printed Character Recognition
ReLU: Rectified Linear Unit
SGD: Stochastic Gradient Descent
Adam: Adaptive Moment Estimation
NAdam: Nesterov-accelerated Adaptive Moment Estimation
x
1. INTRODUCTION
1.1Background:
Artificial Intelligence (AI) has acquired the interest of many researchers these days. This is
a field with the vast areas of study and application. Among which machine learning is an
important branch of AI. Many researchers even believe that without machine learning we
could never mimic the exact working of human brain. Although till the date, AI is still
considered to be an evolving field of computer science, there are already a lot of successful
implementations made. Among which handwriting recognition is a famous classification
problem of machine learning. Handwriting recognition has gained a lot of attention in the
field of pattern recognition and machine learning due to its application in various fields.
Character classification is an important part in many computer vision problems like Optical
character recognition, license Plate recognition, etc. It mainly deals with training thousands
of images of a certain pattern and predict the output of test data given afterwards. The
accuracy rate of learning depends mainly upon numbers of data trained and selection of
learning algorithm.
Development of a recognition system is an emerging need for digitizing handwritten Nepali
documents that use Devanagari characters. Optical Character Recognition systems are least
explored for Devanagari characters. The proposed system will implement Deep CNN for
training Nepali (Devanagari) characters and integers. Introduction of multilayer perceptron
network has been a milestone in many classification tasks in computer vision. But,
performance of such a network has always been greatly dependent on the selection of good
representing features. Deep Neural Networks on the other hand do not require any feature to
be explicitly defined, instead they work on the raw pixel data generating the best features
and using them to classify the inputs into different classes. The proposed system is expected
to recognize the given test image and narrate the pronunciation of recognized characters
based on trained data.
1.2 Problem Statement
Even though there has been much advancement in the field of Character Recognition and
OCR, there are not many systems that can interpret handwritten Devanagari Characters
easily. Also, for visually impaired people it is difficult to read Devanagari text as
implementation of Braille Script in Devanagari is difficult and may be costly to implement.
The Proposed system is expected to eliminate above problems in a small scale by
implementing Devanagari OCR and Text-To-Speech in the same system.
1
1.3 Objectives:
1.3.1 Project objectives
The prime objective of the project being proposed is to design and build a system that a
basically recognizes and narrates the pronunciation of the digits and characters in an image.
The typical objectives are listed below:
  implement two different optimizers for image classification and to carry out the
comparison between the performance of the two optimizers.
 To make use of domain specific model and algorithm in field of handwriting
recognition.
 To develop a system that can recognize digits and characters in an image.
 To understand the basics of image processing.
 To get knowledge on various handwriting recognition approaches.
1.3.2 Academic objectives
Academically, the project is primarily focused on fulfilling the discipline of an engineering

student as a computer engineer working on a project and gain experience as a team
throughout the different phases of a project. Some Typical academic objectives of the project
are:
 To fulfill the requirements of the major project of bachelor’s in computer
engineering.
 To design and complete a functional project that integrates various course concepts.
 To develop various skills related to project management like teamwork, resource
management, documentation and time management.
 To get hands-on experience of working in a project as a teamwork.
 To learn about and become familiar with the professional engineering practices.
1.4 Scope of work:
Analyzing our current project work, it’s not professional by any means to deploy it as a
complete system for actual application. However, the idea of this project, do have several
scopes which are discussed below:
2
Narration System for blind people: For blind people, in the absence of Braille, it is a
great problem to read written texts. Our system can be solution to this problem due to
recognized character narration system which will read aloud the characters recognized.
Teaching Nepali Language to Kids: A fun way to teach kids about their mother tongue
can be narrating the pronunciation of characters/words along with the structure. This can
help kids to learn both structure and pronunciation of Nepali characters.
Writer Recognition System: It is the system to identifying the author of a given text. It
may look like a little different problem but can be solved with a similar kind of
implementation and could be of great use to the government investigations department.
Handwritten Address Interpretation System: This system has already been
implemented successfully in USA for the first time, to interpret the postal address with the
help of Zip codes and street numbers written in the envelope.
Bank- Check Processing System: Verification of account numbers and signatures in a
cheque can be automated by using the similar concept of machine learning as that could be
a professional system to be used.
3
2. LITERATURE REVIEW
2.1 History of Handwriting Recognition
An early notable attempt in the area of character recognition research is by Grimsdale in

1959. The origin of a great deal of research work in the early sixties was based on an
approach known as analysis-by-synthesis method suggested by Eden in 1968. The great
importance of Eden's work was that he formally proved that all handwritten characters are
formed by a finite number of schematic features, a point that was implicitly included in
previous works. This notion was later used in all methods in syntactic (structural) approaches
of character recognition [1].
However, the Hubel-Wiesel experiment in 1962 was a breakthrough to understand the

working of visual cortex in brain. The classic experiment is the fundamental to our
understanding of how neurons along the visual pathway extract increasingly complex
information from the pattern of light cast on the retina to construct an image. For one, they
showed that there is a topographical map in the visual cortex that represents the visual field,
where nearby cells process information from nearby visual fields. Moreover, their work
determined that neurons in the visual cortex are arranged in a precise architecture. Cells with
similar functions are organized into columns, tiny computational machines that relay
information to a higher region of the brain, where a visual image is formed. In all, their work
revealed how visual cortical neurons encoded image features, the fundamental properties of
objects that help us build our perception of the world around us [2].
This inspired the advancement in the existing artificial neural networks and a deep learning
network called Convolutional Neural Network (CNN) which was proposed by Yann LeCun and
his team in 1998 [3]. CNNs do take a biological inspiration from the visual cortex. The visual
cortex has small regions of cells that are sensitive to specific regions of the visual field. This
idea was expanded upon by a fascinating experiment by Hubel and Wiesel where they showed
that some individual neuronal cells in the brain responded (or fired) only in the presence of
edges of a certain orientation. For example, some neurons fired when exposed to vertical
edges and some when shown horizontal or diagonal edges. Hubel and Wiesel found out that
all of these neurons were organized in a columnar architecture and that together, they were
able to produce visual perception. This idea of specialized components inside of a system
4
having specific tasks (the neuronal cells in the visual cortex looking for specific
characteristics) is one that machines use as well, and is the basis behind CNNs [2], [3].
For Devanagari characters recognition, some works have been done which includes the
research paper of S. Acharya et.al on 2015 [7] and of A. K. Pant et.al on 2012 [8] which are
the main two building blocks on the field of Devanagari Characters recognition. Also, many
researchers are currently working on advancements of it.
2.2 Different approaches on Handwriting Recognition:
2.2.1 Gradient Based Learning Approach
Multilayer neural network trained with backpropagation algorithm and Stochastic Gradient
Descent (SGD) as learning algorithm is considered to be the one of the highly successful
method for handwritten character recognition [3]. This is the original inspiration and basis
for the improvements in accuracy we made to the character dataset. The paper describes the
process they used to achieve up to a 99.1% accuracy on the MNIST dataset, using both a 3-
layer CNN and a 5-layer network that failed to outperform the former. Additionally, the
paper discusses the problem of segmentation, and how it cannot be decoupled from the
recognition of isolated characters. Essentially, making the decision to segment an image with
multiple characters before the recognition of individual characters is not optimal, as the
process should be parallelized to test multiple hypotheses at the same time. In terms of
feature extraction, LeCun et. al. [3] argues that humans cannot possibly capture all relevant
information from images and even coming close requires expert knowledge on the subject,
so instead they resorted to utilizing Gradient-Based Learning to learn the useful features for
the images.
Figure 2.1: LeNet-5 Architecture
5
Convolutional networks combine three architectural ideas to ensure some degree of shift,
scale, and distortion invariance: local receptive fields, shared weights (or weight replication),
and spatial or temporal sub-sampling. Figure above shows the architecture of LeNet-5, a
convolutional neural network proposed by LeCun et al. in 1998 [3], which was used
commercially for reading bank checks. A convolutional layer is composed of several feature
maps with different weight vectors so that multiple features can be extracted at each location.
The receptive fields of contiguous units in a feature map are centered on correspondingly
contiguous units in the previous layer. Therefore, receptive fields of neighboring unit
overlap. All the units in a feature map share the same set of weights and the same bias so
they detect the same feature at all possible locations on the input. A sequential
implementation of a feature map would scan the input image with a single unit that has a
local receptive field and store the states of this unit at corresponding locations in the feature
map. This operation is equivalent to a convolution, followed by an additive bias and
squashing function, hence the name convolutional network.
An interesting property of convolutional layers is that if the input image is shifted, the feature
map output will be shifted by the same amount, but if not shifted the feature map output will be
left unchanged. This property is at the basis of the strength of convolutional networks to shifts
and distortions of the input. A large degree of invariance to geometric transformations of the
input can be achieved with this progressive reduction of spatial resolution compensated by a
progressive increase of the richness of the representation (the number of feature maps). The
convolution/subsampling combination, inspired by Hubel and Wiesel’s notions of "simple" and
"complex" cells [2], was implemented in Fukushima's Noncognition [4], though no globally
supervised learning procedure such as backpropagation was available then.
Starting with LeNet-5 [3], CNN have typically had a standard structure-stacked
convolutional layers (optionally followed by contrast normalization and max-pooling) are
followed by one or more fully connected layers. All the weights are [4] learned with
backpropagation. Variants of this basic design are prevalent in the image classification
literature and have yielded the best results to-date on MNIST and most notably on the
ImageNet classification challenge [5]. For larger datasets such as ImageNet [5], the recent
trend has been to increase the number of layers and layer size, while using dropout to address
the problem of overfitting.
2.2.2 De Campos et. al. (2009)
This paper introduces the Chars74K dataset that were used to train and evaluate the
convolutional network. Some of the features that De Campos et. al. use to train their
6
classifiers include geometric blur, spin image, as well as affine transformations. With a
maximum accuracy of 55.25% achieved by the paper, we can immediately see that the
problem of character recognition in addition to simply digits is a much harder problem that
potentially requires a different approach [6].
2.2.3 Diagonal Based Feature Extraction
In [9], diagonal feature extraction has been proposed for offline character recognition. It is
based on Artificial Neural Network (ANN) model. Two approaches using 54 features and
69 features are chosen to build this ANN based recognition system. To compare the
recognition efficiency of the proposed diagonal method of feature extraction, the neural
network recognition system is trained using the horizontal and vertical feature extraction
methods. It is found that the diagonal method of feature extraction yields the recognition
accuracy of 97.8 % for 54 features and 98.5% for 69 features.
From reviewing literature, there are three main strategies for segmentation, plus numerous hybrid
approaches. These three approaches are dissection, recognition based, and holistic methods.
Dissection is the process of cutting the image into meaningful components and passed
individually into our classifier. When using dissection, the segmentation becomes the most
crucial step in the recognition process. By far the most common approach used by researchers,
this approach tries to find the presence of ligatures (interconnections between characters) and
cut the word image through these ligatures. Recognition based segmentation searches the image
for components that match a predetermined alphabet. These approaches take advantage of the
Hidden Markov Model, bypassing the need for complex dissection algorithms. This approach
has also been called “segmentation-free” recognition.
2.3 Handwritten Digits and Character Recognition System:
The idea behind handwriting recognition is to provide a means to digitize the text in an image
into ASCII text. There exist many approaches to achieve this goal. The most simple and easy
technique is to build a model for the digits and characters that needs to be recognized.
Success in handwriting recognition depends on extracting and modeling the large variety of
datasets dependent characteristics which can effectively distinguish each digits and
characters from another. The handwriting recognition system may be viewed as collection
of several modules as shown in figure below.
7
Figure 2.2: System Pipeline View
2.3.1 Image Acquisition
The first step in any OCR system is to capture text data and transform it into a digital form.
The recognition systems differ in how they acquire their input. There are two different ways:
online and offline systems.
Online systems are the real-time systems that recognize the text while the user is writing it,
e.g. a digital tablet. The tablet captures the (x, y) coordinates of the pen location while it is
moving. This generates one dimensional vector of these points. This vector depends on the
tablet resolution (points/inch) and the sampling rate (point/second).
The online systems have a high recognition performance where each character is represented
by a vector of points that are sorted by the time factor; time dependent. The user of this
system can see directly the output of the recognition system and verify the results. The
system is limited to recognize handwritten text only. Thus, online systems make use of the
8
digitizers which directly captures writing through the order of the strokes, speed, pen- up
and pen- down information.
Offline systems recognize the text after it has been written or printed on pages. Most
interesting text is already printed in documents or books and the need to convert it into an
electronic media gives a great value to the offline recognition systems. Unlike online
systems, off–line systems do not have information dependent on the time factor. Each page
of text is represented by a two-dimensional array of pixel values. The system may acquire
the input text using scanners. Further additional operations can be performed to enhance the
scanned images. These operations are thresholding and noise elimination as described below
[10].
2.3.2 Image Preprocessing
The raw data, depending on the data acquisition type, is subjected to a number of preliminary
processing steps to make it usable in the descriptive stages of character analysis. In above
figure a set of handwritten character is taken. In preprocessing image is taken and is
converted to gray scale image. The gray scale image is then converted to binary image. This
process is called digitization of image.
Figure 2.3: Digitization of Image
Basically, any scanner is not perfect; the scanned image may have some noise. This noise
may be due to some unnecessary information available in the image. Various steps are
following in preprocessing technique:
Noise removal: Optical scanning devices introduces some noises like, disconnected line
segments, bumps and gaps in lines, filled loops etc. It is necessary to remove all these noise
elements prior to the character recognition been developed by various researchers [11].
Normalization: The main component of the pre-processing stage is normalization, which

attempts to remove some of the variations in the images, which do not affect the identity of
9
the word. Handwritten image normalization from a scanned image includes several steps,
which usually begin with image cleaning, skew correction, line detection, slant and slope
removal and character size normalization [11].
Compression: Space domain techniques are required for compression. Two important
techniques are thresholding and thinning. Thresholding reduces the storage requirements and
increases the speed of processing by converting the gray-scale or color images to binary
image by taking a threshold value. Thinning extracts the shape information of the characters
[11].
2.3.3 Image Segmentation
In the segmentation stage, an image consisting of a sequence of characters is decomposed into

sub-images of individual characters. The main goal is to divide an image into parts that have a
strong correlation with objects or areas of the real world contained in the image.
Segmentation is very important for recognition system. Segmentation is an important stage
because the extent one can reach in separation of words, lines, or characters directly affects
the recognition rate of the script. Image segmentation is the process of assigning a label to
every pixel in an image such that pixels with the same label share certain visual
characteristics. In character recognition techniques, the segmentation is the most important
process. Segmentation is done to make the separation between the individual characters of
an image. The goal of segmentation is to simplify and/or change the representation of an
image into something that is more meaningful and easier to analyze. Image segmentation is
typically used to locate objects and boundaries in images. Script segmentation is done by
executing the following operations: Line segmentation, Word segmentation and Character
segmentation [12].
2.3.4 Feature Extraction
Feature Extraction is the crucial steps for character recognition. The range of accuracy
obtained is dependent on the features we provide to the learning algorithm. Feature
extraction is the method to retrieve the most significant data from the raw data. The main
objective of feature extraction is to extract a set of features, which maximizes the recognition
rate. Feature extraction is the heart of any pattern recognition application. Feature extraction
techniques like Principle Component Analysis (PCA), Linear Discriminant Analysis (LDA),
Independent Component Analysis (ICA), Chain Code (CC), Scale Invariant Feature
Extraction (SIFT), zoning, Gradient based features, Histogram might be applied to extract
the features of individual characters.
10
These features are used to train the system. The features extracted are given as input to the
classification stage and the output of this stage is a recognized character. The selection of
the combination of feature-classifier contributes to the performance of the system. Several
research works have been focusing toward evolving such methods to reduce the processing
time and providing higher recognition.
Feature extraction based on three types of feature:
1.) Statistical Feature
These features are derived from statistical distribution points. They provide high speed and
low complexity and take care of style variation. Zoning, characteristic loci, crossing and
distance are the main statistical features [13].
2.) Structural Features
Structural features are based on topological and geometrical properties of character, such as,
aspect ratio, cross point, loops, branch points, strokes, inflection between two points,
horizontal curve on top or bottom, etc. The representation of this type may also encode some
knowledge about the structure of the object or may provide some knowledge as to what sort
of components make up that object [13].
3.) Global transformation and series expansion
A continuous signal generally contains more information than needs to be represented for
the purpose of classification. This may be true for discrete approximations of continuous
signals as well. One way to represent a signal is by a linear combination of a series of simpler
well-defined functions. The coefficients of the linear combination give a compact encoding
known as transformation or/and series expansion. Deformations like translation and
rotations are invariant under global transformation and series expansion. Gabor
transformation, Fourier transformation and wavelet transformation are common transform
and series expansion method used in character recognition method [13].
2.3.5 Classification
The classification phase is the decision making, part of the recognition system. The
performance of a classifier based on the quality of the features. This stage uses the features
extracted in the previous stage to identify the character. When input image is presented to
HCR system, its features are extracted and given as an input to the trained classifier like
11
artificial neural network or support vector machine. Classifiers compare the input feature
with stored pattern and find out the best matching class for input.
There are four classification method:

a.) Template Matching: The simplest way of character recognition is based on matching
the stored prototypes against the character or word to be recognized. Generally speaking,
matching operation determines the degree of similarity between two vectors (group of pixels,
curvature, shapes etc.) in the feature space. Matching techniques can be studied in three
classes Direct Matching, Deformable Templates and Elastic Matching and Relaxation
Matching [14].
b.) Statistical Techniques: Statistical decision theory is concerned with statistical decision
functions and a set of optimality criteria, which maximizes the likelihood of the observed
pattern given the model of a certain class. The purpose of the statistical methods is to
determine to which category the given pattern belongs. By making observations and
measurement processes, a set of numbers is prepared, which is used to prepare a
measurement vector. Statistical classifiers are automatically trainable. The K-NN rule is a
non-parametric recognition method. This method compares an unknown pattern to a set of
patterns that have been already labeled with class identities in the training stage. A pattern
is identified to be of the class of pattern, to which it has the closest distance. Another
common statistical method is to use Bayesian classification. A Bayesian classifier assigns a
pattern to a class with the maximum a posteriori probability. Besides these methods other
statistical methods are: Quadratic Discriminant Function (QDF), Linear Discriminant
Function (LDF), Euclidean distance, cross correlation, Mahanalobis distance, Regularized
Discriminant Analysis (RDA) [15].
c.) Structural Techniques: The recursive description of a complex pattern in terms of

simpler patterns based on the shape of the object was the initial idea behind the creation of
structural pattern recognition Structural methods are good for classifying handwritten texts.
This type of classifier, classifies the input patterns on the basis of components of the
characters and the relationship among these components. Firstly, the primitives of the
character are identified and then strings of the primitives are checked on the basis of pre-
decided rules. Generally, a character is represented as a production rules structure, whose
left-hand side represents character labels and whose right-hand side represents string of
primitives. The right-hand side of rules is compared to the string of primitives extracted from
a word. So, classifying a character means finding a path to a leaf [15].
12
d.) Neural Networks: Artificial Neural Network is a widely used classifier in the pattern
recognition field. ANN are a non-linear system and may be characterized according to a
particular network topology, characteristics of the artificial networks and learning algorithms
used. A neural network is defined as a computing architecture that consists of massively
parallel interconnection of adaptive 'neural' processors. Because of its parallel nature, it can
perform computations at a higher rate compared to the classical techniques. It can easily
adapt to changes in the data and learn the input signals characteristics, because of its adaptive
nature. A neural network consists of many nodes. The output from one node is fed to another
one in the network and the final decision depends on the complex interaction of all nodes. In
spite of the different underlying principles, it can be shown that most of the neural network
architectures are equivalent to statistical pattern recognition methods. Several approaches
exist for training of neural networks. These include the SGD with backpropagation, adaptive
learning rate method, Hebbian learning methods and so on. They cover binary and
continuous valued input, in addition with supervised and unsupervised learning. On the other
hand, neural network architectures can be classified into two major groups, namely, feed-
forward and feedback (recurrent) networks. The most familiar neural networks used in the
systems are the multilayer perceptron of the feed forward networks, and the Kohonen's Self
Organizing Map (SOM) of the feedback networks. To improve the accuracy of neural
networks in handwritten digits and character recognition, deeper neural networks with
convolving properties are suitably the best classifier. They are the Convolutional Neural
Networks [3], [16].
13
3. METHODOLOGY
3.1 General Description:

The system developed through this project can be viewed with four cases:
1. Dataset Preparation
2. Image Processing
3. CNN Training
4. Recognition
In the first case, the data of handwritten sample images are collected and converted into
trainable dataset. This involves series of image processing steps after collecting the
handwritten samples and then extracting the features from it. Thus, extracted features of each
images are properly labelled and used as dataset to train and validate the network.
In the second case, the user provides the images and the system perform processing to get
segmented digits and characters as output from the provided image. Image processing is used
while preparing dataset and also during the recognition phase. Every real-world images that
are to be recognized are passed through the exact same image processing steps that were
applied to the images while preparing the dataset.
The third case is about CNN training. After the datasets are prepared, the CNN is built using
appropriate algorithms and then trained until we obtain a best hypothesis that can recognize
images with highest accuracy. The system is taught about the various classes and the way
they are represented.
Finally, based in that the fourth case is used by the system to make prediction. The trained
classifier, in this case the CNN is used in a recognition module to get the approximations
about the digits and characters that are present in the image to be recognized.
14
3.2 System Design:
The Schematic Diagram of the developed system is given below:
Figure 3.1: Schematic diagram of the total system
The level-0 Data Flow Diagram (DFD) for overall system and further expanded DFD into
level-1 for different processes are shown below.
15
Figure 3.2: Use-Case Diagram for Dataset Preparation Module
Figure 3.3: Use Case For Image Processing Module
16
Figure 3.4: Use-Case Diagram For CNN Training
Figure 3.5: Use-Case Diagram for Recognition Module
17
Figure 3.6: Use Case Diagram for Total System
Figure 3.7: Level-1 DFD for Image Processing
18
Figure 3.8: Level 1 DFD for training Convolutional Neural Network
Figure 3.9: Level-0 DFD for recognition module
19
Figure 3.10: Level-0 DFD of Complete System
3.3 Dataset Preparation
3.3.1 Data Collection
Data collection in this project refer to obtaining pictures of the handwritten alphabets and
digits. Since, this is the deep learning-based project so, huge amount of training data
incorporating the different varieties of handwriting needed to be collected. Thus, we took the
data samples, ‘Devanagari Handwritten Character Dataset (DHCD)’ collected by [7].
DHCD is a combined dataset comprising 2000 images each of size (32*32) for each
devanagari digits and characters which makes 92000 total images. Around 200 images for
validation and testing were collected by project members. The images are gray scaled and
are in png format.
3.3.2 Training and Test Data Generation
After collecting the image data, the images were forwarded for image processing steps. To
get the better results the training data that is to be fed to the convolutional neural network
must be very precise and clean. Thus, to ensure the precision of data, we applied series of
image processing steps.
Firstly, the raw RGB image data was converted into two-dimensional grayscale image. Then,
a gaussian filter/blur was applied to wipe the noisy or anomalous data in the images. After
applying the gaussian filter the images were then binarized using Ostu’s thresholding
algorithm. Thus, obtained bilevel image was then segmented to get the individual digits or
20
characters image. Finally, each of the images were organized into the specific folders each
representing a class. As per the folders the class labels were assigned to the images present
in the organized folders. These data were used for the training of the network.
The test data were prepared from the handwritten, image samples of we four team members
of this project in a similar manner as training data, using the same dataset creation module.
For this project, total 92000 image data were prepared, 2000 images for each class. Out of
which, 64400 images, 1400 of each class were used as training dataset and 27600 images,
600 of each class were used as test/validation data. Apart from these 92000 images,
hundreds of other images were prepared by project team members for testing purpose.
3.4 Image Processing
3.4.1 Image acquisition
In Image acquisition, the recognition system acquires a scanned image as an input image.
The image should have a specific format such as JPEG, PNG etc. This image is acquired
through a suitable digital input device. However, in this project we have not access to camera
input, so the image input is possible by browsing from file directory.
Figure 3.11: Sample of Image loaded for Image Processing
21
Figure 3.12: Histogram of sample image loaded
3.4.2 Image Preprocessing
The pre-processing is a series of operations performed on the scanned input image. It

essentially enhances the image rendering it suitable for segmentation. The various tasks
performed on the image in pre-processing stages are gray scaling, gaussian filtering and
thresholding the images using otsu-binarization. Binarization process converts a gray scale
image into a binary image using global thresholding technique. All these preprocessing
operations are performed to produce the image suitable for segmentation. Generally, the
process of dilating an image is included in the preprocessing stage but we have used it after
segmentation to the individual images for better result.
Image Preprocessing involves following series of steps:
3.4.2.1 Grayscale Conversion
In this step, BGR (blue, green & red) image is converted to Grayscale image. BGR image
actually consist of 3 images (1 for each channel) varying from 0 to 255 and blended with
alpha channel which contains transparency information. Higher the alpha value, more
opaque will be the image. BGR to Grayscale conversion can be made by:
RGB[A] to Gray: Y←0. 299·R+0.587·G+0.114·B
Figure below shows the output from Grayscale Conversion ie. Gray scaled image for the
sample data loaded.
22
Figure 3.13: Gray scaled sample image
Figure 3.14: Histogram of Gray scaled sample image
The figure shows the histogram of image after grayscale conversion for the input taken
above.
3.4.2.2 Otsu’s Thresholding
Thresholding is the process of converting a grayscale image to binary image. It works on the
principle that replace each pixel with black pixel if it has image intensity I less than some
constant T otherwise replace with white pixel. In this way, you can have only two level of
intensity in your image turning it to binary image.
Otsu's thresholding method involves iterating through all the possible threshold values and
calculating a measure of spread for the pixel levels each side of the threshold, i.e. the pixels
that either fall in foreground or background.
Otsu’s method searches for the threshold that minimizes the intra-class variance (the
variance within the class), defined as a weighted sum of variances of the two classes:
𝜎𝜔2 (𝑡) = 𝜔0 (𝑡)𝜎02 (𝑡) + 𝜔1 (𝑡)𝜎12 (𝑡) (3.1)
23
Where weights ω0 and ω1 are probability of two classes separated by a threshold t, and σ20
and σ12 are variances of two classes
Figure 3.15: Histogram plot of Otsu-binarized sample data image
This figure shows the histogram of the image after Otsu’s thresholding. From this plot we
can see that majority of the pixel have value 0 representing black and few of them are 255
representing white since the image are binarized and inversed to make segmentation and
classification convenient.
3.4.3 Segmentation
In this step, individual images respective to individual digit or alphabet are separated from
the bilevel image containing multiple digits or alphabets. For this first of the regions of
contours are detected in the image then bounding rectangle is drawn respective to each
contour which is cropped out of the image.
After performing all the above given steps, the image is then resized to 32x32 pixel which is
converted to matrix rows and column equal to 28 and is fed into the network. The stated
steps of image processing is used during both the training and the recognition of the
handwritten digits or alphabets.
24
3.4.4 Feature Extraction
As a feature to be fed into the network the gray value of each pixel position is fed in the
network for each bilevel images in the training set.
3.5 Training Convolution Neural Network
Convolution neural network is a class of deep feed-forward artificial neural network. A CNN
consist of one input, one output and multiple hidden layers. The hidden layers are either
convolution, pooling or fully connected.
3.5.1 CNN Architecture
The architecture used in this project consists of similar architecture as shown above. We have
used 32x32 input layer, 64 convolutional layers from 3x3 receptive field, respective number of
pooling layers from 2x2 convolution neuron, 64 convolutional layers, 64 pooling layers, 64
convolutional layers, 64 pooling layers, a 46 neuron fully connected layers and a output layer.
Figure 3.16: CNN Architecture used
3.5.1.1 Convolutional layer
Convolutional layers apply a convolution operation to the input, passing the result to the
next layer. Each convolutional neuron processes data only for its receptive field. During
forward pass a number of filter are activated through those receptive field. Receptive field
25
is small region of neurons which maps to a single convolution neuron. The output from the
filters in Convolutional layer is called feature map.
Figure 3.17: Convolutional Layer
3.5.1.2 Pooling layer
Pooling layers are usually used immediately after convolutional layers. The pooling layer
takes feature map from convolutional layer and prepares condensed feature map. For
instance, each unit in the pooling layer may summarize a region of (say) 2×22×2 neurons in
the previous layer. As a concrete example, one common procedure for pooling is known as
max-pooling. In max pooling, a pooling unit simply outputs the maximum activation in the
2×2 input region, as illustrated in the following diagram:
Figure 3.18: Pooling layer and Feature mapping
26
3.5.1.3 Fully connected Layer
Fully connected layers connect every neuron in one layer to every neuron in another layer.
It is in principle the same as the traditional multi-layer perceptron neural network. It is
usually used before the output layer in the network.
3.5.1.4 Dense and Dropout
Dropout is a regularization technique, which aims to reduce the complexity of the model
with the goal to prevent overfitting. Using “dropout", certain units (neurons) can be
deactivated in a layer with a certain probability p from a Bernoulli distribution (typically
50%). As a consequence, the neural network will learn different, redundant representations;
the network can’t rely on the particular neurons and the combination (or interaction) of these
to be present. Another nice side effect is that training will be faster.
Dense layer is simply a layer where each unit or neuron is connected to each neuron in the
next layer.
Dense and dropout layer help to regularize the network whose working is shown in below
figure:
Figure 3.19: Dense and Drop Out Mechanism
27
3.5.2 Training
The training of CNN is done by using the following parameters:
3.5.2.1 Weight Initialization
Deep CNN has huge amounts of parameters and its loss function is non-convex [15], which
makes it very difficult to train. To achieve a fast convergence in training and avoid the
vanishing gradient problem, a proper network initialization is one of the most important
prerequisites [15]. The bias parameters can be initialized to zero, while the weight parameters
should be initialized carefully to break the symmetry among hidden units of the same layer.
If the network is not properly initialized, e.g., each layer scales its input by k, the final output
will scale the original input by kL where L is the number of layers. In this case, the value of
k > 1 leads to extremely large values of output layers while the value of k
< 1 leads a diminishing output value and gradients. Krizhevsky et al [5] initialize the weights
of their network from a zero-mean Gaussian distribution with standard deviation 0.01 and
set the bias terms of the second, fourth and fifth convolutional layers as well as all the fully-
connected layers to constant one. In the CNN, we initialized the weights from a zero mean
gaussian distribution with standard deviation 1.
3.5.2.2 Loss Function
Loss function, often known as cost function, is a function returning how well the neural
network is doing. Lower its value closer is predicted output to actual output. Here in this
design we have chosen ‘Categorical cross entropy’ as this projects’s cost function. Cross
Entropy between two events p and q over underlying set of events measures the average
number of bits needed to identify an event drawn from the set. Categorical cross entropy is
given the following formula:
1
𝐻(𝑦, ý) = ∑𝑖 𝑦𝑖 𝑙𝑜𝑔 ý = − ∑𝑖 𝑦𝑖 𝑙𝑜𝑔ý (3.2)
28
3.5.2.3 Optimizer
Optimization is the process of finding the set of parameters that helps to minimize the loss
function. For this task of optimization, we have actually taken two optimizers,Adaptive
Moment Estimation (Adam) and Nesterov Adam (NAdam) and compared the results.
Adam
Adam is an adaptive learning rate optimization algorithm that’s been designed specifically
for training deep neural networks. Adam can be looked at as a combination of RMSprop and
Stochastic Gradient Descent with momentum. It uses the squared gradients to scale the
learning rate like RMSprop and it takes advantage of momentum by using moving average
of the gradient instead of gradient itself like SGD with momentum.
Adam is an adaptive learning rate method, which means, it computes individual learning
rates for different parameters. Its name is derived from adaptive moment estimation, and the
reason it’s called that is because Adam uses estimations of first and second moments of
gradient to adapt the learning rate for each weight of the neural network. Here, N-th moment
of a random variable is defined as the expected value of that variable to the power of n. More
formal
𝑚𝑛 = 𝐸[𝑋 𝑛 ] (3.3)
To estimate the moments, Adam utilizes exponentially moving averages, computed on the
gradient evaluated on a current mini batch:
𝑚𝑡 = 𝛽1 𝑚𝑡−1 + (1 − 𝛽1 )𝑔𝑡 (3.4)
𝑣𝑡 = 𝛽2 𝑣𝑡−1 + (1 − 𝛽1 )𝑔2 𝑡 (3.5)

Where m and v are moving averages, g is gradient on current mini-batch, and betas — new
introduced hyper-parameters of the algorithm. They have really good default values of 0.9
and 0.999 respectively. Almost no one ever changes these values. The vectors of moving
averages are initialized with zeros at the first iteration.[18]
Nadam
As discussed above, Adam can be looked at as a combination of RMSprop and Stochastic
Gradient Descent with momentum. While Adam has been widely used as a standard
momentum method optimizer, Nesterov Adam or shortly NAdam can increase accuracy over
even momentum method. The difference between Momentum method and Nesterov
Accelerated Gradient is in gradient computation phase. In Momentum method, the gradient
was computed using current parameters θ𝑡
1
𝑔 = 2 ∑𝑛𝑖=1▽𝜃 Լ(𝑥 𝑖 , 𝑦 𝑖 , 𝜃𝑡 ) (3.6)
29
whereas in Nesterov Accelerated Gradient, we apply the velocity vt to the parameters θ to
compute interim parameters θ̃. We then compute the gradient using the interim parameters
θ̃ = 𝜃𝑡 + 𝛼𝑣𝑡 (3.7)
1
𝑔𝑁𝐴𝐺 = 𝑛 ∑𝑛𝑖=1▽𝜃 Լ(𝑥 𝑖 , 𝑦 𝑖 , 𝜃𝑡 ) (3.8)
NAdam (Nesterov Adam) thus combines Adam and NAG. In order to incorporate NAG into
Adam, we need to modify its momentum term 𝑚𝑡 .
Let us recall the momentum update rule using our current notation:
𝑔𝑡 =▽𝜃𝑡 𝐽(𝜃𝑡 ) (3.9)
𝑚𝑡 = ϒ𝑚𝑡−1 + ɳ𝑔𝑡 (3.10)
𝜃𝑡+1 = 𝜃𝑡 − 𝑚𝑡 (3.11)
where J is our objective function, γ is the momentum decay term, and η is our step size.
Expanding the third equation above yields:
𝜃𝑡+1 = 𝜃𝑡 − (ϒ𝑚𝑡−1 + ɳ𝑔𝑡 ) (3.12)
This demonstrates again that momentum involves taking a step in the direction of the
previous momentum vector and a step in the direction of the current gradient.
NAG then allows us to perform a more accurate step in the gradient direction by updating
the parameters with the momentum step before computing the gradient. We thus only need
to modify the gradient 𝑔𝑡 to arrive at NAG:
𝑔𝑡 =▽𝜃𝑡 𝐽(𝜃𝑡 − ϒ𝑚𝑡−1 ) (3.13)
𝑚𝑡 = ϒ𝑚𝑡−1 + ɳ𝑔𝑡 (3.14)
𝜃𝑡+1 = 𝜃𝑡 − 𝑚𝑡 (3.15)
Here, rather than utilizing the previous momentum vector 𝑚𝑡−1 as in the equation of the
expanded momentum update rule above, we now use the current momentum vector 𝑚𝑡 to
look ahead. In order to add Nesterov momentum to Adam, we can thus similarly replace the
previous momentum vector with the current momentum vector.[19]
3.5.2.4 Activation Function
Activation function is the function that defines the output of the node for the sets of given
input or sets of inputs. In activation function, it is often known as transfer function.
Activation function used here are:
a. ReLU:
A Rectified Linear Unit (ReLU) has output 0 if the input is less than 0, and raw output
otherwise. That is, if the input is greater than 0, the output is equal to the input.
30
f(x)= max(x,0) (3.16)
ReLU is used as activation function for each Convolutional layer used.
b. Softmax:
The softmax function squashes the outputs of each unit to be between 0 and 1, just like a
sigmoid function. But it also divides each output such that the total sum of the outputs is
equal to 1. The output of the softmax function is equivalent to a categorical probability
distribution, it tells you the probability that any of the classes are true. Mathematically, it can
be represented as:
𝑒 𝑧𝑗
𝜎(𝑧)𝑗 = ∑𝑘 𝑧𝑘
(3.17)
𝑘=1 𝑒
Softmax is used as activation for the Fully connected layer.
3.5.2.5 Regularization
As stated above the dropout and the dense layer acts as the regularization body in the
network.
3.5.2.6 Parameters
The CNN has to be fed with different constructive parameters during its initialization. Those
parameters include number of output classes, batch-size and number of epochs to run during
training. Epoch is a single pass through the entire training set, followed by testing of the
verification set. It was set 7 i.e. training was done 7 times passing through entire training set.
The batch size is the number of training samples your training will use in order to make one
update to the model parameters. Ideally you would use all the training samples to calculate
the gradients for every single update, however that is not efficient. The batch size simply
put, will simplify the process of updating the parameters. Batch size was set to 512. And the
number of output classes is the number of classifications that could be made by the system.
It was set 46.
3.6 Recognition
Recognition is final stage of the system. In this stage, we load the saved CNN model and saved
weights into .hdf5 file. Now this loaded model can be used to recognize the input image after
passing image through the image processing phase.
31
4. IMPLEMENTATION
4.1 Software Development Life Cycle
A software development methodology or system development methodology is a framework that

is used to structure, plan, and control the process of developing a software system. It is the
practice of using selected process techniques to improve the quality of a software development
effort. The documented collection of policies, processes and procedures used by a development
team or organization to practice software engineering is called its software development
methodology (SDM) or system development life cycle (SDLC). For the timely
and successful implementation of any project one must follows suitable software
development model. For the purpose of this project we followed an incremental software
building model as this project consists of several functional components to be developed in
an incremental manner. In incremental model, the whole requirement is divided into various
builds. During each iteration, the development module goes through the requirements,
design, implementation and testing phases. Each subsequent release of the module adds
function to the previous release. The process continues till the complete system is ready as
per the requirement. It starts with a simple implementation of a subset of the software
requirements and iteratively enhances the evolving versions until the full system is
implemented. At each iteration, design modifications are made and new functional
capabilities are added. The basic idea behind this method is to develop a system through
repeated cycles and in smaller portions at a time.
Figure 4.1: Incremental Software Development Model
32
In this software development model, major requirements are defined; however, some
functionalities or requested enhancements may evolve with time. Some working
functionality can be developed quickly and early in the life cycle. Results are obtained early
and periodically. Parallel development can be planned. Less costly to change the
scope/requirements. Risk analysis is better, and it supports changing requirements. Partial
systems are built to produce the final system.
By following this model, we focused on building several components of the system in an

incremental basis and finally those components are merged together to form a total
functional system of Handwritten Nepali Character Recognition and Narration System using
Deep CNN.
4.2 Requirement Analysis
Requirements analysis, also called requirements engineering, is the process of determining

user expectations for a new system that is going to be build. These features, called
requirements, must be quantifiable, relevant and detailed. In software engineering, such
requirements are often called functional specifications. Requirements analysis is an
important aspect of project management. Requirements analysis involves frequent
communication with system users to determine specific feature expectations, resolution of
conflict or ambiguity in requirements as demanded by the various users or groups of users,
avoidance of feature creep and documentation of all aspects of the project development
process from start to finish. Energy should be directed towards ensuring that the final system
or product conforms to client needs rather than attempting to mold user expectations to fit
the requirements. Requirements analysis is a team effort that demands a combination of
hardware, software and human factors engineering expertise as well as skills in dealing with
people.
4.2.1 Functional Requirements
Functional requirements explain what has to be done by identifying the necessary task, action
or activity that must be accomplished. In this project the core functional requirements can be
as depicted in below:
1. The system must be able to recognize and narrate the digits and characters in an image
taken from any camera sources.
33
2. The system must be able to run on different cross platforms.
4.2.2 Non-Functional Requirements
Non-functional requirements are requirements that specify criteria that can be used to judge
the operation of a system, rather than specific behaviors. The non-functional requirements
of the project are listed below:
1. The training of the image samples must be effective and efficient.
2. Mechanism of training sample generation must be easy and fast.
3. The integration of all the system modules that compose a system must be easier and
effective.
4.3 Feasibility Study
Image Classification is one of the hottest topic in the current field of technology and science.
Many researches have been carried out in this field from several decades ago to till today and
many of them are still under study to optimize the study. The HCR system have been
utilized in several sectors and have proved their importance in today's technological world.
By undertaking this project our attempt is to make use of image recognition in a simple to
digitize the piece of texts in an image We have undergone through several feasibility studies
to make sure that the project is feasible and be developed. Some study topics are discussed
below:
4.3.1 Technical Feasibility:
HCR have been utilized under several platforms and several development approaches have
been developed. Development of new artificial intelligence and pattern matching models
have made it very simpler for implementation of HCR. Similarly, today's powerful
computing processors and easy data collection software makes it more technically feasible.
4.3.2 Economic Feasibility
The project is economically feasible to begin with as no expensive hardware and software
components are required. Similarly, all the tools and techniques to be used are open source
and are easily available free of cost. Data collection is done among us and other individuals
which is economically feasible.
34
4.3.3 Schedule Feasibility
To develop the project a proper timeline has been projected to complete relevant portion of
the project in scheduled time period. Most of the necessary resources are searched on the
web and are available to begin research in time. Also, all the related software packages are
easily available which makes if more feasible.
4.4 Tools and Technologies
4.4.1 Python and Text Editor:
The entire work of this project is coded in Python programming language. Python is a widely
used high-level, general-purpose, interpreted, dynamic programming language. Its design
philosophy emphasizes code readability, and its syntax allows programmers to express
concepts in fewer lines of code than possible in other high level languages. Python supports
multiple programming paradigms, including object-oriented, imperative and functional
programming or procedural styles. It features a dynamic type system and automatic memory
management and has a large and comprehensive standard library.
The entire project was coded in ‘Sublime Text’ text editor. Sublime Text is a shareware
cross-platform source code editor with a Python application programming interface (API).
It natively supports many programming languages and markup languages, and functions can
be added by users with plugins, typically community-built and maintained under free-
software licenses.
Thus written Python codes were run in python environment under windows operating system
using command line.
Python allows the simple representation of large data matrix and vectors and also supports
external libraries as required. Due to this feature of python, we were able to implement the
algorithm successfully with minimum lines of codes. Thus, the major reasons behind the use
of python programming language for this project are discussed below in.
• Software quality: For many, Python's focus on readability, coherence, and software
quality in general sets it apart from other tools in the scripting world. Python code is designed
to be readable, and hence reusable and maintainable much more so than traditional scripting
languages. The uniformity of Python code makes it easy to understand, even if you did not
write it. In addition, Python has deep support for more advanced software reuse mechanisms,
such as object-oriented programming (OOP) and function programming.
35
• Developer productivity: Python boosts developer productivity many times beyond
compiled or statically typed languages such as C, C++, and Java. Python code is typically
one-third to one-fifth the size of equivalent C++ or Java code. That means there is less to
type, less to debug, and less to maintain after the fact. Python programs also run immediately,
without the lengthy compile and link steps required by some other tools, further boosting
programmer speed.
Program Portability: Most Python programs run unchanged on all major computer
platforms. Porting Python code between Linux and Windows, for example, is usually just a
matter of copying a script's code between machines. Moreover, Python offers multiple
options for coding portable graphical user interfaces, database access programs, web-based
systems, and more. Even operating system interfaces, including program launches and
directory processing are as portable is Python as they can possibly be.
• Support Libraries: Python comes with a large collection of pre-built and portable
functionality, known as the standard library. This library supports an array of application-
level programming tasks, from text pattern matching to network scripting. In addition,
Python can be extended with both homegrown libraries and a vast collection of third-party
application support software. Python 's third-party domain offers tools for website
construction, numeric programming, serial port access, game development, and much. The
NumPy extension, for instance, has been described as a free and more powerful equivalent
to the Matlab numeric programming system.
• Component Integration: Python scripts can easily communicate with other parts of an
application, using a variety of integration mechanisms. Such integrations allow Python to be
used as a product customization and extension tool. Today, Python code can invoke C and
C++ libraries, can be called from C and C++ programs, can integrate with Java and .NET
components, can communicate over frameworks such as COM and Silverlight, can interface
with devices over serial ports, and can interact over networks with interfaces like SOAP,
XML-RPC, and CORBA. It is not a standalone tool.
4.4.2 NumPy
NumPy is the high-performance numeric programming extension for python. It is the core
library for scientific computing in Python. It is a Python library that provides a
multidimensional array object, various derived objects (such as masked arrays and matrices),
and an assortment of routines for fast operations on arrays, including mathematical, logical,
shape manipulation, sorting, selecting, I/O, discrete Fourier transforms, basic linear algebra,
36
basic statistical operations, random simulation and much more. It contains among other
things:
• a powerful N-dimensional array object

• sophisticated (broadcasting) functions
• tools for integrating C/C++ and Fortran code
• useful linear algebra, Fourier transform, and random number capabilities
Besides its obvious scientific uses, NumPy can also be used as an efficient multidimensional
container of generic data. Arbitrary datatypes can be defined. This allows NumPy to
seamlessly and speedily integrate with a wide variety of databases.
4.4.3 OpenCV2
OpenCV (Open Source Computer Vision Library) is released under a BSD license and hence
it’s free for both academic and commercial use. It has C++, C, Python and Java interfaces
and supports Windows, Linux, Mac OS, iOS and Android. OpenCV was designed for
computational efficiency and with a strong focus on real-time applications in the field of
computer vision and image processing. It is written in optimized C/C++, the library can take
advantage of multi-core processing and enabled with OpenCL, it can take advantage of the
hardware acceleration of the underlying heterogeneous compute platform. OpenCV supports
the Deep Learning frameworks, TensorFlow, Torch/PyTorch and Caffe. In our project
opencv2 has been used for the various image processing purposes.
4.4.4 Tensorflow
TensorFlow is an open source software library for numerical computation using data flow
graphs. Nodes in the graph represent mathematical operations, while the graph edges
represent the multidimensional data arrays (tensors) communicated between them. The
flexible architecture allows us to deploy computation to one or more Central Processing Unit
(CPU) or Graphical Processing Unit (GPU) in a desktop, server, or mobile device with a
single Application Programming Interface (API). TensorFlow was originally developed by
researchers and engineers working on the Google Brain Team within Google’s Machine
Intelligence research organization for the purposes of conducting machine learning and deep
neural networks research, but the system is general enough to be applicable in a wide variety
of other domains as well.
37
4.4.5 Keras
Keras is a high-level neural networks API, written in Python and capable of running on top
of TensorFlow, CNTK, or Theano. It enables fast experimentation.
4.4.6 Flask
Flask is a lightweight WSGI web application framework. It is designed to make getting

started quick and easy, with the ability to scale up to complex applications. It began as a
simple wrapper around Werkzeug and Jinja and has become one of the most popular Python
web application frameworks.
Flask offers suggestions but doesn’t enforce any dependencies or project layout. It is up to
the developer to choose the tools and libraries they want to use. There are many extensions
provided by the community that make adding new functionality easy.
4.4.6 PlaySound
PlaySound is a module provided by python which plays saved audio files when run. This project uses
PlaySound module for playing the correct pronunciation of recognized handwritten Nepali digits and
characters.
4.5 GUI Development
For the development of a simple user interface we used a Light WSGI web application
framework of Python, Flask. The GUI of this project is simple and easy to understand. There
is a button for user to load the image from where the uploaded character is to be detected.
The loaded image will be shown in the window labeled as Input Image. Then there’s another
button named Recognize, which deploys the model upon input image and displays the output
under the image. Subsequently, the playsound module plays the narration of the recognized
digit or character.
38
Figure 4.2: GUI using Flask
39
5. RESULT
The main target of this project was to create a system that could recognize and narrate
pronunciation of the digits and characters in an image. Among all the processes the most
important task was to obtain a stable model with highest level of accuracy by training the
convolutional neural network with our own training data set.
So, to achieve the objective of this project, we implemented both Adam and NAdam as an
optimizer, categorical cross-entropy function as the loss function in a convolutional neural
network with an architecture of [(Conv => ReLU => maxpooling)*2 => dropout => conv =>
ReLU => dropout => flatten => dense => output].
The training of the developed system was performed on a computer system without running
any other external applications during the entire period of training with the specifications as
mentioned below in the specification table:
Table 5.1: System Specification Table
S.N. Particulars System Specifications
1. Computer Model Dell Inspiron 5577
2.80GHz Intel ® Core™

2. Computer Processor
I7-7700HQ CPU
8.00GB (7.86GB
3. Installed RAM Usable)
64bit Windows 10 Pro

operating system, x64-
Computer System Type
4. based processor
For the training of the system being developed, the parameters required were configured as
mentioned below in the training configuration table:
40
Table 5.2: Training Configuration Table
S.N. Particulars Applied Configurations
1. Number of Epochs 7
2. Batch Size 512
3. Optimizer Adam, Nadam
4. Loss Function Categorical Cross Entropy
5. Learning Rate 0.1
6. Number of Training Image Samples 64400
7. Number of Validation Image Samples 27600
8. Total Classes 46
The snapshots of loading datasets before training, and training completions are
mentioned below:
Figure 5.1: A snapshot of dataset loading before initializing training
41
Figure 5.2: Snapshot of Training with Adam Optimizer
Figure 5.3: Snapshot of Training with NAdam Optimizer

With the data training completed, the summary of the trained model was taken which is as
shown in figure below:
Figure 5.4: Model Summary after saving Model
42
The final analytics of the obtained results is shown below in the plot between model loss
and total number of epochs and also the plot between model accuracy and total number of
epochs, for both Adam and NAdam optimizers for which the network was trained.
Figure 5.5: Plot between model accuracy and epochs with Adam Optimizer
Figure 5.6: Plot between model loss function and epochs with Adam Optimizer
43
Figure 5.7: Plot between model Accuracy and epochs with NAdam Optimizer
Figure 5.8: Plot between model loss function and epochs with NAdam optimizer
The training took total time of 15 minutes with both Adam and NAdam optimizer and
produced an excellent stable training and validation accuracy of 92.68% and 95.66%
respectively while the training and validation losses were 0.2388 and 0.1498 with Adam
Optimizer while it produced better stable training and validation accuracy of 95.09% and
44
97.07% respectively with NAdam optimizer. The training and validation losses with NAdam
optimizer were 0.1575 and 0.0974 respectively.
45
6. OUTPUT
After obtaining the, the stable hypothesis, we used the model to recognize the handwritten
Nepali digits and characters sample from our friends and colleagues. Some images were
recognized with 100% accuracy whereas some were mis-predicted. We also used the system
to recognize images in MS- paint. The overall performance of the system for total images
used to be identified was always above 90% accurate. Some of the samples of completely
recognized images are shown below.
Figure 6.1: Image loaded for recognition
Figure 6.2: Recognized Image Snapshot
46
47
The snapshots of the images being recognized shows the predicting capacity of our
model. All the three images included are handwritten in plain white paper and loaded into
the model. Thus, not only during the validation highest accuracy was obtained but the
accuracy also seems pretty good during recognition phase.
48
7. CONCLUSION
This project work is a successful output of the course called ‘Major Project’ considered as
the partial fulfillment of B.E. Computer Engineering at IOE. The main objective of the
project was to recognize the handwritten Nepali digits and characters in an image and to
narrate the correct pronunciation of it. Another main objective was to compare the two image
classification optimizers namely Adam and Nadam. The system was implemented in Python
programming language with Flask using static webpage with HTML/CSS/Javascript and its
performance was tested on real images.
This system is one kind of AI program. The knowledge of image processing and
Convolutional Neural Network that implements the OCR technology has been used to
extract the digits from an image and finally high level of accuracy i.e. 95.66% of validation
accuracy with Adam and 97.07% with Nadam optimizer during training and the system
performs very well with the other real worlds images as well, in recognition phase.
This project work has been a great achievement for us even though some limitations still
exist. Finally, the overall result was satisfactory, and we successfully carried out this project
as a part of our course work and also developed the hands-on experience of working in a
project.
49
8. LIMITATIONS AND FURTHER WORK:
Handwriting Recognition has been quite an interesting field today. Many technical teams
around the world are still working together to get the satisfactory result. Several researches
have been done and some are still ongoing in this field and due to advancement in technology
and efficient new models it has made possible to make further enhancement in this field to
get more accurate results. As other projects our project has also got limitations and the
enhancements that can be made in future.
8.1 Limitations
Some of the particular limitations of our project are:
1. Offline operation: The current system is designed only for the offline operation right
now i.e. while deployed in web, it is accessible only through localhost now.
2. Narrow application domain: The present system is focused only in implementation of

handwritten digits and character recognition in web environment. But no other post
applications steps such as Desktop or Mobile apps are implemented.
3. Narrow recognition domain: Currently, the system recognizes only the handwritten
Nepali alphabets and digits but not any other characters or words. It is unable to recognize
vowel characters and joined characters. Similarly, due to less amount of dataset for the deep
neural network, the recognition accuracy is not sufficient enough.
8.2 Future Enhancements
The main objective of this project was to obtain a model with highest accuracy that can
predict the real-world images. But yet the model we have obtained is not perfectly trained.
So, the basic tasks yet to be done are as:
1. Increasing the number and quality of training dataset.

2. Performing the training session by altering different parameters and switching between
the network architecture, different activation functions, loss functions and optimizers.
3. Getting different analytical results of the system such as confusion matrix analysis.
4. Improving the model so that it can recognize not only consonants and numerals but
vowels, joined characters and words also.
5. Recognition in real time with drawing through mouse pointer.
50
9. REFERENCES
[1] R. L. Grimsdale and J. M. Bullingham, "Character Recognition by Digital Computer
using a Special Flying-Spot Scanner," The Computer Journal, vol. 4, no. 2, p. 129–
136, 1959.
[2] D. H. Hubel and T. N. Wiesel, "Receptive fields, binocular interactions and functional
architecture in the cat visual cortex," Journal of Physiology, vol. 160, pp. 106-154,
1962.
[3] Y. Lecun, L. Bottau, Y. Benjio and P. Haffner, "Gradient Based Learning Applied to
Document Recognition," Proceedings of the IEEE, vol. 86, p. 2278–2324, 1998.
[4] K. Fukushima, "Neocognitron: A Self-organizing Neural Network Model for a

Mechanism of Pattern Recognition Unaffected by Shift in Position," Biol. Cybernetics,
vol. 36, pp. 192-202, 1980.
[5] A. Krizhevsky, I. Sutskever and G. Hinton, "Imagenet classification with deep

convolutional neural network," Advances in Neural Information Processing Systems,
p. 1106–1114, 2012.
[6] T. E. de Campos, B. R. Babu and M. Varma, "Character recognition in natural images,"

Lisbon, February 2009.
[7] S. Acharya, A.K Pant, P.K Gyawali, “Deep Learning Based Large-Scale Handwritten
Devanagari Character Recognition” 9th International Conference on Software,
Knowledge, Information Management and Applications (SKIMA) pp 1-6, 2015
[8] A. K. Pant, S. P. Pandey and S. P. Joshi “Off-line Nepali Handwritten Character

Recognition using Multilayer Perceptron and Radial Basis Function neural networks,”
Third Asian Himalayas International Conference on Internet (AH-ICI), pp 1-5, 2012
[9] J. Pradeep, E. Srinivasan and S. Himavathi, "Diagonal Based Feature Extraction For
Handwritten Alphabets Recognition System Using Neural Network," International
Journal of Computer Science & Information Technology (IJCSIT), vol. 3, no. 1, Feb
2011.
[10] R. Plamondon and S. N. Srihari, "Online and offline handwritten character recognition:
A Comprehensive Survey," IEEE transaction on pattern analysis and machine
intelligence, vol. 22, no. 1, pp. 63-84, 2000.
51
[11] Y. Alginahi, "Preprocessing Techniques in Character Recognition," 2010.
[12] J. Pradeep, E. Srinivasan and S. Himawathi, "Perfomance analysis of hybrid feature

extraction technique for recognizing english handwritten character," IEEE, pp.
46734805, 8 12 2012.
[13] M. 'Arif Mohamad, D. Nasien, H. Hassan and H. Haron, "A Review on Feature
Extraction and Feature Selection for Handwritten Character Recognition," (IJACSA)
International Journal of Advanced Computer Science and Applications, vol. 6, no. 2,
pp. 204-212, 2015.
[14] N. Perveen, D. Kumar and I. Bhardwaj, "An Overview on Template Matching

Methodologies and its Applications," International Journal of Research in Computer
and Communication Technology, vol. 2, no. 10, pp. 989-995, 2013.
[15] R. Verma and D. J. Ali, "A-Survey of Feature Extraction and Classification

Techniques in OCR Systems," International Journal of Computer Applications &
Information Technology, vol. 1, no. 3, 2012.
[16] M. A. Nielson, Neural Networks and Deep Learning, Determination Press, 2015.
[17] A. Choromanska, M. Henaff, M. Mathieu, G. B. Arous and Y. LeCun, "The loss

surfaces of multilayer networks," AISTATS, 2015.
[18] Kingma, D. P., & Ba, J. L, “Adam: a Method for Stochastic Optimization.”
International Conference on Learning Representations, pp1–13, 2015.
[19] Dozat, T, “Incorporating Nesterov Momentum into Adam.”, ICLR Workshop 2013–
2016, 2016
52

Handwritten Nepali Character Recognition and Narration System Using Deep CNN

Uploaded by

Copyright:

Available Formats

Handwritten Nepali Character Recognition and Narration System Using Deep CNN

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Handwritten Nepali Character Recognition and Narration System Using Deep CNN

Uploaded by

Copyright:

Available Formats

TRIBHUVAN UNIVERSITY

HANDWRITTEN NEPALI CHARACTER RECOGNITION AND

A PROJECT WAS SUBMITTED TO THE DEPARTMENT OF ELECTRONICS AND

DEPARTMENT OF ELECTRONICS AND COMPUTER ENGINEERING

-Anish Baral (39803)

-Diwash Khanal (39812)

-Prabin Baskota (39820)

-Pradeep Dahal (39821)

Keywords: Devanagari Handwritten Character Dataset; Image processing; Computer

COPYRIGHT ................................................................................................................................. iii

TABLE OF CONTENTS ...............................................................................................................vi

LIST OF FIGURES ..................................................................................................................... viii

LIST OF TABLES ..........................................................................................................................ix

LIST OF ABBREVIATIONS ........................................................................................................ x

2.1 History of Handwriting Recognition ................................................................................. 4

3.1 General Description: .......................................................................................................... 14

4.1 Software Development Life Cycle ................................................................................... 32

8. LIMITATIONS AND FURTHER WORK: ........................................................................... 50

8.1 Limitations .......................................................................................................................... 50

1.2 Problem Statement

Academically, the project is primarily focused on fulfilling the discipline of an engineering

1.4 Scope of work:

2.1 History of Handwriting Recognition

An early notable attempt in the area of character recognition research is by Grimsdale in

However, the Hubel-Wiesel experiment in 1962 was a breakthrough to understand the

2.2 Different approaches on Handwriting Recognition:

2.2.1 Gradient Based Learning Approach

Figure 2.1: LeNet-5 Architecture

2.2.2 De Campos et. al. (2009)

2.2.3 Diagonal Based Feature Extraction

2.3 Handwritten Digits and Character Recognition System:

2.3.1 Image Acquisition

2.3.2 Image Preprocessing

Figure 2.3: Digitization of Image

Normalization: The main component of the pre-processing stage is normalization, which

2.3.3 Image Segmentation

In the segmentation stage, an image consisting of a sequence of characters is decomposed into

2.3.4 Feature Extraction

Feature extraction based on three types of feature:

1.) Statistical Feature

2.) Structural Features

3.) Global transformation and series expansion

There are four classification method:

c.) Structural Techniques: The recursive description of a complex pattern in terms of

3.1 General Description:

The Schematic Diagram of the developed system is given below:

Figure 3.1: Schematic diagram of the total system

Figure 3.3: Use Case For Image Processing Module

Figure 3.5: Use-Case Diagram for Recognition Module

Figure 3.7: Level-1 DFD for Image Processing

Figure 3.9: Level-0 DFD for recognition module

3.3 Dataset Preparation

3.3.1 Data Collection

3.3.2 Training and Test Data Generation

3.4 Image Processing

3.4.1 Image acquisition

Figure 3.11: Sample of Image loaded for Image Processing

3.4.2 Image Preprocessing

The pre-processing is a series of operations performed on the scanned input image. It