Telugu Script Achanta Hastie 2015.2805047
Telugu Script Achanta Hastie 2015.2805047
Telugu Script Achanta Hastie 2015.2805047
1
2 R. ACHANTA ET AL.
Rajasekaran and Deekshatulu (1977). This same approach was followed later
in Rao and Ajitha (1995). The rst attempt to use neural networks for Tel-
ugu OCR to our knowledge was in Sukhaswami, Seetharamulu and Pujari
(1995). They train multiple neural networks, and pre-classify an input image
based on its aspect ratio and feed it to the corresponding network. This re-
duces the number of classes that each sub-network needs to learn. But this is
likely to increase error rate, as failure in pre-classication is not recoverable.
The neural network employed is a Hopeld net on a down-sampled vector-
ized image. Later work on Telugu OCR primarily followed the featurization-
classication paradigm. Combinations like ink-based features with nearest
class centroid Negi, Bhagvati and Krishna (2001); ink-gradients with near-
est neighbours Lakshmi and Patvardhan (2002); principal components with
support vector machines(SVM) Jawahar, Kumar and Kiran (2003); wavelet
features with Hopeld nets Pujari et al. (2004) were used. More recent work
in this eld Kumar et al. (2011) is centred around improving the supporting
modules like segmentation, skew-correction and language modelling.
This paper improves on previous work in a few signicant ways. While
previous work was restricted to using only a handful of fonts, we develop
a robust font-independent OCR system by using training data from fty
fonts in four styles. This data (along with the rest of the OCR program)
is publicly released to act as a benchmark for future research. The training
and test data are big and diverse enough to let one get reliable estimates
of accuracy. Our classier achieves near human classication rate. We also
integrate a much more advanced language model, which also helps us recover
broken letters.
In our work, we break from the above mentioned featurize and classify
paradigm. We employ a convolutional neural network(CNN), which learns
the two tasks in tandem. In addition, a CNN also exploits the correlation
between adjacent pixels in the two dimensional space LeCun et al. (1998).
Originally introduced for digit classication (a sub-task of OCR), CNNs
have been adapted to classify arbitrary colour images from even a thousand
classes Krizhevsky, Sutskever and Hinton (2012). This was aided in part
by better regularization techniques like training data augmentation Simard,
Steinkraus and Platt (2003), dropout Hinton et al. (2012) and by increased
computing power. We take this opportunity to go on an excursion into deep
learning, and try to demystify the process of training big and deep neural
networks. We not only solve the image classication problem at hand but
also integrate it into the larger framework for text recognition.
The rest of the paper is organized as follows. We introduce the problem
in Section 2, describing the Telugu language and its calligraphy. Section
TELUGU OCR FRAMEWORK USING DEEP LEARNING 3
Calligraphy. The consonant forms the heart of the syllable and is written
at the center, between the top and bottom lines. The consonant is augmented
by a vowel which is written as the ascender. As an example let us consider
the moderately complex word sakarma rendered in Figure 1. While ka is
rendered as one glyph, rma is written in two. ra is written at the center
and a vowel-less m is written below the baseline disconnected from ra. A few
vowels like pa and sa have the vowels detached from the base.
4 R. ACHANTA ET AL.
Figure 1. The word (sakarma) rendered in Telugu and in English with Telugu styling.
The 3 in Telugu corresponds to the vowel a, usually attached to the consonant. The rst
syllable (sa) is written in two parts, (ka) in one, and (rma) is split into (ra) and
(m).
Figure 2. A sample Telugu poem. The detected line-separations are in red, top-lines in
blue and bottom-lines in green. The marginal and the best tting harmonic are shown to
the right. Lines are detected by looking for sudden jumps in the marginal and using the
wavelength of the harmonic as a heuristic.
Figure 3. Glyphs extracted from a text image. Each connected component, shown in a
dierent colour, is scaled before being fed to the Neural Network.
bold, italic and bold-italic). This results in nearly 160 unique rendering per
glyph (after removing duplicates). Hence in all, we have 160 460 73, 000
labelled samples. Figure 5 shows all the renderings for a sample class. We
think these cover the range of modern type-settings in Telugu very well.
However, we still could not nd the computer equivalent for some of the
main fonts used extensively in the early days of printing. This leaves us
vulnerable to the problem of data drift, where the distribution of the train-
ing and test datasets does not completely capture that of the real world
problem. We need to consider this while regularizing our classier.
6 R. ACHANTA ET AL.
Figure 4. The word jarjara. The The letter (ja) when rendered above the base line has
its usual interpretation. But when rendered below, it results in the loss of a vowel here a.
Some glyphs can not be rendered stand-alone since they exist only as
parts of multi-glyph syllables. For e.g., the consonant combiner (+m) in
Figure 1 can only be a part of a syllable like (rma). So we render random
text on to a digital image and extract the glyphs back. Since we know the
rendered text, we know the labels of the glyph segments obtained. More
importantly, we also know the location information for each glyph, i.e. its
position relative the top line and baselines. We expect to use this informa-
tion for better classication. Since the same glyph, when placed at dierent
locations relative to the top and base lines, results in it being in dierent
classes (Figure 4 shows an example). Now we have the training data that
can be used to train any machine learning algorithm of interest.
Figure 5. The 167 unique renderings of the letter (go). This illustrates the expected
spread in the distribution of glyphs. Similarly we have nearly as many unique renderings
for each of the 460 classes.
Figure 6. The original binary image of the letter (ja) is shown at the top, the three
sample convolution kernels at the center, and the corresponding outputs at the bottom.
with the assumption that for any indices out of range, A is zero. A non-
to give the nal C
linearity R is applied after the convolution CW W operation
of the convolutional layer.
(5.3) CW (A) = R(CW (A))
TELUGU OCR FRAMEWORK USING DEEP LEARNING 9
+ +
5.1.2. Pool Layer. A typical convolution layer outputs many more maps
than it takes in. There is also high correlation between the adjacent output
values in a map. Hence it makes sense to somehow scale the maps down,
generally by a factor of two in each co-ordinate. This operation is called
pooling. Typically pooling is done over a 2 2 grid. The maximum of the
four pixels in the grid is extracted as output. Taking maximum (as opposed
to the average) gives the neural network translational invariance, which is
key to good classication. We can increase the number of maps four-fold
at each convolutional layer and decrease the area of the image-maps by a
factor of four at the succeeding pool layer. This preserves the size of the
image while at the same time transforming it to a dierent, more useful,
10 R. ACHANTA ET AL.
5.1.4. Output Layer. The last of such fully-connected layers is the output
layer FWS . It has as many nodes as the number of classes, K, in the problem.
(5.7) S
FW : Rn2 RK
S
(5.8) FW (A) = S(W A)
where,
eAk
(5.9) Sk (A) = K
Aj
j=1 e
This is the same transformation used in the multilogit model, and produces
positive values that sum to one. The K-vector obtained from the matrix
multiplication is exponentiated to make it all positive and then normalized
to belong to the K-simplex. These K values are interpreted as the class
probabilities. In summary, one can think of the convolutional and pool layers
as the feature extraction phase of the CNN and the the fully connected layers
as a vanilla neural network classier on top.
Let X, Y denote a random image and its class label respectively. Then,
according to our model, the likelihood is given by
where A(x) is the input to the softmax layer (for a given set of network
parameters W = {Wi }). Neural networks are trained to maximize the log of
the above likelihood over the training data D to nd the optimum network
parameters W .
(5.11) L= log (py (x; W))
(x,y)D
(5.12) W = argmax L
W
leaky rectifier
1
rectifier
sigmoid
tanh
0
1
1 0 1
12 6
48
24
2
50
48
45
0
7
3
3
2
connect
convolve pool convolve pool convolve pool fully
Figure 9. A traditional convolutional neural network where a single channel 4848 input
is subjected to three convolution-pool operations (resulting in 6 24 24, 16 12 12,
30 6 6 maps respectively). The nal 30 6 6 tensor is attened to 1080 nodes and
is fully connected to a hidden layer with 500 units, followed by a multi-class logistic layer
over the 457 classes.
TELUGU OCR FRAMEWORK USING DEEP LEARNING 13
5.3. Our Neural Network. The glyphs obtained from segmentation are
scaled up or down to a 48 48 square and fed to the CNN. We preserve
the aspect ratio while scaling. In addition to these 2304 binary pixel values,
we have two numbers representing the locations of the top and baselines,
which we will later incorporate into the network. We use a traditional ar-
chitecture, based on LeCun et al. (1998), as our reference point to compare
various design choices and regularizations. It has three pairs of convolutional-
pool(conv-pool) layers, all of which employ a 3 3 convolution kernel and a
2 2 max-pooling window. These are followed by two fully connected layers.
We use leaky-ReLUs in place of the tanh activations of LeCun et al. (1998).
The last layers soft-max activation yields the nal class probabilities. This
model can can be written as.
(5.18) p(x; W) = FW
S
5
FW4 P2 CW3 P2 CW2 P2 CW1 (x)
Figure 10. The letter (go) goes through the slim network from Table 1. The last two
images (at the bottom right) are the input and output to the softmax classication layer.
TELUGU OCR FRAMEWORK USING DEEP LEARNING 15
Figure 11. Intermediate features and nal transformations for various input images (ex-
tracted from the fourth and the last hidden layers respectively of the slim network). Notice
how similar looking images have similar representations in the transformed space.
One can make the network size as small as 100K parameters while still
achieving test error rates below 1.5%. For these deeper networks (with only
the one fully connected layer) half of the parameters are in this last clas-
sication layer. The rest of the network with its six to eight convolutional
layers uses the other half of the parameters. Thus the network parame-
ters are equally split between the classication and feature-extraction tasks.
However, the convolution operation is computationally expensive and can
signicantly increase the time required to classify a glyph. But as we are in
the sub milli-second range, this is acceptable. Even the deepest neural net-
works, once trained, classify much faster than a kNN classier. Also, in the
context of the larger Telugu OCR problem, the Viterbi algorithm employed
by the language model is the main speed bottle neck.
While it might seem natural to use the Fast Fourier Transform(FFT) to
perform the convolution operation, it does not give any signicant improve-
ment in performance as the kernel size is small and xed at three. While
direct convolution needs 9n operations, FFT would require n log(n)+n oper-
ations, which is not a signicant reduction in complexity. FFT has additional
memory requirements to store the transformation; and back-propagating
through it is not as straightforward.
5.5. Design Choices. The fourteen layered neural network from Table 1
(line 7) requires the specication of nearly a hundred hyper-parameters.
It does take quite a bit of experimentation to obtain the set of hyper-
parameters that lead to good prediction. The slim architecture from Ta-
16 R. ACHANTA ET AL.
ble 1 has been carefully hand-crafted to improve both on size and speed
over the traditional architecture. One could spend countless hours tuning
these hyper-parameters. This diculty in picking the right architecture and
hyper-parameters gives neural networks an air of mystery. Here we will
briey point out some of the important design choices in an attempt to
demystify the process.
In our dataset of binary images, nearly one-fth of the pixels are ink and
the rest are background. If we consider ink to be one and background zero,
the mean of the image is at 20%. One can instead consider the reverse and
increase the mean to 80%. A higher mean will help keep the ReLU units
activated (i.e. in their linear, as opposed to at, region). This is a non-issue
for the leaky ReLUs, which are never really at. In addition to more success-
ful training instances, the leaky ReLUs also improve performance by about
25%. Using traditional tanh activation not only increases the computational
complexity but also reduces performance by 25%. Table 2 summarizes these
eects.
Table 2
The eect of various design choices on test and training errors. We use the traditional
architecture from line 6 of Table 1. We report median rates over eleven training attempts.
The standard error of the error estimates is .1%
5.6. Regularization. Given that we are tting a {0, 1}4848 (0, 1)457
model using over a million parameters, it is very easy to overt, making
regularization a key component of the training process. Two forms of regu-
larization seem to work well for this problem dropout Hinton et al. (2012)
TELUGU OCR FRAMEWORK USING DEEP LEARNING 17
and input distortion. Both of them distort the input to a classier, albeit in
very dierent ways.
Figure 12. Three distortions applied to the seven original images in the top row. Each
distortion is a combination of a random amount of translation, rotation, zoom, elastic
deformation and noise.
(2012) is that we encourage the network to learn features that are useful
by themselves rather than features that are useful in the presence of one
another. By dropping out a random half of the features at each gradient
descent step, we are preventing co-adaptation of features. This helps gener-
alization error. In other words, we are very weakly training a huge number
of models and averaging them at test time.
While dropout can be applied at any layer of the CNN, it does not make
much sense to use dropout at the convolution layers. If dropout is applied
at the input layer, it is the same as salt & pepper noise. Dropout works
best when applied to the nally extracted features just before feeding them
to the softmax layer. We hypothesize that dropping out a feature (that has
been extracted by eight convolutional layers) is equivalent to distorting the
input image signicantly. Thus dropout and input distortions have similar
eects. This is also apparent in Table 3, which summarizes performance
using various forms of regularization.
Table 3
The eect of various forms of regularization on test and training errors. We take the
traditional architecture from Table 1 and remove one form of regularization from it to see
how the absence aects performance. Median rates over 11 trails are reported. The
standard error of the error estimates is .1%
3.0
2.5
%Test Error Rate
2.0
1.5
Input Distortion
Absent
1.0
Present
0 20 40 60 80
%Dropout
Figure 13. The eect of dropout on test error. Each dot represents one training attempt.
Solid lines show median performance for a given dropout rate over eleven such attempts.
In the presence of input distortion, no to little dropout is required. In its absence, however,
40-80% helps.
Table 4
Improved test error with location information for networks from Table 1
50.0
Test Error
Training Error
10.0
% Error
5.0
2.0
1.0
0.5
0 20 40 60 80 100
Pass
Figure 14. Training and Test error as a function of passes through the data.
trains for an hour, going over the entire dataset fty times, on a system with
Nvidia GTX Titan Black GPU, to give a test error rate of 2%. Training
is ve times slower without the GPU. We usually train the network with
100 passes over the training data. Convergence is faster with leaky ReLUs
than with ReLUs, as the former have gradients of larger magnitudes. For the
same reason, convergence is faster with ReLUs than with tanh activations.
Further, we can incorporate biases into the OCR system according to the
frequency of occurrence of various glyphs, letters, words, etc.
Given a sequence of input images x = (x1 , x2 , . . . , xt ) where xi {0, 1}4848 ,
we need to nd y = (y1 , y2 , . . . , yt ), the sequence of output labels that max-
imizes P (y|x).
First consider the term P (x|y). In reality there is a font based dependence
across glyph renderings {xi } given corresponding labels {yi }. But we make
the independence assumption that
(6.2) P (x|y) = P (xi |yi ).
i
While simplifying the math, it only makes the problem harder. Additionally,
it is applicable to our case as we do not incorporate font based dependencies
across glyphs in our neural network (say by maintaining a state as we move
from glyph to glyph). Now,
P (yi )
(6.5) P (yi |xi ) P (yi |xi )
P (yi )
TELUGU OCR FRAMEWORK USING DEEP LEARNING 23
In our case, the training data contains equal number of instances per class
(although some classes are more than a thousand times as likely as some
others). Therefore, P (yi ) = 1/457 is the same for all i, hence it drops o
from (6.5). Now, plugging in P (yi |xi ) as an estimate for P (yi |xi ) in (6.4),
we obtain this simple formula as an estimate for P (y|x):
(6.6) P (y|x) P (y) P (yi |xi )
i
Here, P (y) is an estimate for P (y). Once we nd P (y) using the n-gram
model detailed below we nd the most probable sequence of glyphs as
y = argmaxy P (y|x).
t
(6.7) P (y) = P (yi |yi1 , yi2 , . . . , yi(n1) )
i=1
glyphs. This renders the trigram incidence matrix sparse. We take the top
ve candidates for each glyph with corresponding scores P (yi |xi ) and run
the Viterbi algorithm to get the most likely sequence of glyphs according to
the formula
t
(6.8) P (y|x) P (yi |yi1 , yi2 )P (yi |xi )
i=1
Combining broken glyphs. The initial graph is linear and hence each
node has at most one child. The graph is parsed bottom up, starting from
the last node. At a given node, we consider the two consecutive edges from
self to child and from child to grandchild. We check to see if the two edges
need to be combined. This is repeated over all grandchildren of all children.
Although each node has only one child to begin with, each time we decide
to add a new edge, a grandchild of the current node becomes a child (like
TELUGU OCR FRAMEWORK USING DEEP LEARNING 25
r and n combining to give m in Figure 15). This way more than two pieces
can be combined to give a candidate glyph (like o, r, n combining to give
om in the same gure).
com
om
m
c o r n
Figure 15. Segmentation graph for a word that could be corn or com. Starting with the
black edges, given by the segmentation module, the green edges are to be added by our
recovery mechanism.
6.3. Viterbi Decoder. Once we have our nal segmentation graph, which
is a Directed Acyclic Graph(DAG), we generate the recognition graph Le-
Cun et al. (1998). For each arc in the segmentation graph, we take the
corresponding image and pass it through our classier, and get the top M
candidates (and their probabilities). We build a new graph by replacing each
arc in the segmentation graph by M weighted arcs each carrying a dierent
candidate-character. It seems that M = 5 is sucient in practice. Figure 16
shows an example with M = 3 matches per image.
We now need to nd the path in the recognition graph that corresponds
to the highest probability as dened by (6.8). Note that (6.8) has two terms
26 R. ACHANTA ET AL.
m 25
# 15
% 15
c 75 o 85 r 85 n 85
e5 d5 t5 h5
u3 g4 l1 b2
m 95
w1
f1
Figure 16. Recognition graph for the image in Figure 15. Each arc is replaced by three
arcs corresponding to the top three matches for the image. Corresponding percentage prob-
abilities are also shown as edge weights.
per glyph: one is the probability of a candidate label given the image the
other is the n-gram probability of this candidate label given the previous
n 1. The former is incorporated in the recognition graph, however the
latter is not. Finding the strongest path in the recognition graph would
correspond to picking the top-match for each glyph. Doing so would not
incorporate any linguistic information. Hence we need to pick among all
the paths in the DAG, the one that has the highest path probability as
dened by (6.8). This could be computationally expensive. Suppose we have
a string of thirty glyphs with ve matches per glyph, we will have 530 (over
1020 ) paths to consider. To solve this seemingly intractable problem, we
use the famous Viterbi algorithm Viterbi (1967) an ecient dynamic
programming algorithm that nds the shortest path in a graph.
Before we can use the Viterbi algorithm, we need to augment the recogni-
tion graph with n-gram probabilities. The length of an edge is not just the
likelihood of a glyph on it, instead, we need to multiply it by the n-gram
probability of the current character-candidate given the candidates for the
previous n1. Each of the M edges (between a node and one of its children)
in the recognition graph is replaced by M n1 edges, where n is the order of
the n-gram. Thus, now, between a node and one of its children, we have M n
edges. We call this the n-gram graph. Figure 17 shows one such graph with
M = 2, n = 2. We run the Viterbi algorithm on this graph to get the most
likely path, which in our example from Figure 15 is corn.
cm 0,25
c# 0,15
em 5,15
e# 0,15
om 2,95
dm 0,1
ow 1,1
dw 0,1
Figure 17. n-gram graph, shown here with two matches per image (M = 2) and bigram
probabilities (n = 2). For our application, we use M = 5, n = 3, leading to a much more
complicated graph. Viterbi algorithm nds the strongest path in this graph. The c,o in
the edge-label co 9,85 denote c as a candidate from the previous image and o from
the current. 85 is the %probability of o given the image and 9 is the %probability of the
bigram co. The edge strength is .09 .85 = .0765. _ represents beginning of a sentence.
classier. While under our model, P (yi |xi ) is asymptotically consistent for
P (yi |xi ), there are a few considerations. Firstly, a huge model like ours is
prone to over-t the values of P (yi |xi ) to the training data. While we do
regularize the network to prevent over-tting, we might have to further tune
the network parameters. This is because we are using the learned network as
a component in a bigger model that includes a language prior. To this end,
we recalibrate the learned probabilities by further penalizing the weights of
the softmax layer.
The class probability for the k th class is given by
T
ek A
(6.9) P (k|x) = ,
K jT A
j=1 e
where A is the input to the softmax layer (for a given image x) and {k }
dene the weight matrix W of (5.8). One could scale all the coecients j by
a xed quantity and the predictions of the neural network do not change,
28 R. ACHANTA ET AL.
only the condences in them do. Dene the biased probabilities P (k|x) as
T
ek A
(6.10) P (k|x) = .
K jT A
j=1 e
(a)
(b)
(c)
(d)
(e)
(f)
Figure 18. The six sample texts used in the end-to-end evaluation.
30 R. ACHANTA ET AL.
from a modern corpus of 43M unicode Telugu characters obtained from the
internet. For this test, we do not re-calibrate the probabilities as described
in the previous section.
We get perfect recovery for the very well printed and scanned Nudi text
that uses a modern font. However, it is not a realistic specimen. To better
evaluate performance on our target documents, we test on fonts that have
not been seen by the neural network. It can be seen from Table 5 that the
network is not susceptible to data-drift, as the error-rates are not higher
for unseen fonts (of Ksa and Annamaya). Our implementation of the
recognition-graph also gives us signicant improvement in performance, as
it extends the scope of the problem to text with erasure of ink. We recover
well from a scenario where as high as one in four glyphs is broken (as in
Nannayya). We also see that implementation of the n-gram model further
reduces the error-rates for Telugu texts. However, it introduces more errors
for the Sanskrit text Rmayaa that is written in Telugu script. This is
understandable given that we use a corpus of modern Telugu to learn the
trigram probabilities. One could manipulate the recalibration parameter
(of Section 6.3) according to how similar the scanned text is going to be to
modern internet Telugu. All in all, we think our system gives satisfactory
performance over diverse looking documents. While this performance might
not be enough to replace a human transcriber, it is good enough to enable
search facility for a digital corpus of scanned texts.
employed by humans are are far more complex (as studied in greater detail
in Natural Language Processing).
Our framework does not generalize to scripts like Devanagari and Ara-
bic. These are written with words as connected components, whereas, our
system relies heavily on being able to segment the printed text. Modern
Deep Learning techniques like Recurrent Neural Networks with Connection-
ist Temporal Classication(CTC) Graves et al. (2009) overcome this design
limitation. They do not need the data to be segmented in a way such that
there is a one-to-one correspondence between input samples and output la-
bels. While CTC broadens the scope of the problem, it more dicult to
train. It might also need a more complicated language model. It would be
interesting to compare a CTC based framework with ours as a reference.
References.
Abu-Mostafa, Y. S. (1990). Learning from hints in neural networks. Journal of com-
plexity 6 192198.
Bastien, F., Lamblin, P., Pascanu, R., Bergstra, J., Goodfellow, I. J., Berg-
eron, A., Bouchard, N. and Bengio, Y. (2012). Theano: new features and speed
improvements. Deep Learning and Unsupervised Feature Learning NIPS 2012 Work-
shop.
Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G.,
Turian, J., Warde-Farley, D. and Bengio, Y. (2010). Theano: a CPU and GPU
Math Expression Compiler. In Proceedings of the Python for Scientic Computing Con-
ference (SciPy).
Bishop, C. M. (1995). Training with noise is equivalent to Tikhonov regularization. Neural
computation 7 108116.
Bloomberg, D. (2007). Leptonica: An open source C library for ecient image processing,
analysis and operation.
Cirean, D. and Schmidhuber, J. (2013). Multi-column deep neural networks for oine
handwritten Chinese character classication. arXiv preprint arXiv:1309.0261.
Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathe-
matics of control, signals and systems 2 303314.
Graves, A., Liwicki, M., Fernndez, S., Bertolami, R., Bunke, H. and Schmidhu-
ber, J. (2009). A novel connectionist system for unconstrained handwriting recognition.
Pattern Analysis and Machine Intelligence, IEEE Transactions on 31 855868.
He, K., Zhang, X., Ren, S. and Sun, J. (2015). Delving deep into rectiers: Surpassing
human-level performance on imagenet classication. arXiv preprint arXiv:1502.01852.
Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I. and Salakhutdi-
nov, R. R. (2012). Improving neural networks by preventing co-adaptation of feature
detectors. arXiv preprint arXiv:1207.0580.
Jawahar, C., Kumar, M. P. and Kiran, S. R. (2003). A bilingual OCR for Hindi-Telugu
documents and its applications. In null 408. IEEE.
Krizhevsky, A., Sutskever, I. and Hinton, G. E. (2012). Imagenet classication with
deep convolutional neural networks. In Advances in neural information processing sys-
tems 10971105.
32 R. ACHANTA ET AL.
Kumar, P. P., Bhagvati, C., Negi, A., Agarwal, A. and Deekshatulu, B. L. (2011).
Towards improving the accuracy of Telugu OCR systems. In Document Analysis and
Recognition (ICDAR), 2011 International Conference on 910914. IEEE.
Lakshmi, C. V. and Patvardhan, C. (2002). A multi-font OCR system for printed
Telugu text. In Language Engineering Conference, 2002. Proceedings 717. IEEE.
LeCun, Y., Bottou, L., Bengio, Y. and Haffner, P. (1998). Gradient-based learning
applied to document recognition. Proceedings of the IEEE 86 22782324.
Nair, V. and Hinton, G. E. (2010). Rectied linear units improve restricted boltzmann
machines. In Proceedings of the 27th International Conference on Machine Learning
(ICML-10) 807814.
Negi, A., Bhagvati, C. and Krishna, B. (2001). An OCR system for Telugu. In Docu-
ment Analysis and Recognition, 2001. Proceedings. Sixth International Conference on
11101114. IEEE.
Pujari, A. K., Naidu, C. D., Rao, M. S. and Jinaga, B. (2004). An intelligent character
recognizer for Telugu scripts using multiresolution analysis and associative memory.
Image and Vision Computing 22 12211227.
Rajasekaran, S. and Deekshatulu, B. (1977). Recognition of printed Telugu charac-
ters. Computer graphics and image processing 6 335360.
Rao, P. and Ajitha, T. (1995). Telugu script recognition-a feature based approach.
In Document Analysis and Recognition, 1995., Proceedings of the Third International
Conference on 1 323326. IEEE.
Simard, P. Y., Steinkraus, D. and Platt, J. C. (2003). Best practices for convolu-
tional neural networks applied to visual document analysis. In 2013 12th International
Conference on Document Analysis and Recognition 2 958958. IEEE Computer Society.
Simard, P. Y., LeCun, Y. A., Denker, J. S. and Victorri, B. (1998). Transformation
invariance in pattern recognitiontangent distance and tangent propagation. In Neural
networks: tricks of the trade 239274. Springer.
Sukhaswami, M., Seetharamulu, P. and Pujari, A. K. (1995). Recognition of Telugu
characters using neural networks. International journal of neural systems 6 317357.
Viterbi, A. J. (1967). Error bounds for convolutional codes and an asymptotically opti-
mum decoding algorithm. Information Theory, IEEE Transactions on 13 260269.
Wager, S., Wang, S. and Liang, P. S. (2013). Dropout training as adaptive regulariza-
tion. In Advances in Neural Information Processing Systems 351359.