Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

An Attempt To Recognize Handwritten Tamil Character Using Kohonen SOM

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Int. J.

of Advance d Networking and Applications 188


Volume: 01 Issue: 03 Pages: 188-192 (2009)

An Attempt to Recognize Handwritten Tamil


Character Using Kohonen SOM
R.Indra Gandhi
Research Scholar, Department of Computer Science, Mother Teresa Women’s University,
Kodaikanal - 624 102, TN, India
Email: shambhavi.rajesh@gmail.com
Dr.K.Iyakutti
CSIR Emeritus Scientist, School of Physics,
Madurai Kamaraj University, Madurai – 625 02, TN, India
Email: iyakutti@yahoo.co.in

-------------------------------------------------------------------------ABSTRACT -------------------------------------------------------------
This paper presents a new approach of Kohonen neural network based Self Organizing Map (SOM) algorithm for
Tamil Character Recognition. Which provides much higher performance than the traditional neural network.
Approaches: Step 1: It describes how a system is used to recognize a hand written Tamil characters using a
classification approach. The aim of the pre-classification is to reduce the number of possible candidates of unknown
character, to a subset of the total character set. This is otherwise known as cluster, so the algorithm will try to group
similar characters together. Step 2: Members of pre-classified group are further analyzed using a statistical classifier
for final recognition. A recognition rate of around 79.9% was achieved for the first choice and more than 98.5% for
the top three choices. The result shows that the proposed Kohonen SOM algorithm yields promising output and
feasible with other existing techniques.

Keywords: Handwritten character, SOM, Baseline, Statistical, Structural, Crux, Meticulous and Sobel edge detection.
-------------------------------------------------------------------------------------------------------------------------------------------------------
Date of Submission: August 08, 2009 Revised: November 11, 2009 Accepted: November 18,2009
-------------------------------------------------------------------------------------------------------------------------------------------------------

Structural techniques use some qualitative measurements as


1. INTRODUCTION features. Statistical techniques use some quantitative
measurement. In hybrid approach, these two techniques are
combined at appropriate stage first representation of

D uring the last four decades, the field of character


recognition has been receiving significant attention,
from research workers in diverse disciplines such as
characters and utilizing them for recognition. In this paper
we use hybrid techniques, in which structural properties of
the text line are used for the first stage of preliminary
conversion of handwritten of printed document to an classifications. A statistical classifier recognizes the
editable soft format, recognition of postal addresses for unknown character as one of the members of the pre-
automated postal system, data and word processing, data classified group.
acquisition in bank checks, processing of archived
institutional records. Most of the work done in the field of 2. TAMIL LANGUAGE
character recognition is confined to Roman [1], English
[2,3], Urdu [4,5], Chinese / Japanese languages [6,7,8]. Tamil is one of the 16 major national languages spoken by
Now a day some efforts have been reported in literature for the South Indian. Most Tamil letters have circular shapes;
Devanagari [9,10], Bangla [11,18], Telugu [12,13,14], partially due to the fact that they were originally carved
Tamil [15,16,17] scripts. Most of the character recognition with needles on palm leaves, a technology that favored
techniques are problem-oriented. Techniques are devised rounded shapes. The writing of Tamil is a combination of
for the recognition of a particular script depending upon the alphabetical and syllabic systems. The Tamil script is used
nature and complexity of the character. Broadly speaking, to write the Tamil Language in Tamil Nadu state of India,
the features can be physical, topological, mathematical or Sri Lanka, Singapore and parts of Malaysia as we as write
statistical in nature. These strategy used for recognition can minority languages such as Badaga [19]. Compared to other
be broadly classified into structural, statistical and hybrid.
Int. J. of Advance d Networking and Applications 189
Volume: 01 Issue: 03 Pages: 188-192 (2009)

Indian language it has a relatively small number of pure


consonants and vowels.
The Tamil alphabet has thirty basic letters of which 18 are
consonants and 12 are vowels. In addition 216
combinations of consonants and vowels, which are either,
compound letters or syllables. In character recognition
point of view, only 67 symbols have to be identified to
recognize all 247[17]. We have considered 67 symbols of
the Tamil Alphabet for our study.
Fig.1 Reference Line Identifications
3. TENTATIVE SYSTEM
Most of the recognition systems are composed of two basic A pre-formatted paper for the collection of handwriting was
subparts: Feature extraction and classification. Feature used to guide the writer and simplify the process of
extraction deals with the basic operations like acquisition, reference line extraction. Each document has the four
noise reduction, scaling, segmentations etc., On the other references line printed on it. However, these lines are
hand, classification can be said as recognition. The aim of completely eliminated during the binarization of the image
preliminary classification is to reduce the number of and have no effect on the segmentations. After the
possible unknown character, to a subset of the total references lines have been found, words and characters are
character set. extracted using the vertical projection profile of each text
line. Word boundaries and character boundaries are
4. DATA COLLECTION distinguishable since the former are much wider than the
latter. One all the characters have been segmented, the
Data samples were collected from different writers on any minimum-bounding box of each character is identified
sized documents. First of all, the input data are resized to eliminating the while space around it. Upper and lower
250 X 250 pixels to satisfy procedure, regardless of whether boundary values of the minimum boundary box, along with
it’s an image of a single character or a word. The system the four reference lines, are sent to the next stage for
was trained with both computer-generated images and preliminary classification.
scanned images of text; may it be a single character or a
word. In preprocessing, noise is removed from the image by 6. PRELIMINARY CLASSIFICATION.
a spatial filter. It should be noted that no skew correction
was done, so the scanning process is expected to be a high Aspiration of this classification is to reduce the number of
quality. Quality of the image is a great factor for the possible characters for an unknown character, form the
performance of the system. known one refer Fig 2.

5. SEGMENTATION
Text area from the document, which may consist of multi
lines, is extracted and the segmentation step is followed.
Further, each line is segmented into individual words, and
finally ach word is segmented into individual characters.
The method is based on horizontal projection profile Fig.2 Crux and Exhaustive character
corresponds to the horizontal gaps between text lines. Each
text line is identified using two-reference line known as So the characters are categorized into two groups where the
upper line and lower line. They correspond to the minimum characters of the first group lie in the two baselines are
and maximum zero value positions adjusting a text line categorized into crux characters group. On the other hand,
respectively. (See Fig. 1) First derivative of the horizontal the character that cross the base line as Exhaustive group.
projection profile is calculated for each segmented text line. Again this exhaustive group is further divided into two sub
The lines drawn across the two peaks in Fig. 1 indicate the groups for easy recognition.
two baselines.
Int. J. of Advance d Networking and Applications 190
Volume: 01 Issue: 03 Pages: 188-192 (2009)

ii) K =Maximum (X (k) ), for iteration step k=1...K, get an


input vector X (k) randomly or in order.
iii) Calculate Distance = X(k),  k = 1…n
1…n refers to neuron nodes.
iv) Select the winner output neuron j * with minimum
distance.
v) Update weights  W j ( k+ 1) to neurons j * and its
Table 1. Primary Classification of Two groups neighborhood:

“Ascending exhaustive characters” which cross the upper W j (k + 1) = W j ( k) + α (k + 1) ∩ ( j , j*


base line and “Descending exhaustive characters” are the ( k+1) , ( k + 1) ) [ X ( k + 1) − W j (k) ],
one that cross the lower base line. Table 1. Lists all the j = 1…L,
characters under this consideration, classified into the above
vi) If k= K go to step (ii).
pre-classification groups. Characters belonging to other
groups like numbers and Sanskrit based characters are
assumed to be invalid matches and are not considered for In this algorithm, α ( k ) is a step function that decreases
the recognition. monotonically with k ∩ ( j… j*(k), k as neighborhood
function. It is formulated as follows:
7. FEATURE EXTRACTION
2
This feature extraction is a most important part of the d j*k(k)
character recognition procedure. Here creation of vectors ∩ A ( j , j*(k), k) = - exp
from the image (binary images) is carried out. All the 2
2σ (k)
segmented characters images are then scaled into a common
height and width (32 X 32 pixels) using a bilinear
Where σ (k) defines the width of the neighborhood which
interpolation technique. Usually some unwanted portions
are included in the image. This can be corrected by Sobel decreases in time monotonically, and d2j*k ( k) is Euclidean
edge detection algorithm, using Sobel mask. The process metric distance between the neuron to be adjusted to the
makes the feature detection process easier. Moreover winner neuron j* .
Median filtering made the sample that increases the
efficiency of the process. 9. TENTATIVE RESULT
Experimental data is divided into two distinct sets: a
8. RECOGNITION PROCESS training set of 200 samples and a testing set of 800 samples.
Lots of activities in pre-processing stages helps to process In experiment, total 100 text lines were subjected to
this stage very easy. Self-organizing feature maps (SOFM segmentation and reference line identification. We
or SOM) are unsupervised machine learning that learns by conducted several test by various portion of the training
self-organizing and competition [20]. The main idea for this data, to see how well the system represents the data it has
is to make it simple and acceptable for Kohonen SOM. It been trained on. In all the cases, every character in each text
reduces a remarkable amount of time. SOM is clustering the line was correctly segmented. The reference line
input vector by calculating neuron weight vector according identification was almost 98.5% accurate resulting only 1%
to some measure (e.g. Euclidean distance), thus weight pre-classification error. Results of the recognition process
vector that closet to input vector comes out as winning are given in Table 2.
neuron. However, instead of updating only the winning
neuron, all neurons within a certain neighborhood of the Kohonen SOM shows very good promise indeed, especially
winning neuron are updated using the Kohonen rule [20]. as compared to Neural network based ones. Not only is the
accuracy rate consistently higher, the time performance to
The algorithm is described as follows, suppose the training train and recognize are better as Kohonen networks do not
set has sample vectors X, trains the SOM network has have hidden layers.
following steps:

i) Firstly, all neuron nodes weights, defined as


W j (1), j = 1…L, are initialized randomly.
L is the number of neurons in the output layer.
Int. J. of Advance d Networking and Applications 191
Volume: 01 Issue: 03 Pages: 188-192 (2009)

[3] Hu, M. K. Brown and W. Turin, “HMM based on-line

Tot-al
Sample Data

Test1

Test2

Test3
handwriting recognition”, IEEE Trans. on pattern Anal.
Mach. Intell., vol. 18, no. 10, pp. 1039-1045, Oct. 1996.
Tested 639.0 104.0 32.0
Number 800
Tested

[4] U. Pal and B. B. Chaudhuri, “Indian script character


%Tested 79.9 92.9 96.9
Set

recognition: a survey”, Pattern Recognition, Vol. 37(9), pp.


1887-1899,
Trained 179.0 15.0 3.0
Trained

Number 200 [5]http://en.wikipedia.org/wiki/Official_languages _of _


%Trained 89.5 97.0 98.5
Set

India Tentative System

[6] D. Deng, K. P. Chan, and Y. Yu, “Handwritten Chinese


Table 2. Recognition Process Result
character recognition using spatial Gabor filters and self-
organizing feature maps”, Proc. IEEE Inter. Confer. On
Image Processing, vol. 3, pp. 940-944, Austin TX, June
10. CONCLUSION 1994.
We investigated a new representation of Tamil Character
Recognition, and used Kohonen SOM techniques efficiently [7] C-H. Chang, “Simulated annealing clustering of
classifies handwritten and also for Printed Tamil characters. Chinese words for contextual text recognition”, Pattern
More effective and efficient feature detection techniques Recognition Letters, vol. 17, no. 1, pp. 57-66, 1996.
will make the system more powerful. There are still some
more problems in recognition. They are, during letter [8] H. Yamada, K. Yamamoto, and T. Saito, “A non-linear
segmentations and abnormally written characters (which normalization method for handprinted Kanji character
misguide the system during recognition). Misrecognition recognition–line density equalization”, Pattern
could be avoided by using a word dictionary to look-up for Recognition, vol. 23, no. 9, pp. 1023-1029, 1990.
possible character composition. The presence of contextual
knowledge will help to eliminate the ambiguity. We show [9] S. D. Connell, R. M. K. Sinha and A. K. Jain,
that, in practice, the proposed approach produces near “Recognition of unconstrained On-line Devanagari
optimal results besides outperforming the other characters”, in the Proceedings of 15 International
methodologies in existence. Our future work in this regard Conference on Pattern Recognition (ICPR), Vol. 2, Spain,
will be analyzing the features of joined letters and pp. 368-371, 2000.
incorporating better segmentation accuracy. Results indicate
that the approach can be used for character recognition in [10] S. D. Connell and A. K. Jain, “Template-based online
other Indic scripts as well. character recognition”, Pattern Recognition , Vol. 34(1),
pp. 1-14, 2001.
CONTRIBUTION
The algorithm presented in this paper is first time [11] Bangla A. K. Ray and B. Chatterjee, “Design of a
introduced for Tamil character recognition. Another nearest neighbor classifier system for Bengali character
advantage of using this SOM model is to capture the recognition”, J. Inst. Elec. Telecom. Engg., Vol. 30, pp.
invariant features of the Tamil Scripts. Unlike other neural 226-229, 1984.
network it does not hold any hidden layer. Only two layers
are needed. One is for input and the other for output. This [12] S. N. S Rajasekaran and B. L. Deekshatulu,
is useful for visualizing from higher dimensional input “Recognition of printed Telugu characters”, Computer
space to lower-dimensional map space. Graphics and Image Processing (CGIP), Vol. 6, pp. 335-
360, 1977.

REFERENCES [13] C. V. Lakshmi and C. Patvardhan, “A high accuracy


OCR system for printed Telugu text”, in the Proceedings of
[1] C. E. Dunn and P. S. P. Wang, “Character Conference on Convergent Technologies for Asia-Pacific
segmentation techniques for handwritten text - a survey”, in Region (TENCON 2003), Vol. 2, pp. 725-729, 2003.
the Proceedings of 11th ICPR, Vol. 2, pp. 577-580, 1992.
[14] A. Negi, C. Bhagvati and B. Krishna, “An OCR system
[2] R. M. Bozinovic and S. N. Srihari, “Off-line cursive for Telugu”, in the Proceedings of the Sixth International
script word recognition”, IEEE Trans. on Pattern Anal. Conference on Document Processing, pp. 1110-1114, 2001.
Mach. Intell., vol. 11, no. 1, pp. 68-83, Jan. 1989.
Int. J. of Advance d Networking and Applications 192
Volume: 01 Issue: 03 Pages: 188-192 (2009)

[15] P. Chinnuswamy, S.G. Khrishnamoorthy, “Recognition Authors Biography


of handprinted Tamil characters”, Pattern Recognition,
vol. 12, pp. 141-152, 1980. Indra Gandhi Raman, Research Scholar,
Mother Teresa Women’s University, Kodaikanal.
[16] N. Damayanthi, P. Thangavel, “Handwritten Tamil She has completed here M.Sc., (Mathematics),
character recognition using Neural Network”, Proc. The M.C.A., M.T.M., M.Phil., in Computer Science.
Tamil Internet 2000 Conference, Singapore, July 2000. Currently she is doing her research in neural
network. Her area of interest is AI, Neural Network,
[17] R.M. Suresh, S. Arumugam and K.P. Aravanan, Kohonen’s Neural algorithm and Software Engineering,
“Recognition of handwritten Tamil characters using fuzzy She is very much interested in developing OCR for Tamil
classificatory approach”, Proc. The Tamil Internet 2000 Characters especially for distorted characters. She will be
Conference, Singapore, July 2000. available at shambhavi.rajesh@gmail.com

[18] B. B. Chaudhuri and U. Pal, “A complete printed


Bangla OCR system”, Pattern Recognition, vol. 31, no. 5, Iyakutti Kombiah is a CSIR Emeritus Scientist,
pp. 531-549, 1997. School of Physics, Madurai Kamaraj University,
Madurai, India. His research interests are
[19] The Unicode Consortium, The Unicode Standard 3.0, Computational Physics and Software
Harlow: Addison Wesley publishers, 2000. Engineering. Contact him at iyakutti@yahoo.co.in
[20] Kohonen, T. (1990) The Self-organizing map, Proc.
IEEE, vol. 78, no. 9, 1464-1480.

You might also like