Differentiation Among Scripts On The Basis of Histogram: Volume 2, Issue 3, May - June 2013
Differentiation Among Scripts On The Basis of Histogram: Volume 2, Issue 3, May - June 2013
Differentiation Among Scripts On The Basis of Histogram: Volume 2, Issue 3, May - June 2013
Abstract:
India is a hub of various languages. Each language has its own personal features. Our main motive is to extract that features so that differentiation could take place among the different languages. For this we will use Histogram technique. The main key of this approach is left profile and right profile generated by Histogram of scripting languages This enable a script differ from other scripts and provide correct and efficient information for the maxima, minima peak generated for a scripting language to enhance the effectiveness of this approach.
component level or text line level, and global methods, where script identification is treated as a texture classification problem.
3. Multi-Script Identification
Most of the existing literature on script identification either focuses on the development of new approaches or on the improvement of existing approaches which work for some specific application or specific script classes. We need to increase the compatibility of system with respect to multi-script languages. Some important thing need to focus while making a solution of this problem. 3.1 These important features are (a) Complexity in calculating basic differentiable features such as left profile, right profile. (b) Complexity in feature extraction & classification, (c) Computational speed of multi-scripts, Currently, individual approaches are designed such that they can effectively deal with some of the factors listed above (not all). None of them has showed their potential to become a generalized script identification scheme.
1. INTRODUCTION
Main problem in this field is if we use one document written in various languages then the presented systems are able to identify only one language among them and after processing it will provide output according to one language and ignore other languages on output .it will create junk output for other languages. Thus system is not capable for multi-script processing.
2. Previous work
2.1 Weighed Euclidean distance and Gaussian Mixture Model This is used to identify English from Arabic, Chinese, Korean and Hindi scripts using cluster-based templates; an automatic script Identification technique has been described by Hochberg Using fractal-based texture features, Tan described an automatic method for Identification of Chinese, English, Greek, Russian, Bengali and Persian text. All the above mentioned works deal with Non-Indian script. 2.2 Script Identification using local and Global method Early determination of the language used in a document can greatly facilitate further processing, such as character recognition, document image indexing and translation. Many algorithms have been proposed in the past years, which can be generally divided into two categories: local methods, where local features are extracted at connected Volume 2, Issue 3 May June 2013
4. Contributed Work
If we provide a Multi-script document processing system would notable to differentiate among the multi-scripts. This will provide a junk output for all multi-script languages except of one language that depend for which particular language a system is made. I would like to develop a mechanism that can help the processing system in script identification of a Multi-script document. These scripts as input shown in Fig 1a and in Fig 1 b. Input scripts 1. English. 2. Devanagiri. 3. Bengali.
Page 106
Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com Volume 2, Issue 3, May June 2013 ISSN 2278-6856
head-line in Devnagari and Bengali script. We say headline feature exists in a word if one of the following two conditions satisfies in the word (a) If the length of the longest run is greater than 70% of the width of a word (b) if the length of the longest run is greater than 2 times of the height of middle zone (x-height). x-height is shown with dotted line in Fig.2 a. and Histogram in Fig 2.b
. Figure 2a .The X-height is shown with a dotted line In a Bengali and a Devnagari word.
Input Script Fig1 a and 1b Steps followed for script differentiation Input Scripts
Histogram Generation
Figure 2b. Of Histogram 4.4 Use of Histogram This will provide left and right profile for multi-script document. Suppose each character is located within a rectangular boundary, a frame. The horizontal or vertical distances from any one side of the frame to the character edge are a group of parallel lines which we call the profile. If we compute left and right profile of the characters in a text line, we can notice some distinct difference in some of the scripts Some of the principal features used in our identification scheme are as follows: 4.5 Functions of Histogram Here we use concept of Head- line features. If we take the longest horizontal run of black pixels on the rows of a text line then such run length for Bangla, Devnagari and Gurumukhi script will be much higher than that of other scripts. This is because characters in a word are connected by head-line in these scripts. For illustration, see Fig.3.Here row-wise maximum run is shown in the left part of Fig.3.This run information has been used to separate Bangla, Devnagari and Gurumukhi lines from other text lines.If we compute the horizontal projection profile of the text lines, we note some distinct features among some of the scripts. For example, in some scripts (like Malayalam, Kannada and English etc.) we get two Page 107
Feature Extraction
4.1 Image Binarization This works on binary image. A binary image is a digital image that has only two possible values for each pixel. Typically the two colors used for a binary image are black and white though any two colors can be used. The color used for the object in the image is the foreground color while the rest of the image is the background color. In the document scanning industry this is often referred to as bitonal. 4.2 Noise Removal Removal of noise is done so that the "original" image is discernible, blurring of image could be removed. 4.3 Histogram A histogram is a graphical representation of the distribution of data Histogram helps to find out horizontal run of black pixels on the rows of a text line.This run length for Devnagari and Bengali script will be much higher than that of English scripts. This is because, most of the time the characters in a word are connected by Volume 2, Issue 3 May June 2013
Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com Volume 2, Issue 3, May June 2013 ISSN 2278-6856
prominent local maxima whereas in some other scripts (like Telugu, Urdu and Kashmiri etc.) we obtain only one peak as shown in right side of Fig.3. From the projection profile of Urdu text line it can be noted that the peak of the projection profile occurs at the lower half of the text line. In text lines of English, Kannada and Malayalam scripts one peak occurs in upper half and the other peak occurs in lower half of the text line.
References
[1] Jan Neumann, Hanan Samet, and Aya Soffer, "Integration of local and global shape analysis for logo classification," Pattern Recognition Letters, vol. 23, no. 12, pp. 1449-1457, 2002. [2] J. Hochberg, P. Kelly, T Thomas and L. Kerns, Automatic script identification from document images using cluster-based templates, IEEE PAMI, vol. 19, pp. 176-181, 1997. [3] T. N. Tan, Rotation invariant texture features and their use in automatic script identification, IEEE Trans. PAMI, vol. 20, pp. 751-756, 1998. [4] [4] S. Sinha, U. Pal and B. B. Chaudhuri, "WordwiseIdentification from Indian documents", Lecture Notes onComputer Science (LNCS-3136), Eds. S. Marinai and A.Dengel, pp.310-321, 2004. [5] B. B.Chaudhuri and U. Pal, Skew angle detection of digitized Indian Script documents, IEEE PAMI, vol.19, pp.182-186, 1997. [6] J D Hobby, "Using shape and layout information to find signatures, text, and graphics," Computer Vision and Image Understanding, vol. 80, pp. 88-110, 2000. [7] U. Pal and B. B. Chaudhuri, "Automatic separation of machine printed and hand-written text lines.," in Proceedings of the International Conference on Document Analysis and Recognition, 1999, pp. 645648. [8] U. Pal and B. B. Chaudhuri, Identification of different script lines from multi-script documents, Image and Vision computing, Vol. 20, no.13-14 pp. 945-954 2002. [9] T. F. Wu, C. J. Lin, and R. C. Weng. Probability Estimates for Multi-class Classification by Pair wise Coupling. Journal of Machine Learning Research, vol. 5, pp. 975-1005, 2004. [10] U. Pal, S. Sinha and B. B. Chaudhuri, Multi-Script Line identification from Indian Documents, In Proc. 7th ICDAR, pp.880-884, 2003
Fig.3. Different Indian script lines (from top to bottom: Devnagari, Bangla, Gurumukhi, Malayalam, Kannada, English, Tamil, Telugu, Urdu, Kashmiri, Gujrathi, Oriya) with their row-wise maximum run (left side) and horizontal profile (right side). 4.6 Left and right profile Suppose each character is located within a rectangular boundary, a frame. The horizontal or vertical distances from any one side of the frame to the character edge are a group of parallel lines which we call the profile .If we compute left and right profile of the characters in a text line, we can notice some distinct difference in some of the scripts. For example, in both the left and right profiles of Malayalam script, most of the characters have one transition point because of their concave shape. By transition we mean change of the profiles from increasing mode to decreasing mode or vice-versa. But in some other scripts like Gujrathi, Tamil, English etc. this behavior is absent. Similarly, if we consider profile from top then we notice one transition point in most of the Oriya characters. We use left and right profile feature for the identification of Malayalam script. We also use top profile feature for the identification of Oriya script.
5 Conclusions
Separation or identification of different scripts is a very important step Here we use Histogram approach which focus on differentiation features. Use of Histogram in project development is more efficient and robust method for Script Identification with higher accuracy. Main idea of using HPP is to use head-line features so that multilingual script identification could complete in proper manner without giving junk in output for non common languages. Volume 2, Issue 3 May June 2013 Page 108