Abstract
Without a publicly available database, we cannot advance research nor can we make a fair comparison with the state-of-the-art methods. To bridge this gap, we present a database of eleven Indic scripts from thirteen official languages for the purpose of script identification in multi-script document images. Our database is composed of 39K words that are equally distributed (i.e., 3K words per language). At the same time, we also study three different pertinent features: spatial energy (SE), wavelet energy (WE) and the Radon transform (RT), including their possible combinations, by using three different classifiers: multilayer perceptron (MLP), fuzzy unordered rule induction algorithm (FURIA) and random forest (RF). In our test, using all features, MLP is found to be the best performer showing the bi-script accuracy of 99.24% (keeping Roman common), 98.38% (keeping Devanagari common) and tri-script accuracy of 98.19% (keeping both Devanagari and Roman common).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Pati, P.B., Ramakrishnan, A.G.: Word-level multi-script identification. Pattern Recog. Lett. 29(9), 1218–1229 (2008)
Hochberg, J., Kelly, P., Thomas, T., Kerns, L.: Automatic script identification from document images using cluster-based templates. IEEE Trans. Pattern Anal. Mach. Intell. 19, 176–181 (1997)
Pal, U., Chaudhuri, B.B.: Identification of different script lines from multi-script documents. Image Vis. Comput. 20(13/14), 945–954 (2002)
Jawahar, C.V., Kumar, M., Kiran, S.S.R.: A bilingual OCR for Hindi-Telugu documents and its applications. In: Proceedings of International Conference Document Analysis and Recognition, pp. 408–412 (2003)
Chanda, S., Sinha, S., Pal, U.: Word-wise English Devnagari and Oriya script identification. In: Speech and Language Systems for Human Communication, pp. 244–248 (2004)
Joshi, G.D., Garg, S., Sivaswamy, J.: Script identification from Indian documents. In: 7th International Association of Pattern Recognition Workshop on Document Analysis Systems, pp. 255–267 (2006)
Dhanya, D., Ramakrishna, A.G., Pati, P.B.: Script identification in printed bilingual documents. Sadhana 27(1), 73–82 (2002)
Chaudhury, S., Harit, G., Madnani, S., Shet, R.B.: Identification of scripts of Indian languages by combining trainable classifiers. In: Indian Conference on Computer Vision Graphics and Image Processing (2000)
Ghosh, D., Dube, T., Shivprasad, S.P.: Script recognition-a review. IEEE Trans. Pattern Anal. Mach. Intell. 32(12), 2142–2161 (2010)
Hangarge, M., Santosh, K.C., Pardeshi, R.: Directional discrete cosine transform for handwritten script identification. In: Proceedings of the International Conference on Document Analysis and Recognition, pp. 344–348 (2013)
Pardeshi, R., Chaudhuri, B.B., Hangarge, M., Santosh, K.C.: Automatic handwritten Indian scripts identification. In: 14th International Conference on Frontiers in Handwriting Recognition, pp. 375–380 (2014)
Obaidullah, S.M., Mondal, A., Das, N., Roy, K.: Script identification from printed Indian document images and performance evaluation using different classifiers. Appl. Comput. Intell. Soft Comput. 22, 12 (2014)
Huhn, J., Hullermeier, E.: FURIA: an algorithm for unordered fuzzy rule induction. Data Min. Knowl. Discov. 19(3), 293–319 (2009)
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Mallat, S.G.: A theory for multiresolution signal decomposition: the wavelet representation. IEEE Trans. Pattern Anal. Mach. Intell. 11(7), 674–693 (1989)
Deans, S.R.: Applications of the Radon Transform. Wiley Interscience Publications, New York (1983)
Santosh, K.C., Lamiroy, B., Wendling, B.: DTW for matching radon features: a pattern recognition and retrieval method. In: 13th International Conference on Advances Concepts for Intelligent Vision Systems, pp. 249–260 (2011)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Obaidullah, S.M., Santosh, K.C., Halder, C., Das, N., Roy, K. (2017). Word-Level Thirteen Official Indic Languages Database for Script Identification in Multi-script Documents. In: Santosh, K., Hangarge, M., Bevilacqua, V., Negi, A. (eds) Recent Trends in Image Processing and Pattern Recognition. RTIP2R 2016. Communications in Computer and Information Science, vol 709. Springer, Singapore. https://doi.org/10.1007/978-981-10-4859-3_2
Download citation
DOI: https://doi.org/10.1007/978-981-10-4859-3_2
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-4858-6
Online ISBN: 978-981-10-4859-3
eBook Packages: Computer ScienceComputer Science (R0)