Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Word-Level Thirteen Official Indic Languages Database for Script Identification in Multi-script Documents

  • Conference paper
  • First Online:
Recent Trends in Image Processing and Pattern Recognition (RTIP2R 2016)

Abstract

Without a publicly available database, we cannot advance research nor can we make a fair comparison with the state-of-the-art methods. To bridge this gap, we present a database of eleven Indic scripts from thirteen official languages for the purpose of script identification in multi-script document images. Our database is composed of 39K words that are equally distributed (i.e., 3K words per language). At the same time, we also study three different pertinent features: spatial energy (SE), wavelet energy (WE) and the Radon transform (RT), including their possible combinations, by using three different classifiers: multilayer perceptron (MLP), fuzzy unordered rule induction algorithm (FURIA) and random forest (RF). In our test, using all features, MLP is found to be the best performer showing the bi-script accuracy of 99.24% (keeping Roman common), 98.38% (keeping Devanagari common) and tri-script accuracy of 98.19% (keeping both Devanagari and Roman common).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Pati, P.B., Ramakrishnan, A.G.: Word-level multi-script identification. Pattern Recog. Lett. 29(9), 1218–1229 (2008)

    Article  Google Scholar 

  2. Hochberg, J., Kelly, P., Thomas, T., Kerns, L.: Automatic script identification from document images using cluster-based templates. IEEE Trans. Pattern Anal. Mach. Intell. 19, 176–181 (1997)

    Article  Google Scholar 

  3. Pal, U., Chaudhuri, B.B.: Identification of different script lines from multi-script documents. Image Vis. Comput. 20(13/14), 945–954 (2002)

    Article  Google Scholar 

  4. Jawahar, C.V., Kumar, M., Kiran, S.S.R.: A bilingual OCR for Hindi-Telugu documents and its applications. In: Proceedings of International Conference Document Analysis and Recognition, pp. 408–412 (2003)

    Google Scholar 

  5. Chanda, S., Sinha, S., Pal, U.: Word-wise English Devnagari and Oriya script identification. In: Speech and Language Systems for Human Communication, pp. 244–248 (2004)

    Google Scholar 

  6. Joshi, G.D., Garg, S., Sivaswamy, J.: Script identification from Indian documents. In: 7th International Association of Pattern Recognition Workshop on Document Analysis Systems, pp. 255–267 (2006)

    Google Scholar 

  7. Dhanya, D., Ramakrishna, A.G., Pati, P.B.: Script identification in printed bilingual documents. Sadhana 27(1), 73–82 (2002)

    Google Scholar 

  8. Chaudhury, S., Harit, G., Madnani, S., Shet, R.B.: Identification of scripts of Indian languages by combining trainable classifiers. In: Indian Conference on Computer Vision Graphics and Image Processing (2000)

    Google Scholar 

  9. Ghosh, D., Dube, T., Shivprasad, S.P.: Script recognition-a review. IEEE Trans. Pattern Anal. Mach. Intell. 32(12), 2142–2161 (2010)

    Article  Google Scholar 

  10. Hangarge, M., Santosh, K.C., Pardeshi, R.: Directional discrete cosine transform for handwritten script identification. In: Proceedings of the International Conference on Document Analysis and Recognition, pp. 344–348 (2013)

    Google Scholar 

  11. Pardeshi, R., Chaudhuri, B.B., Hangarge, M., Santosh, K.C.: Automatic handwritten Indian scripts identification. In: 14th International Conference on Frontiers in Handwriting Recognition, pp. 375–380 (2014)

    Google Scholar 

  12. Obaidullah, S.M., Mondal, A., Das, N., Roy, K.: Script identification from printed Indian document images and performance evaluation using different classifiers. Appl. Comput. Intell. Soft Comput. 22, 12 (2014)

    Google Scholar 

  13. Huhn, J., Hullermeier, E.: FURIA: an algorithm for unordered fuzzy rule induction. Data Min. Knowl. Discov. 19(3), 293–319 (2009)

    Article  MathSciNet  Google Scholar 

  14. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)

    Article  MATH  Google Scholar 

  15. Mallat, S.G.: A theory for multiresolution signal decomposition: the wavelet representation. IEEE Trans. Pattern Anal. Mach. Intell. 11(7), 674–693 (1989)

    Article  MATH  Google Scholar 

  16. Deans, S.R.: Applications of the Radon Transform. Wiley Interscience Publications, New York (1983)

    MATH  Google Scholar 

  17. Santosh, K.C., Lamiroy, B., Wendling, B.: DTW for matching radon features: a pattern recognition and retrieval method. In: 13th International Conference on Advances Concepts for Intelligent Vision Systems, pp. 249–260 (2011)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sk Md Obaidullah .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer Nature Singapore Pte Ltd.

About this paper

Cite this paper

Obaidullah, S.M., Santosh, K.C., Halder, C., Das, N., Roy, K. (2017). Word-Level Thirteen Official Indic Languages Database for Script Identification in Multi-script Documents. In: Santosh, K., Hangarge, M., Bevilacqua, V., Negi, A. (eds) Recent Trends in Image Processing and Pattern Recognition. RTIP2R 2016. Communications in Computer and Information Science, vol 709. Springer, Singapore. https://doi.org/10.1007/978-981-10-4859-3_2

Download citation

  • DOI: https://doi.org/10.1007/978-981-10-4859-3_2

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-10-4858-6

  • Online ISBN: 978-981-10-4859-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics