Automatic Indic script identification from handwritten documents: page, block, line and word-level approach

Obaidullah, Sk Md; Santosh, K. C.; Halder, Chayan; Das, Nibaran; Roy, Kaushik

doi:10.1007/s13042-017-0702-8

Automatic Indic script identification from handwritten documents: page, block, line and word-level approach

Original Article
Published: 17 July 2017

Volume 10, pages 87–106, (2019)
Cite this article

International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Sk Md Obaidullah¹,
K. C. Santosh ORCID: orcid.org/0000-0003-4176-0236²,
Chayan Halder³,
Nibaran Das⁴ &
…
Kaushik Roy³

593 Accesses
Explore all metrics

Abstract

Script identification is a well-studied problem in literature since last decade. Several methods for automatic script identification have been reported. All these methods consider a document as either at page, block, line or word-level, but no experimental/empirical conclusion has been provided in choosing the particular level of work. To address this, we have carried out a multi-level script identification experiment, i.e., the same document is considered at different levels namely: page, block, line and word for script identification. Two different types of features are considered: script dependent and script independent, which is computed at each level to categorize different scripts. The experiment is conducted on a newly created handwritten multi-script and multi-level dataset, where 5 blocks, 7.5 lines and 15 words are generated from a single page, on an average (440 pages, 2200 blocks, 3300 lines and 6600 words, in total). Finally, we conclude two major issues: (1) find an optimal level of work, i.e. page/block/line/word-level, (2) provide a qualitative measure of feature set on particular level of work considered.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Handwritten Indic Script Identification – A Multi-level Approach

An Approach for Automatic Indic Script Identification from Handwritten Document Images

PHDIndic_11: page-level handwritten document image dataset of 11 official Indic scripts for script identification

Article 18 January 2017

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Eight_Schedule. [Online]. http://mha.nic.in/hindi/sites/upload_files/mhahindi/files/pdf/Eighth_Schedule.pdf. Accessed 01 May 2017
Ghosh D, Dube T, Shivprasad SP (2010) Script recognition—a review. IEEE Trans Pattern Anal Mach Intell 32(12):2142–2161
Article Google Scholar
Obaidullah SM, Das SK, Roy K (2013) A system for handwritten script identification from Indian document. J Pattern Recognit Res 8:1–12
Obaidullah SM, Das N, Roy K (2014) Gabor filter based technique for offline Indic script identification from handwritten document images. In: International conference on devices, circuits and communications (ICDCCom-2014), pp 1–6
Obaidullah SM, Karim R, Shaikh S, Halder C, Das N, Roy K (2015) Transform based approach for Indic script identification from handwritten document images. In: 3rd International conference on signal processing, communications and networking, pp 1–7
Singh PK, Chatterjee I, Sarkar R (2015) Page-level handwritten script identification using modified log-Gabor filter based features. In: IEEE 2nd international conference on recent trends in information systems, pp 225–230
Basu S, Das N, Sarkar R, Kundu M, Nasipuri M, Basu DK (2010) A novel framework for automatic sorting of postal documents with multi-script address blocks. Pattern Recognit 43(10):3507–3521
Article MATH Google Scholar
Rajput G, Anita HB (2010) Handwritten script recognition using DCT and wavelet features at block level. Int J Comput Appl Spec Issue Recent Trends Image Process Pattern Recognit 3:158–163
Google Scholar
Obaidullah SM, Halder C, Das N, Roy K (2015) An approach for automatic Indic script identification from handwritten document images. In: 2nd doctoral symposium on applied computation and security systems, pp 37–51
Hangarge M, Santosh KC, Pardeshi R (2013) Directional discrete cosine transform for handwritten script identification. In: Proceedings of the international conference on document analysis and recognition, ICDAR, pp 344–348
Pardeshi R, Chaudhuri BB, Hangarge M, Santosh KC (2014) Automatic handwritten Indian scripts identification. In: 2014 14th international conference on frontiers in handwriting recognition, pp 375–380
Singh PK, Sarkar R, Nasipuri M, Doermann D (2015) Word-level script identification for handwritten Indic scripts. In: 13th international conference on document analysis and recognition, pp 1106–1110
Obaidullah SM, Halder C, Das N, Roy K (2015) Numeral script identification from handwritten document images. Procedia Comput Sci J 54C:585–594
Article Google Scholar
Hochberg J, Bowers K, Cannon M, Kelly P (1999) Script and language identification for handwritten document images. Int J Doc Anal Recognit 2(2/3):45–52
Article Google Scholar
Zhu G, Yu X, Li Y, Doermann D (2009) Language identification for handwritten document images using a shape codebook. Pattern Recognit 42:3184–3191
Article MATH Google Scholar
Kanoun S, Ennaji A, Courtier YL, Alimi AM (2002) Script and nature differentiation for arabic and latin text images. In: 8th international workshop on frontiers in handwriting recognition (IWFHR), pp 309–313
Singhal V, Navin N, Ghosh D (2003) Script-based classification of hand-written text documents in a multi-lingual environment. In: 13th international workshop on research issues in data engineering: multi-lingual information management, pp 47–54
Zhou L, Lu Y, Tan CL (2006) Bangla/English script identification based on analysis of connected component profiles. In: 2nd international workshop on document analysis systems, pp 243–254
Hangarge M, Dhandra BV (2010) Offline handwritten script identification in document images. Int J Comput Appl 4(6):6–10
Google Scholar
Obaidullah SM, Halder C, Das N, Roy K (2015) Indic script identification from handwritten document images—an unconstrained block-level approach. In: IEEE 2nd international conference on recent trends in information systems, pp 213–218
Moussa SB, Zahour A, Benabdelhafid A, Alimi AM (2008) Fractal-based system for Arabic/Latin, printed/handwritten script identification. In: International conference on pattern recognition, pp 1–4
Rajput GG, Anita HB (2011) Handwritten script identification from a bi-script document at line level using gabor filter. In: International workshop on soft computing applications and knowledge discovery, pp 94–101
Roy K, Banerjee A, Pal U (2004) A system for word wise handwritten script identification for indian postal automation. In: IEEE India annual conference, pp 266–271
Roy K, Pal U, Chaudhuri BB (2005) Neural network based word-wise handwritten script identification system for Indian postal automation. In: International conference on intelligent sensing and information processing, pp 240–245
Roy K, Pal U (2006) Word-wise hand-written script separation for Indian postal automation. In: 10th International workshop on frontiers in handwriting recognition (IWFHR), pp 521–526
Benjelil M, Kanoun S, Mullot R, Alimi AM (2009) Arabic and Latin script identification in printed and handwritten types based on steerable pyramid features. In: Steerable pyramid features, international conference on document analysis and recognition (ICDAR), pp 591–595
Roy K, Alaei A, Pal U (2010) Word-wise handwritten Persian and Roman script identification. In: 12th international conference on frontiers in handwriting recognition (ICFHR), pp 628–633
Sarkar R, Das N, Basu S, Kundu M, Nasipuri M, Basu DK (2010) Word level script identification from Bangla and Devanagri handwritten texts mixed with Roman script. J Comput 2(2):103–108
Google Scholar
Chanda S, Franke K, Pal U (2011) Identification of Indic scripts on torn-documents. In: International conference on document analysis and recognition, pp 713–717
Singh PK, Sarkar R, Das N, Basu S, Nasipuri M (2013) Identification of Devnagari and Roman scripts from multi-script handwritten documents. In: 5th International conference pattern recognition and machine intelligence, pp 509–514
Dey N, Ashoura A, Hassanien A (2017) Feature detectors and descriptors generations with numerous images and video applications: a recap. In: Handbook of research on applied video processing and mining, pp 36–65
Obaidullah SM, Roy K, Das N (2013) Comparison of different classifiers for script identification from handwritten document. In: 2013 IEEE International Conference Signal Processing, Computing and Control, ISPCC, pp 0–5
Obaidullah SM, Goswami C, Santosh KC, Halder C, Das N, Roy K (2017) Separating Indic scripts with ‘matra’ for effective handwritten script identification in multi-script documents. Int J Artif Intell Pattern Recognit 31(4):1753003
Article Google Scholar
Chacko BP, Krishnan VRV, Raju G, Anto PB (2012) Handwritten character recognition using wavelet energy and extreme learning machine. Int J Mach Learn Cybern 3(2):149–161
Article Google Scholar
Saba T, Rehman A (2013) Effects of artificially intelligent tools on pattern recognition. Int J Mach Learn Cybern 4(2):155–162
Article Google Scholar
AlShahrani A, Al-Abadi M, Al-Malki A, Ashour A, Dey N (2016) Automated system for crops recognition and classification. In: Handbook of research on applied video processing and mining, pp 54–69
Hore S et al (2016) Neural-based prediction of structural failure of multistoried RC buildings. Struct Eng Mech 58(3):459–473
Article Google Scholar
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Article MATH Google Scholar
Sarkar R, Das N, Basu S, Kundu M, Nasipuri M, Basu DK (2012) CMATERdb1: a database of unconstrained handwritten Bangla and Bangla-English mixed script document image. Int J Doc Anal Recognit 15(1):71–83
Article Google Scholar
Aleai A, Nagabhushan P, Pal U (2011) A benchmark Kannada handwritten document dataset and its segmentation. In: International conference on document analysis and recognition (ICDAR), pp 140–145
Demsar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
MathSciNet MATH Google Scholar
Huang GB, Zhou H, Ding X, Zhang R (2012) Extreme learning machine for regression and multiclass classification. IEEE Trans Syst Man Cybern Part B Cybern 42(2):513–529
Article Google Scholar
Liu P, Huang Y, Meng L, Gong S, Zhang G (2016) Two-stage extreme learning machine for high-dimensional data. Int J Mach Learn Cybern 7(5):765–772
Article Google Scholar
Li J, Mei X, Prokhorov D, Tao D (2017) Deep neural network for structural prediction and lane detection in traffic scene. IEEE Trans Neural Netw Learn Syst 28(3):690–703
Article Google Scholar
Fang Y, Liu ZH, Min F (2016) Multi-objective cost-sensitive attribute reduction on data with error ranges. Int J Mach Learn Cybern 7(5):783–793
Article Google Scholar
Abdessalem W, Ashour A, Sassi D, Roy P, Kausar N, Dey N (2015) MEDLINE text mining: an enhancement genetic algorithm based approach for document clustering. In: Applications of intelligent optimization in biology and medicine, Springer, pp 267–287
Acharjya D, Anitha A (2017) A comparative study of statistical and rough computing models in predictive data analysis. IJACI 8(2):32–35

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Aliah University, Kolkata, India
Sk Md Obaidullah
Department of Computer Science, University of South Dakota, Vermillion, SD, 57069, USA
K. C. Santosh
Department of Computer Science, West Bengal State University, Kolkata, India
Chayan Halder & Kaushik Roy
Department of Computer Science and Engineering, Jadavpur University, Kolkata, India
Nibaran Das

Authors

Sk Md Obaidullah
View author publications
You can also search for this author in PubMed Google Scholar
K. C. Santosh
View author publications
You can also search for this author in PubMed Google Scholar
Chayan Halder
View author publications
You can also search for this author in PubMed Google Scholar
Nibaran Das
View author publications
You can also search for this author in PubMed Google Scholar
Kaushik Roy
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to K. C. Santosh.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Obaidullah, S.M., Santosh, K.C., Halder, C. et al. Automatic Indic script identification from handwritten documents: page, block, line and word-level approach. Int. J. Mach. Learn. & Cyber. 10, 87–106 (2019). https://doi.org/10.1007/s13042-017-0702-8

Download citation

Received: 20 November 2016
Accepted: 05 July 2017
Published: 17 July 2017
Issue Date: 31 January 2019
DOI: https://doi.org/10.1007/s13042-017-0702-8

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Automatic Indic script identification from handwritten documents: page, block, line and word-level approach

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Handwritten Indic Script Identification – A Multi-level Approach

An Approach for Automatic Indic Script Identification from Handwritten Document Images

PHDIndic_11: page-level handwritten document image dataset of 11 official Indic scripts for script identification

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now