Abstract
Script identification is a well-studied problem in literature since last decade. Several methods for automatic script identification have been reported. All these methods consider a document as either at page, block, line or word-level, but no experimental/empirical conclusion has been provided in choosing the particular level of work. To address this, we have carried out a multi-level script identification experiment, i.e., the same document is considered at different levels namely: page, block, line and word for script identification. Two different types of features are considered: script dependent and script independent, which is computed at each level to categorize different scripts. The experiment is conducted on a newly created handwritten multi-script and multi-level dataset, where 5 blocks, 7.5 lines and 15 words are generated from a single page, on an average (440 pages, 2200 blocks, 3300 lines and 6600 words, in total). Finally, we conclude two major issues: (1) find an optimal level of work, i.e. page/block/line/word-level, (2) provide a qualitative measure of feature set on particular level of work considered.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Eight_Schedule. [Online]. http://mha.nic.in/hindi/sites/upload_files/mhahindi/files/pdf/Eighth_Schedule.pdf. Accessed 01 May 2017
Ghosh D, Dube T, Shivprasad SP (2010) Script recognition—a review. IEEE Trans Pattern Anal Mach Intell 32(12):2142–2161
Obaidullah SM, Das SK, Roy K (2013) A system for handwritten script identification from Indian document. J Pattern Recognit Res 8:1–12
Obaidullah SM, Das N, Roy K (2014) Gabor filter based technique for offline Indic script identification from handwritten document images. In: International conference on devices, circuits and communications (ICDCCom-2014), pp 1–6
Obaidullah SM, Karim R, Shaikh S, Halder C, Das N, Roy K (2015) Transform based approach for Indic script identification from handwritten document images. In: 3rd International conference on signal processing, communications and networking, pp 1–7
Singh PK, Chatterjee I, Sarkar R (2015) Page-level handwritten script identification using modified log-Gabor filter based features. In: IEEE 2nd international conference on recent trends in information systems, pp 225–230
Basu S, Das N, Sarkar R, Kundu M, Nasipuri M, Basu DK (2010) A novel framework for automatic sorting of postal documents with multi-script address blocks. Pattern Recognit 43(10):3507–3521
Rajput G, Anita HB (2010) Handwritten script recognition using DCT and wavelet features at block level. Int J Comput Appl Spec Issue Recent Trends Image Process Pattern Recognit 3:158–163
Obaidullah SM, Halder C, Das N, Roy K (2015) An approach for automatic Indic script identification from handwritten document images. In: 2nd doctoral symposium on applied computation and security systems, pp 37–51
Hangarge M, Santosh KC, Pardeshi R (2013) Directional discrete cosine transform for handwritten script identification. In: Proceedings of the international conference on document analysis and recognition, ICDAR, pp 344–348
Pardeshi R, Chaudhuri BB, Hangarge M, Santosh KC (2014) Automatic handwritten Indian scripts identification. In: 2014 14th international conference on frontiers in handwriting recognition, pp 375–380
Singh PK, Sarkar R, Nasipuri M, Doermann D (2015) Word-level script identification for handwritten Indic scripts. In: 13th international conference on document analysis and recognition, pp 1106–1110
Obaidullah SM, Halder C, Das N, Roy K (2015) Numeral script identification from handwritten document images. Procedia Comput Sci J 54C:585–594
Hochberg J, Bowers K, Cannon M, Kelly P (1999) Script and language identification for handwritten document images. Int J Doc Anal Recognit 2(2/3):45–52
Zhu G, Yu X, Li Y, Doermann D (2009) Language identification for handwritten document images using a shape codebook. Pattern Recognit 42:3184–3191
Kanoun S, Ennaji A, Courtier YL, Alimi AM (2002) Script and nature differentiation for arabic and latin text images. In: 8th international workshop on frontiers in handwriting recognition (IWFHR), pp 309–313
Singhal V, Navin N, Ghosh D (2003) Script-based classification of hand-written text documents in a multi-lingual environment. In: 13th international workshop on research issues in data engineering: multi-lingual information management, pp 47–54
Zhou L, Lu Y, Tan CL (2006) Bangla/English script identification based on analysis of connected component profiles. In: 2nd international workshop on document analysis systems, pp 243–254
Hangarge M, Dhandra BV (2010) Offline handwritten script identification in document images. Int J Comput Appl 4(6):6–10
Obaidullah SM, Halder C, Das N, Roy K (2015) Indic script identification from handwritten document images—an unconstrained block-level approach. In: IEEE 2nd international conference on recent trends in information systems, pp 213–218
Moussa SB, Zahour A, Benabdelhafid A, Alimi AM (2008) Fractal-based system for Arabic/Latin, printed/handwritten script identification. In: International conference on pattern recognition, pp 1–4
Rajput GG, Anita HB (2011) Handwritten script identification from a bi-script document at line level using gabor filter. In: International workshop on soft computing applications and knowledge discovery, pp 94–101
Roy K, Banerjee A, Pal U (2004) A system for word wise handwritten script identification for indian postal automation. In: IEEE India annual conference, pp 266–271
Roy K, Pal U, Chaudhuri BB (2005) Neural network based word-wise handwritten script identification system for Indian postal automation. In: International conference on intelligent sensing and information processing, pp 240–245
Roy K, Pal U (2006) Word-wise hand-written script separation for Indian postal automation. In: 10th International workshop on frontiers in handwriting recognition (IWFHR), pp 521–526
Benjelil M, Kanoun S, Mullot R, Alimi AM (2009) Arabic and Latin script identification in printed and handwritten types based on steerable pyramid features. In: Steerable pyramid features, international conference on document analysis and recognition (ICDAR), pp 591–595
Roy K, Alaei A, Pal U (2010) Word-wise handwritten Persian and Roman script identification. In: 12th international conference on frontiers in handwriting recognition (ICFHR), pp 628–633
Sarkar R, Das N, Basu S, Kundu M, Nasipuri M, Basu DK (2010) Word level script identification from Bangla and Devanagri handwritten texts mixed with Roman script. J Comput 2(2):103–108
Chanda S, Franke K, Pal U (2011) Identification of Indic scripts on torn-documents. In: International conference on document analysis and recognition, pp 713–717
Singh PK, Sarkar R, Das N, Basu S, Nasipuri M (2013) Identification of Devnagari and Roman scripts from multi-script handwritten documents. In: 5th International conference pattern recognition and machine intelligence, pp 509–514
Dey N, Ashoura A, Hassanien A (2017) Feature detectors and descriptors generations with numerous images and video applications: a recap. In: Handbook of research on applied video processing and mining, pp 36–65
Obaidullah SM, Roy K, Das N (2013) Comparison of different classifiers for script identification from handwritten document. In: 2013 IEEE International Conference Signal Processing, Computing and Control, ISPCC, pp 0–5
Obaidullah SM, Goswami C, Santosh KC, Halder C, Das N, Roy K (2017) Separating Indic scripts with ‘matra’ for effective handwritten script identification in multi-script documents. Int J Artif Intell Pattern Recognit 31(4):1753003
Chacko BP, Krishnan VRV, Raju G, Anto PB (2012) Handwritten character recognition using wavelet energy and extreme learning machine. Int J Mach Learn Cybern 3(2):149–161
Saba T, Rehman A (2013) Effects of artificially intelligent tools on pattern recognition. Int J Mach Learn Cybern 4(2):155–162
AlShahrani A, Al-Abadi M, Al-Malki A, Ashour A, Dey N (2016) Automated system for crops recognition and classification. In: Handbook of research on applied video processing and mining, pp 54–69
Hore S et al (2016) Neural-based prediction of structural failure of multistoried RC buildings. Struct Eng Mech 58(3):459–473
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Sarkar R, Das N, Basu S, Kundu M, Nasipuri M, Basu DK (2012) CMATERdb1: a database of unconstrained handwritten Bangla and Bangla-English mixed script document image. Int J Doc Anal Recognit 15(1):71–83
Aleai A, Nagabhushan P, Pal U (2011) A benchmark Kannada handwritten document dataset and its segmentation. In: International conference on document analysis and recognition (ICDAR), pp 140–145
Demsar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
Huang GB, Zhou H, Ding X, Zhang R (2012) Extreme learning machine for regression and multiclass classification. IEEE Trans Syst Man Cybern Part B Cybern 42(2):513–529
Liu P, Huang Y, Meng L, Gong S, Zhang G (2016) Two-stage extreme learning machine for high-dimensional data. Int J Mach Learn Cybern 7(5):765–772
Li J, Mei X, Prokhorov D, Tao D (2017) Deep neural network for structural prediction and lane detection in traffic scene. IEEE Trans Neural Netw Learn Syst 28(3):690–703
Fang Y, Liu ZH, Min F (2016) Multi-objective cost-sensitive attribute reduction on data with error ranges. Int J Mach Learn Cybern 7(5):783–793
Abdessalem W, Ashour A, Sassi D, Roy P, Kausar N, Dey N (2015) MEDLINE text mining: an enhancement genetic algorithm based approach for document clustering. In: Applications of intelligent optimization in biology and medicine, Springer, pp 267–287
Acharjya D, Anitha A (2017) A comparative study of statistical and rough computing models in predictive data analysis. IJACI 8(2):32–35
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Obaidullah, S.M., Santosh, K.C., Halder, C. et al. Automatic Indic script identification from handwritten documents: page, block, line and word-level approach. Int. J. Mach. Learn. & Cyber. 10, 87–106 (2019). https://doi.org/10.1007/s13042-017-0702-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13042-017-0702-8