Abstract
Without publicly available dataset, specifically in handwritten document recognition (HDR), we cannot make a fair and/or reliable comparison between the methods. Considering HDR, Indic script’s document recognition is still in its early stage compared to others such as Roman and Arabic. In this paper, we present a page-level handwritten document image dataset (PHDIndic_11), of 11 official Indic scripts: Bangla, Devanagari, Roman, Urdu, Oriya, Gurumukhi, Gujarati, Tamil, Telugu, Malayalam and Kannada. PHDIndic_11 is composed of 1458 document text-pages written by 463 individuals from various parts of India. Further, we report the benchmark results for handwritten script identification (HSI). Beside script identification, the dataset can be effectively used in many other applications of document image analysis such as script sentence recognition/understanding, text-line segmentation, word segmentation/recognition, word spotting, handwritten and machine printed texts separation and writer identification.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Aleai A, Nagabhushan P, Pal U (2011) A benchmark Kannada handwritten document dataset and its segmentation. In: Proceedings of the International Conference on Document Analysis and Recognition, p 140–145
Aleai A, Nagabhushan P, Pal U (2012) Dataset and Ground truth for Handwritten Text in Four Different Scripts. International Journal of Pattern Recognition and Artificial Intelligence, World Scientific, 26(4):1253001 (25 pages)
Bhattacharya U, Chaudhuri BB (2005) Databases for research on recognition of handwritten characters of Indian scripts. In: Proceedings of the International Conference on Document Analysis and Recognition, p 789–793
Busch A, Boles WW, Sridharan S (2005) Texture for script identification. IEEE Trans Pattern Anal Mach Intell 27(11):1720–1732
Chaudhuri BB (2006) A complete handwritten numeral database of Bangla-a major Indic script. In: Proceedings of the International Workshop on Frontiers of Handwriting Recognition, p 379–384
Cun YL, Bottou L, Bengio Y, Haffiner P (1998) Gradient based learning applied to document recognition. Proc IEEE 86(11):2278–2324
Das N, Acharya K, Sarkar R, Basu S, Kundu M, Nasipuri M (2014) A benchmark image database of isolated Bangla handwritten compound characters. Int J Doc Anal Recognit 17(4):413–431
Das N, Sarkar R, Basu S, Saha PK, Kundu M, Nasipuri M (2015) Handwritten Bangla character recognition using a soft computing paradigm embedded in two pass approach. Pattern Recogn 48(6):2054–2071
Demsar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
Diem M, Fiel S, Kleber F, Sablatnig R (2013) CVL-database: an off-line database for writer retrieval, writer identification and word spotting. In: Proceedings of the International Conference on Document Analysis and Recognition, p 560–564
Dongre VJ, Mankar VH (2012) Development of comprehensive Devanagari numeral and character database for offline handwritten character recognition. Journal of Applied Computational Intelligence and Soft Computing (ACISC), Hindawi Publishing Corporation. doi:10.1155/2012/871834
Gatos B, Stamatopoulos N, Louloudis G (2009) Handwriting segmentation contest. In: Proceedings of the International Conference on Document Analysis and Recognition, p 1393–1397
Ghosh D, Dube T, Shivaprasad AP (2010) Script recognition- a review. IEEE Trans Pattern Anal Mach Intell 32(12):2142–2161
Hull J (1994) A database for handwritten text recognition research. IEEE Transaction on Pattern Analysis and Machine Intelligence 16(5):550–554
Kittler J, Hatef M, Robert PWD, Matas J (1998) On combining classifiers. IEEE Trans Pattern Anal Mach Intell 20(3):226–239
Lakshmi CV, Patvardhan C (2004) An optical character recognition system for printed Telugu text. Pattern Analysis and Application 7(2):190–204
Marti U, Bunke H (1999) A full English sentence database for off-line handwriting recognition. In: Proceedings of the International Conference on Document Analysis and Recognition, p 705–708
Marti U, Bunke H (2002) The IAM-database: an English sentence database for off-line handwriting recognition. Int J Doc Anal Recognit 5:39–46
Mulhem P, Martin H (2003) From database to web multimedia documents. Multimed Tool Appl 20(3):263–282
Nethravathi B, Archana CP, Shashikiran K, Ramakrishnan AG, Kumar V (2010) Creation of a huge annotated database for Tamil and Kannada OHR. In: Proceedings of the International Workshop on Frontiers in Handwriting Recognition, p 415–420
Obaidullah SM, Mondal A, Das N, Roy K (2014) Script identification from printed Indian document images and performance evaluation using different classifiers. Applied Computational Intelligence and Soft Computing 2014:12
Obaidullah SM, Halder C, Das N, Roy K (2015) A corpus of word-level offline handwritten numeral images from official indic scripts. In: Proceedings of the International Conference on Computer and Communication Technologies, p 703–711
Obaidullah SM, Goswami C, Santosh KC, Halder C, Das N, Roy K (2016a) Separating Indic scripts with ‘shirorekha’ -- a precursor to script identification in multi-script documents.In: Proceedings of the IAPR International Conference on Computer Vision & Image Processing, India. doi:10.1007/978-981-10-2104-6_19
Obaidullah SM, Halder C, Das N, Roy K (2016b) A new dataset of word-level offline handwritten numeral images from four official Indic scripts and its benchmarking using image transform fusion. International Journal of Intelligent Engineering Informatics 4(1):1–20
Paul M (ed.) (2009) Ethnologue: languages of the world, Sixteenth edition. Dallas: SIL International. Available: http://www.ethnologue.com/. Last accessed on 20 Oct 2016
Rani R, Dhir R, Lehal GS (2013) Script identification for pre-segmented multi-font characters and digits. In: Proceedings of the International Conference on Document Analysis and Recognition, p 2010–1154
Raza A, Siddiqi I, Abidi A, Arif F (2012a) An unconstrained benchmark Urdu sentence database with automatic line segmentation. In: Proceedings of the International Conference on Frontiers in Handwriting Recognition, p 491–496
Raza A, Siddiqi I, Abidi A, Arif F (2012b) QUWI: an Arabic and English handwriting dataset for offline writer identification. In: Proceedings of the International Conference on Frontiers in Handwriting Recognition, p 746–751
Saqheer MW, He CL, Nobile N, Suen CY (2009) A new large urdu database for off-line handwriting recognition. In: Proceedings of the International Conference on Image Analysis and Processing, p 538–546
Sarkar R, Das N, Basu S, Kundu M, Nasipuri M, Basu DK (2012) CMATERdb1: a database of unconstrained handwritten Bangla and Bangla-English mixed script document image. International Journalon Document Analysis and Recognition 15:71–83
Singh PK, Dalal SK, Sarkar R, Nasipuri M (2015) Page-level script identification from multi-script handwritten documents. In: Proceedings of the third international conference computer, Communication, Control and Information Technology, p 1–6
Sklansky J (1982) Finding the convex hull of a simple polygon. Pattern Recogn Lett 1:79–83
Suen CY, Nadal C, Legault R, Mai T, Lam L (1992) Computer recognition of unconstrained handwritten numerals. Proc IEEE 80(7):1162–1180
Sumner M, Frank E, Hall M (2005) Speeding up logistic model tree induction. In: Proceedings of the European Conference on Principles and Practice of Knowledge Discovery in Databases, p 675–683
Thadchanamoorthy S, Kodikara ND, Premaretne HL, Pal U, Kimura F (2013) Tamil handwritten city name database development and recognition for postal automation. In: Proceedings of the International Conference on Document Analysis and Recognition, p 793–797
Wilkinson R, Geist J, Janet S, Grother P, Burges C, Creecy R, Hammond B, Hull J, Larsen N, Vogl T, Wilson C (1992) The First Census Optical Character Recognition Systems. Conference #NISTIR 4912 (The U.S. Bureau of Census and the National Institute of Standards and Technology, Gaithersburg, MD, 1992)
Writing_System (2016) Writing System of India Available: http://en.wikipedia.org/wiki/Writing_system. Last accessed on 20 Oct 2016
Zimmermann M, Bunke H (2000) Automatic segmentation of the IAM off-line database for handwritten English text. In: Proceedings of the International Conference on Pattern Recognition, p 35–39
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Obaidullah, S.M., Halder, C., Santosh, K.C. et al. PHDIndic_11: page-level handwritten document image dataset of 11 official Indic scripts for script identification. Multimed Tools Appl 77, 1643–1678 (2018). https://doi.org/10.1007/s11042-017-4373-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-017-4373-y