Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

BYANJON: A Ground Truth Preparation System for Online Handwritten Bangla Documents

Published: 12 August 2021 Publication History

Abstract

The work reported in this article deals with the ground truth generation scheme for online handwritten Bangla documents at text-line, word, and stroke levels. The aim of the proposed scheme is twofold: firstly, to build a document level database so that future researchers can use the database to do research in this field. Secondly, the ground truth information will help other researchers to evaluate the performance of their algorithms developed for text-line extraction, word extraction, word segmentation, stroke recognition, and word recognition. The reported ground truth generation scheme starts with text-line extraction from the online handwritten Bangla documents, then words extraction from the text-lines, and finally segmentation of those words into basic strokes. After word segmentation, the basic strokes are assigned appropriate class labels by using modified distance-based feature extraction procedure and the MLP (Multi-layer Perceptron) classifier. The Unicode for the words are then generated from the sequence of stroke labels. XML files are used to store the stroke, word, and text-line levels ground truth information for the corresponding documents. The proposed system is semi-automatic and each step such as text-line extraction, word extraction, word segmentation, and stroke recognition has been implemented by using different algorithms. Thus, the proposed ground truth generation procedure minimizes huge manual intervention by reducing the number of mouse clicks required to extract text-lines, words from the document, and segment the words into basic strokes. The integrated stroke recognition module also helps to minimize the manual labor needed to assign appropriate stroke labels. The freely available and can be accessed at https://byanjon.herokuapp.com/.

References

[1]
S. M. Obaidullah, C. Halder, and K. C. Santosh. 2018. PHDIndic_11: Page-level handwritten document image dataset of 11 official Indic scripts for script identification. Multimedia Tools and Applications 77 (2018), 1643–1678.
[2]
P. K. Singh, R. Sarkar, and N. Das. 2018. Benchmark databases of handwritten Bangla-Roman and Devanagari-Roman mixed-script document images. Multimedia Tools and Applications 77 (2018), 8441–8473.
[3]
R. Ghosh, S. Shanu, S. Ranjan, and K. Kumari. 2019. An approach based on classifier combination for online handwritten text and non-text classification in Devanagari script. Sadhana 44, 8 (2019), 1–8.
[4]
Z. Xie, Z. Sun, L. Jin, H. Ni, and T. Lyons. 2017. Learning spatial-semantic context with fully convolutional recurrent network for online handwritten chinese text recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 8 (2017), 1903–1917.
[5]
Y. C. Wu, F. Yin, and C. L. Liu. 2017. Improving handwritten Chinese text recognition using neural network language models and convolutional neural network shape models. Pattern Recognition 65 (2017), 251–264.
[6]
I. Ahmad, R. Leonard, G. A. Fink, and A. S. Mahmoud. 2013. Novel sub-character HMM models for Arabic text recognition. In International Conference on Document Analysis and Recognition, 658–662.
[7]
A. Irfan, G. A. Fink, and S. A. Mahmoud. 2014. Improvements in sub-character HMM model based Arabic text recognition. In International Conference on Frontiers in Handwriting Recognition, 537–542.
[8]
S. Sen, M. Mitra, S. Chowdhury, R. Sarkar, and K. Roy. 2016. Quad-tree based Image segmentation and feature extraction to recognize online handwritten Bangla characters. In 7th IAPR TC3 Workshop on Artificial Neural Networks in Pattern Recognition, 246–256.
[9]
S. Sen, D. Shaoo, M. Mitra, R. Sarkar, and K. Roy. 2018. DFA based online Bangla character recognition. In International Conference on Information Technology & Applied Mathematics, 175–183.
[10]
U. Bhattacharya, B. K. Gupta, and S. K. Parui. 2007. Direction code based features for recognition of online handwritten characters of Bangla. In International Conference on Document Analysis and Recognition, 58–62.
[11]
K. Roy. 2012. Stroke-database design for online handwriting recognition in Bangla. In International Journal of Modern Engineering Research, 2534–2540.
[12]
S. Sen, A. Bhattacharyya, R. Sarkar, K. Roy, and D. Doermann. 2018. Application of structural and topological features to recognize online handwritten Bangla characters. Transaction of Asian Low Resource Language Information Processing 17 (2018).
[13]
U. Bhattacharya, R. Plamondon, S. Dutta Chowdhury, P. Goyal, and S. K. Parui. 2017. A sigma-lognormal model-based approach to generating large synthetic online handwriting sample databases. In International Journal on Document Analysis and Recognition, 1–17.
[14]
R. Ghosh, C. Vamsi, and P. Kumar. 2018. RNN based online handwritten word recognition in Devanagari and Bengali scripts using horizontal zoning. Pattern Recognition 92 (2018), 203–218.
[15]
S. Sen, S. Chowdhury, M. Mitra, F. Schwenker, R. Sarkar, and K. Roy. 2018. A novel segmentation technique for online handwritten Bangla words. Pattern Recognition Letters 139 (2018), 26–33.
[16]
G. A. Fink, S. Vajda, U. Bhattacharya, S. K. Parui, and B. B. Chaudhuri. 2010. Online Bangla word recognition using sub-stroke level features and hidden Markov models. In International Conference on Frontiers in Handwriting Recognition, 393–398.
[17]
U. Bhattacharya, A. Nigam, Y. S. Rawat, and S. K. Parui. 2008. An analytic scheme for online handwritten Bangla cursive word recognition. In International Conference on Frontiers in Handwriting Recognition, 320–325.
[18]
S. Mohiuddin, U. Bhattacharya, and S. K. Parui. 2011. Unconstrained Bangla online handwriting recognition based on MLP and SVM. In Proceedings of the 2011 Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text Data, 16 pages.
[19]
S. Chowdhury, U. Garai, and T. Chattopadhyay. 2011. A weighted finite-state transducer (WFST)-based language model for online Indic script handwriting recognition. In International Conference on Document Analysis and Recognition, 599–602.
[20]
N. Bhattacharya and U. Pal. 2012. Stroke segmentation and recognition from Bangla online handwritten text. In International Conference on Frontiers in Handwriting Recognition, 740–745.
[21]
N. Bhattacharya, U. Pal, and K. Roy. 2011. Individual character segmentation from single stroke of Bangla online handwritten text. International Journal of Machine Intelligence 3 (2011), 980–984.
[22]
E. Indermühle, M. Liwicki, and H. Bunke. 2010. IAMonDo-database: An online handwritten document database with non-uniform contents. In IAPR International Workshop on Document Analysis Systems, 97–104.
[23]
J. Schenk, J. Lenz, and G. Rigoll. 2009. Novel script line identification method for script normalization and feature extraction in online handwritten whiteboard note recognition. Pattern Recognition 42, 12 (2009), 3383–3393.
[24]
A. M. Namboodiri and A. K. Jain. 2004. Online handwritten script recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 26, 1 (2004), 124–130.
[25]
I. Guyon, L. Schomaker, R. Plamondon, M. Liberman, and S. Janet. 1994. Unipen project of on-line data exchange and benchmarks. In Proceedings of IAPR International Conference on Pattern Recognition, 29–33.
[26]
H. Singh, R. K. Sharma, R. Kumar, K. Verma, R. Kumar, and M. Kumar. 2019. A benchmark dataset of online handwritten Gurmukhi script words and numerals. In International Conference on Computer Vision and Image Processing, 457–466.
[27]
B. Nethravathi, C. P. Archana, K. Shashikiran, A. G. Ramakrishnan, and V. Vijay Kumar. 2010. Creation of a huge annotated database for Tamil and Kannada OHR. In Proceedings of International Conference on Frontiers in Handwriting Recognition, 415–420.
[28]
U. Marti and H. Bunke. 1999. A full English sentence database for off-line handwriting recognition. In Proceedings of International Conference on Document Analysis and Recognition, 705–708.
[29]
Hindi and Bengali among top 10 most common languages in the world. (2013). Retrieved on 15 November, 2019 from https://timesofindia.indiatimes.com/world/uk/Hindi-and-Bengali-among-top-10-most-common-languages-in-the-world/articleshow/26104249.cms.

Cited By

View all
  • (2022)A Two-Stage Deep Feature Selection Method for Online Handwritten Bangla and Devanagari Basic Character RecognitionSN Computer Science10.1007/s42979-022-01157-23:4Online publication date: 30-Apr-2022

Index Terms

  1. BYANJON: A Ground Truth Preparation System for Online Handwritten Bangla Documents

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Asian and Low-Resource Language Information Processing
    ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 20, Issue 6
    November 2021
    439 pages
    ISSN:2375-4699
    EISSN:2375-4702
    DOI:10.1145/3476127
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 12 August 2021
    Accepted: 01 April 2021
    Revised: 01 March 2021
    Received: 01 August 2020
    Published in TALLIP Volume 20, Issue 6

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Online handwriting recognition
    2. word segmentation
    3. text-line extraction
    4. stroke extraction
    5. ground truth preparation
    6. Bangla script

    Qualifiers

    • Research-article
    • Refereed

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)10
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 10 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)A Two-Stage Deep Feature Selection Method for Online Handwritten Bangla and Devanagari Basic Character RecognitionSN Computer Science10.1007/s42979-022-01157-23:4Online publication date: 30-Apr-2022

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media