Abstract An OCR system is proposed that can read two In-dian language scripts: Banqla. and Devnag... more Abstract An OCR system is proposed that can read two In-dian language scripts: Banqla. and Devnagari (Hindi), the most popular ones in Indian subcontinent. These scripts, having the same oriqin in ancient Brahmi script, huve many features in common and hence a single sys- ...
IEEE Transactions on Pattern Analysis and Machine …, Jan 1, 2002
We present an omnifont, unlimited-vocabulary OCR system for English and Arabic. The system is bas... more We present an omnifont, unlimited-vocabulary OCR system for English and Arabic. The system is based on hidden Markov models (HMM), an approach that has proven to be very successful in the area of automatic speech recognition. We focus on two aspects of the OCR system. First, we address the issue of how to perform OCR on omnifont and multi-style data, such as plain and italic, without the need to have a separate model for each style. The amount of training data from each style, which is used to train a single model, becomes an important issue in the face of the conditional independence assumption inherent in the use of HMMs. We demonstrate mathematically and empirically how to allocate training data among the different styles to alleviate this problem. Second, we show how to use a word-based HMM system to perform character recognition with unlimited vocabulary. The method includes the use of a trigram language model on character sequences. Using all these techniques, we have achieved character error rates of 1.1 percent on data from the University of Washington English Document Image Database and 3.3 percent on data from the DARPA Arabic OCR Corpus
A complete Optical Character Recognition (OCR) system for printed Bangla, the fourth most popular... more A complete Optical Character Recognition (OCR) system for printed Bangla, the fourth most popular script in the world, is presented. This is the first OCR system among all script forms used in the Indian sub-continent. The problem is difficult because (i) there are about 300 basic, ...
Abstract An OCR system is proposed that can read two In-dian language scripts: Banqla. and Devnag... more Abstract An OCR system is proposed that can read two In-dian language scripts: Banqla. and Devnagari (Hindi), the most popular ones in Indian subcontinent. These scripts, having the same oriqin in ancient Brahmi script, huve many features in common and hence a single sys- ...
IEEE Transactions on Pattern Analysis and Machine …, Jan 1, 2002
We present an omnifont, unlimited-vocabulary OCR system for English and Arabic. The system is bas... more We present an omnifont, unlimited-vocabulary OCR system for English and Arabic. The system is based on hidden Markov models (HMM), an approach that has proven to be very successful in the area of automatic speech recognition. We focus on two aspects of the OCR system. First, we address the issue of how to perform OCR on omnifont and multi-style data, such as plain and italic, without the need to have a separate model for each style. The amount of training data from each style, which is used to train a single model, becomes an important issue in the face of the conditional independence assumption inherent in the use of HMMs. We demonstrate mathematically and empirically how to allocate training data among the different styles to alleviate this problem. Second, we show how to use a word-based HMM system to perform character recognition with unlimited vocabulary. The method includes the use of a trigram language model on character sequences. Using all these techniques, we have achieved character error rates of 1.1 percent on data from the University of Washington English Document Image Database and 3.3 percent on data from the DARPA Arabic OCR Corpus
A complete Optical Character Recognition (OCR) system for printed Bangla, the fourth most popular... more A complete Optical Character Recognition (OCR) system for printed Bangla, the fourth most popular script in the world, is presented. This is the first OCR system among all script forms used in the Indian sub-continent. The problem is difficult because (i) there are about 300 basic, ...
Uploads
Papers by Tamer Alomari