As the recognition rates and speeds of optical character recognition (OCR) systems steadily improve, the problem of OCR--and subsequently research interest--is shifting from recognizing: isolated, high-quality characters to reading cursive scripts and degraded documents. In recognizing such texts, a major undertaking is segmenting cursive words into characters and isolating merged characters. In OCR systems that recognize cursive text, the segmentation subsystem becomes the pivotal stage in the system to which a sizable portion of processing is devoted and a considerable share of recognition errors is attributed. The most notable feature of Arabic writing is its cursiveness. It also poses the most difficult problem for recognition algorithms.
In this work, we describe the design and implementation of a system that is automatically trainable and that recognizes noisy and cursive words. To recognize a word, the system does not segment it into symbols (character shapes) in advance; rather, it recognizes the input word by detecting a set of "shape primitives" on the word. It then matches the regions of the word (represented by the detected primitives) to a set of symbol models. A spatial arrangement of symbol models that are matched to regions of the word, then, becomes the description of the recognized word. Since the number of potential arrangements of all symbol models is large, the system imposes a set of word structure and spatial consistency. It searches the space comprised of the arrangements that satisfy the constraints and tries to maximize the a posteriori probability of the symbol-models' arrangement.
Large-scale experimentation with the system on isolated characters reveals that it has a recognition rate of 99.7% for synthetically degraded symbols and 94.1% for scanned symbols. Experimentation on isolated words reveals that the system has a recognition rate of 99.4% for noise-free words, 95.6% for synthetically degraded words, and 73% for scanned words.
The main theoretical contribution of this work is in laying the foundation for a segmentation-free approach for Arabic word recognition. Recognition is based on maximizing the probability of the word given the detected primitives. The system is designed to minimize training effort and is extensible as training determines the symbols the system recognizes.
Cited By
- Duygulu P, Arifoglu D and Kalpakli M (2016). Cross-document word matching for segmentation and retrieval of Ottoman divans, Pattern Analysis & Applications, 19:3, (647-663), Online publication date: 1-Aug-2016.
- Aljarrah I, Al-Khaleel O, Mhaidat K, Alrefai M, Alzu'bi A and Rabab'ah M Automated system for Arabic optical character recognition Proceedings of the 3rd International Conference on Information and Communication Systems, (1-6)
Index Terms
- A segmentation-free approach to text recognition with application to Arabic text
Recommendations
Offline arabic handwritten text recognition: A Survey
Research in offline Arabic handwriting recognition has increased considerably in the past few years. This is evident from the numerous research results published recently in major journals and conferences in the area of handwriting recognition. Features ...
Automatic processing of Arabic text
IIT'09: Proceedings of the 6th international conference on Innovations in information technologyAutomatic recognition of printed and handwritten documents remains an active area of research. Arabic is one of the languages that present special problems. Arabic is cursive and therefore necessitates a segmentation process to determine the boundaries ...
Two template matching approaches to Arabic, Amharic and Latin isolated characters recognition
With the establishment of commercial OCR systems for Latin text, recent research efforts have been directed at the design of recognition systems for non-Latin scripts, such as Japanese, Cyrillic, Chinese, Hindi, Tibetan, and in particular Arabic. The ...