A segmentation-free approach to text recognition with application to Arabic text

October 1996

Author:
Badr H. Al-Badr
Univ. of Washington

Publisher:

University of Washington
Computer Science Dept. Fr-35 112 Sieg Hall Seattle, WA
United States

Order Number:UMI Order No. GAX95-37297

Bibliometrics

Abstract

As the recognition rates and speeds of optical character recognition (OCR) systems steadily improve, the problem of OCR--and subsequently research interest--is shifting from recognizing: isolated, high-quality characters to reading cursive scripts and degraded documents. In recognizing such texts, a major undertaking is segmenting cursive words into characters and isolating merged characters. In OCR systems that recognize cursive text, the segmentation subsystem becomes the pivotal stage in the system to which a sizable portion of processing is devoted and a considerable share of recognition errors is attributed. The most notable feature of Arabic writing is its cursiveness. It also poses the most difficult problem for recognition algorithms.

In this work, we describe the design and implementation of a system that is automatically trainable and that recognizes noisy and cursive words. To recognize a word, the system does not segment it into symbols (character shapes) in advance; rather, it recognizes the input word by detecting a set of "shape primitives" on the word. It then matches the regions of the word (represented by the detected primitives) to a set of symbol models. A spatial arrangement of symbol models that are matched to regions of the word, then, becomes the description of the recognized word. Since the number of potential arrangements of all symbol models is large, the system imposes a set of word structure and spatial consistency. It searches the space comprised of the arrangements that satisfy the constraints and tries to maximize the a posteriori probability of the symbol-models' arrangement.

Large-scale experimentation with the system on isolated characters reveals that it has a recognition rate of 99.7% for synthetically degraded symbols and 94.1% for scanned symbols. Experimentation on isolated words reveals that the system has a recognition rate of 99.4% for noise-free words, 95.6% for synthetically degraded words, and 73% for scanned words.

The main theoretical contribution of this work is in laying the foundation for a segmentation-free approach for Arabic word recognition. Recognition is based on maximizing the probability of the word given the detected primitives. The system is designed to minimize training effort and is extensible as training determines the symbols the system recognizes.

Cited By

Contributors

Badr H Al-Badr
University of Washington
- Publication Years1995 - 1997
- Publication counts4
- Citation count55
- Available for Download0
- Downloads (cumulative)0
- Downloads (12 months)0
- Downloads (6 weeks)0
- Average Downloads per Article0
- Average Citation per Article14
View Full Profile

Index Terms

A segmentation-free approach to text recognition with application to Arabic text
1. Applied computing
  1. Document management and text processing
2. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
  2. Computer graphics
    1. Image manipulation
      1. Image processing

Comments

Recommendations

Offline arabic handwritten text recognition: A Survey

Research in offline Arabic handwriting recognition has increased considerably in the past few years. This is evident from the numerous research results published recently in major journals and conferences in the area of handwriting recognition. Features ...
Automatic processing of Arabic text
IIT'09: Proceedings of the 6th international conference on Innovations in information technology

Automatic recognition of printed and handwritten documents remains an active area of research. Arabic is one of the languages that present special problems. Arabic is cursive and therefore necessitates a segmentation process to determine the boundaries ...
Two template matching approaches to Arabic, Amharic and Latin isolated characters recognition

With the establishment of commercial OCR systems for Latin text, recent research efforts have been directed at the design of recognition systems for non-Latin scripts, such as Japanese, Cyrillic, Chinese, Hindi, Tibetan, and in particular Arabic. The ...

Browse Theses

Sections

Cited By

Index Terms

Offline arabic handwritten text recognition: A Survey

Automatic processing of Arabic text

Two template matching approaches to Arabic, Amharic and Latin isolated characters recognition

Sections

Cited By

Save to Binder

Index Terms

Recommendations

Offline arabic handwritten text recognition: A Survey

Automatic processing of Arabic text

Two template matching approaches to Arabic, Amharic and Latin isolated characters recognition