OCR Using Image Processing
OCR Using Image Processing
Science (IJAETMAS)
ISSN: 2349-3224 || www.ijaetmas.com || Volume 03 - Issue 09 || September - 2016 || PP. 115-122
Abstract
OCR stands for Optical Character Recognition and is the mechanical or electronic translation of
images consisting of text into the editable text. It is mostly used to convert handwritten(taken by
scanner or by other means) into text. Human beings recognize many objects in this manner our eyes
are the "optical mechanism." But while the brain "sees" the input, the ability to comprehend these
signals varies in each person according to many factors.
Digitization of text documents is often combined with the process of optical character recognition
(OCR). Recognizing a character is a normal and easy work for human beings, but to make a machine
or electronic device that does character recognition is a difficult task. Recognizing characters is one of
those things which humans do better than the computer and other electronic devices.
Keywords: Character Recognition System, Camera Captured Document Images, Handheld Device,
Image Segmentation.
Introduction
An Optical Character System enable us to convert a PDF file or a file of scanned images directly into a computer
file, and edit the file using a MS-Word or WordPad. Examples of issues that need to be dealt with in
character recognition of heritage documents are:
Degradation of paper, which often results in high occurrence of noise in the digitized images,
or fragmented (broken) characters.
Characters are not machine-written. If they are manually set, this will result in, for instance,
varying space sizes between characters or (accidentally) touching characters.
An example of implementation of OCR on a scanned image from web with the character recognition
as per Human’s:-
Culpeper’s Midwife Exlarged.5 .lieonthoughtThn Infant drew in his Nouriflhment by his whole Body;
because it is rare and fpungy, as aSpunge fucks in Water o every Side; and so he thought it fucked
Blood, not only from the Mother’s Veins, but also from the Womb. Democrats and Epicurus, recorded
by Plutarch, that the Child fucked in the Nourishment by its Mouth. And also Hippocrates, Lib. de
Principiis, affirms, that the Child fucked both Nourishment and Brea:h by its Mouth from the Mo-ther
when le breathed,(thought in his other Treatises he feernsto deny
Figure 2: Output image after processing given Input image
A number of research works on mobile OCR systems have been found. Laine et al. [7] developed a
system for only English capital letters. At first, the captured image is skew corrected by looking for a
line having the highest number of consecutive white pixels and by maximizing the given alignment
criterion. Then, the image is segmented based on X-Y Tree decomposition and recognized by
measuring Manhattan distance based similarity for a set of centroid to boundary features. However,
this work addresses only the English capital letters and the accuracy obtained is not satisfactory for
real life applications.
Under the current work, a character recognition system is presented for recognizing English
characters extracted from camera captured image/graphics embedded text documents such as business
card images
Process/ Methodology -
The process of OCR involves several steps including Image Scanning, Pre-Processing, Segmentation,
Feature Extraction, Post-Processing and Classification.
Image Scanning
Pre-Processing
Segmentation
Feature Extraction
Post Processing
Classification
Editable Text
Image Scanning:- An image scanner is a digital device used to scan images, pictures, printed text and
objects and then convert them to digital images. Image scanners are used in a variety of domestic and
industrial applications like design, reverse engineering, orthotics, gaming and testing. The most
widely used type of scanner in offices or homes is a flatbed scanner, also known as a Xerox machine.
The process of scanning the image with the help of Scanner is known as Image Scanning. It helps in
getting the image of the handwritten text.
Pre-Processing: - Pre-Processing is required for coloured, binary or grey-level images containing
text. Most of the algorithms of OCR works on binary and grey-level image because the computation is
difficult for coloured images. Images may contain background or watermark or any other thing
different from text making it difficult to extract the text from the scanned image. So, Pre-Processing
helps in removing the above difficulties. The result after Pre-Processing is the binary image
containing text only. Thus, to achieve this, several steps are needed, first, some image enhancement
techniques to remove noise or correct the contrast in the image, second, thresholding(described
below) to remove the background containing any scenes, watermarks and/or noise, third, page
segmentation to separate graphics from text, fourth, character segmentation to separate characters
from each other and, finally, morphological processing to enhance the characters in cases where
thresholding and/or other pre-processing techniques eroded parts of the characters or added pixels to
them. This method is used widely in various character recognition implementations.
Thresholding: Thresholding is a process of converting a grayscale input image to a bi-level image by
using an optimal threshold. The purpose of thresholding is to extract those pixels from some image
which represent an object (either text or other line image data such as graphs, maps). Though the
information is binary the pixels represent a range of intensities. Thus the objective of binarization is to
mark pixels that belong to true foreground regions with a single intensity and background regions
with different intensities.
differentiating features from the matrices of digitized characters. A number of features have been
found in literature on the basis of which the OCR system works to recognize the characters.
According to C. Y. Suen (1986), Features of a character can be classified into two classes: Global or
statistical features and Structural or topological features. Global features are obtained from the
arrangement of points constituting the character matrix. These features can be easily detected as
compared to topological features. Global features are not affected too much by noise or distortions as
compared to topological features. A number of techniques are used for feature extraction; some of
these are: moments, zoning, projection histograms, n-tuples, crossings and distances.
Classification using K-nearest algorithm: - Classification determines the region of feature space in
which an unknown pattern falls. In k-nearest neighbour algorithm (k-NN) is a method for classifying
objects based on closest training examples in the feature space. The k-nearest neighbour algorithm is
amongst the simplest of all other machine learning algorithms: an object is classified by a majority
vote of its neighbours, with the object being assigned to the class most common amongst its k nearest
neighbours (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the
class of its nearest neighbour. Generally, we calculate the Euclidean distance between the test point
and all the reference points in order to find K nearest neighbours, and then arrange the distances in
ascending order and take the reference points corresponding to the k smallest Euclidean distances. A
test sample is then attributed the same class label as the label of the majority of its K nearest
(reference) neighbours.
Post-processing
OCR accuracy can be increased if the output is constrained by a lexicon – a list of words that are
allowed to occur in a document. This might be, for example, all the words in the English language, or
a more technical lexicon for a specific field. This technique can be problematic if the document
contains words not in the lexicon, like proper nouns. Tesseract uses its dictionary to influence the
character segmentation step, for improved accuracy.
The output stream may be a plain text stream or file of characters, but more sophisticated OCR
systems can preserve the original layout of the page and produce, for example, an annotated PDF that
includes both the original image of the page and a searchable textual representation. “Near-neighbour
analysis" can make use of co-occurrence frequencies to correct errors, by noting that certain words are
often seen together. For example, "Washington, D.C." is generally far more common in English than
"Washington DOC".Knowledge of the grammar of the language being scanned can also help
determine if a word is likely to be a verb or a noun, for example, allowing greater
accuracy.The LevenshteinDistance algorithm has also been used in OCR post-processing to further
optimize results from an OCR API.
Editable Text: - Editable Text here means any document or file which can be edited in computer like
any other MS-Word or Word-Pad file. It consists of the text which is given in the input as an image. It
is the output of the OCR system up to 99% accuracy.
Applications: -
5. Vehicle Monitoring through automated real time alerts for Unauthorized / Barred / Stolen
vehicles.
7. Industrial Inspection
8. Document imaging
Process
Server
APPLICATION SOFTWARE
System
Appliance
Communication
Licence Communication Callobration With
Application
Computer
2. Printed invoices OCR recognition: -Testsconsisted of the recognition of two values: invoice
number and date. Invoice number was printed in a fixed area, date might be printed in a few
areas (but each document contained only one date and it had a fixed format). During page
processing, each of these date areas were recognized. Next, all of them were searched for
string in a proper format. First string found was considered as the date. 6 Number of
processed documents was 1000 and all invoice numbers were recognized correctly. Twenty
dates were unrecognized, but they were partly or completely devastated by extra handwritten
notes or stamps. The next test consisted of layout recognition. Invoices contain elements such
as tables, images and stamps. The layout was properly recognized in 50%, 10% of documents
had small defects like lacking one line in table. In 40% of documents the layout recognition
was weak, some of them had more pages than original, some images and stamps were located
in different places, some tables were incomplete.
Conclusion
This paper tells about OCR system for offline handwritten character recognition. The systems have
the ability to yield excellent results. In this there is the detailed discussion about handwritten character
recognize and include various concepts involved, and boost further advances in the area. The accurate
recognition is directly depending on the nature of the material to be read and by its quality. Pre -
processing techniques used in document images as an initial step in character recognition systems
were presented. The feature extraction step of optical character recognition is the most important. It
can be used with existing OCR methods, especially for English text. This system offers an upper edge
by having an advantage i.e. its scalability, i.e. although it is configured to read a predefined set of
document formats, currently English documents, it can be configured to recognize new types. Future
research aims at new applications such as online character recognition used in mobile devices,
extraction of text from video images, extraction of information from security documents and
processing of historical documents. Recognition is often followed by a post-processing stage. We
hope and foresee that if post-processing is done, the accuracy will be even higher and then it could be
directly implemented on mobile devices. Implementing the presented system with post-processing on
mobile devices is also taken as part of our future work.
References