Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Texture For Script Identification

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

1720 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 27, NO.

11, NOVEMBER 2005

Texture for Script Identification


Andrew Busch, Member, IEEE, Wageeh W. Boles, Member, IEEE, and
Sridha Sridharan, Senior Member, IEEE

Abstract—The problem of determining the script and language of a document image has a number of important applications in the
field of document analysis, such as indexing and sorting of large collections of such images, or as a precursor to optical character
recognition (OCR). In this paper, we investigate the use of texture as a tool for determining the script of a document image, based on
the observation that text has a distinct visual texture. An experimental evaluation of a number of commonly used texture features is
conducted on a newly created script database, providing a qualitative measure of which features are most appropriate for this task.
Strategies for improving classification results in situations with limited training data and multiple font types are also proposed.

Index Terms—Script identification, wavelets and fractals, texture, document analysis, clustering, classification and association rules.

1 INTRODUCTION

A Sthe world moves ever closer to the concept of the


“paperless office,” more and more communication and
storage of documents is performed digitally. Documents
by the increased number of possible characters. In addition,
many script types do not lend themselves to traditional
methods of character segmentation, an essential part of the
and files that were once stored physically on paper are now OCR process and, thus, must be handled somewhat differ-
being converted into electronic form in order to facilitate ently. For all of these reasons, the determination of the script
quicker additions, searches, and modifications, as well as to of the document is an essential step in the overall goal of OCR.
prolong the life of such records. A great proportion of Previous work has identified a number of approaches for
business documentation and communication, however, still determining the script of a printed document. A number uses
takes place in physical form and the fax machine remains a character-based features or connected component analysis
vital tool of communication worldwide. Because of this, [2], [3]. The paradox inherent in such an approach is that it is
there is a great demand for software which automatically sometimes necessary to know the script of the document in
extracts, analyzes, and stores information from physical order to extract such components. In addition to this, the
documents for later retrieval. All of these tasks fall under presence of noise or significant image degradation can also
the general heading of document analysis, which has been a significantly affect the location and segmentation of these
fast growing area of research in recent years. characters, making them difficult or impossible to extract. In
A very important area in the field of document analysis is such conditions, a method of script recognition which does
that of optical character recognition (OCR), which is broadly not require such segmentation is required. Texture analysis
defined as the process of recognizing either printed or techniques are a logical choice for solving such a problem as
handwritten text from document images and converting it they give a global measure of the properties of a region,
into electronic form. To date, many algorithms have been without requiring analysis of each individual component of
presented in the literature to perform this task, with some of the script. Printed text of different scripts is also highly
these having been shown to perform to a very high degree of preattentively distinguishable, a property which has long
accuracy in most situations, with extremely low character- been considered a sign of textural differences [4].
recognition error rates [1]. However, such algorithms rely In Section 2, we provide an overview of the problem of
extensively on a priori knowledge of the script and language script recognition, highlighting its importance in a number of
of the document in order to properly segment and interpret applications. The idea of using texture analysis techniques to
each individual character. While in the case of Latin-based determine the script of printed text is then investigated in
languages such as English, German, and French, this problem Section 3, showing the rationale behind such an approach and
can be overcome by simply extending the training database to the work done in this area to date. Due to the nature of most
include all character variations, such an approach will be texture features, document images must be normalized to
unsuccessful when dealing with differing script types. At ensure accuracy and algorithms for such preprocessing
best, the accuracy of such a system will be necessarily reduced stages, including binarization, skew detection and correction,
and normalization of text are presented in Section 4. Details of
the final texture features extracted from the normalized
. A. Busch is with the School of Microelectronic Engineering, Griffith
University Nathan Campus, Nathan, QLD. 4111 Australia.
images are then given in Section 5, with experimental
E-mail: a.busch@griffith.edu.au. evaluation of each such feature set performed in Section 6,
. W.W. Boles and S. Sridharan are with the School of Engineering Systems, using a newly constructed document image database con-
Queensland University of Technology, GPO Box 2343, Brisbane, QLD. taining a total of 10 different scripts.
4001 Australia. E-mail: {w.boles, s.sridharan}@qut.edu.au.
Section 7 outlines a technique for improving classifier
Manuscript received 1 Mar. 2004; revised 10 Dec. 2004; accepted 14 Dec. performance by utilizing the common properties of printed
2004; published online 14 Sept. 2005.
Recommended for acceptance by M. Pietikainen.
text when training a Gaussian mixture model classifier. It is
For information on obtaining reprints of this article, please send e-mail to: proposed that by taking advantage of this a priori informa-
tpami@computer.org, and reference IEEECS Log Number TPAMI-0112-0304. tion, less training data will be required to adequately describe
0162-8828/05/$20.00 ß 2005 IEEE Published by the IEEE Computer Society
BUSCH ET AL.: TEXTURE FOR SCRIPT IDENTIFICATION 1721

each class density function, leading to increased over


performance. The problem of multifont script recognition is
addressed in Section 8. This is of much importance in practical
applications since the various fonts of a single script type can
differ significantly in appearance and are often not ade-
quately characterized in feature space if modeling with a Fig. 1. Commonly labeled text regions for an English sample.
single class is attempted. A method for clustering the features
and using multiple classes to describe the distribution is of Indian script, and certain fonts such as italics which
proposed here, with experimental results showing that this enhance connectivity of consecutive characters.
technique can achieve significant improvement when auto- Suen et al. also apply two extra segmentation algorithms
matically dealing with many font types. at this point, in order to remove extremely large connected
components and noise [3]. Large components are consid-
ered to be those with bounding boxes more than five times
2 SCRIPT AND LANGUAGE RECOGNITION the average size. It is thought that these correspond to
Although a large number of OCR techniques have been nontextual regions of the input image and can safely be
developed in recent years, almost all existing work in this discarded. Any bounding boxes with dimensions smaller
field assumes that the script and/or language of the than the text font stroke are considered noise and also
document to be processed is known. Although it is certainly removed. Practical experiments have shown some problems
possible to train an OCR system with characters from many when using these techniques, as important image features
languages to obtain a form of language independence, the such as the dots on “i” characters and accent marks are
performance of such a classifier would naturally be lower erroneously removed. If character segmentation is not
than one trained solely with the script and/or language of accurate entire lines of text may also be removed, as they
interest. Using specialized classifiers for each language is are considered to be a single large connected component.
also advantageous in that it allows for the introduction of Before feature extraction, it is necessary to define various
language and script specific knowledge when performing regions within a line of text. Four horizontal lines define the
other required tasks such as document segmentation boundaries of three significant zones on each text line.
character separation. Using such specialized classifiers in These lines are the top-line, x-height, baseline, and bottom-
a multilingual environment requires an automated method line. The top-line is the absolute highest point of the text
of determining the script and language of a document. region, and typically corresponds to the height of the largest
A number of different techniques for determining the characters in the script, for example “A.” The x-height line
script of a document have been proposed in the literature. is the height of the smaller characters of the script, such as
Spitz has proposed a system which relies on specific, well- “a” and “e,” and is not defined for all scripts, for example,
defined pixel structures for script identification [2]. Such Chinese. The baseline is defined as lowest point of the
features include locations and numbers of upward concav- majority of the characters within a script, excluding those
ities in the script image, optical density of text sections, and which are classified as descenders. Finally, the bottom-line is
the frequency and combination of relative character heights. the absolute lowest point of any character within the script,
This approach has been shown to be successful at for example the “g” and “q” characters. The three zones
distinguishing between a small number of broad script encompassed by these lines are known as the descender
types (Latin-based, Korean, Chinese, and Japanese) and zone, x-zone, and ascender zones. Fig. 1 shows the locations
moderately effective at determining the individual lan- of each of these lines and regions for an English text sample.
guage in which the text is written. Results when using a Calculating the positions of these lines is performed using
wider variety of script types (Cyrillic, Greek, Hebrew, etc.) vertical projection profiles of the connected components. By
are not presented nor is any attempt made to define the projecting the lowest position, top position, and pixels for
conditions for an unknown script type. each connected component, the positions of each line can then
Preprocessing of document images is required before this be determined. The peak in the lowest position profile is taken
technique is applied. The purpose of this is to form a binary as the baseline position, since the majority of characters in any
image, that is, an image composed entirely of white and black known language will have their lowest point here. Searching
pixels only. By convention, white pixels are considered upwards from this point, the peak in the top position profile is
background and black pixels are considered text. Following then labeled as the x-height, although this may lead to
this, connected components [5] are extracted from the binary inaccurate x-line positioning when lines of text with a large
representation. For each component extracted in this way, number of capital letters and/or punctuation symbols are
information such as the position, bounding box, and lists of present [2]. The positions of the top and bottom lines are
pixels runs is stored. Representing the image in this way found simply by searching for the highest and lowest
provides an efficient data structure on which to perform projected positions, respectively, and removing any values
image processing operations, such as the calculation of which are excessive. Having determined the positions of
upward concavities and optical density required in the latter these lines, they are then used as a reference point for all
stages of the process. For many script types, calculating location information for individual connected components.
connected components will also separate individual char- The primary feature used in the script recognition
acters, although this is not always true in the general case. algorithm of Spitz is the location and frequency of upward
Further work has attempted to address this problem by concavities in the text image. An upward concavity is present
segmenting individual connected components at this stage at a particular location if two runs of black pixels appear on a
[6]. This is required primarily for languages that possess a single scan line of the image and there exists a run on the line
high degree of connectivity between characters, such as types below which spans the distance between these two runs.
1722 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 27, NO. 11, NOVEMBER 2005

Once found, the position of each upward concavity in relation preattentively determined by a human observer regardless of
to the baseline of the character is noted and a histogram of such factors, indicating that the overall texture of the image is
these positions for the entire document constructed. Analysis maintained. For these reasons, texture analysis appears to be
of such histograms for Latin-based languages shows a a good choice for the problem of script identification from
distinctly bimodel distribution, with the majority of upward document images.
concavities occurring either slightly above the baseline or Previous work in the use of texture analysis for script
slightly below the x-height line. In contrast to this, Han-based identification has been limited to the use of Gabor filterbanks
scripts exhibit a much more uniform distribution, with the [10], [11]. While this work has shown that texture can provide
modal value typically evenly spaced between the baseline a good indication of the script type of a document image,
and x-height. Using this information, it is possible to other texture features may give better results for this task, and
accurately distinguish Latin-based and Han-based scripts we investigate this possibility in this work.
using a simple measure of variance. No histograms were
provided for other script types such as Greek, Hebrew, or
Cyrillic, so it is unknown how such a method will perform for 4 PREPROCESSING OF IMAGES
these script types. In constant use, this method has never been In general, blocks of text extracted from typical document
observed to incorrectly classify Latin or Han-based scripts [7]. images are not good candidates for the extraction of texture
As well as this technique, a number of other approaches to features. The varying degrees of contrast in gray-scale images
automatic script recognition have been proposed. Hochberg and the presence of skew and noise could all potentially affect
used textual symbols extracted from individual characters to such features, leading to higher classification error rates.
classify text regions, with template matching used to classify Additionally, the large areas of white space, unequal
each such symbol [8]. Pal and Chaudhuri use properties of character word and line spacings, and line heights can also
individual text lines to distinguish between five scripts, have a significant effect on these features. In order to reduce
including the challenging Devanagari and Bangla scripts [9]. the impact of these factors, the text blocks from which texture
This approach uses a rule-based approach relying on features are to be extracted must undergo a significant
extensive knowledge of the distinctive properties of these amount of preprocessing. The individual steps which are
scripts, making unsuitable for applications where unknown performed in this stage are binarization, deskewing, and
scripts are to be added and recognized. block normalization.
The segmentation and extraction of text regions from a
document image is a difficult problem which has received
3 TEXTURE ANALYSIS FOR SCRIPT RECOGNITION significant attention in the literature [12], [13], [14], [15],
The work presented in the previous section has shown [16]. This stage of processing, however, is beyond the scope
excellent results in identifying a limited number of script of this paper, and manual extraction of text regions is
types in ideal conditions. In practice, however, such techni- performed for all experiments. Such an approach is
ques have a number of disadvantages which in many cases consistent with previous work in this field, with manual
make identification of some scripts difficult. The detection of extraction text regions used by a number of authors [2], [11].
upward concavities in an image is highly susceptible to noise Binarization can be described as the process of converting
and image quality, with poor quality and noisy images having a gray-scale image into one which contains only two distinct
high variances in these attributes. Experiments conducted on tones, that is black and white. This is an essential stage in
noisy, low-resolution, or degraded document images have many of the algorithms used in document analysis, especially
shown that classification performance drops to below those that identify connected components, that is, groups of
70 percent for only two script classes of Latin-based and pixels which are connected to form a single entity. Although
Han. The second disadvantage of the technique proposed by document images are typically produced with a high level of
Spitz is that it cannot effectively discriminate between scripts contrast for ease of reading, scanning artifacts, noise, paper
with similar character shapes, such as Greek, Cyrillic, and defects, colored regions, and other image characteristics can
Latin-based scripts, even though such scripts are easily sometimes make this a nontrivial task, with many possible
visually distinguished by untrained observers. solutions presented to date in the literature. In general, a
Determination of script type from individual characters is decision threshold is used to determine the final binary value
also possible using OCR technology. This approach, as well as of each pixel, with the actual threshold value determined
others which rely on the extraction of connected components, either globally for the entire image [17], [18], [19] or, locally,
requires accurate segmentation of characters before their for different regions of the image [20], [21], [22]. Iterative
application, a task which becomes difficult for noisy, low approaches to binarization, which allow for better perfor-
resolution, or degraded images. Additionally, certain script mance in the presence of textured and other irregular
types, for example, Sanskrit, do not lend themselves well to background, have also been proposed [23].
character segmentation, and require special processing. This For the purposes of this evaluation, all of the images
presents a paradox in that to extract the characters for script used are of high contrast with no background shading
identification, the script, in some cases, must already be effects. Because of this, a global thresholding approach
known. Using global image characteristics to recognize the provides an adequate means of binarization, and the
script type of a document image overcomes many of these method proposed by Otsu in [17] is used.
limitations. Because it is not necessary to extract individual
characters, no script-dependent processing is required. The 4.1 Skew Detection and Correction
effects of noise, image quality, and resolution are also limited Knowing the skew of a document is necessary for many
to the extent that it impairs the visual appearance of a sample. document analysis tasks. Calculating projection profiles, for
In most cases, the script of the document can still be readily example, requires knowledge of the skew angle of the
BUSCH ET AL.: TEXTURE FOR SCRIPT IDENTIFICATION 1723

image to a high precision in order to obtain an accurate


result. In practical situations, the exact skew angle of a
document is rarely known, as scanning errors, different
page layouts, or even deliberate skewing of text can result
in misalignment. In order to correct this, it is necessary to
accurately determine the skew angle of a document image
or of a specific region of the image, and, for this purpose, a
number of techniques have been presented in the literature.
Postl [24] found that the maximum valued position in the
Fourier spectrum of a document image corresponds to the
angle of skew. However, this finding was limited to those
documents that contained only a single line spacing, thus the
peak was strongly localized around a single point. When
variant line spacings are introduced, a series of Fourier Fig. 2. Example of projection profile of text segment. (a) Original text and
spectrum maxima are created in a line that extends from the
(b) projection profile.
origin. Also evident is a subdominant line that lies at
90 degrees to the dominant line. This is due to character and
word spacings and the strength of such a line varies with First, horizontal projection profiles are taken for each
changes in language and script type. Peake and Tan expand segment. By detecting valleys in these profiles, the positions
on this method, breaking the document image into a number of line breaks, as well as the height of each line and line
of small blocks, and calculating the dominant direction of space is calculated, assuming that the text is correctly
each such block by finding the Fourier spectrum maxima [25]. aligned following deskewing. An example of a typical
These maximum values are then combined over all such projection profile obtained in this manner is shown in Fig. 2.
blocks and a histogram formed. After smoothing, the Having detected the lines of text, the average height of the
maximum value of this histogram is chosen as the approx- lines in the region is then calculated and those that are either
imate skew angle. The exact skew angle is then calculated by significantly larger or smaller than this average are discarded.
taking the average of all values within a specified range of this Investigation of many regions has found that such lines often
approximate. There is some evidence that this technique is represent headings, captions, footnotes, or other nonstandard
invariant to document layout and will still function even in text, and as such may have an undesirable effect on the
the presence of images and other noise [25]. A number of resulting texture features if they are retained. The remaining
other techniques for the estimation of the skew angle have lines are then normalized by the following steps:
also been proposed [26], [27], [28].
Expanding on the work of Peake and Tan, Lowther et al. 1. Each line is scaled to convert it to a standard height.
use the Radon transform to accurately locate the peak of the Although 15 pixels has been found to provide good
Fourier spectrum of the document image [29]. In order to results in our experiments, larger or smaller values
remove DC components and the higher weightings of the may be more appropriate in situations where the
diagonals, a circular mask is first applied to the spectrum. expected input resolution of the document images is
Experimental results have shown that this technique significant higher or lower.
provides superior accuracy in estimating the skew angle 2. Normalization of character and word spacings. Often,
of a wide range of documents, with the correct skew angle modern word processing software expands spaces
of over 96 percent of test documents determined to within between words and even characters to completely fill
0.25 degrees [29], and almost all within 1 degree. Because of a line of text on a page, leading to irregular and
these results, this technique has been used to detect and sometimes large areas of white space. Tabulation and
correct the skew of all text regions in each of the other formatting techniques may also cause similar
experiments outlined in Section 6. problems. By traversing the line and ensuring that
each space does not exceed a specified distance (two-
4.2 Normalization of Text Blocks thirds of the standard height), this white space can be
Extraction of texture features from a document image removed.
requires that the input images exhibit particular properties. 3. Removal and padding of short lines. After perform-
The images must be of the same size, resolution, orientation, ing the above operations on each line, the length of
and scale. Line and word spacing, character sizes and the longest line is determined, and each of the others
heights, and the amount of white space surrounding the padded to extend them to this length to avoid large
text, if any, can also affect texture features. In order to areas of white space at the ends of lines. To
minimize the effects of such variations to provide a robust accomplish this, the line is repeated until the desired
texture estimate, our system attempts to normalize each text length is achieved. Clearly, for lines which are very
region before extracting texture features. This process will short, such repetition may lead to peaks in the
also remove text regions that are too small to be resulting spatial frequency spectrum of the final
characterized adequately by texture features. image, and hence lines which do not satisfy a
An effective algorithm for overcoming these problems minimum length, expressed as a percentage of the
and normalizing each region of text has been developed, longest line, are simply removed.
based on the work done by Peake and Tan [10]. After Following normalization, lines must be recombined to
binarization, deskewing, and segmentation of the document construct the final block of text. When performing this stage of
image, a number of operations are performed on each processing, it is important that the line spacings are constant
region in order to give it a uniform appearance. to avoid significant white space between lines. Due to
1724 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 27, NO. 11, NOVEMBER 2005

Fig. 3. Example of text normalization process on an English document image. (a) Original image and (b) normalized block.

Fig. 4. Example of text normalization process on a Chinese document image. (a) Original image and (b) normalized block.

differences in the nature of various scripts, analysis of the line jfððr; sÞ; ðt; vÞÞ : Iðr; sÞ ¼ i; Iðt; vÞ ¼ jgj
Pd ði; jÞ ¼ ; ð1Þ
is required in order to determine the limits and relative NM
frequencies of character heights. For example, Latin-based where j j represents the cardinality of a set. Due to the variable
scripts comprise mostly of characters of a small height, such parameter d, the set of co-occurrence matrices is arbitrarily
as “a,” “r,” and “s,” with a small but significant number of large. In practice, the most relevant correlations occur at short
characters which protrude above and/or below these heights, distances and, thus, the values of d are typically kept small,
for example, “A,” “h,” “p,” and “j.” In contrast to this, almost and expressed in the form ðd; Þ, with d representing the lineal
all characters in some other scripts, such as Chinese, have distance in pixels, and  the angle between them. Typically, 
identical heights and, thus, require a somewhat larger line is restricted to the values f0 ; 45 ; 90 ; 135 g, and d limited to a
spacing in order to maintain a uniform appearance. Determi- small range of values. It is also possible to modify the GLCM
nation of which class a sample of text belongs to can be easily somewhat to ensure diagonal symmetry of the matrix. This is
made by an examination of the projection profiles of each line. achieved by the transformation
In order to obtain a uniform appearance over all script types,
this information is taken into account when normalizing the Pðd;Þ ¼ Pðd;Þ þ Pðd;Þ : ð2Þ
line spacings. To allow for a more uniform appearance,
samples of text with uniform or near-uniform character For a typical image with R  8, the size of the resulting
heights are combined using a larger line spacing. GLCMs makes their direct use unwieldy, and statistical
An example showing the effect of the entire normalization features such as correlation, entropy, energy, and homo-
process applied to a typical text segment is shown in Fig. 3. geneity are instead extracted and used to characterize the
From this example, it can be seen that the original image, image [30]. Due to the binary nature of the document images
which is somewhat noisy and contains many large regions of from which the features are extracted, the extraction of such
whitespace, highly variable line spacing, and nonstandard features is unnecessary and indeed counterproductive. Since
there are only two gray levels, the matrices will be of size 2  2,
text in form of equations, is transformed into a block of
meaning that it is possible to fully describe each matrix with
relatively uniform appearance. Closer inspection reveals the
only three unique parameters due to the diagonal symmetry
existence of repeated sections, however, preattentively this is
property. Using these values directly is feasible and has
not apparent. The algorithm described above works equally
experimentally shown to give better results than attempting
well on all tested scripts and languages, which is clearly an
to extract the co-occurrence features of such a small matrix.
important property for this application. Fig. 4 shows the
Using values of d ¼ f1; 2g and  ¼ f0 ; 45 ; 90 ; 135 g leads to
results obtained after processing a Chinese document image.
a total of 24 features.
Note in this example the increased line spacings due to the
equal height of characters in the Chinese script. 5.2 Gabor Energy Features
The energy of the output of a bank of Gabor filters has been
5 TEXTURE FEATURE EXTRACTION previously used as features for identifying the script of a
document image, with good results shown for a small set of
From each block of normalized text, the following texture test images [10], [11]. In this work, both even and odd
features are evaluated for the purpose of script identification. symmetric filters are used; they are described by
 
5.1 Gray-Level Co-Occurrence Matrix Features x2 y
2
12 þ 2
2
x y
Gray-level co-occurrence matrices (GLCMs) are used to ge ðx; yÞ ¼ e cosð2u0 ðx cos  þ y sin ÞÞ ð3Þ
represent the pairwise joint statistics of the pixels of an image  
2
x2 y
12 þ 2
and have been used for many years as a means of go ðx; yÞ ¼ e 2
x y sinð2u0 ðx cos  þ y sin ÞÞ; ð4Þ
characterizing texture [30]. For a gray-scale image quantized
to R discrete levels, such matrices contain R  R elements where x and y are the spatial coordinates, 0 the frequency of
and can be defined for an image I as the sinusoidal component of the Gabor filter, and x and y
BUSCH ET AL.: TEXTURE FOR SCRIPT IDENTIFICATION 1725

the frequencies of the Gaussian envelope along the principal 5.4 Wavelet Log Mean Deviation Features
axes, typically with x ¼ y . In the experimental results Previous work in the field of texture classification has shown
presented in [11], a single value of 0 ¼ 16 was used, with that by applying a nonlinear function to the coefficients of the
16 orientation values spaced equidistantly between 0 and 2, wavelet transform, a better representation of naturally
giving a total of 16 filters. By combining the energies of the textured images can be obtained [36]. By using a logarithmic
outputs of the even and odd symmetric filters for each such transform and extracting the mean deviation of these values
orientation, a feature vector of same dimensionality is rather than the energy, significant improvements in overall
created. To obtain rotation invariance, this vector is trans- classification accuracy were obtained when compared to the
formed via the Fourier transform, and the first four resulting standard wavelet energy signatures, at negligible increase in
coefficients used for classification. Since skew detection and computational cost. These features, named the wavelet log
correction has been performed on the test images to be used in mean deviation features, are defined as [36]
these experiments, such a transformation is not required, and  
PM PN jDjk ðn;mÞj
the features will be used unmodified. By combining the m¼1 n¼1 log þ 1
Sj 
energies of the odd and even symmetric filters for each LMDjk ¼ ; ð10Þ
resolution and orientation, a total of 16 features are obtained MN
using this method. While these features have shown good where  is a constant specifying the degree of nonlinearity in
performance on a small number of script types [11], using the transform, and Sj represents the estimated maximum
only a single frequency does not provide the necessary value of the coefficients at resolution level j. Although the
discrimination when a large set of scripts and fonts are used. optimal value of  is high dependent upon the textures being
To overcome this, an additional 16 filters with a frequency of used, previous work has found that a value of  ¼ 0:001
0 ¼ 8 are employed, giving a final dimensionality of 32. performs well over a wide variety of natural textures. The
total number of features obtained in this manner is equal to
5.3 Wavelet Energy Features the wavelet energy features, thus when using four levels of
The wavelet transform has emerged over the last two decomposition a dimensionality of 12 is again obtained.
decades as a formal, concise theory of signal decomposition
and has been used to good effect in a wide range of 5.5 Wavelet Co-Occurrence Signatures
disciplines and practical applications. A discrete, two- By extracting second-order features of the wavelet coeffi-
dimensional form of the transform can be defined as [31] cients, represented by the co-occurrence features at small
distances, it is possible to significantly improve the classifica-
Aj ¼ ½Hx  ½Hy  Aj1 #2;1 #1;2 ð5Þ tion of natural textures [35]. Such features are extracted from
Dj1 ¼ ½Gx  ½Hy  Aj1 #2;1 #1;2 ð6Þ each of the wavelet detail images in a similar manner to the
GLCM features described previously, with linear quantiza-
Dj2 ¼ ½Hx  ½Gy  Aj1 #2;1 #1;2 ð7Þ tion used to transform the near-continuous wavelet coeffi-
Dj3 ¼ ½Gx  ½Gy  Aj1 #2;1 #1;2 ; ð8Þ cients to discrete form. In order to avoid overly sparse
matrices, an undecimated form of the wavelet transform is
where Aj and Djk are the approximation and detail used in place of the standard two-dimensional FWT, provid-
coefficients at each resolution level j, H, and G are the ing greater spatial resolution, less sparse co-occurrence
low and high-pass filters, respectively, and #x;y represents matrices at low resolutions, and translation invariance [37].
downsampling along each axis by the given factors. The These features are known as the wavelet co-occurrence
energies of each detail band of this transform, calculated by features.
PM PN The nonlinear transform described above can also be
Djk ðm; nÞ used when calculating the wavelet co-occurrence features
Ejk ¼ m¼1 n¼1 ; ð9Þ
MN by modifying the quantization function, with experimental
results showing significantly reduced overall error rates
where M and N represent the size of each detail image, have
when used to classify a variety of natural textures [36]. This
been used by many authors as a texture feature vector [32],
is most easily achieved by modifying the quantization
[33]. Although such features are relatively primitive in
function qðxÞ such that for a desired number of levels I,
nature, their wide use and simple nature make them an ideal
point of reference with which to compare more sophisticated 8 h  i
< round  log x þ 1 ; x >¼ 0
approaches. q1 ðxÞ ¼ h Sj  i ð11Þ
These features can be directly extracted from a region of : round  log jxj þ 1 ; x < 0;
Sj 
normalized text, giving a total of 3J features, where J is the
total number of decomposition levels used in the transform. where
In the evaluation conducted in this paper, a value of J ¼ 4 is I1
used, leading to a feature dimensionality of 12. The choice of ¼ : ð12Þ
logð1= þ 1Þ
analyzing is also of importance when extracting such
features. Although the literature has presented results using Once the wavelet coefficients are quantized using (11), co-
a number of different analyzing wavelets, the family of occurrence matrices are formed in an identical manner to the
biorthogonal spline wavelets [34] is a popular choice due to construction of the GLCMs described previously. From such
their symmetry, compact support and smoothness, and matrices, the following co-occurrence features are extracted:
regularity properties. Previous work has shown that these energy, entropy, inertia, local homogeneity, contrast, cluster
wavelets perform well in texture characterization problems, shade, cluster prominence, and information measure of
and we have used a second-order wavelet of this type in all correlation, as shown in Table 1 [30]. These features are
work presented in this paper [35]. known as the wavelet log co-occurrence features.
1726 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 27, NO. 11, NOVEMBER 2005

TABLE 1
Cyrillic, Hebrew, Sanskrit, and Farsi). Examples of these
Co-Occurrence Features Extracted from GLCMs images are shown in Fig. 5. Each such image was binarized,
deskewed, and normalized using the algorithms described
above, and 200 segments, each 64  64 pixels in size, extracted
for each script class. This size sample corresponds to roughly
four lines of printed text, typically, containing two or three
words on each line. Although higher accuracies could be
obtained by using larger areas or even complete regions, we
have used these small regions to simulate situations where
only a limited amount of text is available. The images obtained
from this process were then divided into two equal groups to
create the training and testing sets, ensuring that samples
taken from the same image were placed into the same group.
In order to improve classification accuracy and reduce
the dimensionality of the feature space, feature reduction is
performed prior to classification by means of linear
discriminate analysis [39]. This technique maps the feature
space to one of lower dimensionality while maximizing the
Fisher criterion, a measure of class separability. For a
training set of N classes and a feature dimensionality of M,
this analysis will return a M ! ðN  1Þ mapping, repre-
senting the N  1 hyperplanes necessary to segment the
feature space linearly. To illustrate the effectiveness of this
technique, the results obtained both with and without
performing this step are shown in Table 2.
Classification of the samples is performed using a
Gaussian mixture model (GMM) classifier, which attempts
to model each class as a combination of Gaussian distribu-
Extracting these features for the first four resolution levels tions in feature space [39], and is trained using a version of the
of the wavelet decomposition gives a total of 96 features for expectation maximization (EM) algorithm. Due to the large
both the linear and logarithmic quantized cases. range of possible scripts, a dynamic method of determining a
classifier topology for each class is required. For this purpose,
5.6 Wavelet Scale Co-Occurrence Signatures we have chosen to use the Bayes information criterion (BIC),
The wavelet scale co-occurrence signatures have been which can be approximated for each candidate topology T i
recently shown to provide unique texture information by by [40], [41], [42]
describing the relationships between scales of the
wavelet transform, allowing improved modeling of visual ^ i Þ   Ki log Ni ;
BICðT i Þ ¼ log pðXi jT i ;  ð14Þ
texture features which contain information on many 2
scales and orientations [38]. A scale co-occurrence matrix where Xi is the set of N training observations,  ^ i the
is defined as [36] parametric form of the trained classifier, K the number of
jfðu; vÞ : q1 ðDji ðu; vÞÞ ¼ k; q2 ðAj ðu; vÞÞ ¼ lgj free parameters in the model, and  a scaling factor. From (14),
Sji ðk; lÞ ¼ ; it can be seen that this criterion is made up of the likelihood of
NM
the model given the training data minus a penalty factor
ð13Þ which increases linearly with model complexity.
where Aj ðu; vÞ is the approximation image at resolution The overall classification error rates for each of the texture
level j, Dji ðu; vÞ are the three detail images, q1 ðxÞ and q2 ðxÞ are features are shown in Table 2. It can be seen that the wavelet
the quantization functions for the detail and approximation log co-occurrence significantly outperform any of the other
coefficients, respectively, and ðk; lÞ 2 f1 . . . Ig, where I is the features for script classification, with an overall error rate of
only 1 percent. This result is relatively consistent with those
number of discrete quantization levels used. The logarithmic
reported for natural textures [36], indicating that local
quantization function described in (11) is used for the detail relationships between wavelet coefficients are an excellent
coefficients, while linear quantization has been shown to be basis for representing texture. The relative increase in
more suitable for the approximation data. From each of the performance of these features compared to those extracted
scale co-occurrence matrices, a number of the features with linear quantization is again consistent with previously
described in Table 1 are extracted. These features have been published results.
shown to perform well on a variety of naturally textured The scale co-occurrence features did not perform as well
images, with lower overall classification errors when applied on the binary script images as has been previously reported
to some texture databases. for natural textures [38], with only a slightly reduced error
rate when compared to the wavelet energy features. The
GLCM features showed the worst overall performance,
6 CLASSIFICATION RESULTS from which it can be concluded that pixel relationships at
The proposed algorithm for automatic script identification small distances are insufficient to characterize the script of a
from document images was tested on a database containing document image. The poor performance of the features
eight different script types (Latin, Chinese, Japanese, Greek, proposed by Spitz can be attributed to the fact that they
BUSCH ET AL.: TEXTURE FOR SCRIPT IDENTIFICATION 1727

Fig. 5. Examples of document images used for training and testing. (a) English, (b) Chinese, (c) Greek, (d) Cyrillic, (e) Hebrew, (f) Hindi, (g) Japanese,
and (h) Persian.

were only designed to distinguish between Latin and Han- 7 ADAPTIVE GMMS FOR IMPROVED CLASSIFIER
based scripts and cannot effectively discriminate script PERFORMANCE
pairs such as Greek and Latin or Persian and Devangari.
Table 3 shows the distribution of the errors among the Printed text, regardless of the script, has a distinct visual
various script classes for the wavelet log co-occurrence texture and is easily recognized as such by a casual observer.
features. In order to give a meaningful distribution, linear With such a commonality between all script classes, it is
discriminate analysis was not used when generating these possible to use this a priori knowledge to improve the
results. The Chinese script shows the lowest overall error modeling of each individual texture class. This is done by
rate for these features, with the largest errors arising from training a global model using all available training, then
misclassifications between the Cyrillic and Greek scripts. adapting this model for each individual class, rather than
creating each model independently. By doing this, a more
robust representation can be obtained, somewhat overcom-
TABLE 2 ing the blind nature of the learning algorithm. It is also
Script Recognition Results for Each of the Feature Sets possible to train a class using less training observations, since
with and without Feature Reduction an initial starting point for the model is already available. This
technique has been used with great success in applications
where the general form of a model can be estimated using
prior information, such as the modeling of speech and
speakers, and is known as maximum a posterior (MAP)
adaptation [43].

7.1 MAP Adaptation


When estimating the parametric form of a classifier, the
maximum likelihood estimate is defined as the parameter
^ such that [44], [39]
set 
^ ¼ arg max lð
 Þ; ð15Þ

1728 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 27, NO. 11, NOVEMBER 2005

TABLE 3
Confusion Matrix for the Wavelet Log Co-Occurrence Features


Þ is the likelihood of the training observations for
where lð constant 2  2min
gð2 Þ ¼ ð20Þ
that parametric form  defined as 0 otherwise;
where 2min is estimated from a large number of observa-
Þ ¼ pðoj
lð  Þ: ð16Þ
tions, in the case of our application the entire database of
Given these definitions, the ML framework can be thought script images. Given this simplified density function, the
of as finding a fixed but unknown set of parameters  ^ . In MAP estimate of the variance is then given by
contrast to this, the maximum a posterior (MAP) approach 
assumes  to be a random vector with a known distribution, 2 Sx Sx  2min
~ ¼ ð21Þ
0 otherwise;
with an assumed correlation between the training observa-
tions and the parameters  [45]. From this assumption, it where Sx is the variance of the training observations. This
becomes possible to make a statistical inference of  using procedure is often known as variance clipping and is effective
only a small set of adaption data o, and prior knowledge of in situations where limited training data does not allow for
the parameter density gð  Þ. The MAP estimate therefore an adequate estimate of the variance parameter.
maximizes the posterior density such that In the current application of script recognition, the prior
parameter density gð  Þ can be estimated using a global
 MAP ¼ arg max gð
 joÞ ð17Þ model of script, trained using all available data. This choice

is justified by the observation that printed text, in general,
¼ arg max lðoj
Þgð
 Þ: ð18Þ regardless of script type, has a somewhat unique appear-

ance and as such the texture features obtained should be
Since the parameters of a prior density can also be relatively well clustered in feature space. Training for each
estimated from an existing set of parameters  0 , the MAP of the individual textures is then carried out by adapting
framework also provides an optimal method of combining this global model for each individual script class. In order to
 0 with a new set of observations o. create more stable representations and limit computational
In the case of a Gaussian distribution, the MAP expense, only the mean parameters and weights of these
estimations of the mean m ~ and variance ~2 can be obtained mixtures are adapted during this process, using (19).
using the framework presented above, given prior distribu- 7.2 Classification Results
tions of gðmÞ and gð2 Þ, respectively. If the mean alone is to
Using the same training data as the previous experiment
be estimated, this can be shown to be given by [46] and the MAP approach outlined above, a global script
model was created for each of the feature sets and adapted
T 2 2
~ ¼
m 
x þ ; ð19Þ separately for each script class. The optimal number of
2 þ T  2 2 þ T 2 mixtures for these models was again determined dynami-
where T is the total number of training observations, x is cally using the Bayes information criterion (BIC) described
the mean of those observations, and  and 2 are the mean in Section 6. The overall classification results from this
and variance, respectively, of the conjugate prior of m. From experiment are shown in Table 4. These results show a
(19), it can be seen that the MAP estimate of the mean is a small improvement in overall classifier error when com-
pared to those of Table 2, due to the more robust model
weighted average of the conjugate prior mean  and the
obtained by utilizing prior information.
mean of the training observations. As T ! 0, this estimate It is important to note that in these experiments a
will approach the prior , and as T ! 1, it will approach x, relatively large amount of training data (100 samples per
which is the ML estimate. class) is used, resulting in models which are stable and well-
Using MAP to estimate the variance parameter, with a defined. In situations where less training data is available, it
fixed mean, is accomplished in a somewhat simpler is expected that results will be somewhat poorer, and the
manner. Typically, a fixed prior density is used, such that benefit of using MAP adaptation to create a starting point
BUSCH ET AL.: TEXTURE FOR SCRIPT IDENTIFICATION 1729

TABLE 4 TABLE 5
Script Recognition Results for Various Feature Sets Script Recognition Results with and without MAP Adaptation for
Using MAP Adaptation with Large Training Sets Various Texture Features for Small Training Sets

for each model more clearly illustrated. To test this feature space within a particular script, as the texture features
hypothesis, the amount of training data was reduced to extracted from different fonts can vary considerably due to
only 25 samples per class, and the experiment above the unique characteristics of each.
repeated. The overall classification error rates obtained To overcome this limitation of LDA, we propose to
with and without using MAP are shown in Table 5. These perform automatic clustering on the data prior to determin-
results more clearly indicate the benefits of the MAP ing the discriminate function, and assign a separate class
adaptation process, with error rates significantly reduced label to each individual cluster. Training and classification
for each of the feature sets when compared to using models is then performed on this extended set of classes, and the
trained independently using the ML algorithm. final decision mapped back to the original class set.
Although this leads to less training data for each individual
subclass, using the adaptation technique presented in the
8 MULTIFONT SCRIPT RECOGNITION previous section can somewhat alleviate this problem.
Within a given script there typically exists a large number of The k-means clustering algorithm is a fast, unsupervised,
fonts, often of widely varying appearance. Because of such nondeterministic, iterative method for generating a fixed
variations, it is unlikely that a model trained on one set of fonts number of disjoint clusters. Each data point is randomly
will consistently correctly identify an image of a previously assigned to one of k initial clusters, such that each cluster has
unseen font of the same script. To overcome this limitation, it is approximately the same number of points. In each subse-
necessary to ensure that an adequate amount of training quent iteration of the algorithm, the distance from each point
observations from each font to be recognized are provided in to each of the clusters is calculated using some metric, and
order that a sufficiently complex model is developed. moved into the cluster with the lowest such distance.
In addition to requiring large amounts of training data, Commonly used metrics are the Euclidian distance to the
creating a model for each font type necessitates a high centroid of the clusters or a weighted distance which
degree of user interaction, with a correspondingly higher considers only the closest n points. The algorithm terminates
chance of human error. In order to reduce this level of when no points are moved in a single iteration. As the final
supervision, an ideal system would automatically identify result is highly dependent on the initialization of the clusters,
the presence of multiple fonts in the training data and the algorithm is often repeated a number of times, with each
process this information as required. solution scored according to some evaluation function.
8.1 Clustered LDA
The linear discriminate function described previously
attempts to transform the feature space such that the
interclass separation is maximized, while minimizing the
intraclass separation, by finding the maximum of the cost
function trðCS1 0
w Sb C Þ. While this function is optimal in this
sense, it does make a number of strong assumptions
regarding the nature of the distributions of each class in
feature space. All classes are assumed to have equal
covariance matrices, meaning that the resulting transform
will be optimal only in the sense of separation of the class
means. Additionally, since the function is linear, multimodal
distributions cannot be adequately partitioned in some
circumstances. Fig. 6 shows a simplistic synthetic example
of this case, where two classes are clearly well separated in
feature space, however have the same mean and therefore
Fig. 6. Synthetic example of the limitations of LDA. The two multimodal
cannot be effectively separated by a linear discriminate distributions, although well separated in feature space, have identical
function. When analyzing scripts containing multiple fonts, it means and, hence, an effective linear discriminate function cannot be
is common to encounter such multimodal distributions in determined.
1730 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 27, NO. 11, NOVEMBER 2005

TABLE 6 TABLE 7
Script Recognition Error Rates for Scripts Containing Multiple Script Recognition Error Rates for Scripts Containing Multiple
Fonts When Trained with a Single Model Fonts When Clustering Is Used to Create Multiple Models

Determining the optimal number of clusters is a problem A number of texture features were evaluated for the purpose
which has been previously addressed in the literature [47], of script recognition, including GLCM, Gabor filterbank
[48], [49]. However, for the purposes of multifont script energies, and a number of wavelet transform-based features.
recognition, using a fixed number of clusters has shown to By using such features, it is not necessary to extract individual
provide adequate results at significantly reduced computa- script components, making them ideal for degraded and
tional cost. In the experiments in the following section, noisy documents or situations where such segmentation is
10 clusters are used in all cases, as this number was found to not possible. The amount of text required for accurate
be generally sufficient to describe the font variations present recognition is also quite small, with as little as five words
within all of the tested scripts. Although the majority of sufficient in some cases. When classifying scripts containing a
classes can in fact be represented adequately using fewer than single font, experimental results have shown that texture
this number of clusters, using more clusters does not features can outperform other script recognition techniques,
significantly degrade performance. with the wavelet log co-occurrence features giving the lowest
overall classification error rate.
8.2 Classification Results In order to provide more stable model of each script class, as
To illustrate the limitations of using a single model in a multi- well as reducing the need for excessive training data, a
font environment, experiments using a number of fonts from technique was proposed whereby MAP adaptation is used to
each script class were conducted. A total of 30 fonts were create a global script model. Because of the strong interclass
present in the database, with 10 from Latin script, four each correlations which exist between the extracted features of
from Chinese, Japanese, and Persian, and three each from script textures, this approach was found to be well-suited to
Sanskrit, Hebrew, Greek, and Cyrillic. 100 training and the application of automatic script identification. Experi-
testing samples were extracted from each font type. mental results showed a small increase in overall classification
To illustrate the limitations of using a single model for performance when using large training sets, and significant
multiple fonts, each of the scripts was trained as a single class improvement when limited training data is available.
using the MAP classification system proposed above. From Using a single model to characterize multiple fonts within
the results shown in Table 6, it can be seen that large errors are a script class has been shown to be inadequate, as the fonts
introduced, with the most common misclassification occur- within a script class can vary considerably in appearance,
ring between fonts of the Latin and Greek scripts. Interest- often resulting in a multimodal distribution in feature space.
ingly, these results show that the simpler texture features do To overcome this problem, a technique whereby each class is
not suffer the same performance degradation as the more automatically segmented using the k-means clustering
complex features, with the wavelet energy signatures algorithm before performing LDA is presented. By doing
showing the lowest overall classification error of 12.3 percent. this, a number of subclasses are automatically generated and
The proposed clustering algorithm is implemented by trained without the need for any user intervention. Experi-
using k-means clustering to partition each class into ments performed on a multifont script database have shown
10 regions. Each subclass is then assigned an individual label that this technique can successfully identify scripts contain-
and LDA and classification performed as normal. The results ing multiple fonts and styles.
of this experiment are shown in Table 7, with the wavelet log
co-occurrence features again providing the lowest overall
error rate of 2.1 percent. Although the error rates for each of
REFERENCES
the feature sets is slightly higher than the single font results of [1] I. Bazzi, R. Schwartz, and J. Makhoul, “An Omnifont Open-
Vocabulary OCR System for English and Arabic,” IEEE Trans.
Table 5, a vast improvement is achieved when compared to Pattern Analysis and Machine Intelligence, vol. 21, no. 6, pp. 495-504,
the results obtained using a single model only. June 1999.
[2] A.L. Spitz, “Determination of the Script and Language Content of
Document Images,” IEEE Trans. Pattern Analysis and Machine
9 CONCLUSIONS AND FUTURE WORK Intelligence, vol. 19, no. 3, pp. 235-245, Mar. 1997.
[3] C. Suen, N.N. Bergler, B. Waked, C. Nadal, and A. Bloch,
This paper has shown the effectiveness of texture analysis “Categorizing Document Images into Script and Language
techniques in the field of document processing and, more Classes,” Proc. Int’l Conf. Advances in Pattern Recognition, pp. 297-
specifically, to the problem of automatic script identification. 306, 1998.
BUSCH ET AL.: TEXTURE FOR SCRIPT IDENTIFICATION 1731

[4] B. Julesz, “Visual Pattern Discrimination,” IRE Trans. Information [31] I. Daubechies, “The Wavelet Transform, Time-Frequency Locali-
Theory, vol. 8, pp. 84-92, 1962. zation and Signal Analysis,” IEEE Trans. Information Theory, vol. 36,
[5] C. Ronse and P. Devijver, Connected Components in Binary Images: pp. 961-1005, 1990.
The Detection Problem. Research Studies Press, 1984. [32] T. Chang and C.C. Kuo, “Texture Segmentation with Tree-
[6] D.S. Lee, C.R. Nohl, and H.S. Baird, “Language Identification in Structured Wavelet Transform,” Proc. IEEE Int’l Symp. Time-
Complex, Unoriented, and Degraded Document Images,” Proc. Frequency and Time-Scale Analysis, vol. 2, p. 577 1992.
IAPR Workshop Document Analysis and Systems, pp. 76-98, 1996. [33] H. Greenspan, S. Belongie, R. Goodman, and P. Perona, “Rotation
[7] A.L. Spitz and M. Ozaki, “Palace: A Multilingual Document Invariant Texture Recognition Using a Steerable Pyramid,” Proc.
Recognition System,” Proc. Int’l Assoc. for Pattern Recognition 12th Int’l Conf. Pattern Recognition, vol. 2, pp. 162-167, 1994.
Workshop Document Analysis Systems, pp. 16-37, 1995. [34] M. Unser, A. Aldroubi, and M. Eden, “A Family of Polynomial
[8] J. Hochberg, “Automatic Script Identification from Images Using Spline Wavelet Transforms,” Signal Processing, vol. 30, pp. 141-162,
Cluster-Based Templates,” IEEE Trans. Pattern Analysis and 1993.
Machine Intelligence, vol. 19, no. 2, pp. 176-181, Feb. 1997. [35] G. Van de Wouwer, P. Scheunders, and D. Van Dyck, “Statistical
[9] U. Pal and B.B. Chaudhuri, “Automatic Identification of English, Texture Characterization from Discrete Wavelet Representations,”
Chinese, Arabic Devnagari and Bangla Script Line,” Proc. Sixth IEEE Trans. Image Processing, vol. 8, no. 4, pp. 592-598, 1999.
Int’l Conf. Document Analysis and Recognition, pp. 790-794, 2001. [36] A. Busch, W.W. Boles, and S. Sridharan, “Logarithmic Quantiza-
[10] G. Peake and T. Tan, “Script and Language Identification from tion of Wavelet Coefficients for Improved Texture Classification
Document Images,” Proc. Workshop Document Image Analysis, vol. 1, Performance,” IEEE Int’l Conf. Acoustics, Speech, and Signal
pp. 10-17, 1997. Processing, 2004.
[11] T. Tan, “Rotation Invariant Texture Features and Their Use in [37] S.G. Mallat, “Zero-Crossings of a Wavelet Transform,” IEEE Trans.
Automatic Script Identification,” IEEE Trans. Pattern Analysis and Information Theory, vol. 37, pp. 1019-1033, 1991.
Machine Intelligence, vol. 20, no. 7, pp. 751-756, July 1998. [38] A. Busch and W.W. Boles, “Texture Classification Using Wavelet
[12] M. Acharyya and M.K. Kundu, “Document Image Segmentation Scale Relationships,” Proc. IEEE Int’l Conf. Acoustics, Speech, and
Using Wavelet Scale-Space Features,” IEEE Trans. Circuits and Signal Processing, vol. 4, pp. 3484-3487, 2002.
Systems for Video Technology, vol. 12, no. 12, pp. 1117-1127, 2002. [39] R.O. Duda, P.E. Hart, and D.G. Stork, Pattern Classification. New
[13] P. Clark and M. Mirmedhi, “Combining Statistical Measures to York: John Wiley & Sons, Inc., 2001.
Find Image Text Regions,” Proc. 15th Int’l Conf. Pattern Recognition, [40] R.E. Kass and A.E. Raftery, “Bayes Factors,” J. Am. Statistical
pp. 450-453, 2000. Assoc., vol. 90, pp. 773-795, 1994.
[14] N. Jin and Y.Y. Tang, “Text Area Localization under Complex- [41] J. Olivier and R. Baxter, “MML and Bayesianism: Similarities and
Background Using Wavelet Decomposition,” Proc. Sixth Int’l Conf. Differences,” Technical Report 206, Monash Univ., Australia, 1994.
Document Analysis and Recognition, pp. 1126-1130, 2001. [42] G. Schwarz, “Estimating the Dimensionality of a Model,” Ann.
[15] H. Li, D. Doermann, and O. Kia, “Automatic Text Detection and Statistics, vol. 6, no. 2, pp. 461-464, 1978.
Tracking in Digital Video,” IEEE Trans. Image Processing, vol. 9, [43] D.A. Reynolds, “Comparison of Background Normalization
no. 1, pp. 147-156, 2000. Methods for Text-Independent Speaker Verification,” Proc.
[16] V. Wu, R. Manmatha, and E.M. Riseman, “Finding Text in EUROSPEECH vol. 2, pp. 963-970, 1997.
Images,” Proc. Second ACM Int’l Conf. Digital Libraries, 1997. [44] K. Fukunaga, Introduction to Statistical Pattern Recognition, second
ed. San Diego: Academic Press 1990.
[17] N. Otsu, “A Threshold Selection Method from Gray-Level
[45] C. Lee and J. Gauvain, “Bayesian Adaptive Learning and MAP
Histograms,” IEEE Trans. Systems, Man, and Cybernetics, vol. 9,
Estimation of HMM,” Automatic Speech and Speaker Recognition:
no. 1, pp. 62-66, 1979.
Advanced Topics, Boston: Kluwer Academic, pp. 83-107, 1996.
[18] J. Kittler and J. Illingworth, “Minimum Error Thresholding,”
[46] C.-H. Lee, C.-H. Lin, and B.-H. Juang, “A Study on Speaker
Pattern Recognition, vol. 19, pp. 41-47, 1986.
Adaptation of the Parameters of Continuous Density Hidden
[19] J.N. Kapur, P.K. Sahoo, and A.K. C. Wong, “A New Method for
Markov Models,” IEEE Trans. Acoustics, Speech, and Signal
Gray-Level Picture Thresholding Using the Entropy of the
Processing, vol. 39, no. 4, pp. 806-814, 1991.
Histogram,” Computer Vision, Graphics, and Image Processing,
[47] H.-S. Rhee and K.-W. Oh, “A Validity Measure for Fuzzy
vol. 29, pp. 273-285, 1985.
Clustering and Its Use in Selecting Optimal Number of Clusters,”
[20] Y. Liu, R. Fenich, and S.N. Srihari, “An Object Attribute Thresh- Proc. Fifth IEEE Int’l Conf. Fuzzy Systems, vol. 2, pp. 1020-1025, 1996.
olding Algorithm for Document Image Binarization,” Proc. Int’l [48] K.S. Younis, M.P. DeSimio, and S.K. Rogers, “A New Algorithm
Conf. Document Analysis and Recognition, pp. 278-281, 1993. for Detecting the Optimal Number of Substructures in the Data,”
[21] J. Yang, Y. Chen, and W. Hsu, “Adaptive Thresholding Algorithm Proc. IEEE Aerospace and Electronis Conf. , vol. 1, pp. 503-507, 1997.
and Its Hardware Implementation,” Pattern Recognition Letters, [49] I. Gath and A.B. Geva, “Unsupervised Optimal Fuzzy Clustering,”
vol. 15, pp. 141-150, 1994. IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 11, no. 7,
[22] J. Sauvola and M. Pietikainen, “Adaptive Document Image pp. 773-780, July 1989.
Binarization,” Pattern Recognition, vol. 33, pp. 225-236, 2000.
[23] Y. Liu and S.N. Srihari, “Document Image Binarization Based on Andrew Busch received a double degree in
Texture Features,” IEEE Trans. Pattern Analysis and Machine electronic engineering and information technol-
Intelligence, vol. 19, no. 5, pp. 540-544, May 1997. ogy and the PhD degree in engineering from the
[24] W. Postl, “Detection of Linear Oblique Structures and Skew Scan Queensland University of Technology, Australia,
in Digitized Documents,” Proc. Int’l Conf. Pattern Recognition, in 1998 and 2004, respectively. He currently
pp. 687-689, 1986. works as a lecturer within the school of Micro-
[25] G. Peake and T. Tan, “A General Algorithm for Document Skew electronic Engineering at Griffith University,
Angle Estimation,” Proc. Int’l Conf. Image Processing, vol. 2, pp. 230- Australia. His research interests include texture
233, 1997. classification, multiresolution signal analysis,
[26] B.B. Chaudhuri and U. Pal, “Skew Angle Detection of Digitized document analysis, and the use of imagery for
Indian Script Documents,” IEEE Trans. Pattern Analysis and biometric authentication. He is a member of IEEE.
Machine Intelligence, vol. 19, no. 2, pp. 703-712, Feb. 1997.
[27] H.S. Baird, “The Skew Angle of Printed Documents,” Document
Image Analysis, L. O’Gorman and R. Kasturi, eds., IEEE CS Press,
pp. 204-208, 1995.
[28] A. Vailaya, H.J. Zhang, and A.K. Jain, “Automatic Image
Orientation Detection,” Proc. Int’l Conf. Image Processing, vol. 2,
pp. 600-604, 1999.
[29] S. Lowther, V. Chandran, and S. Sridharan, “An Accurate Method
for Skew Determination in Document Images,” Digital Image
Computing Techniques and Applications, vol. 1, pp. 25-29, 2002.
[30] R.M. Haralick, K. Shanmugam, and I. Dinstein, “Textural Features
for Image Classification,” IEEE Trans. Systems, Man, and Cyber-
netics, vol. 3, pp. 610-621, 1973.
1732 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 27, NO. 11, NOVEMBER 2005

Wageeh W. Boles is an associate professor at Sridha Sridharan received the BSc (electrical
the School of Engineering Systems, Queensland engineering) degree and received the MSc
University of Technology (QUT), Australia. (communication engineering) degree from the
Professor Boles obtained the BSc degree in University of Manchester Institute of Science
electrical engineering from Assiut University, and Technology (UMIST), United Kingdom, and
Egypt, and the MSc and the PhD degrees in the PhD degree in the area of signal processing
electrical engineering from the University of from University of New South Wales, Australia.
Pittsburgh. He also obtained a graduate certifi- He is a fellow of the Institution of Engineers,
cate in education (higher education) from QUT. Australia, a senior member of the IEEE and the
He held the academic positions of assistant chairman of the IEEE Queensland Chapter in
professor, Penn State University, then lecturer, senior lecturer, and Signal Processing and Communication. He is currently with the
associate professor at Queensland University of Technology. From Queensland University of Technology (QUT) where he is a professor
1999 to 2004, he held the position of assistant dean (Teaching and in the School of Electrical and Electronic Systems Engineering.
Learning), at the Faculty of Built Environment and Engineering, QUT. Professor Sridharan is the leader of the Research Program in Speech,
Professor Boles has been successful in obtaining numerous competitive Audio, Image, and Video Technologies (SAIVT) at QUT and is a deputy
research and teaching development grants and has more than director of the Information Security Institute (ISI) at QUT. In 1997, he
100 publications. He initiated and maintained a unique, active, and was the recipient of the award of Outstanding Academic of QUT in the
pioneering research effort in developing new image processing area of research and scholarship.
techniques and adopting them to various applications such as biometric
human identification using iris and palm images, object recognition, and
texture analysis. He is very passionate about his teaching and has
published in the areas of work integrated learning and the study and
utilization of learners’ cognitive styles in the design and implementation
of computer-based learning solutions. Professor Boles has been
awarded two Outstanding Teaching Assistant Medals from the . For more information on this or any other computing topic,
University of Pittsburgh, in 1987/1988. In 1999, he received the QUT please visit our Digital Library at www.computer.org/publications/dlib.
Award for Outstanding Academic Contribution in Teaching and Leader-
ship, the Faculty’s Teaching Excellence Award, and was one of only
three University nominees to the Australian Awards for University
Teaching. He also won the national Engineers Australia—Australasian
Association for Engineering Education (EA-AAEE) Award for Excellence
in Teaching and Learning in Engineering Education, in 2004. Professor
Boles is a member of the Executive of the Australasian Association for
Engineering Education, AaeE, since 2001 and a member of the IEEE
since 1984. He is also a member of the IEEE Computer Society.

You might also like