Texture For Script Identification
Abstract—The problem of determining the script and language of a document image has a number of important applications in the
field of document analysis, such as indexing and sorting of large collections of such images, or as a precursor to optical character
recognition (OCR). In this paper, we investigate the use of texture as a tool for determining the script of a document image, based on
the observation that text has a distinct visual texture. An experimental evaluation of a number of commonly used texture features is
conducted on a newly created script database, providing a qualitative measure of which features are most appropriate for this task.
Strategies for improving classification results in situations with limited training data and multiple font types are also proposed.
Index Terms—Script identification, wavelets and fractals, texture, document analysis, clustering, classification and association rules.
Once found, the position of each upward concavity in relation preattentively determined by a human observer regardless of
to the baseline of the character is noted and a histogram of such factors, indicating that the overall texture of the image is
these positions for the entire document constructed. Analysis maintained. For these reasons, texture analysis appears to be
of such histograms for Latin-based languages shows a a good choice for the problem of script identification from
distinctly bimodel distribution, with the majority of upward document images.
concavities occurring either slightly above the baseline or Previous work in the use of texture analysis for script
slightly below the x-height line. In contrast to this, Han-based identification has been limited to the use of Gabor filterbanks
scripts exhibit a much more uniform distribution, with the [10], [11]. While this work has shown that texture can provide
modal value typically evenly spaced between the baseline a good indication of the script type of a document image,
and x-height. Using this information, it is possible to other texture features may give better results for this task, and
accurately distinguish Latin-based and Han-based scripts we investigate this possibility in this work.
using a simple measure of variance. No histograms were
provided for other script types such as Greek, Hebrew, or
Cyrillic, so it is unknown how such a method will perform for 4 PREPROCESSING OF IMAGES
these script types. In constant use, this method has never been In general, blocks of text extracted from typical document
observed to incorrectly classify Latin or Han-based scripts [7]. images are not good candidates for the extraction of texture
As well as this technique, a number of other approaches to features. The varying degrees of contrast in gray-scale images
automatic script recognition have been proposed. Hochberg and the presence of skew and noise could all potentially affect
used textual symbols extracted from individual characters to such features, leading to higher classification error rates.
classify text regions, with template matching used to classify Additionally, the large areas of white space, unequal
each such symbol [8]. Pal and Chaudhuri use properties of character word and line spacings, and line heights can also
individual text lines to distinguish between five scripts, have a significant effect on these features. In order to reduce
including the challenging Devanagari and Bangla scripts [9]. the impact of these factors, the text blocks from which texture
This approach uses a rule-based approach relying on features are to be extracted must undergo a significant
extensive knowledge of the distinctive properties of these amount of preprocessing. The individual steps which are
scripts, making unsuitable for applications where unknown performed in this stage are binarization, deskewing, and
scripts are to be added and recognized. block normalization.
The segmentation and extraction of text regions from a
document image is a difficult problem which has received
3 TEXTURE ANALYSIS FOR SCRIPT RECOGNITION significant attention in the literature [12], [13], [14], [15],
The work presented in the previous section has shown [16]. This stage of processing, however, is beyond the scope
excellent results in identifying a limited number of script of this paper, and manual extraction of text regions is
types in ideal conditions. In practice, however, such techni- performed for all experiments. Such an approach is
ques have a number of disadvantages which in many cases consistent with previous work in this field, with manual
make identification of some scripts difficult. The detection of extraction text regions used by a number of authors [2], [11].
upward concavities in an image is highly susceptible to noise Binarization can be described as the process of converting
and image quality, with poor quality and noisy images having a gray-scale image into one which contains only two distinct
high variances in these attributes. Experiments conducted on tones, that is black and white. This is an essential stage in
noisy, low-resolution, or degraded document images have many of the algorithms used in document analysis, especially
shown that classification performance drops to below those that identify connected components, that is, groups of
70 percent for only two script classes of Latin-based and pixels which are connected to form a single entity. Although
Han. The second disadvantage of the technique proposed by document images are typically produced with a high level of
Spitz is that it cannot effectively discriminate between scripts contrast for ease of reading, scanning artifacts, noise, paper
with similar character shapes, such as Greek, Cyrillic, and defects, colored regions, and other image characteristics can
Latin-based scripts, even though such scripts are easily sometimes make this a nontrivial task, with many possible
visually distinguished by untrained observers. solutions presented to date in the literature. In general, a
Determination of script type from individual characters is decision threshold is used to determine the final binary value
also possible using OCR technology. This approach, as well as of each pixel, with the actual threshold value determined
others which rely on the extraction of connected components, either globally for the entire image [17], [18], [19] or, locally,
requires accurate segmentation of characters before their for different regions of the image [20], [21], [22]. Iterative
application, a task which becomes difficult for noisy, low approaches to binarization, which allow for better perfor-
resolution, or degraded images. Additionally, certain script mance in the presence of textured and other irregular
types, for example, Sanskrit, do not lend themselves well to background, have also been proposed [23].
character segmentation, and require special processing. This For the purposes of this evaluation, all of the images
presents a paradox in that to extract the characters for script used are of high contrast with no background shading
identification, the script, in some cases, must already be effects. Because of this, a global thresholding approach
known. Using global image characteristics to recognize the provides an adequate means of binarization, and the
script type of a document image overcomes many of these method proposed by Otsu in [17] is used.
limitations. Because it is not necessary to extract individual
characters, no script-dependent processing is required. The 4.1 Skew Detection and Correction
effects of noise, image quality, and resolution are also limited Knowing the skew of a document is necessary for many
to the extent that it impairs the visual appearance of a sample. document analysis tasks. Calculating projection profiles, for
In most cases, the script of the document can still be readily example, requires knowledge of the skew angle of the
Fig. 3. Example of text normalization process on an English document image. (a) Original image and (b) normalized block.
Fig. 4. Example of text normalization process on a Chinese document image. (a) Original image and (b) normalized block.
differences in the nature of various scripts, analysis of the line jfððr; sÞ; ðt; vÞÞ : Iðr; sÞ ¼ i; Iðt; vÞ ¼ jgj
Pd ði; jÞ ¼ ; ð1Þ
is required in order to determine the limits and relative NM
frequencies of character heights. For example, Latin-based where j j represents the cardinality of a set. Due to the variable
scripts comprise mostly of characters of a small height, such parameter d, the set of co-occurrence matrices is arbitrarily
as “a,” “r,” and “s,” with a small but significant number of large. In practice, the most relevant correlations occur at short
characters which protrude above and/or below these heights, distances and, thus, the values of d are typically kept small,
for example, “A,” “h,” “p,” and “j.” In contrast to this, almost and expressed in the form ðd; Þ, with d representing the lineal
all characters in some other scripts, such as Chinese, have distance in pixels, and the angle between them. Typically,
identical heights and, thus, require a somewhat larger line is restricted to the values f0 ; 45 ; 90 ; 135 g, and d limited to a
spacing in order to maintain a uniform appearance. Determi- small range of values. It is also possible to modify the GLCM
nation of which class a sample of text belongs to can be easily somewhat to ensure diagonal symmetry of the matrix. This is
made by an examination of the projection profiles of each line. achieved by the transformation
In order to obtain a uniform appearance over all script types,
this information is taken into account when normalizing the Pðd;Þ ¼ Pðd;Þ þ Pðd;Þ : ð2Þ
line spacings. To allow for a more uniform appearance,
samples of text with uniform or near-uniform character For a typical image with R 8, the size of the resulting
heights are combined using a larger line spacing. GLCMs makes their direct use unwieldy, and statistical
An example showing the effect of the entire normalization features such as correlation, entropy, energy, and homo-
process applied to a typical text segment is shown in Fig. 3. geneity are instead extracted and used to characterize the
From this example, it can be seen that the original image, image [30]. Due to the binary nature of the document images
which is somewhat noisy and contains many large regions of from which the features are extracted, the extraction of such
whitespace, highly variable line spacing, and nonstandard features is unnecessary and indeed counterproductive. Since
there are only two gray levels, the matrices will be of size 2 2,
text in form of equations, is transformed into a block of
meaning that it is possible to fully describe each matrix with
relatively uniform appearance. Closer inspection reveals the
only three unique parameters due to the diagonal symmetry
existence of repeated sections, however, preattentively this is
property. Using these values directly is feasible and has
not apparent. The algorithm described above works equally
experimentally shown to give better results than attempting
well on all tested scripts and languages, which is clearly an
to extract the co-occurrence features of such a small matrix.
important property for this application. Fig. 4 shows the
Using values of d ¼ f1; 2g and ¼ f0 ; 45 ; 90 ; 135 g leads to
results obtained after processing a Chinese document image.
a total of 24 features.
Note in this example the increased line spacings due to the
equal height of characters in the Chinese script. 5.2 Gabor Energy Features
The energy of the output of a bank of Gabor filters has been
5 TEXTURE FEATURE EXTRACTION previously used as features for identifying the script of a
document image, with good results shown for a small set of
From each block of normalized text, the following texture test images [10], [11]. In this work, both even and odd
features are evaluated for the purpose of script identification. symmetric filters are used; they are described by
5.1 Gray-Level Co-Occurrence Matrix Features x2 y
12 þ 2
x y
Gray-level co-occurrence matrices (GLCMs) are used to ge ðx; yÞ ¼ e cosð2u0 ðx cos þ y sin ÞÞ ð3Þ
represent the pairwise joint statistics of the pixels of an image
x2 y
12 þ 2
and have been used for many years as a means of go ðx; yÞ ¼ e 2
x y sinð2u0 ðx cos þ y sin ÞÞ; ð4Þ
characterizing texture [30]. For a gray-scale image quantized
to R discrete levels, such matrices contain R R elements where x and y are the spatial coordinates, 0 the frequency of
and can be defined for an image I as the sinusoidal component of the Gabor filter, and x and y
the frequencies of the Gaussian envelope along the principal 5.4 Wavelet Log Mean Deviation Features
axes, typically with x ¼ y . In the experimental results Previous work in the field of texture classification has shown
presented in [11], a single value of 0 ¼ 16 was used, with that by applying a nonlinear function to the coefficients of the
16 orientation values spaced equidistantly between 0 and 2, wavelet transform, a better representation of naturally
giving a total of 16 filters. By combining the energies of the textured images can be obtained [36]. By using a logarithmic
outputs of the even and odd symmetric filters for each such transform and extracting the mean deviation of these values
orientation, a feature vector of same dimensionality is rather than the energy, significant improvements in overall
created. To obtain rotation invariance, this vector is trans- classification accuracy were obtained when compared to the
formed via the Fourier transform, and the first four resulting standard wavelet energy signatures, at negligible increase in
coefficients used for classification. Since skew detection and computational cost. These features, named the wavelet log
correction has been performed on the test images to be used in mean deviation features, are defined as [36]
these experiments, such a transformation is not required, and
PM PN jDjk ðn;mÞj
the features will be used unmodified. By combining the m¼1 n¼1 log þ 1
energies of the odd and even symmetric filters for each LMDjk ¼ ; ð10Þ
resolution and orientation, a total of 16 features are obtained MN
using this method. While these features have shown good where is a constant specifying the degree of nonlinearity in
performance on a small number of script types [11], using the transform, and Sj represents the estimated maximum
only a single frequency does not provide the necessary value of the coefficients at resolution level j. Although the
discrimination when a large set of scripts and fonts are used. optimal value of is high dependent upon the textures being
To overcome this, an additional 16 filters with a frequency of used, previous work has found that a value of ¼ 0:001
0 ¼ 8 are employed, giving a final dimensionality of 32. performs well over a wide variety of natural textures. The
total number of features obtained in this manner is equal to
5.3 Wavelet Energy Features the wavelet energy features, thus when using four levels of
The wavelet transform has emerged over the last two decomposition a dimensionality of 12 is again obtained.
decades as a formal, concise theory of signal decomposition
and has been used to good effect in a wide range of 5.5 Wavelet Co-Occurrence Signatures
disciplines and practical applications. A discrete, two- By extracting second-order features of the wavelet coeffi-
dimensional form of the transform can be defined as [31] cients, represented by the co-occurrence features at small
distances, it is possible to significantly improve the classifica-
Aj ¼ ½Hx ½Hy Aj1 #2;1 #1;2 ð5Þ tion of natural textures [35]. Such features are extracted from
Dj1 ¼ ½Gx ½Hy Aj1 #2;1 #1;2 ð6Þ each of the wavelet detail images in a similar manner to the
GLCM features described previously, with linear quantiza-
Dj2 ¼ ½Hx ½Gy Aj1 #2;1 #1;2 ð7Þ tion used to transform the near-continuous wavelet coeffi-
Dj3 ¼ ½Gx ½Gy Aj1 #2;1 #1;2 ; ð8Þ cients to discrete form. In order to avoid overly sparse
matrices, an undecimated form of the wavelet transform is
where Aj and Djk are the approximation and detail used in place of the standard two-dimensional FWT, provid-
coefficients at each resolution level j, H, and G are the ing greater spatial resolution, less sparse co-occurrence
low and high-pass filters, respectively, and #x;y represents matrices at low resolutions, and translation invariance [37].
downsampling along each axis by the given factors. The These features are known as the wavelet co-occurrence
energies of each detail band of this transform, calculated by features.
PM PN The nonlinear transform described above can also be
Djk ðm; nÞ used when calculating the wavelet co-occurrence features
Ejk ¼ m¼1 n¼1 ; ð9Þ
MN by modifying the quantization function, with experimental
results showing significantly reduced overall error rates
where M and N represent the size of each detail image, have
when used to classify a variety of natural textures [36]. This
been used by many authors as a texture feature vector [32],
is most easily achieved by modifying the quantization
[33]. Although such features are relatively primitive in
function qðxÞ such that for a desired number of levels I,
nature, their wide use and simple nature make them an ideal
point of reference with which to compare more sophisticated 8 h i
< round log x þ 1 ; x >¼ 0
approaches. q1 ðxÞ ¼ h Sj i ð11Þ
These features can be directly extracted from a region of : round log jxj þ 1 ; x < 0;
normalized text, giving a total of 3J features, where J is the
total number of decomposition levels used in the transform. where
In the evaluation conducted in this paper, a value of J ¼ 4 is I1
used, leading to a feature dimensionality of 12. The choice of ¼ : ð12Þ
logð1= þ 1Þ
analyzing is also of importance when extracting such
features. Although the literature has presented results using Once the wavelet coefficients are quantized using (11), co-
a number of different analyzing wavelets, the family of occurrence matrices are formed in an identical manner to the
biorthogonal spline wavelets [34] is a popular choice due to construction of the GLCMs described previously. From such
their symmetry, compact support and smoothness, and matrices, the following co-occurrence features are extracted:
regularity properties. Previous work has shown that these energy, entropy, inertia, local homogeneity, contrast, cluster
wavelets perform well in texture characterization problems, shade, cluster prominence, and information measure of
and we have used a second-order wavelet of this type in all correlation, as shown in Table 1 [30]. These features are
work presented in this paper [35]. known as the wavelet log co-occurrence features.
Cyrillic, Hebrew, Sanskrit, and Farsi). Examples of these
Co-Occurrence Features Extracted from GLCMs images are shown in Fig. 5. Each such image was binarized,
deskewed, and normalized using the algorithms described
above, and 200 segments, each 64 64 pixels in size, extracted
for each script class. This size sample corresponds to roughly
four lines of printed text, typically, containing two or three
words on each line. Although higher accuracies could be
obtained by using larger areas or even complete regions, we
have used these small regions to simulate situations where
only a limited amount of text is available. The images obtained
from this process were then divided into two equal groups to
create the training and testing sets, ensuring that samples
taken from the same image were placed into the same group.
In order to improve classification accuracy and reduce
the dimensionality of the feature space, feature reduction is
performed prior to classification by means of linear
discriminate analysis [39]. This technique maps the feature
space to one of lower dimensionality while maximizing the
Fisher criterion, a measure of class separability. For a
training set of N classes and a feature dimensionality of M,
this analysis will return a M ! ðN 1Þ mapping, repre-
senting the N 1 hyperplanes necessary to segment the
feature space linearly. To illustrate the effectiveness of this
technique, the results obtained both with and without
performing this step are shown in Table 2.
Classification of the samples is performed using a
Gaussian mixture model (GMM) classifier, which attempts
to model each class as a combination of Gaussian distribu-
Extracting these features for the first four resolution levels tions in feature space [39], and is trained using a version of the
of the wavelet decomposition gives a total of 96 features for expectation maximization (EM) algorithm. Due to the large
both the linear and logarithmic quantized cases. range of possible scripts, a dynamic method of determining a
classifier topology for each class is required. For this purpose,
5.6 Wavelet Scale Co-Occurrence Signatures we have chosen to use the Bayes information criterion (BIC),
The wavelet scale co-occurrence signatures have been which can be approximated for each candidate topology T i
recently shown to provide unique texture information by by [40], [41], [42]
describing the relationships between scales of the
wavelet transform, allowing improved modeling of visual ^ i Þ Ki log Ni ;
BICðT i Þ ¼ log pðXi jT i ; ð14Þ
texture features which contain information on many 2
scales and orientations [38]. A scale co-occurrence matrix where Xi is the set of N training observations, ^ i the
is defined as [36] parametric form of the trained classifier, K the number of
jfðu; vÞ : q1 ðDji ðu; vÞÞ ¼ k; q2 ðAj ðu; vÞÞ ¼ lgj free parameters in the model, and a scaling factor. From (14),
Sji ðk; lÞ ¼ ; it can be seen that this criterion is made up of the likelihood of
the model given the training data minus a penalty factor
ð13Þ which increases linearly with model complexity.
where Aj ðu; vÞ is the approximation image at resolution The overall classification error rates for each of the texture
level j, Dji ðu; vÞ are the three detail images, q1 ðxÞ and q2 ðxÞ are features are shown in Table 2. It can be seen that the wavelet
the quantization functions for the detail and approximation log co-occurrence significantly outperform any of the other
coefficients, respectively, and ðk; lÞ 2 f1 . . . Ig, where I is the features for script classification, with an overall error rate of
only 1 percent. This result is relatively consistent with those
number of discrete quantization levels used. The logarithmic
reported for natural textures [36], indicating that local
quantization function described in (11) is used for the detail relationships between wavelet coefficients are an excellent
coefficients, while linear quantization has been shown to be basis for representing texture. The relative increase in
more suitable for the approximation data. From each of the performance of these features compared to those extracted
scale co-occurrence matrices, a number of the features with linear quantization is again consistent with previously
described in Table 1 are extracted. These features have been published results.
shown to perform well on a variety of naturally textured The scale co-occurrence features did not perform as well
images, with lower overall classification errors when applied on the binary script images as has been previously reported
to some texture databases. for natural textures [38], with only a slightly reduced error
rate when compared to the wavelet energy features. The
GLCM features showed the worst overall performance,
6 CLASSIFICATION RESULTS from which it can be concluded that pixel relationships at
The proposed algorithm for automatic script identification small distances are insufficient to characterize the script of a
from document images was tested on a database containing document image. The poor performance of the features
eight different script types (Latin, Chinese, Japanese, Greek, proposed by Spitz can be attributed to the fact that they
Fig. 5. Examples of document images used for training and testing. (a) English, (b) Chinese, (c) Greek, (d) Cyrillic, (e) Hebrew, (f) Hindi, (g) Japanese,
and (h) Persian.
were only designed to distinguish between Latin and Han- 7 ADAPTIVE GMMS FOR IMPROVED CLASSIFIER
based scripts and cannot effectively discriminate script PERFORMANCE
pairs such as Greek and Latin or Persian and Devangari.
Table 3 shows the distribution of the errors among the Printed text, regardless of the script, has a distinct visual
various script classes for the wavelet log co-occurrence texture and is easily recognized as such by a casual observer.
features. In order to give a meaningful distribution, linear With such a commonality between all script classes, it is
discriminate analysis was not used when generating these possible to use this a priori knowledge to improve the
results. The Chinese script shows the lowest overall error modeling of each individual texture class. This is done by
rate for these features, with the largest errors arising from training a global model using all available training, then
misclassifications between the Cyrillic and Greek scripts. adapting this model for each individual class, rather than
creating each model independently. By doing this, a more
robust representation can be obtained, somewhat overcom-
TABLE 2 ing the blind nature of the learning algorithm. It is also
Script Recognition Results for Each of the Feature Sets possible to train a class using less training observations, since
with and without Feature Reduction an initial starting point for the model is already available. This
technique has been used with great success in applications
where the general form of a model can be estimated using
prior information, such as the modeling of speech and
speakers, and is known as maximum a posterior (MAP)
adaptation [43].
Confusion Matrix for the Wavelet Log Co-Occurrence Features
Þ is the likelihood of the training observations for
where lð constant 2 2min
gð2 Þ ¼ ð20Þ
that parametric form defined as 0 otherwise;
where 2min is estimated from a large number of observa-
Þ ¼ pðoj
lð Þ: ð16Þ
tions, in the case of our application the entire database of
Given these definitions, the ML framework can be thought script images. Given this simplified density function, the
of as finding a fixed but unknown set of parameters ^ . In MAP estimate of the variance is then given by
contrast to this, the maximum a posterior (MAP) approach
assumes to be a random vector with a known distribution, 2 Sx Sx 2min
~ ¼ ð21Þ
0 otherwise;
with an assumed correlation between the training observa-
tions and the parameters [45]. From this assumption, it where Sx is the variance of the training observations. This
becomes possible to make a statistical inference of using procedure is often known as variance clipping and is effective
only a small set of adaption data o, and prior knowledge of in situations where limited training data does not allow for
the parameter density gð Þ. The MAP estimate therefore an adequate estimate of the variance parameter.
maximizes the posterior density such that In the current application of script recognition, the prior
parameter density gð Þ can be estimated using a global
MAP ¼ arg max gð
joÞ ð17Þ model of script, trained using all available data. This choice
is justified by the observation that printed text, in general,
¼ arg max lðoj
Þ: ð18Þ regardless of script type, has a somewhat unique appear-
ance and as such the texture features obtained should be
Since the parameters of a prior density can also be relatively well clustered in feature space. Training for each
estimated from an existing set of parameters 0 , the MAP of the individual textures is then carried out by adapting
framework also provides an optimal method of combining this global model for each individual script class. In order to
0 with a new set of observations o. create more stable representations and limit computational
In the case of a Gaussian distribution, the MAP expense, only the mean parameters and weights of these
estimations of the mean m ~ and variance ~2 can be obtained mixtures are adapted during this process, using (19).
using the framework presented above, given prior distribu- 7.2 Classification Results
tions of gðmÞ and gð2 Þ, respectively. If the mean alone is to
Using the same training data as the previous experiment
be estimated, this can be shown to be given by [46] and the MAP approach outlined above, a global script
model was created for each of the feature sets and adapted
T 2 2
~ ¼
x þ ; ð19Þ separately for each script class. The optimal number of
2 þ T 2 2 þ T 2 mixtures for these models was again determined dynami-
where T is the total number of training observations, x is cally using the Bayes information criterion (BIC) described
the mean of those observations, and and 2 are the mean in Section 6. The overall classification results from this
and variance, respectively, of the conjugate prior of m. From experiment are shown in Table 4. These results show a
(19), it can be seen that the MAP estimate of the mean is a small improvement in overall classifier error when com-
pared to those of Table 2, due to the more robust model
weighted average of the conjugate prior mean and the
obtained by utilizing prior information.
mean of the training observations. As T ! 0, this estimate It is important to note that in these experiments a
will approach the prior , and as T ! 1, it will approach x, relatively large amount of training data (100 samples per
which is the ML estimate. class) is used, resulting in models which are stable and well-
Using MAP to estimate the variance parameter, with a defined. In situations where less training data is available, it
fixed mean, is accomplished in a somewhat simpler is expected that results will be somewhat poorer, and the
manner. Typically, a fixed prior density is used, such that benefit of using MAP adaptation to create a starting point
Script Recognition Results for Various Feature Sets Script Recognition Results with and without MAP Adaptation for
Using MAP Adaptation with Large Training Sets Various Texture Features for Small Training Sets
for each model more clearly illustrated. To test this feature space within a particular script, as the texture features
hypothesis, the amount of training data was reduced to extracted from different fonts can vary considerably due to
only 25 samples per class, and the experiment above the unique characteristics of each.
repeated. The overall classification error rates obtained To overcome this limitation of LDA, we propose to
with and without using MAP are shown in Table 5. These perform automatic clustering on the data prior to determin-
results more clearly indicate the benefits of the MAP ing the discriminate function, and assign a separate class
adaptation process, with error rates significantly reduced label to each individual cluster. Training and classification
for each of the feature sets when compared to using models is then performed on this extended set of classes, and the
trained independently using the ML algorithm. final decision mapped back to the original class set.
Although this leads to less training data for each individual
subclass, using the adaptation technique presented in the
8 MULTIFONT SCRIPT RECOGNITION previous section can somewhat alleviate this problem.
Within a given script there typically exists a large number of The k-means clustering algorithm is a fast, unsupervised,
fonts, often of widely varying appearance. Because of such nondeterministic, iterative method for generating a fixed
variations, it is unlikely that a model trained on one set of fonts number of disjoint clusters. Each data point is randomly
will consistently correctly identify an image of a previously assigned to one of k initial clusters, such that each cluster has
unseen font of the same script. To overcome this limitation, it is approximately the same number of points. In each subse-
necessary to ensure that an adequate amount of training quent iteration of the algorithm, the distance from each point
observations from each font to be recognized are provided in to each of the clusters is calculated using some metric, and
order that a sufficiently complex model is developed. moved into the cluster with the lowest such distance.
In addition to requiring large amounts of training data, Commonly used metrics are the Euclidian distance to the
creating a model for each font type necessitates a high centroid of the clusters or a weighted distance which
degree of user interaction, with a correspondingly higher considers only the closest n points. The algorithm terminates
chance of human error. In order to reduce this level of when no points are moved in a single iteration. As the final
supervision, an ideal system would automatically identify result is highly dependent on the initialization of the clusters,
the presence of multiple fonts in the training data and the algorithm is often repeated a number of times, with each
process this information as required. solution scored according to some evaluation function.
8.1 Clustered LDA
The linear discriminate function described previously
attempts to transform the feature space such that the
interclass separation is maximized, while minimizing the
intraclass separation, by finding the maximum of the cost
function trðCS1 0
w Sb C Þ. While this function is optimal in this
sense, it does make a number of strong assumptions
regarding the nature of the distributions of each class in
feature space. All classes are assumed to have equal
covariance matrices, meaning that the resulting transform
will be optimal only in the sense of separation of the class
means. Additionally, since the function is linear, multimodal
distributions cannot be adequately partitioned in some
circumstances. Fig. 6 shows a simplistic synthetic example
of this case, where two classes are clearly well separated in
feature space, however have the same mean and therefore
Fig. 6. Synthetic example of the limitations of LDA. The two multimodal
cannot be effectively separated by a linear discriminate distributions, although well separated in feature space, have identical
function. When analyzing scripts containing multiple fonts, it means and, hence, an effective linear discriminate function cannot be
is common to encounter such multimodal distributions in determined.
Script Recognition Error Rates for Scripts Containing Multiple Script Recognition Error Rates for Scripts Containing Multiple
Fonts When Trained with a Single Model Fonts When Clustering Is Used to Create Multiple Models
Determining the optimal number of clusters is a problem A number of texture features were evaluated for the purpose
which has been previously addressed in the literature [47], of script recognition, including GLCM, Gabor filterbank
[48], [49]. However, for the purposes of multifont script energies, and a number of wavelet transform-based features.
recognition, using a fixed number of clusters has shown to By using such features, it is not necessary to extract individual
provide adequate results at significantly reduced computa- script components, making them ideal for degraded and
tional cost. In the experiments in the following section, noisy documents or situations where such segmentation is
10 clusters are used in all cases, as this number was found to not possible. The amount of text required for accurate
be generally sufficient to describe the font variations present recognition is also quite small, with as little as five words
within all of the tested scripts. Although the majority of sufficient in some cases. When classifying scripts containing a
classes can in fact be represented adequately using fewer than single font, experimental results have shown that texture
this number of clusters, using more clusters does not features can outperform other script recognition techniques,
significantly degrade performance. with the wavelet log co-occurrence features giving the lowest
overall classification error rate.
8.2 Classification Results In order to provide more stable model of each script class, as
To illustrate the limitations of using a single model in a multi- well as reducing the need for excessive training data, a
font environment, experiments using a number of fonts from technique was proposed whereby MAP adaptation is used to
each script class were conducted. A total of 30 fonts were create a global script model. Because of the strong interclass
present in the database, with 10 from Latin script, four each correlations which exist between the extracted features of
from Chinese, Japanese, and Persian, and three each from script textures, this approach was found to be well-suited to
Sanskrit, Hebrew, Greek, and Cyrillic. 100 training and the application of automatic script identification. Experi-
testing samples were extracted from each font type. mental results showed a small increase in overall classification
To illustrate the limitations of using a single model for performance when using large training sets, and significant
multiple fonts, each of the scripts was trained as a single class improvement when limited training data is available.
using the MAP classification system proposed above. From Using a single model to characterize multiple fonts within
the results shown in Table 6, it can be seen that large errors are a script class has been shown to be inadequate, as the fonts
introduced, with the most common misclassification occur- within a script class can vary considerably in appearance,
ring between fonts of the Latin and Greek scripts. Interest- often resulting in a multimodal distribution in feature space.
ingly, these results show that the simpler texture features do To overcome this problem, a technique whereby each class is
not suffer the same performance degradation as the more automatically segmented using the k-means clustering
complex features, with the wavelet energy signatures algorithm before performing LDA is presented. By doing
showing the lowest overall classification error of 12.3 percent. this, a number of subclasses are automatically generated and
The proposed clustering algorithm is implemented by trained without the need for any user intervention. Experi-
using k-means clustering to partition each class into ments performed on a multifont script database have shown
10 regions. Each subclass is then assigned an individual label that this technique can successfully identify scripts contain-
and LDA and classification performed as normal. The results ing multiple fonts and styles.
of this experiment are shown in Table 7, with the wavelet log
co-occurrence features again providing the lowest overall
error rate of 2.1 percent. Although the error rates for each of
Award for Outstanding Academic Contribution in Teaching and Leader-
ship, the Faculty’s Teaching Excellence Award, and was one of only
three University nominees to the Australian Awards for University
Teaching. He also won the national Engineers Australia—Australasian
Association for Engineering Education (EA-AAEE) Award for Excellence
in Teaching and Learning in Engineering Education, in 2004. Professor
Boles is a member of the Executive of the Australasian Association for
Engineering Education, AaeE, since 2001 and a member of the IEEE
since 1984. He is also a member of the IEEE Computer Society.