Indexing of Handwritten Historical Documents - Recent Progress
Indexing of Handwritten Historical Documents - Recent Progress
Indexing of Handwritten Historical Documents - Recent Progress
R. Manmatha
Toni M. Rath
Center for Intelligent Information Retrieval
Computer Science Department
University of Massachusetts Amherst
Abstract
Indexing and searching collections of handwritten
archival documents and manuscripts has always been
a challenge because handwriting recognizers do not
perform well on such noisy documents. Given a collection of documents written by a single author (or a
few authors), one can apply a technique called word
spotting. The approach is to cluster word images
based on their visual appearance, after segmenting
them from the documents. Annotation can then be
performed for clusters rather than documents.
Given segmented pages, matching handwritten
word images in historical documents is a great challenge due to the variations in handwriting and the
noise in the images. We describe investigations
into a number of different matching techniques for
word images. These include shape context matching,
SSD correlation, Euclidean Distance Mapping and
dynamic time warping. Experimental results show
that dynamic time warping works best and gives an
average precision of around 70% on a test set of
2000 word images (from ten pages) from the George
Washington corpus.
Dynamic time warping is relatively expensive and
we will describe approaches to speeding up the computation so that the approach scales. Our immediate goal is to process a set of 100 page images with
a longer term goal of processing all 6000 available
pages.
Introduction
Libraries contain an enormous amount of handwritten historical documents. Such collections are interesting to a great range of people, be it for historians,
students or just curious readers. Efficient access to
such collections (e.g. on digital media or on the Internet) requires an index, for example like in the
back of a book. Such indexes are usually created by
manual transcription and automatic index generation from a digitized version. While this approach
may be feasible for small numbers of documents, the
cost of this approach is prohibitive for large collections, such as the manuscripts of George Washington
with well over 6000 pages.
Using Optical Character Recognition (OCR) as
an automatic approach may seem like an obvious
choice, since this technology has advanced enough
to make commercial applications (e.g. tablet PCs)
possible. However, OCR techniques have only been
successful in the online domain, where the pen position and possibly other features are recorded during writing, and in offline applications (recognition
from images) with very limited lexicons, such as
automatic check processing (26 words allowed for
legal amount field). However, for general historical documents with large lexicons, and the usually
greatly degraded image quality (faded ink, ink bleedthrough, smudges, etc.), traditional OCR techniques
are not adequate. Figure 1 shows part of a page
from the George Washington collection (this page is
of relatively good quality).
Previous work [23] has shown the difficulties that
even high-quality handwriting recognizers have with
historical documents: the authors aligned a page
from the Thomas Jefferson collection with a perfect
transcription of the document. The transcription
was used to generate a limited lexicon for each word
hypothesis in the document. With a limited lexicon, the recognizer only has to decide which of a
few possibilities the word to be recognized is. However, even for very small lexicon sizes of at most 11
(ASCII) words per word hypothesis in the image,
only 83% of the words on a page could be correctly
aligned with the transcription.
The wordspotting idea [12] has been proposed as
an alternative to OCR solutions for building indexes
of handwritten historical documents, which were
produced by a single author: this ensures that identical words, which were written at different times,
will have very similar visual appearances. This fact
can be exploited by clustering words into groups
with image matching techniques. Ideally, each clus-
Matching Techniques
2.1
Pixel-by-Pixel Matching
2. SSD[8]: translates template and candidate image relative to each other to find the minimum
cost (= matching cost) based on the Sum of
Squared Differences.
3. EDM[8, 13]: Euclidean Distance Mapping.
This technique is similar to XOR, but difference pixels in larger groups are penalized more
heavily, because they are likely to result from
structural differences between the template and
the candidate image, not from noise.
Early versions of the above algorithms were proposed by [12, 13]. Kane et al. [8] improved them by
using extensive normalization techniques that align
images and also conducted more systematic experiments . The above algorithms (including the normalization techniques) are detailed in [8].
2.2
Feature-Oriented Matching
2.3
is more flexible than linear scaling: in the matching algorithm that we describe here, image columns
are aligned and compared using DTW. In our framework, each image column is represented by a time series sample point. Figure 2 shows an example alignment of two time series using dynamic time warping.
value
value
samples
samples
samples
(2)
DT W (i 1, j)
DT W (i, j) = min DT W (i, j)
+ d(i, j),(4)
DT W (i, j 1)
d(i, j) =
d
X
(5)
k=1
2.3.1
Figure 4:
Projection profile feature (rangenormalized and inverted).
2/3. Word Profiles: for every image, we can extract an upper and lower profile of the contained word, by going along the top (bottom)
of the enclosing bounding box, and recording
the distance to the nearest ink pixel in the
current image column. Identifying ink pixels
is currently realized by a thresholding technique, which we have found to be sufficient
for our purposes. For more sophisticated foreground/background separation, see [11]. Together, these features capture the shape of the
word outline (see Figure 5).
2.3.2
(6)
Such a lower-bounding function can be used for finding a time series in a collection, that has the lowest
distance merr to a given query series.
The approach is still to sequentially scan the data
base for the best matching series, but lb is used to
compare the query Q to a candidate C: if lb(Q, C)
is greater than the distance merr(Q, M) to the currently best matching series M, it is not necessary to
evaluate merr(Q, C), since
merr(Q, M) < lb(Q, C) merr(Q, C).
(7)
2.4
The image matching approach based on point correspondences identifies image corners in the input
images using the Harris detector. Then, the similarity between these points is determined by correlating their intensity neigborhoods using the sum
of squared differences measure. Then the recovered
correspondences are used to calculate a measure of
similarity between the input word images.
2.4.1
Recovering Corner
Correspondences
correlation windows without slowing down the implementation. Using larger SSD windows can help
prevent false matches, because of the added context
that is taken into account when comparing image
regions.
2.4.2
,
(8)
#correspondences
where A is the query image, and B a candidate image; (xai , yai ) and (xbi , ybi ) are the coordinates of a
pair of corresponding feature points, in A and B respectively. Essentially, we are calculating the mean
Euclidean distance of corresponding feature points.
Additionally, we penalize for every point in image
A, that does not have a correspondence in image B6
by multipliying the average distance with a weight.
Thus, the fewer corresponding feature points are
found in the candidate image B, the larger the distance between B and A.
More details on the point correspondence technique can be found in [18].
Results
3.1
Experimental Methodology
3.2
Result Discussion
Figure 8: Examples from the two test sets used in the evaluation.
Run
A
B
C
D
XOR
SSD
SLH
SC
EDM
DTW
54.14% 52.66% 42.43% 48.67% 72.61% 73.71%
n/a
n/a
n/a
n/a
n/a
65.34%
n/a
n/a
n/a
n/a
15.05% 58.81 %
n/a
n/a
n/a
n/a
n/a
51.81%
CORR
73.95%
62.57%
59.96%
51.08%
SC
EDM
DTW CORR
40.58% 67.67% 67.92% 69.69%
n/a
n/a
40.98% 36.23%
n/a
n/a
13.04% 14.84%
n/a
n/a
16.50% 15.49%
Table 1: Average precision scores for all test runs (XOR: matching using difference images, SSD: sum of
squared differences technique, SLH: technique by Scott & Longuet-Higgins [22], SC: shape context matching
[1], EDM: euclidean distance mapping, DTW: dynamic time warping matching, CORR: recovered correspondences). Four right-most columns show corrected evaluation results.
as well as DTW and CORR, with mean average precision values in the 40-50% range.
The general performance difference of DTW and
CORR on data sets A and B can be explained by
the pruning heuristics, which work much better on
set A than on set B: in set A, only 10% of the valid
matches are discarded in the pruning, while in set
B, almost 30% are discarded. A similar observation
can be made for the test sets C and D: on both
sets, the pruning discards around 45% of the valid
matches. This can be seen in the smaller differences
in mean average precision for both DTW and CORR
on these data sets. The reason for the high rejection
rate of valid matches by the pruning lies in the word
segmentation, which is heavily affected by the bad
quality of the document images.
dexing such manuscript collections. We have discussed a number of different approaches to matching
with the best performing ones being dynamic time
warping and a point correspondence based technique. Challenges remain, including the creation of
word clusters and the necessity of speeding up these
algorithms sufficiently, so that large collections can
be handled in a reasonable amount of time.
Building a system involves creating a user interface. While it is straightforward to imagine a visual
index with pictures and links to pages, it is not clear
whether users would be able to use such an index effectively. An ASCII user interface can be created
by annotating the matched word clusters manually
- permitting a more traditional index. Recent advances in automatic picture annotation [2, 4, 7], using machine learning and information retrieval techniques, may permit a completely different approach
to this problem through automatic annotation of
word image clusters.
Acknowledgments
We would like to thank Jamie Rothfeder and Shaolei
Feng for contributing to the work on point correspondences for word spotting. We also thank the
Library of Congress for providing the images of the
George Washington collection.
This work was supported in part by the Center
for Intelligent Information Retrieval and in part by
the National Science Foundation under grant number IIS-9909073. Any opinions, findings and conclusions or recommendations expressed in this material
are the author(s) and do not necessarily reflect those
of the sponsor.
References
[1] S. Belongie, J. Malik and J. Puzicha: Shape
Matching and Object Recognition Using Shape
Contexts. IEEE Trans. on Pattern Analysis and
Machine Intelligence 24:24 (2002) 509-522.
[2] D. M. Blei and M. I. Jordan: Modeling Annotated Data. Technical Report UCB//CSD-021202, 2002.
[3] C.-H. Chen: Lexicon-Driven Word Recognition.
In: Proc. of the Third Intl Conf. on Document Analysis and Recognition 1995, Montreal,
Canada, August 14-16, 1995, pp. 919-922.
[4] P. Duygulu, K. Barnard, N. de Freitas and
D. Forsyth: Object Recognition as Machine
Translation: Learning a Lexicon for a Fixed Image Vocabulary. In: Proc. 7th European Conference on Computer Vision, Copenhagen, Denmark, May 27-June 2, 2002, vol. 4, pp. 97-112.
[5] C. Faloutsos: Multimedia IR: Indexing and
Searching. In: Modern Information Retrieval,
R. Baeza-Yates and B. Ribeiro-Neto; AddisonWesley, Reading, MA, 1999.
[6] C. Harris and M. Stephens: A Combined Corner
and Edge Detector. In: Proc. of the 4th Alvey
Vision Conf., 1988, pp. 147-151.
[7] J. Jeon, V. Lavrenko and R. Manmatha: Automatic Image Annotation and Retrieval Using
Cross-Media Relevance Models. CIIR Technical
Report MM-41, 2003.
[8] S. Kane, A. Lehman and E. Partridge: Indexing
George Washingtons Handwritten Manuscripts.
Technical Report MM-34, Center for Intelligent Information Retrieval, University of Massachusetts Amherst, 2001.
[9] E. Keogh: Exact Indexing of Dynamic Time
Warping. In: Proc. of the 28th Very Large