Shape Similarity and Visual Parts
Shape Similarity and Visual Parts
Shape Similarity and Visual Parts
2003
Fig. 1. Some shapes used in part B of MPEG-7 Core Experiment CE-Shape-1. Shapes
in each row belong to the same class.
The main part of the Core Experiment CE-Shape-1 was part B: similarity-
based retrieval. The data set used for this part is composed of 1400 shapes
stored as binary images. The shapes are divided into 70 classes with 20 images
in each class. In the test, each image was used as a query, and the number
of similar images (which belong to the same class) was counted in the top 40
matches (bulls-eye test). Since the maximum number of correct matches for a
single query image is 20, the total number of correct matches is 28000.
It turned out that this data set is the only set that is used to objectively
evaluate the performance of various shape descriptors. We present now some
of the shape descriptors with the best performance on this data set. It is not
our goal to provide a general overview of all possible shape descriptors. A good
overview can be found in the book by Costa and Cesar [4].
The shape descriptors can be divided into three main categories:
1. contour based descriptors: the contour of a given object is mapped to some
representation from which a shape descriptor is derived,
2. area based descriptors: the computation of a shape descriptor is based on
summing up pixel values in a digital image of the area containing the silhou-
ette of a given object; the shape descriptor is a vector of a certain number
of parameters derived this way (e.g., Zernike moments [13]),
3. skeleton based descriptors: after a skeleton is computed, it is mapped to a tree
structure that forms the shape descriptor; the shape similarity is computed
by some tree-matching algorithm.
cardinality function. Each vertex v in P i (except the first and the last if the
polyline is not closed) is assigned a relevance measure that depends on v and its
two neighbor vertices u, w in P i :
where d is the Euclidean distance function. Note that K measures the bending
of P i at vertex v; it is zero when u, v, w are collinear.
The process of discrete curve evolution (DCE) is very simple:
For end vertices of open polylines no relevance measure is defined, since the end
vertices do not have two neighbors. Consequently, end-points of open polylines
remain fixed.
Note that P i+1 is obtained from P i by deleting such a vertex that the
length change between P i and P i+1 is minimal. Observe that relevance mea-
sure K(v, P i ) is not a local property with respect to the polygon P = P 0 ,
although its computation is local in P i for every vertex v. This implies that the
relevance of a given vertex v is context dependent, where the context is given
by the adaptive neighborhood of v, since the neighborhood of v in P i can be
different than its neighborhood in P . The discrete curve evolution has also been
successfully applied in the context of video analysis to simplify video trajectories
in feature space [6, 15].
DCE may be implemented efficiently. Polyline’s vertices can be represented
within a double-linked polyline structure and a self-balancing tree simultane-
ously. Setting up this structure for a polyline containing n vertices has the com-
plexity of O(n log n). A step within DCE constitutes of picking out the least
relevant point (O(log n)), removing it (O(log n)), and updating it’s neighbor’s
relevance measures (O(1)). As there are at most n points to be deleted, this
yields an overall complexity of O(n log n). As it is applied to segmented poly-
lines, the number of vertices is much smaller than the number of points read
from the sensor.
Basic similarity of arcs is defined in tangent space. Tangent space, also called
turning function, is a multi-valued step function mapping a curve into the inter-
val [0, 2π) by representing angular directions of line-segments only. Furthermore,
arc lengths are normalized to 1 prior to mapping into tangent space. This repre-
sentation was previously used in computer vision, in particular, in [1]. Denoting
the mapping function by T , the similarity gets defined as follows:
µZ 1 ¶ ½ ¾
2 l(C) l(D)
Sarcs (C, D) = (TC (s) − TD (s) + ΘC,D ) ds · max , , (2)
0 l(D) l(C)
where l(C) denotes the arc length of C. The constant ΘC,D is chosen to minimize
the integral (it respects for different orientation of curves) and is given by
Z 1
ΘC,D = TC (s) − TD (s)ds.
0
Robot mapping and localization are the key points in building truly autonomous
robots. The central method required is matching of sensor data, which - in the
typical case of a laser range finder as the robot’s sensor - is called scan matching.
Fig. 4. An illustration of query process in our shape database; from left to right: query
sketch, first result, and refined result.
Fig. 5. A regular living room perceived by a laser range finder. Each circle represents
a reflection point measured. The lack of linear features is evident. Hence, more com-
plex, versatile features need to be employed. The cross denotes the position and the
orientation of the robot.
(a) (b)
Fig. 6. (a) This figure from the paper by Gutmann and Konolidge [9] shows a partially
mapped environment. Due to the propagation of errors, the cyclic path the robot was
following is no longer cyclic. Subsequent mapping would lead to an overlap. (b) Using
shape-similarity we can detect the overlapping parts (highlighted).
Fig. 7. The process of extracting polygonal features from a scan consists of two steps:
First, polygonal lines are set up from raw scanner data (a) (1 meter grid, the cross
denotes the coordinate system’s origin). The lines are split, wherever two adjacent
vertices are too far apart (20 cm). The resulting set of polygonal lines (b) is then
simplified by means of discrete curve evolution with a threshold of 50. The resulting
set of polygonal lines (c) consists of less data though still capturing the most significant
information. Below, results of applying DCE with different parameters as threshold are
shown. As can be observed, choosing the value is not critical for shape information.
Thresholds chosen: (d) 10, (e) 30, (f) 70.
Segmented polylines still contain all the information read form the LRF.
However, this data contains some noise. Therefore, we apply DCE (Section 2)
that cancels noise as well as makes the data compact without loosing valuable
shape information. To illustrate the complete process of feature extraction and,
most importantly, the applicability of DCE to range finder data, refer to Figure
7.
!
X
S(B i , B ′ j ) + C · (2| ∼ | − |B| − |B ′ |) = min.
(B i ,B ′ j )∈∼
7 Aligning Scans
Once a correspondence has been computed, the scans involved need to be aligned
in order to determine the current robot’s position from which the latest scan has
been perceived, and finally to build a global map from the perceived scans. To
align two scans, a translation and rotation (termed a displacement) must be
computed such that corresponding visual parts are placed at the same position.
The overall displacement is determined from the individual correspondences. Of
course, due to noise, this can only be fulfilled to a certain extend, as boundaries
may sometimes not be aligned perfectly and individual displacements may differ.
To define the best overall displacement, the overall error, i.e., the summed up
differences to individual displacements, is minimized according to the method of
least squares.
To mediate between all, possibly differing individual displacements, it is ad-
vantageous to restrict the attention to the most reliable matches. The presented
approach uses only the best three matcheing pairs of visual parts selected using
a reliability criterion described in Section 7.1.
Based on the correspondence of the three matcheing pairs two complete scan
boundaries from time t and t − 1 are aligned. For each corresponding polyline
pair, we also know the correspondence of the line segments of which the poly-
lines are composed. These correspondences have been determined along the way
of computing the similarity of two polylines. Proceeding this way, the problem
of aligning two scan is reduced to aligning two sets of corresponding lines. This
is tackled by computing the individual displacements that reposition the cor-
responding line segments atop each other using standard techniques. First, the
induced rotation is computed as the average value of rotational differences and
the scans are aligned accordingly. Second, the induced translation is computed.
This is done by solving an over-determined set of linear equations. As due to
noise usually no solution exists, the solution minimizing the least square error is
chosen.
Previous sections explained how correspondences between two scans can be de-
tected and how an induced displacement can be computed. In principle, an in-
cremental scan matching can be realized in a straightforward manner: For each
scan (at time t) visual parts are extracted and matched against the last scan
perceived (at time t − 1). As the boundaries are matched they are displaced
accordingly and entered in a map. However, such approach suffers from accu-
mulating noise. For example, if a wall is perceived in front of the robot with
a noise in distance of about 4cm (typical noise of a LRF), computing a single
displacement can introduce an error of 8cm. Such errors accumulate during the
continuous matching. Hence, maps resulting from several hundred scans render
themselves useless. This is reason enough for any real application to incorporate
some handling of uncertainty, e.g., by means of stochastic models.
Our way of handling the uncertainty is again based on shape similarity. In-
stead of aligning all scans incrementally, i.e., scan t is aligned with respect to
scan t − 1, we align scan t with respect to a reference scan t − n for some n > 1.
Scan t − n remains as the reference scan as long as the three most reliable mach-
ing visual parts from scan t are sufficiently similar to the corresponding visual
parts from scan t − n. This reference scan allows us to keep the accumulating
incremental error down, as the reference visual parts do not change so often. Our
criterion on when to change the reference scan is a threshold on shape similarity
of actual visual parts to the reference ones.
The performance of our system is demonstrated in Figure 8(a), where the
map constructed from 400 scans obtained by a robot moving along the path
marked with the dashed line is shown. For comparison, a ground truth map of
the reconstructed indoor environment (a hallway at the University of Bremen)
is shown in 8(b).
Glas doors/windows
(a) (b)
Fig. 8. (a) A map created by our approach. The robot path is marked with a dashed
line. (b) A ground truth map of the indoor environment.
8 Conclusions
References
1. M. Arkin, L. P. Chew, D. P. Huttenlocher, K. Kedem, and J. S. B. Mitchell. An
efficiently computable metric for comparing polygonal shapes. IEEE Trans. PAMI,
13:209–206, 1991.
2. S. Belongie, J. Malik, and J. Puzicha. Shape matching and object recognition using
shape contexts. IEEE Trans. Pattern Analysis and Machine Intelligence, 24:509–522,
2002.
3. H. Blum. Biological shape and visual science. Journal of Theor. Biol., 38:205–287,
1973.
4. L. da F. Costa and R. M. Cesar. Shape Analysis and Classification. Theory and
Practice. CRC Press, Boca Raton, 2001.
5. Cox, I.J., Blanche – An experiment in Guidance and Navigation of an Autonomous
Robot Vehicle. IEEE Transaction on Robotics and Automation 7:2, 193–204, 1991.
6. D.F. DeMenthon, L.J. Latecki, A. Rosenfeld, and M. Vuilleumier Stückelberg. Rel-
evance ranking and smart fast-forward of video data by polygon simplification. pages
49–61, 2000.
7. Dissanayake, G. ,Durrant-Whyte, H., and Bailey, T., A computationally effi-
cient solution to the simultaneous localization and map building (SLAM) problem.
ICRA’2000 Workshop on Mobile Robot Navigation and Mapping, 2000.
8. Gutmann, J.-S., Schlegel, C., AMOS: Comparison of Scan Matching Approaches
for Self-Localization in Indoor Environments. 1st Euromicro Workshop on Advanced
Mobile Robots (Eurobot), 1996.
9. Gutmann, J.-S. and Konolige, K., Incremental Mapping of Large Cyclic Environ-
ments. Int. Symposium on Computational Intelligence in Robotics and Automation
(CIRA’99), Monterey, 1999.
10. Gutmann, J.-S., Robuste Navigation mobiler System, PhD thesis, University of
Freiburg, Germany, 2000.
11. D. Hähnel, D. Schulz, and W. Burgard. Map Building with Mobile Robots in
Populated Environments, Int. Conf. on Int. Robots and Systems (IROS), 2002.
12. D. Huttenlocher, G. Klanderman, and W. Rucklidge. Comparing images using the
Hausdorff distance. IEEE Trans. PAMI, 15:850–863, 1993.
13. A. Khotanzan and Y. H. Hong. Invariant image recognition by zernike moments.
IEEE Trans. PAMI, 12:489–497, 1990.
14. B. Kuipers. The Spatial Semantic Hierarchy, Artificial Intelligence 119, pp. 191–
233, 2000.
15. L. J. Latecki and D. de Wildt. Automatic recognition of unpredictable events in
videos. In Proc. of Int. Conf. on Pattern Recognition (ICPR), volume 2, Quebec
City, August 2002.
16. L. J. Latecki and R. Lakämper. Convexity rule for shape decomposition based on
discrete contour evolution. Computer Vision and Image Understanding, 73:441–454,
1999.
17. L. J. Latecki and R. Lakämper. Shape similarity measure based on correspondence
of visual parts. IEEE Trans. Pattern Analysis and Machine Intelligence, 22(10):1185–
1190, 2000.
18. L. J. Latecki and R. Lakämper. Application of planar shapes comparison to object
retrieval in image databases. Pattern Recognition, 35 (1):15–29, 2002.
19. L. J. Latecki and R. Lakämper. Polygon evolution by vertex deletion. In M. Nielsen,
P. Johansen, O.F. Olsen, and J. Weickert, editors, Scale-Space Theories in Computer
Vision. Proc. of Int. Conf. on Scale-Space’99, volume LNCS 1682, Corfu, Greece,
September 1999.
20. L. J. Latecki, R. Lakämper, and U. Eckhardt. Shape descriptors for non-rigid
shapes with a single closed contour. In Proc. of IEEE Conf. on Computer Vision
and Pattern Recognition, pages 424–429, South Carolina, June 2000.
21. Lu, F., Milios, E., Robot Pose Estimation in Unknown Environments by Matching
2D Range Scans. Journal of Intelligent and Robotic Systems 18:3 249–275, 1997.
22. F. Mokhtarian, S. Abbasi, and J. Kittler. Efficient and robust retrieval by shape
content through curvature scale space. In A. W. M. Smeulders and R. Jain, editors,
Image Databases and Multi-Media Search, pages 51–58. World Scientific Publishing,
Singapore, 1997.
23. F. Mokhtarian and A. K. Mackworth. A theory of multiscale, curvature-based
shape representation for planar curves. IEEE Trans. PAMI, 14:789–805, 1992.
24. Röfer, T., Using Histogram Correlation to Create Consistent Laser Scan Maps .
IEEE Int. Conf. on Robotics Systems (IROS). EPFL, Lausanne, Switzerland, 625–
630, 2002.
25. K. Siddiqi, A. Shokoufandeh, S. J. Dickinson, and S. W. Zucker. Shock graphs and
shape matching. Int. J. of Computer Vision, 35:13–32, 1999.
26. S.Thrun. Learning Metric-Topological Maps for Indoor Mobile Robot Navigation,
Artificial Intelligence 99, pp. 21–71, 1998.
27. S.Thrun. Probabilistic algorithms in robotics. AI Magazine, 21(4):93–109, 2000.
28. S. Thrun. Robot Mapping: A Survey, In Lakemeyer, G. and Nebel, B. (eds.): Ex-
ploring Artificial Intelligence in the New Millenium, Morgan Kaufmann, 2002.
29. Thrun, S., Burgard, W., and Fox, D., A real-time algorithm for mobile robot
mapping with applications to multi-robot and 3D mapping. IEEE Int. Conf. on
Robotics and Automation (ICRA), 2000.