Sponsored by:
Proceedings of the IROS 2006 workshop:
From Sensors to
Human Spatial Concepts
October 10, 2006, Beijing, China
Ogranisers:
Zoran Zivkovic, University of Amsterdam, The Netherlands
Ben Krose, University of Amsterdam, The Netherlands
Henrik I. Christensen, KTH, Stockholm, Sweden
Roland Siegwart, EPFL, Lausanne, Switzerland
Raja Chatila, LAAS-CNRS, Toulouse, France
Introduction
The aim of the workshop is to bring together researchers that work on space representations appropriate for
communicating with humans and on developing algorithms for relating the robot sensor data to the human spatial
concepts. Additionally in order to facilitate the discussion a dataset consisting of omnidirectional camera images,
laser range readings and robot odometry is provided.
The often used geometric space representations present a natural choice for robot localization and navigation.
However, it is hard to communicate with a robot in terms of, for example, 2D (x,y) positions. Some common
human spatial concepts are: ”the living room”, ”the corridor between the living room and the kitchen”; or more
general as ”a room”, ”a corridor”; or more specific and related to objects ”behind the TV in the living room”; etc,
etc. Appropriate space representation is needed for the natural communication with the robot. Furthermore, the
robot should be able to relate its sensor reading to the human spatial concepts.
Suggested topics for the workshop include, but are not limited to the following areas:
• space representations appropriate for communicating with humans
• papers applying and analyzing the results from cognitive science about the human spatial concepts
• space representations suited to cover cognitive requirements of learning, knowledge acquisition and contextual
control
• methods for relating the robot sensor data to the human spatial concepts
• comparison and/or combination of the appearance based and geometric approach for the task of relating the
robot sensor data to the human spatial concepts
We also encourage the use of the provided dataset:
http://staff.science.uva.nl/∼zivkovic/FS2HSC/dataset.htm.
Selected papers are considered for a Special Issue of the Robotics and Automation Systems Journal.
IROS 2006 workshop: From Sensors to Human Spatial Concepts
Page: 3
Contents
1 Cognitive Maps for Mobile Robots – An Object based Approach
S. Vasudevan, S. Gächter, M. Berger, R. Siegwart
7
2 Hierarchical localization by matching vertical lines in omnidirectional images
A.C. Murillo, C. Sagüéss, J.J. Guerrero, T. Goedemé, T. Tuytelaars, L. Van Gool
13
3 Virtual Sensors for Human Spatial Concepts – Building Detection by an Outdoor Mobile
Robot
M. Persson, T. Duckett, A. Lilienthal
21
4 Learning the sensory-motor loop with neurons – recognition association prediction decision
N. Do Huu, R. Chatila
27
5 Semantic Labeling of Places using Information Extracted from Laser and Vision Sensor Data
O.M. Mozos, R. Triebel, P. Jensfelt, W. Burgard
33
6 Towards Stratified Spatial Modeling for Communication and Navigation
R.J. Ross, C. Mandel, J. Bateman, S. Hui, U. Frese
41
7 Learning Spatial Concepts from RatSLAM Representations
M. Milford, R. Schulz, D. Prasser, G. Wyeth, J. Wiles
47
8 From images to rooms
O. Booij, Z. Zivkovic and B. Kröse
53
9 Robust Models of Object Geometry
J. Glover, G. Gordon, D. Rus, N. Roy
59
IROS 2006 workshop: From Sensors to Human Spatial Concepts
Page: 5
...
Cognitive Maps for Mobile Robots – An Object based
Approach
Shrihari Vasudevan, Stefan Gächter, Marc Berger & Roland Siegwart
Autonomous Systems Laboratory (ASL)
Swiss Federal Institute of Technology Zurich (ETHZ)
8092 Zurich, Switzerland
{shrihari.vasudevan , stefan.gachter , r.siegwart}@ieee.org
Abstract - Robots are rapidly evolving from factory workhorses to robot-companions. The future of robots, as our
companions, is highly dependent on their abilities to understand,
interpret and represent the environment in an efficient and
consistent fashion, in a way that is comprehensible to humans.
This paper is oriented in this direction. It suggests a hierarchical
probabilistic representation of space that is based on objects. A
global topological representation of places with object graphs
serving as local maps is suggested. Experiments on place
classification and place recognition are also reported in order to
demonstrate the applicability of such a representation in the
context of understanding space and thereby performing spatial
cognition. Further, relevant results from user studies validating
the proposed representation are also reported. Thus the theme of
the work is – representation for spatial cognition.
Index Terms - Cognitive Spatial Representation, Robot
Mapping, Conceptualization of spaces, Spatial Cognition
I. INTRODUCTION
Robotics today, is visibly and very rapidly moving
beyond the realm of factory floors. Robots are working their
way into our homes in an attempt to fulfill our needs for
household servants, pets and other cognitive robot
companions. If this “robotic-revolution” is to succeed, it is
going to warrant a very powerful repertoire of skills on the
part of the robot. Apart from navigation and manipulation, the
robot will have to understand, interpret and represent the
environment in an efficient and consistent fashion. It will also
have to interact and communicate in human-compatible ways.
Each of these is a very hard problem. These problems are
made difficult by a multitude of reasons including the
extensive amount of information, the huge number of types of
data (multi-modality), the presence of entities in the
environment which change with time, to name a few. Adding
to all of these problems are the two simple facts that
everything is uncertain and at any time, only partial
knowledge of the environment is available.
The underlying representation of the robot is probably the
single most critical component in that it constitutes the very
foundation for all things we might expect the robot to do,
these include the many complex tasks mentioned above. Thus,
the extent to which robots will evolve from factory workhorses to robot-companions will in some ways (albeit
indirectly) be decided by the way they represent their
surroundings. This report is thus dedicated towards finding an
appropriate representation that will make today’s dream,
tomorrow’s reality.
II. RELATED WORK
Robot mapping is a relatively well researched problem,
however, with many very interesting challenges yet to be
solved. An excellent and fairly comprehensive survey of robot
mapping has been presented in [1]. Robot mapping has
traditionally been classified into two broad categories – metric
and topological. Metric mapping [2] tries to map the
environment using geometric features present in it. A related
concept in this context is that of the relative map [3] – a map
state with quantities invariant to rotation and translation of the
robot. Topological mapping [4] usually involves encoding
place related data and information on how to get from one
place to another. More recently, a new scheme has become
quite popular – the one of hybrid mapping [5, 6]. This kind of
mapping typically uses both a metric map for precision
navigation in a local space and a global topological map for
moving between places.
The one similarity between all these representations is
that all of them are navigation-oriented, i.e. all of them are
built around the single application of robot-navigation. These
maps are useful only in the navigation context and fail to
encode the semantics of the environment. The focus of this
work is to address this deficiency. Several other domains
inspire our approach towards addressing this challenge – these
include hierarchical representations of space, “high-level” †
feature extraction, scene interpretation and the notion of a
Cognitive Map.
The work presented here closely resembles those that
suggest the notion of a hierarchical representation of space.
Ref. [7] suggests one such hierarchy for environment
modeling. In [8], Kuipers put forward a “Spatial Semantic
Hierarchy” which models space in layers comprising
respectively of sensorimotor, view-based, place-related and
metric information. The work [9] probably bears the most
similarity with the work presented in this paper. The authors
use a naive technique to perform “object recognition” and add
the detected objects to an occupancy grid map. The primary
†
Objects, doors etc. are considered “high-level” features contrasting with
lines, corners etc. which are considered “low-level” ones.
IROS 2006 workshop: From Sensors to Human Spatial Concepts
Page: 7
difference in the work presented here is that the proposed
representation uses objects as the functional basis – i.e. the
map is created and grown with the objects perceived.
Typically, humans seem to perceive space in terms of
high-level information such as objects, states & descriptions,
relationships etc. Thus, a human-compatible representation
would have to encode similar information. The work reported
here attempts to create such a representation using typical
household objects and doors. It also attempts to validate the
proposed representation in the context of spatial cognition.
For object recognition, a very promising approach that has
also been used in this work, is the one based on the SIFT [10].
In our experience, it was found to be a very effective tool for
recognizing textured objects. Several works have attempted to
model and detect doors. The explored techniques range from
modeling/estimating door parameters [11] to those that model
the door opening [12] and to those like [13], based on more
sophisticated algorithms such as boosting. Ref. [13] also
addresses the problem of scene interpretation in the context of
spatial cognition. The authors use the AdaBoost algorithm and
simple low-level scan features and vision together with hidden
markov models to classify places.
This work takes inspiration from the way we believe
humans represent space. The term “Cognitive Map” was first
introduced by Tolman in a widely cited work, [14]. Since
then, several works in cognitive psychology and AI / robotics
have attempted to understand and conceptualize a cognitive
map. Some of the more relevant theories are mentioned in this
context. Kuipers, in [15], elicited a conceptual formulation of
the cognitive map. He suggests the existence of five different
kinds of information (topological, metric, routes, fixed
features and observations) each with its own representation.
More recently, Yeap et al. in their work [16] trace the theories
that have been put forward to explain the phenomenon of
early cognitive mapping. They classify representations as
being space based and object based. The proposed approach
in this work is primarily an object based one. Some of the
most relevant object based approaches include the
MERCATOR (Davis, 1986) and more recently RPLAN
(Kortenkamp, 1993, [17]). The former bears the closest
resemblance to some of the ideas put forward in this work. It
should be emphasized that among most previously explored
approaches classified as "object" based, either the works do
not necessarily suggest a hierarchical representation or they
do not use high-level features.
In summary, a single unified representation that is
multi-resolution, multi-purpose, probabilistic and consistent is
still a vision of the future; it is also the aspiration of this work.
The approach can be understood as an engineering solution
(as applicable to mobile robots) to the general Cognitive
Mapping problem. Although being primarily object based, the
proposed approach attempts to overcome some of the believed
limitations of purely object based (i.e. no notion of the space)
methods by incorporating some spatial elements (in this case
doors). The kinds of elements that are incorporated will be
gradually upgraded as the work is enhanced.
III. APPROACH
A. Problem Definition
This work is aimed at developing a generic representation
of space for mobile robots. Towards this aim, in this particular
work, two scientific questions are addressed - (1) How can a
robot form a high-level probabilistic representation of space?
(2) How can a robot understand and reason about a place?
The first question directly addresses the problems of highlevel feature extraction, mapping and place formation. The
second question may be considered as the problem of spatial
cognition. Together, when appropriately fused, they give rise
to the hierarchical representation being sought. This
representation must consider and treat information uncertainty
in an appropriate manner. Also, in order to understand places,
the robot has to be able to conceptualize space; to be able to
classify its surroundings and to recognize it, when possible.
B. Overview
Figures 1 & 2 respectively show the mapping process and
the method used to demonstrate spatial cognition using the
created map. In an integrated system, the mapping and
reasoning processes cannot be totally separated, but it is done
here so as to facilitate understanding of the individual
processes. Subsection C elicits the details of the perception
system – this includes the object recognition and door
detection processes. Subsection D specifies the details on how
the representation is created (process depicted in fig. 1) – both
local probabilistic object graphs and individual places. It also
addresses the issue of learning about place categories
(kitchens, offices etc.). Subsections E explains how such a
representation could be used for spatial cognition (process
depicted in fig. 2) and the manner in which the representation
is updated. All of these sections are briefly presented. For
more details, the interested reader is referred to another recent
report by the authors, [18]. The remaining parts of the papers
discuss the experiments conducted, the user study and the
conclusions drawn thereof. The main contribution of this
paper is an enhancement of the previously reported results by
the provision of relevant results from user studies, in support
of our representation, as a cognitive validation of the theory.
Fig. 1 The mapping process. High-level feature extraction is implemented as
an object recognition system. Place formation is implemented using door
detection. Beliefs are represented and appropriately treated. Together, these
are encoded to form a hierarchical representation comprising of places,
connected by doors and themselves represented by local probabilistic object
graphs. Concepts about place categories are also learnt.
IROS 2006 workshop: From Sensors to Human Spatial Concepts
Page: 8
Fig. 2 The reasoning process for each place. First step is place classification –
the robot uses the objects it perceives to classify the place into one of its
known place categories (office, kitchen etc.). Next step is - recognizing
specific instances of the place it is aware of – place recognition. Accordingly
map update / adding of new place is done.
C.
Perception
This work deals with representing space using high-level
features. In particular, two kinds of features are used here –
typical household objects and doors. Reliable and robust
methods for high-level feature extraction are yet unavailable.
It must be emphasized that the perception component of this
work, is not the thrust of this work. Thus, established or
simplified algorithms have been used.
For this work, a SIFT based object recognition system
was developed (fig. 3) along the lines of [10]. The objects
detected are used to represent places as explained in subsection D. Doors are used in this work in the context of place
formation. A method of door detection based on line
extraction and the application of certain heuristics was used.
The sensor of choice was the laser range finder. More details
on the perception of objects and doors with regards to this
work can be found in [18].
D. Representation
The representation put forward here is a hierarchical one
that is composed of places which are connected to each other
through doors and are themselves represented by local
probabilistic object graphs (a probabilistic graphical
representation composed of objects and relationships between
them). Objects detected in a place are used to form a relative
map for that local space. Doors are incorporated into the
representation when they are crossed and link the different
places together.
Object graphs were used by the authors in [19]. The
problem with this work is that the information encoded in the
representation was purely semantic and not “persistent” i.e.
not invariant and not re-computable based on current
viewpoint. This work addresses this drawback by drawing on
the relative mapping approach in robotics. It suggests the use
of a probabilistic relative object graph as a means of local
metric map representation of places. The metric information
encoded between objects includes distance and angle
measures in 3D space. These measures are invariant to robot
translation and rotation in the local space. Such a
representation not only encodes the inter-object semantics but
also provides for a representation that could be used in the
context of robot navigation.
The robot uses odometry to know the robot pose which is
in turn used towards the creation of the relative object graph.
A stereo camera is used to know the positions of various
objects in 3D space. As mentioned before, the representation
is probabilistic. “Existential” beliefs (discrete probability
values) are obtained from the perception system for each
object that is observed. Simultaneously, “precision” beliefs
are maintained in the form of covariance matrices. By
representing both kinds of beliefs, such a representation will
serve in the context of high level reasoning / scene
interpretation and yet be useful for lower level navigation
related tasks. As mentioned earlier, the relative spatial
information encoded, include distance and angle measures in
3D space. These also have associated existence and precision
beliefs. Details on the sensor models used and the
mathematical formulations for belief computation are
mentioned in [18]. Concepts are learnt when creating the
representation of various places. These encode the occurrence
statistics (and thus likelihood values) of different objects in
different place categories (office, kitchen, etc.). Thus, in a
future exploration task, a robot could actually understand its
environment and thereby classify its surroundings based on
the objects it perceives.
E. Spatial Cognition (Place Classification / Recognition)
and Map Update
Place classification is done in an online incremental
fashion, with every perceived object contributing to one or
more hypotheses of previously learnt place concepts. Place
recognition is done by a graph matching procedure which
matches both the nodes and its relationships to identify a node
match. The aim is to find the maximal common set of
identically configured objects between places the robot knows
(previously mapped) and the one it currently perceives. A map
update operation (internal graph representation is updated) is
required both for handling the revisiting of places and the reobservation of objects while mapping a place. It involves the
addition / deletion of nodes and the update of their beliefs.
More details can be obtained from [18].
Fig. 3 Object recognition using SIFT features. Left image shows a mug being
recognized, right image shows a table being recognized. Objects used in this
work include cartons of different kinds, a table, a chair, a shelf & a mug.
IROS 2006 workshop: From Sensors to Human Spatial Concepts
Page: 9
IV. EXPERIMENTS
A. System Overview and Scenario
The robot platform shown in fig. 4 was used for this
work. The robot is equipped with several sensors including
encoders, stereo and two back-to-back laser range scanners.
The robot was driven across 5 rooms covering about 20m in
distance. The objects used (and the way they are named) for
the representation comprised of different cartons (carton,
cartridge, xerox, logitech, elrob, tea), a chair (chair), a mug
(mug), a shelf (shelf), a table (table) & a book (book). The
experiments were conducted in our lab – thus, the places
visited included offices and corridors.
B. Mapping
Fig. 5 shows the path of the robot. The objects & doors
recognized are shown in the object based map depicted in fig.
6. Finally, fig. 7 illustrates the complete probabilistic objectgraph representation formed as a result of the process.
The robot performed the mapping process as per
expectations. Objects and doors were recognized and the
representation was formed as per the methods described in the
previous sections. However, the robot often observed
multiple doors at the same place (due to the presence of large
cupboards) on either side of the door. Further, the robot
created multiple occurrences of the corridor as, the topological
information between places that is encoded was not used in
the experiments in this work. Also, it did not see an identical
set of objects through the corridor so as to be able to
recognize the previously visited corridor. These two issues
(fusing of doors and loop closing) would be addressed in
subsequent works.
matching spatial configuration to a place that it has visited
before. Thus, at this point, the “unknown” place was
recognized as the place SV (office) and the internal map
representation of the robot is updated to reflect the changes to
the place that the robot had perceived. Figure 9 displays the
updated internal representation of the robot. The corridor was
also successfully classified. More details can be got from [18].
Fig. 5 Map displaying the robot path. The robot traverses through 4 rooms
crossing a corridor each time it moves from one room to another. Green/red
circles indicate the doors detected. The red circles also serve as the place
references for the place explored on crossing the door. The numbers indicate
the sequence in which the places were visited.
Fig. 6 Object based map produced as a result of exploring the test
environment. Zoomed-in view of the above map. Blue squares are the place
references, red circles are the objects and the green stars are the doors.
Fig. 4 The robot platform that was used for the experiments. The encoders,
stereo vision system and laser scanners were used for this work.
C. Spatial Cognition – Place classification/recognition
The robot was made to traverse a previously visited
place – SV (office) and the corridor (refer fig. 6). The
locations of movable objects (all but the table, shelf and the
door) were changed so that a significant configuration change
of both places was observed. The robot was then made to
interpret these places.
For the first place, the robot perceived the objects in the
sequence shelf – xerox – carton – table – logitech – cartridge.
Fig. 8 displays the object map for the “unknown” place. On
seeing the first two objects, the robot successfully classified
the place as an office. Subsequently the robot attempted to
match this place with its knowledge of prior offices it has
visited. When finally crossing the door, the robot found
enough objects (including the door) that are located in a
Fig. 8 First ‘unknown’ place at the time of place recognition. The
configuration of the objects is different from that of the same place in Fig. 6.
Note - The carton is above the table & xerox is above the shelf.
Fig. 9 Updated internal representation of the robot after place recognition
IROS 2006 workshop: From Sensors to Human Spatial Concepts
Page: 10
Fig. 7 Probabilistic object graph representation created as a result of exploring the path shown in fig. 5
V. USER STUDIES – A COGNITIVE VALIDATION OF THE
PROPOSED REPRESENTATION
A. The study – objectives and methods
The broad aim of the study was to validate the proposed
representation in a cognitive sense. The aim was to verify our
approach and to find out what other details (kinds of features /
data) the proposed representation could encode. As mentioned
before, the complete representation is beyond the scope of this
report. Thus, only results of the survey that are relevant to the
aspects of the representation proposed here are quoted. The
complete study will be reported in a more appropriate forum.
The study was performed with input from 52 people. The
people were chosen from a diverse population spanning
different nationalities, backgrounds and occupation. Both
genders have been appropriately represented.
B. Relevant Results
In the tables that follow, most criteria correspond to their
literal (dictionary) meanings. The “function” of a place refers
to the typical functionality / purpose associated with a place.
“Ground materials” refer to the floor material (wooden /
carpeted /…). “Boundaries” refer to walls, doors, partitions
etc. The percentages indicate the number of people, of the
total number surveyed, that replied with information
corresponding to the particular criteria for the place in
consideration.
Survey takers were asked to imagine their presence in a
living room, an office and a kitchen. They were then asked to
describe what they understood / represented about that place
in their minds. Table 1 shows the results obtained. The most
common objects identified with an office were desks, chairs,
computers etc. Living rooms were better understood in terms
of the presence of sofas, armchairs, tables etc. and finally
kitchens were typically identified with cooker, oven, sink,
fridge, utensils etc.
TABLE 1
MEANS OF REPRESENTATION OF PLACES.
Office
Criteria / Place
Living Room (%)
(%)
Objects
98
96
Function
13
21
Boundaries
71
48
People
23
10
Size
17
25
Ambience
19
33
Luminosity
37
37
Ground Material
8
15
Smell
-
Kitchen
(%)
98
13
38
8
35
27
13
12
4
Next, users were taken to three places in our laboratory
premises – a “standard” office, a refreshment room and lastly,
a large electronics lab-office. Survey takers were asked to
describe each place – what they saw in as much detail as
possible. The typical ways in which survey takers tend to
describe these places are conveyed through the following
graphs shown in table 2.
TABLE 2
MEANS OF DESCRIPTIONS OF PLACES
Office
Criteria / Place
Refreshment room (%)
(%)
Objects
100
100
Function
52
90
Boundaries
40
10
Lab
(%)
100
63
15
Finally, users were taken from one room to another and
asked if they believed they were in a new place and the reason
for their belief. The results obtained are shown in a graphical
form below in fig. 10.
IROS 2006 workshop: From Sensors to Human Spatial Concepts
Page: 11
ACKNOWLEDGEMENTS
The work described in this paper was conducted within
the EU Integrated Project COGNIRON ("The Cognitive
Companion") and was funded by the European Commission
Division FP6-IST Future and Emerging Technologies under
Contract FP6-002020.
REFERENCES
Fig.10 Criteria to ascertain a change of place
C. Analysis / Inference
The reason survey takers were first asked to imagine
being in a place and then taken to such a place for questioning
was to get both inputs – that of the accumulated (through
experience) representation of the place and also that obtained
from on-site scene interpretation. It was found that objects
constituted a very critical component of both a representation
and a description. People seem to understand places in terms
of the high-level features (objects) that are present in it – the
underlying philosophy of this work and the direction of our
future works as well. It was also found that boundaries (walls
/ doors / windows) constituted an important component in
describing the places and the “function” of the place
(kitchen – cooking etc.) was an important descriptive element.
The last graph seems to convey that boundary elements (such
as doors and walls) and the arrangement of objects are critical
to detecting a change of place. From an implementation
perspective, this information seems to validate our choice for
using the objects as the functional basis of the representation
and doors as the links between places. Lastly, we believe that
a transition between places occurs when there is a change of
“visibility”, a term we can now implement in terms of the
other important factors that crop up in the graph shown in
figure 10, including arrangement of objects, luminosity, size,
color and ground materials. Thus these results not only
validate the proposed representation but further provide ideas
on the future enhancements (functionality of a place etc. need
to be incorporated) to this representation and how it is formed.
VI. CONCLUSIONS & FUTURE WORK
A cognitive probabilistic representation of space based on
high level features was proposed. The representation was
experimentally demonstrated. Spatial cognition using such a
representation was shown through experiments on place
classification and place recognition. The uncertainty for all
required aspects of such a representation were appropriately
represented and treated. Relevant results from a user study
conducted were also reported, thus validating the
representation in a cognitive sense. They also suggest the next
steps towards enhancing the proposed representation.
Fusing of doors and merging of places are both required
to get a more appropriate representation of space. On the
conceptual front, the suggested representation needs to be
made richer but yet lighter and computationally efficient in
applications. A more in-depth survey focusing on specific
aspects of the representation is also warranted.
[1] S. Thrun, "Robotic mapping: A survey", In G. Lakemeyer and B. Nebel,
editors, Exploring Artificial Intelligence in the New Millenium. Morgan
Kaufmann, 2002.
[2] R.Chatila and J.-P. Laumond. Position referencing and consistent world
modeling for mobile robots. In the Proceedings of the IEEE International
Conference on Robotics and Automation, 1985.
[3] A. Martinelli, A. Svensson, N. Tomatis and R. Siegwart, "SLAM Based
on Quantities Invariant of the Robot's Configuration", IFAC Symposium
on Intelligent Autonomous Vehicles, 2004.
[4] H. Choset and K. Nagatani, "Topological simultaneous localization and
mapping (SLAM): toward exact localization without explicit
localization", IEEE Transactions on Robotics and Automation, Volume:
17 Issue: 2, Apr 2001.
[5] S. Thrun, "Learning metric-topological maps for indoor mobile robot
navigation", Artificial Intelligence, Volume 99, Issue 1, February 1998,
Pages 21-71.
[6] N. Tomatis, I. Nourbakhsh and R. Siegwart, "Hybrid Simultaneous
Localization and Map Building: A Natural Integration of Topological and
Metric", Robotics and Autonomous Systems, 44, 3-14., July 2003.
[7] A. Martinelli, A. Tapus, K.O. Arras and R. Siegwart, "Multi-resolution
SLAM for Real World Navigation", In Proceedings of the 11th
International Symposium of Robotics Research, Siena, Italy, 2003.
[8] B. Kuipers, "The Spatial Semantic Hierarchy", Artificial Intelligence,
119: 191-233, May 2000.
[9] C. Galindo, A. Saffiotti, S. Coradeschi, P. Buschka, J.A. FernándezMadrigal and J. González, "Multi-Hierarchical Semantic Maps for Mobile
Robotics", Proc. of the IEEE/RSJ Intl. Conf. on Intelligent Robots and
Systems (IROS), pp. 3492-3497. Edmonton, CA, 2005.
[10] D. G. Lowe, "Distinctive image features from scale-invariant keypoints", International Journal of Computer Vision, vol. 60, no. 2, 2004.
[11] D. Anguelov, D. Koller, E. Parker, S. Thrun, "Detecting and Modeling
Doors with Mobile Robots", Proceedings of the International Conference
on Robotics and Automation (ICRA), 2004.
[12] D. Kortenkamp, L. D. Baker, and T. Weymouth, "Using Gateways to
Build a Route Map," Proceedings IEEE/RSJ International Conference on
Intelligent Robots and Systems, 1992.
[13] C. Stachniss, Ó. Martínez-Mozos, A. Rottmann, and W. Burgard,
"Semantic Labeling of Places", In Proc. of the Int. Symposium of
Robotics Research (ISRR).San Francisco, CA, USA, 2005.
[14] E. C. Tolman. “Cognitive maps in rats and men.” Psychological Review,
55:189-208, 1948.
[15] B. J. Kuipers, "The cognitive map: Could it have been any other way?",
Spatial Orientation: Theory, Research, and Application. New York:
Plenum Press, 1983, pages 345-359.
[16] W.K. Yeap and M.E. Jefferies, "On early cognitive mapping", Spatial
Cognition and Computation 2(2) 85-116, 2001.
[17] D. Kortenkamp. “Cognitive maps for mobile robots: A representation
for mapping and navigation”. PhD Thesis, University of Michigan, 1993.
[18] S. Vasudevan, V. Nguyen and R. Siegwart, “Towards a Cognitive
Probabilistic Representation of Space for Mobile Robots”, In the
Proceedings of the IEEE International Conference on Information
Acquisition (ICIA), 20-23 August 2006, Shandong, China.
[19] A. Tapus, S. Vasudevan, and R. Siegwart, "Towards a Multilevel
Cognitive Probabilistic Representation of Space", In Proceedings of the
International Conference on Human Vision and Electronic Imaging X,
part of the IS&T/SPIE Symposium on Electronic Imaging 2005, 16-20
January 2005, CA, USA.
IROS 2006 workshop: From Sensors to Human Spatial Concepts
Page: 12
Hierarchical localization by matching vertical lines
in omnidirectional images
A.C.Murillo, C.Sagüés, J.J.Guerrero
T. Goedemé, T. Tuytelaars, L. Van Gool
DIIS - I3A, University of Zaragoza, Spain
{acm, csagues, jguerrer}@unizar.es
PSI-VISICS, University of Leuven, Belgium
{tgoedeme, tuytelaa, vangool}@esat.kuleuven.be
Abstract— In this paper we propose a new vision based method
for robot localization using an omnidirectional camera. The
method has three steps efficiently combined to deal with big
reference image sets, each step evaluates less images than the
previous but is more complex and accurate. Given the current
uncalibrated image seen by the robot, the hierarchical algorithm
gives the possibility of obtaining appearance-based (topological)
and metric localization. Compared to other similar vision-based
localization methods, the one proposed here has the advantage
that it gets accurate metric localization based on a minimal
reference image set, using the 1D three view geometry. Moreover,
thanks to the linear wide baseline features used, the method is
insensitive to illumination changes and occlusions, while keeping
the computational load small. The simplicity of the radial line
feature used speeds up the process while it keeps acceptable
accuracy. We show experiments with two omnidirectional image
data-sets to evaluate performance of the method.
Index Terms: - topological + metric localization, hierarchical methods, radial line matching, omnidirectional images
I. I NTRODUCTION
Usually the robots have at their disposal a map where
they have to act (given or because if has been automatically acquired). For some robotic tasks, such as navigation
or obstacle avoidance, the robot needs to localize itself to
correct the trajectory errors due to, for instance, odometry
inaccuracy, slipping ... At this point it is clear we need accurate
metric localization information. However, the same accuracy
in localization is not always needed. Sometimes the robot
needs just topological localization (e.g. identifying in which
room we are, because the robot is asked for some information
about a certain room) and indeed this is a more intuitive
information to communicate with humans.
When working with computer vision, these maps usually
consist of a set of more or less organized reference images. In
this paper we present a vision based method for hierarchical
robot localization, combining topological and metric information, using a Visual Memory (VM). This VM consists of a
database of sorted omnidirectional reference images (we know
a priori the room they belong to and their relative positions).
Omnidirectional vision has become widespread in the last
years, and has many well-known advantages as well as extra
difficulties compared to conventional images. There are many
works using all kind of omnidirectional images, e.g. a mapbased navigation with images from conic mirrors [1] or
localization based on panoramic cylindric images composed
of mosaics of conventional ones [2].
We focus our work on hierarchical localization, a quite
interesting subject because of the possibility of combining
different kinds of localization and also because its higher
efficiency allows to handle better big databases of reference
images. The localization at different levels of accuracy with
omnidirectional images was previously proposed by Gaspar
et al. [3]. Their navigation method consists of two steps: a
fast but less accurate topological step, useful in long-term
navigation, is followed by a more accurate tracking-based step
for short distance navigation. One of the main novelties of our
proposal is that it gets till metric localization using multiple
view geometry constraints. Most of the previous vision-based
localization algorithms give as final localization the location
of one of the images from the reference set, that implies less
accuracy or very high density in the stored reference set. For
example in [4], a hierarchical localization with omnidirectional
images is performed. It is based on the Fourier signature and
gives an area where the robot is located around reference
images from one environment. It is computationally efficient
but it seems to require a very dense database to achieve precise
localization. With the metric localization we propose, the
stored images density is only restricted by the wide-baseline
matching that the local features used are able to deal with.
Our method performs a hierarchical localization in three
steps, each one more accurate than the previous but also
computationally more expensive. From the two first steps we
get the topological localization (the room where we are), using
a method inspired by the work of Grauman et al. [5] for object
recognition using pyramid matching kernels. In the final and
third step, we continue to get a metric localization relative to
the VM. For that we use local feature matches in three views
to robustly estimate a 1D trifocal tensor. The relative location
between views can be obtained with an algorithm based on
the 1D trifocal tensor for omnidirectional images [6].
The features used are scene vertical lines. They are projected
in the omni-images as radial ones, supposing vertical camera
axis, and allow us to estimate the center of projection in the
image. The 1D projective cameras have been previously used
e.g. for self calibration [7] and the 1D trifocal tensor, that
explains the geometry for three of those camera views, was
presented in [8]. Correcting radial distortion is another recent
application of the 1D tensor for omnidirectional images [9].
The line features are detailed in section II and all the steps of
the localization method are explained in section III. In section
IROS 2006 workshop: From Sensors to Human Spatial Concepts
Page: 13
IV we can see several experiments proving the effectiveness
of the different steps of our method with real images.
II. L INES AND THEIR D ESCRIPTORS
The number of features proposed for matching has increased
largely over the last few years, where SIFT [10] has become
very popular. Yet, in this work, we have chosen vertical lines
with their support regions as features. These lines show several
advantages when working with omnidirectional images (e.g.
robustness against partial occlusion, easy and fast extraction
or more invariant to affine geometrical deformations). Each
line is described by a set of descriptors that characterize it in
the most discriminant way possible, although it is necessary to
find a balance between invariance and discriminative power,
as the more invariant the descriptors are, the less discriminant
they become. In this section, we first explain lines extraction
process (section II-A), whereafter we shed light on the kind
of descriptors used to characterize them (section II-B).
A. Line Extraction
In this work we use lines and their Line Support Region
(LSR) as primitives for the matching. They are extracted using
our implementation of the Burns algorithm [11]. Firstly, this
extraction algorithm computes the image brightness gradient
and after it segments the image into line support regions, which
consist of adjacent pixels with similar gradient direction and
gradient magnitude higher than a threshold. Secondly a line
is fitted to each of those regions using a model of brightness
variation over the LSR [12]. We can see some extracted LSR
in Fig. 1. As mentioned before, we use only vertical lines. We
suppose that the optical axis of the camera is perpendicular to
the floor and that the motion is done on a plane parallel to the
floor. Under these conditions, these lines are the only ones that
keep always their straightness, being projected as radial lines
in the image. Therefore, they are quite easy to find and we
can automatically obtain from them the center of projection,
an important parameter to calibrate this kind of images. We
estimate it with a simple ransac-based algorithm that checks
where the radial lines are pointing. As it can be seen in Fig. 1,
the real center of projection is not coincident with the center of
the image, and the more accurate we find its real coordinates
the better we can estimate later the multi-view geometry and
therefore, the robot localization.
B. Line Descriptors
After extracting the image features, next step is to compute
a set of descriptors that characterize them as well as possible.
Most descriptors will be computed separately over each of the
sides in which the LSR is divided (Fig.1).
1) Geometric Descriptors: We obtain two geometric parameters from the vertical lines. Firstly, its direction δ, a
boolean indicating if the line is pointing to the center or not.
This direction is established depending on which side of the
line is the darkest region. The second parameter is the line
orientation in the image (θ ∈ [0, 2π]).
Fig. 1. Left: detail of some LSRs, divided by their line (the darkest side on
the right in blue and the lighter on the left in yellow). A red triangle shows
the lines direction. Right: all the radial lines with their LSR extracted in one
image. The estimated center of projection is pointed with a yellow star (∗)
and the center of the image with a blue cross (+).
2) Color Descriptors.: We have worked with three of the
color invariants suggested in [13], based on combinations of
the generalized color moments,
abc
Mpq
=
xp y q [R(x, y)]a [G(x, y)]b [B(x, y)]c dxdy.
(1)
LSR
abc
being Mpq
a generalized color moment of order p + q and
degree a+b+c and R(x, y), G(x, y) and B(x, y) the intensities
of the pixel (x, y) in each RGB color band centralized around
its mean. These invariants are grouped in several classes,
depending on the schema chosen to model the photometric
transformations. After studying the work where they are
defined, and trying to keep a compromise between complexity
and discriminative power, we chose as most suitable for us
those invariants defined for scale photometric transformations
using relations between couples of color bands. The definitions
of the 3 descriptors chosen are as follows:
DS RG =
110
000
M00
M00
100 M 010
M00
00
DS RB =
101
000
M00
M00
100 M 001
M00
00
DS GB =
011
000
M00
M00
010 M 001
M00
00
(2)
3) Intensity Frequency Descriptors.: Finally we also use as
descriptors the first seven coefficients of the Discrete Cosine
Transform, DCT , over the intensity signal (I) of the LSR of
each line. The DCT is a well known transform in the areas of
signal processing [14] and widely used for image compression.
It is possible to estimate the number of coefficients necessary
to describe a certain percentage of the image content. For
example with our test images, seven coefficients are necessary
on average to represent 99% of the intensity signal over a LSR.
We have chosen 22 descriptors, but to deal with big amounts
of images, it would be good to decrease this number. As
known, using Principal Component Analysis (PCA) we get
several linear combinations of the descriptors we study, each
of them with a certain discriminative level indicated. If we
have a look at the weights of each descriptor in the linear
combinations which turn to be more distinctive for our feature
sets, we can get an idea of which descriptors seem more
representative. Then we reduced our line descriptors vector
to 12 elements, what we will use for the Pyramidal matching:
IROS 2006 workshop: From Sensors to Human Spatial Concepts
Page: 14
3 color descriptors (DS RG , DS RB , DS GB ) and 2 frequency
descriptors (DCT1 , DCT2 ), computed on each side of the LSR,
and 2 geometric properties (orientation θ and direction δ).
However, for the individual line matching we propose to use
also the 5 following coefficients of the DCT in both sides of
the LSRs, 22 descriptors per line, because there this size is
not that critical as we will deal only with 3 images.
are stored in a vector (pyramid) ψ with different levels of
resolution.
The similarity between two images, the current (c) and one
of the visual memory (v) is obtained by finding the intersection
of the two pyramids of histograms
S(ψ(c), ψ(v)) =
L
wi Ni (c, v) ,
(3)
i=0
III. H IERARCHICAL L OCALIZATION M ETHOD
Scv =
6 value of
descriptor 1
As said before, our proposal consists of a hierarchical robot
localization in three steps. With this kind of localization, we
can deal with large databases of images in an acceptable time.
This is possible because we avoid evaluating all the images
in the entire data set in the most computationally expensive
steps. Note also that the VM stores already the extracted image
features and their descriptors. Even the matches between
adjacent images can be also pre-computed. The three steps
of our method, schematized in Fig. 2, are as follows:
with Ni the number of matches (LSRs that fall in the same bin
of the histograms, see Fig. 3 ) between images c and v in level
i of the pyramid. wi is the weight for the matches in that level,
that is the inverse of the current bin size (2i ). This distance
is divided by a factor determined by the self-similarity score
of each image, in order to avoid giving advantage to images
with bigger sets of features, so the distance obtained is
Get
adjacent
S(ψ(c), ψ(v))
PYRAMIDAL MATCHING
(with descriptor vector of 2 dimensions)
feature image 1
feature image 2
bin of size 2 (-1)
2 matches
of level 0
C
PYRAMIDAL
MATCHING
Most
similar
Image
O(n)
A
O(1)
Current
Image
LOCATION
(rotation and
translation
between
A - B and C )
bin of size 2 (0)
5 matches
of level 1
(1)
bin of size 2
6 matches
of level 2
O(n 2 )
Step 3
.
.
.
1
Steps 1 and 2
3-VIEW
MATCHING
+
ESTIMATION
1D RADIAL
TRIFOCAL
TENSOR
4
B
3
GLOBAL
FILTER
Filtered
reference
Images
2
Visual
Memory
(reference
images)
5
Adjacent
Image
Topological
Localization
Metric
Localization
0
Current
Image
0
Fig. 2.
(4)
.
S(ψ(c), ψ(c)) S(ψ(v), ψ(v))
Diagram of the three steps of the hierarchical localization.
A. Global Descriptor Filter
In a first step, we compute three color invariant descriptors
over all the pixels in the image. The descriptor used is DS,
described previously in section II-B.2. We compare all the
images in the VM with the current one and we discard the
images with more than an established difference in those
descriptors previously normalized.
B. Pyramidal Matching
This step finds which is the most similar image to the current
one. For this purpose, the set of descriptors of each line is
used to implement a pyramid matching kernel [5]. The idea
consists of building for each image several multi-dimensional
histograms (each dimension corresponds to one descriptor),
where each line feature occupies one of the histogram bins.
The value of each line descriptor is rounded to the histogram
resolution, which gives a set of coordinates that indicates the
bin corresponding to that line.
Several levels of histograms are defined. In each level,
the size of the bins is increased by powers of two until all
the features fall into one bin. The histograms of each image
1
2
3
4
5
6
(n-1)
bin of size 2
match of level n
value of
descriptor 2
Fig. 3. Example of Pyramidal Matching, with correspondences in level 0, 1
and 2. For graphic simplification, with a descriptor of 2 dimensions.
Notice that the matches found here are not always individual
feature-feature matches, as we just count how many fall in the
same bin. The more levels we check in the pyramid, the bigger
are the bins, so the easier it is to get multiple coincidences in
the same bin.
Once the similarity measure between our actual image and
the VM images is obtained, we choose the one with highest
Scv as the most similar. Based on the annotations of this
chosen image, we know in which room the robot is currently.
C. Line Matching and Metric Localization
For some applications, a more accurate localization information than the room is needed. In this case, we continue
with the third step of the hierarchical method, where we
need to obtain individual matches to compute the localization
through a trifocal tensor. From the previous step, the pyramidal
matching we could think to get some matches. However, as
we show from several tests, it is not so convenient because
many times the matches are multiple and it is not possible to
distinguish which one goes with which one, due to the discrete
size of the bins in the pyramid.
IROS 2006 workshop: From Sensors to Human Spatial Concepts
Page: 15
1) Line matching algorithm.: Once we have the most
similar image, we find two-view line matches between the
current and the most similar, and between that most similar and
its adjacent in the VM. The two-view line matching algorithm
we propose works as follows:
• First decide which lines are compatible with each other.
They must have the same direction (every line, has a
direction assigned when extracted, depending in which side
of the line is the darkest region).
The relative rotation between each line and one compatible
in the other view has to be consistent. It must be similar to
the global rotation between both images obtained with the
average gradient of all image pixels. This global rotation is
not accurate at all, but if it is for instance 90 degrees, the real
rotation between matches should not be zero.
As explained, the rest of descriptors are computed in regions around the line in both sides. To classify two lines as
compatible, at least the descriptor distances in one side of the
LSR should be under an established threshold.
• Compute a unique distance, using all the descriptors
individual distances, between each pair of compatible lines.
Mahalanobis distances are the most commonly used, however
a lot of training is needed to compute the required covariance
matrix with satisfying accuracy. Alternatively, we have tried
to normalize those distances dividing the descriptors by the
maximum value of each one. Theoretically, this may not be as
correct as Mahalanobis distances, but in practice it works well
and it is more adaptable to different queries. So it is enough to
normalize the values and to have the distances of the different
kinds of descriptors in similar scales, so we can sum them.
We also apply a correction or penalty to increase the distance
with the ratio of descriptors (Nd ) whose differences were over
their corresponding threshold in the compatibility test. So the
final distance we consider between two lines, i and j, will be
M in
in
dij = (dM
RGB + dDCT )(1 + Nd ),
(5)
in
M in
where dM
RGB and dDCT are the smallest distances (in color
and intensity descriptors respectively) of the two LSR sides.
• Perform a nearest neighbour matching between compatible lines.
• Apply a topological filter to the matches that helps to
reject non-consistent matches. It is an adaptation for radial
lines in omnidirectional images of the proposal in [15], that is
based on the probability that two lines will keep their relative
position in both views. It improves the robustness of this initial
matching (although it can reject some good matches, those can
be recovered afterwards in next step, now it is more important
to have robustness).
• Run a re-matching step, that takes into account the fact
that the neighbours to a certain matched line should rotate in
the image in a similar way from 1 view to the other.
2) 1D Radial Trifocal Tensor and Metric Localization.:
It is well known, that to solve the structure and motion
problem from lines at least three views are necessary. Then,
after computing two-view matches between the two couples of
images explained before, we extend the matches to three views
intersecting both sets of matches and checking a consistent
behaviour.
There is a scheme in Fig. 4 showing a vertical line projected
in the three views used and the location parameters we want
to estimate (α21 , tx21 , ty21 , α31 , tx31 , ty31 ).
We apply a robust method (ransac in our case) to the threeview matches to simultaneously reject outliers and estimate a
1D radial trifocal tensor. The orientation of the lines in the
image is expressed by 1D homogeneous coordinates (r =
[cos θ, sin θ]) to estimate the tensor. The trilinear constraint,
imposed by a trifocal tensor in the coordinates of the same
line v projected in three views (r1 , r2 , r3 ), is as follows,
2 2 2
i=1
j=1
k=1
Tijk r1 (i) r2 (j) r3 (k) = 0
(6)
where Tijk (i, j, k = 1, 2) are the eight elements of the
2 × 2 × 2 trifocal tensor and subindex (·) are the components
of vectors r.
With five matches and two additional constraints defined
for the calibrated situation (internal parameters of the camera
are known), we have enough to compute a 1D trifocal tensor
[8]. In our case with omnidirectional images we have a radial
1D trifocal tensor. From this tensor, even without camera
calibration, we can get a robust set of matches, the camera
motion and the structure of the scene [6]. In general, the
translation between views is estimated up to a scale, but with
the a priori knowledge that we have (we know the relative
position of the two images in the visual memory) we can
solve it exactly.
IV. E XPERIMENTS AND R ESULTS
In this section we show the performance of the algorithms
from previous sections for detecting the current room (IV-A),
for line matching (IV-B) and for metric localization (IV-C).
We worked with two data-sets, Almere and our own (named
data-set 2). From this second one ground truth data was
available, convenient to measure the errors in the localization.
From the Almere data-set, we have extracted the frames
from the low quality videos from the rounds 1 and 4 (2000
frames extracted in the first, and 2040 in the second). We
kept just every 5 frames, selecting half for the visual memory
IROS 2006 workshop: From Sensors to Human Spatial Concepts
(1) (2) (3)
r2 = [r2(1) r2(2) ]
Most similar
Image to O1 (O2)
Current
Image (O1)
Y
Fig. 4.
r3 = [r3(1) r3(2) ]
Image adjacent
to O2 in the VM (O3)
r1 = [r1(1) r1(2) ]
Y
Motion between images and landmark parameters
Page: 16
(every 10 frames starting from number 10) and the other half
for testing (every 10 frames starting from number 5).
The other visual memory has 70 omnidirectional images
(640x480 pixels). 37 of them are sorted, classified in different
rooms, with between 6 and 15 images of each one (depending
on the size of the room). The rest corresponds to unclassified
ones from other rooms, buildings or outdoors.
In Fig. 5 we can see a scheme of both databases, the second
one with details of the relative displacements between them.
All images have been acquired with an omnidirectional vision
sensor with hyperbolic mirror.
Almere
round 1
corridor
ALMERE DATA SET
ROOMS
living room
bedroom
kitchen
Data set 2
Fig. 5. Grids of images in rooms used in the experiments. Top: Almere
data-set. Bottom: Data-set 2
The present implementation of our method is not optimized
for speed, so timing information would be irrelevant. However,
we have analyzed the time complexity in the three steps.
• The first step (Section III-A) has constant complexity with
the number of features in the image. Here all the images in
the VM are evaluated.
• The second step algorithm (Section III-B) is linear with
the number of features. Here only the images that passed the
previous step are evaluated, using 12-descriptors/feature.
• The third step process (Section III-C) has quadratic
complexity with the number of features, and a 22-descriptors
vector per feature is used. However here we evaluate only the
query image and two from the VM.
Indeed, in each step we have a computational cost linear
with the number of images remaining on it. Then, it is very
important to keep the minimal set of images for the last step.
A. Localizing the current room
In this experiment we ran the two first steps of the method.
We evaluated the topological localization for all possible tests
in Data-set 2 (removing 1 test-image and comparing against
all the rest). For the experiments with the Almere data-set we
used the frames left for testing from the round 1, and also the
same frame-number from the round 4.
First step: Before computing a similarity measure to the
full data-set, we applied the global filter. It was expected to
reject a big amount of images, corresponding to very different
ones than the query one, leaving more or less the ones of the
same room and a small percentage of outliers (images from
other rooms). Taking into account amount of images rejected
and false negatives obtained, the best performing value for this
filter was rejecting the images with more than 80% difference
in the global descriptors. With the Data-set 2 around 70% of
the images were rejected, from where only an average of 4% of
the images were rejected being correct (false negatives). Using
the Almere images the results were a little worse (55% rejected
and 9% false negatives). This decreasing in the performance
become more accentuated when classifying images from the
round 4, with many occlusions, where sometimes we got too
many false negatives (around 60% rejected with 14% false
negatives). In the worst cases all good solutions were rejected
already in this first step, making to find a correct classification
impossible for the next steps.
Second step: The goal of this step is to choose the most
similar image to the current one, that for us will be correct if
it is in the proper room.
Firstly, for building the pyramid of histograms, indeed it
is necessary to perform a suitable scaling of the descriptors,
as they have values in very different ranges. What we did is
to scale the descriptors considered more important (color and
geometric ones) between 0 and 100, and the rest between 0
and 50, so we achieve that the accuracy of the most important
descriptors decreases in a slower way than the accuracy of the
others, in each level of the pyramid. Here, from the Almere set
chosen for the VM, we will use only a fifth (every 50 frames:
50-100-150-...), as the images were still quite close and to
localize the room it is not necessary to have so much accuracy.
In the data-set 2 the images are already more separated, as
they were taken every certain distance and not during a whole
robot trajectory, so we kept them all for the VM. In Dataset 2, the VM images were more equally distributed in all
the rooms and without the conflicts in the Almere one, e.g.
when the robot is between two rooms. Due to this fact, and
that the VM from Almere trajectory was built automatically
from the video sequence, without any post-processing such
as clustering nodes, we observe lower performance with the
Almere data-set.
The results when choosing the most similar image are
shown in Table I for all experiments, the one with Dataset 2 and the two using Almere data (Almere1∝1 indicates
we are classifying images from round 1 against the VM of
other images from round 1, while Almere4∝1 means that we
are classifying images from round 4 against the VM made
from round 1). The image chosen as most similar (1 Ok in
the table) was correct in 81% of the cases in Almere1∝1
test and 97% in the Data-set 2. In the test Almere4∝1 the
performance was lower., but we should take into account
that this round corresponds to a round with many occlusions,
while the reference set (Almere1) is from an occlusion-clean
round. This makes the first step work worse and the global
IROS 2006 workshop: From Sensors to Human Spatial Concepts
Page: 17
Almere1∝1
1 Ok
3 Ok
81%
52%
Almere4∝1
1 Ok 3 Ok
71%
20%
TABLE I
F INDING THE MOST SIMILAR IMAGE IN THE VM. 1Ok: most similar
found correct. 3Ok: 3 first images sorted with the similarity score correct.
appearance based descriptors less useful. In some cases, even
all possible correct images were rejected by it. This confirms
the importance of using local features to deal with occlusions.
If we skip the first step that is based on global appearance,
the performance increases but also the computations will grow
quite a lot. We must find a compromise between complexity
and correctness.
Pyramidal vs. Individual Matching to get a similarity score.:
The results of the algorithm we develop for matching lines
in section III-C could also be used for searching the most
similar image. For this purpose, we can compute a similarity
score that depends on the matches found (n) between the
pair of images, weighted by the distance (d) between the
lines matched. It has to take also into account the number of
features not matched in each image (F1 and F2 respectively)
weighted by the probability of occlusion of the features (Po ).
The defined dissimilarity (DIS) measure is
DIS = n d + F1 (1 − Po ) + F2 (1 − Po ) .
Rotation
α21
α31
o
0
0o
o
-1.34
-0.90o
-1.38o
-0.94o
(0.07)
(0.07)
Location
Param.
reference
manual
automatic
(std)
Data-set 2
1 Ok
3 Ok
97%
74 %
Transl. direction
t21
t31
0o
-26.5o
o
0.65
-26.35o
0.68o
-26.44o
(0.13)
(0.17)
TABLE II
M ETRIC LOCALIZATION FROM A Data-set 2 TEST. reference: reference data
manual:results with manual matches; automatic: mean and standard
deviation for results from automatic matches (50 executions per test).
an example of the line matches using images more separated,
not with the image that was selected as most similar but with
some further one.
13 14
5
1
6
18
24
2 12
15
1314 1
3
11
2221
5 6
4
17
26
18
15
34
17 16
719
9
10
20
23
11
21
16
19
98
10 7
20
2 12
22
24
2523
26
8
25
Fig. 6. Matches with more separated images than the couples obtained in
the Pyramidal matching - Room A (A01-A07)- 26 matches/ 7 wrong.
(7)
We have compared this approach and the Pyramidal matching method to select the most similar image in the previous
experiments of this section and we noticed the pyramidal
method seems more suitable. The correctness in the results
was quite similar for both methods. Using Data-set 2 the
percentage of good choices was the same with both, but the
percentage when checking the 3 first images chosen was worse
with individual matching. However with the Almere data, the
individual matchings similarity measure performed a little
better. As the performance seems similar but the pyramidal
method has the advantage of having linear complexity with
the number of features (in the experiments around 25% less
execution time), we decided to use it for this step.
B. Line Matching Results
Here we show the performance of the two-view line matching algorithm developed for the last step of the process. In
these experiments we obtained individual matches between
each image and the most similar selected by the Pyramidal
matching (using the results from the experiments of previous
section IV-A, when a current image was compared against the
rest of the VM). We got between 6 and 35 matches for the
different tests, depending on the rooms, from which only an
average of 10% were wrong.
The better performance we get with wide-baseline images,
the less density needed in the visual memory of each room.
So this step to get wide-baseline matches is the key-point to
be able to work with a minimal database. In Fig.6 we can see
C. Metric Localization Results
As explained before, we need three views to get the localization. Therefore, once we have two-view matches, between
the current image and the most similar selected, and between
this one and one of its neighbours in the VM, we get threeview matches from the intersection of those sets. We robustly
estimate the 1D radial tensor, so we get also a robust set of
three-view matches (those who fit the geometry of the tensor).
In Fig. 7, there are some typical results obtained after applying
the whole method in some examples from Almere data-sets.
In Almere 1 ∝1 the query image is from round 1, while in
Almere 4 ∝1 is from round 4. The localization parameters
seem stable and consistent, however we did not have ground
truth to evaluate them. We show in TABLE II the metric
localization results with a similar test done with images from
Data-set 2, where we had ground truth. Notice the small errors,
around 1o , specially if we take into account the uncertainty of
our ground-truth, that was obtained with a metric tape and
goniometer.
V. C ONCLUSIONS
In this work, we have proposed a hierarchical mobile robot
localization method. Our three-step approach uses omnidirectional images and combines topological (which room) and
metric localization information. The localization task computes the position of the robot relative to a grid of training
images in different rooms. After a rough selection of a number
of candidate images based on a global color descriptor, the best
resembling image is chosen based on a pyramidal matching
IROS 2006 workshop: From Sensors to Human Spatial Concepts
Page: 18
QUERY
MOST SIMILAR FOUND
Image 1 − Almere1LQ00005 − Robust Line Matches
ADJACENT IN THE VM
Image 2 − Almere1LQ00010 − Robust Line Matches
Image 3 − Almere1LQ00020 − Robust Line Matches
12
1
6
1
1
12
2
4
12
5
6
3
2
2
4
3
5
8
8
7
7
9
11
10
8
7
9
9
11
10
Almere 1∝1. 19 initial matches with 11 wrong. After robust estimation: 12 robust matches, 4 wrong.
Image 1 − Almere4LQ01125 − Robust Line Matches
3
6 4
5
10 11
α21
4o
Localization
Parameters
Image 2 − Almere1LQ00500 − Robust Line Matches
α31
16 o
t21
45 o
t31
44o
Image 3 − Almere1LQ00510 − Robust Line Matches
8
8
12
9
11 7
5
3
6
4
1
9
2
10
2
5
9
10
1
3
4
10
2
6
1
43
6
5
11
12
8
11
7
12
7
Almere 4 ∝1. 15 initial matches with 5 wrong. After robust estimation: 12 robust matches, 3 wrong.
Localization
Parameters
α21
161o
α31
150o
t21
114o
t31
132o
Fig. 7. Examples of triple robust matches obtained between the current position (first column image) and 2 reference images (second and third columns),
together with the localization parameters obtained.
with wide baseline local line features. This enables the grid
of training images to be less dense. A third step involving
the computation of the omnidirectional trifocal tensor yields
the metrical coordinates of the unknown robot position. This
allows accurate localization with a minimal reference dataset,
contrary to the approaches that give as current localization
the location of one of the reference views (as topological
information is correct, but as metric can give quite high error).
Our experiments, with two different data-sets of omnidirectional images, show the efficiency and good accuracy of this
approach.
R EFERENCES
[1] Y. Nishizawa Y. Yagi and M. Yachida. Map-based navigation for a
mobile robot with omnidirectional image sensor copis. In IEEE Trans.
Robotics and Automation, pages 634–648, October 1995.
[2] Chu-Song Chen, Wen-Teng Hsieh, and Jiun-Hung Chen. Panoramic
appearance-based recognition of video contents using matching graphs.
IEEE Trans. on Systems Man and Cybernetics, Part B, 34(1):179–199,
2004.
[3] J. Gaspar, N. Winters, and J. Santos-Victor. Vision-based navigation and
environmental representations with an omnidirectional camera. IEEE
Trans. on Robotics and Automation, 16(6):890–898, 2000.
[4] E. Menegatti, T. Maeda, and H. Ishiguro. Image-based memory for robot
navigation using properties of the omnidirectional images. In Robotics
and Autonomous Systems, Elsevier, Vol. 47, Iss. 4 , pp. 251-267, 2004.
[5] K. Grauman and T. Darrell. The pyramid match kernels: Discriminative
classification with sets of image features. In Proc. of the IEEE Int. Conf.
on Computer Vision, 2005.
[6] C. Sagüés, A.C. Murillo, J.J. Guerrero, T. Goedemé, T. Tuytelaars,
and L. Van Gool. Localization with omnidirectional images using the
radial trifocal tensor. In Proc. of the IEEE Int. Conf. on Robotics and
Automation, 2006.
[7] Olivier Faugeras, Long Quan, and Peter Sturm. Self-calibration of a
1d projective camera and its application to the self-calibration of a 2d
projective camera. In Proc. of the 5th European Conference on Computer
Vision, Freiburg, Germany, pages 36–52, June 1998.
[8] K. Åström and M. Oskarsson. Solutions and ambiguities of the structure
and motion problem for 1d retinal vision. Journal of Mathematical
Imaging and Vision, 12:121–135, 2000.
[9] S. Thirthala and M. Pollefeys. The radial trifocal tensor: A tool for
calibrating the radial distortion of wide-angle cameras. In Proc. of
Computer Vision Pattern Recognition, 2005.
[10] David G. Lowe. Distinctive image features from scale-invariant keypoints. Int. Journal of Computer Vision, 60(2):91–110, 2004.
[11] J.B. Burns, A.R. Hanson, and E.M. Riseman. Extracting straight lines.
IEEE Trans. on Pattern Analysis and Machine Intelligence, 8(4):425–
455, 1986.
[12] J.J. Guerrero and C. Sagüés. Camera motion from brightness on
lines. combination of features and normal flow. Pattern Recognition,
32(2):203–216, 1999.
[13] F. Mindru, T. Tuytelaars, L. Van Gool, and T. Moons. Moment invariants
for recognition under changing viewpoint and illumination. Computer
Vision and Image Understanding, 94(1-3):3–27, 2004.
[14] K.R. Rao and P. Yip. Discrete Cosine Transform: Algorithms, Advantages, Applications. Academic Press, Boston, 1990.
[15] H. Bay, V. Ferrari, and L. Van Gool. Wide-baseline stereo matching
with line segments. In Proc. of the IEEE Conf. on Computer Vision and
Pattern Recognition, volume I, June 2005.
IROS 2006 workshop: From Sensors to Human Spatial Concepts
Page: 19
...
IROS 2006 workshop: From Sensors to Human Spatial Concepts
Page: 20
Virtual Sensors for Human Concepts Building Detection by an Outdoor Mobile Robot
Martin Persson1, Tom Duckett2 and Achim Lilienthal1
1
2
Center for Applied Autonomous Sensor Systems
Dept. of Technology, Örebro University, Sweden
{martin.persson,achim.lilienthal}@tech.oru.se
Abstract— In human-robot communication it is often important to relate robot sensor readings to concepts used by humans.
We suggest to use a virtual sensor (one or several physical
sensors with a dedicated signal processing unit for recognition
of real world concepts) and a method with which the virtual
sensor can be learned from a set of generic features. The virtual
sensor robustly establishes the link between sensor data and a
particular human concept. In this work, we present a virtual
sensor for building detection that uses vision and machine
learning to classify image content in a particular direction
as buildings or non-buildings. The virtual sensor is trained
on a diverse set of image data, using features extracted from
gray level images. The features are based on edge orientation,
configurations of these edges, and on gray level clustering. To
combine these features, the AdaBoost algorithm is applied.
Our experiments with an outdoor mobile robot show that
the method is able to separate buildings from nature with a
high classification rate, and extrapolate well to images collected
under different conditions. Finally, the virtual sensor is applied
on the mobile robot, combining classifications of sub-images
from a panoramic view with spatial information (location and
orientation of the robot) in order to communicate the likely
locations of buildings to a remote human operator.
Index Terms— Automatic building detection, virtual sensor,
vision, AdaBoost, Bayes classifier
I. I NTRODUCTION
The use of human spatial concepts is very important in,
e.g., robot-human communication. Skubic et al. [1] discussed
the benefits of linguistic spatial descriptions for different
types of robot control, and pointed out that this is especially
important when there are novice robot users. In those situations it is necessary for the robot to be able to relate its
sensor readings to human spatial concepts. To enable human
operators to interact with mobile robots in, e.g., task planning,
or to allow the system to combine data from external sources,
semantic information is of high value. We believe that virtual sensors can facilitate robot-human communication. We
define a virtual sensor as one or several physical sensors
with a dedicated signal processing unit for recognition of
real world concepts. As an example, this paper describes
a virtual sensor for building detection using methods for
classification of views as buildings or nature based on vision.
The purpose with this is to detect one type of very distinct
objects that often is used in, e.g., textual description of route
directions. The suggested method to obtain a virtual sensor
Department of Computing and Informatics
University of Lincoln, Lincoln, UK
tduckett@lincoln.ac.uk
for building detection is based on learning a mapping from a
set of possibly generic features to a particular concept. It is
therefore expected that the same method can be extended to
virtual sensors for representation of other human concepts.
Many systems for building detection, both for aerial and
ground-level images, use line and edge related features.
Building detection from ground-level images often uses the
fact that, in many cases, buildings show mainly horizontal
and vertical edges. In nature, on the other hand, edges tend
to have more randomly distributed orientations. Inspection
of histograms based on edge orientation confirms this observation. Histograms of edge direction in different scales can
be classified by, e.g., support vector machines [2]. Another
method, developed for building detection in content-based
image retrieval uses consistent line clusters with different
properties [3]. This clustering is based on edge orientation,
edge colors, and edge positions. For more references on
ground-level building detection, see [2].
This paper presents a virtual sensor for building detection.
We use AdaBoost for learning a classifier for classification
of close range monocular gray scale images into ‘buildings’
and ‘nature’. AdaBoost has an ability to select the best socalled weak classifiers out of many features. The selected
weak classifiers are linearly combined to produce one strong
classifier. Bayes Optimal Classifier, BOC, is used as an alternative classifier for comparison. BOC uses the variance and
covariance of the features in the training data to weight the
importance of each feature. The proposed method combines
different types of features such as edge orientation, gray level
clustering, and corners into a system with high classification
rate. The method is applied on a mobile robot as a virtual
sensor for building detection in an outdoor environment and
can be extended to other classes, such as windows and doors.
The paper is organized as follows. Section II describes
the feature extraction. AdaBoost is presented in Section III
and Bayes classifier is presented in Section IV. In Section
V the used image sets, the description of the training phase,
and some properties of the weak classifiers are presented.
Section VI shows the results from the performance evaluation
and Section VII describes the virtual sensor for building
detection. Finally, conclusions are given in Section VIII.
IROS 2006 workshop: From Sensors to Human Spatial Concepts
Page: 21
II. F EATURE E XTRACTION
We select a large number of image features, divided into
three groups, that we expect can capture the properties
of man-made structures. The obvious indication of manmade structures, especially buildings, is that they have a
high content of vertical and horizontal edges. The first type
of features use this property. The second type of features
combines the edges into more complex structures such as
corners. The third type of features is based on the assumption
that buildings often contain surfaces with constant gray level.
The features that we calculate for each image are numbered
1 to 24. All features except 9 and 13 are normalized in order
to avoid scaling problems. Here, the features were selected
with regard to a particular virtual sensor, but one could also
use a generic set of features for different virtual sensors.
A. Edge Orientation
For edge detection we use Canny’s edge detector [4]. It
includes a Gaussian filter and is less likely than others to
be fooled by noise. A drawback is that the Gaussian filter
can distort straight lines. For line extraction in the edge
image an algorithm implemented by Peter Kovesi [5] was
used. This algorithm includes a few parameters that have
been optimized empirically. The absolute values of the line’s
orientation are calculated and used in different histograms.
The features based on edge orientation are:
1) 3-bin histogram of absolute edge orientation values.
2) 8-bin histogram of absolute edge orientation values.
3) Fraction of vertical lines out of the total number.
4) Fraction of horizontal lines.
5) Fraction of non-horizontal and non-vertical lines.
6) As 1) but based on edges longer than 20% of the
longest edge.
7) As 1) but based on edges longer than 10% of the
shortest image side.
8) As 1) but weighted with the lengths of the edges.
The 3-bin histogram has limits of [0 0.2 1.37 1.57] and
the 8-bin histogram [0 0.2 . . . 1.4 1.6] radians. Values for
the vertical (3), horizontal (4) and intermediate orientation
lines (5) are taken from the 3-bin histogram and normalized
with the total number of lines. Features 6, 7, and 8 try to
eliminate the influence from short lines.
B. Edge Combinations
The lines defined above can be combined to form corners
and rectangles. The features based on these combinations are:
9) Number of right-angled corners.
10) 9) divided by the number of edges.
11) Share of right-angled corners with direction angles
close to 45◦ + n · 90◦ , n ∈ 0, . . . , 3.
12) 11) divided by the number of edges.
13) The number of rectangles.
14) 13) divided by the number of corners.
We define a right-angled corner as two lines with close
end points and 90◦ ± βdev angle in between. During the
experiments βdev = 20◦ was used. Features 9 and 10 are the
number of corners. For buildings with vertical and horizontal
lines from doors and windows, the corners most often have
a direction of 45◦ , 135◦ , 225◦ and 315◦ , where the direction
is defined as the ‘mean’ value of the orientation angle for the
respective lines. This is captured in features 11 and 12. From
the lines and corners defined above rectangles representing,
e.g., windows are detected. We allow corners to be used
multiple times to form rectangles with different corners. The
number of rectangles is stored in features 13 and 14.
C. Gray Levels
Buildings are often characterized by large homogeneous
areas in the facades, while nature images often show larger
variation. Other areas in images that can also be homogeneous are, e.g., roads, lawns, water and the sky. Features 15
to 24 are based on gray levels. We use a 25-bin gray level
histogram, normalized with the image size and sum up the
largest bins. This type of feature works globally in the image.
To find local areas with homogeneous gray levels we search
for the largest connected areas within the same gray level.
Based on the gray level histogram, we calculate the largest
regions of interest that are 4-connected. The features based
on gray levels are:
15) Largest value in gray level histogram.
16) Sum of the 2 largest values in gray level histogram.
17) Sum of the 3 largest values in gray level histogram.
18) Sum of the 4 largest values in gray level histogram.
19) Sum of the 5 largest values in gray level histogram.
20) Largest 4-connected area.
21) Sum of the 2 largest 4-connected areas.
22) Sum of the 3 largest 4-connected areas.
23) Sum of the 4 largest 4-connected areas.
24) Sum of the 5 largest 4-connected areas.
III. A DA B OOST
AdaBoost is the abbreviation for adaptive boosting. It was
developed by Freund and Schapire [6] and has been used
in diverse applications, e.g., as classifiers for image retrieval
[7]. In mobile robotics, AdaBoost has, e.g., been used in ball
tracking for soccer-robots [8] and to classify laser scans for
learning of places in indoor environments [9]. This work is
a nice demonstration of using machine learning and a set
of generic features to transform sensor readings to human
spatial concepts.
The main purpose of AdaBoost is to produce a strong classifier by a linear combination of weak classifiers, where weak
means that the classification rate has to be only slightly better
than 0.5 (better than guessing). The principle of AdaBoost
is as follows (see [10] for a formal algorithm). The input
to the algorithm is a number, N , of positive (buildings) and
IROS 2006 workshop: From Sensors to Human Spatial Concepts
Page: 22
negative (nature) examples. The training phase is a loop. For
each iteration t, the best weak classifier ht is calculated and
a distribution D t is recalculated. The boosting process uses
D t to increase the weights of the hard training examples in
order to focus the weak learners on the hard examples.
The general AdaBoost algorithm does not include rules
on how to choose the number of iterations T of the training
loop. The training process can be aborted if the distribution
D t does not change, otherwise the loop runs through the
manually determined T iterations. Boosting is known to be
not particularly prone to the problem of overfitting [10]. We
used T = 30 for training and did not see any indications of
overfitting when evaluating the performance of the classifier
on an independent test set.
The weak classifiers in AdaBoost use single value features.
To be able also to handle feature arrays from the histogram
data, we have chosen to use a minimum distance classifier,
MDC, to calculate a scalar weak classifier. We use Dt to bias
the hard training examples by including it in the calculation
of a weighted mean value for the MDC prototype vector:
P
{n=1...N |yn =k} f (n, l)D t (n)
P
ml,k,t =
{n=1...N |yn =k} D t (n)
where ml,k,t is the mean value array for iteration t, class
k, and feature l and yn is the class of the nth image. The
features for each image are stored in f (n, l) where n is the
image number. For evaluation of the MDC at iteration t, a
distance value dk,l (n) for each class k (building and nature)
is calculated as
dk,l (n) = kf (n, l) − ml,k,t k
and the shortest distance for each feature l indicates the
winning class for that feature.
IV. BAYES C LASSIFIER
It is instructive to compare the result from AdaBoost with
another classifier. For this we have used Bayes Classifier, see
e.g. [11] for a derivation. Bayes Classifier, or Bayes Optimal
Classifier, BOC, as it is sometimes called, classifies normally
distributed data with a minimum misclassification rate. The
decision function is
1
1
dk (x) = ln P (wk )− ln |C k |− [(x−mk )T C −1
k (x−mk )]
2
2
where P (wk ) is the prior probability (set to 0.5), mk is the
mean vector of class k, and C k is the covariance matrix of
class k calculated on the training set, and x is the feature
value to be classified.
Not all of the defined features can be used in BOC. Linear
dependencies between features give numerical problems in
the calculation of the decision function. Therefore normalized
histograms can not be used, hence features 1, 2, 6, 7, and 8
were not considered. The set of features used in BOC was
3, 4, 9-15, 17, 20, 23. This set was constructed by starting
with the best individual feature (see Figure 3, Section V-C)
and adding the second best feature etc., while observing the
condition value of the covariance matrices.
V. E XPERIMENTS
A. Image Sets
We have used three different sources for the collection of
nature and building images used in the experiments. In Set
1, we used images taken by an ordinary consumer digital
camera. These were taken over a period of several months in
our intended outdoor environment. Our goal with the system
is to classify images taken by a mobile robot. In Set 2, we
therefore stored images from manually controlled runs with
a mobile robot, performed on two different occasions. Set 1
and 2 are disjunctive in the sense that they do not contain
images of the same buildings or nature.
In order to verify the system performance with an independent set of images, Set 3 contains images that have been
downloaded from the Internet using Google’s Image Search.
For buildings the search term building was used. The first
50 images with a minimum resolution of 240 × 180 pixels
containing a dominant building were downloaded. For nature
images, the search terms nature (15 images), vegetation (20
images), and tree (15 images), were used. Only images that
directly applied to the search term and were photos of reality
(no arts or computer graphics) were used. Borders and text
around some images were removed manually. Table I presents
the different sets of images and the number of images in each
set.
All images have been converted to gray scale and stored
in two different resolutions (maximum side length 120 pixels
and 240 pixels, referred to as size 120 and 240 respectively).
By this way we can compare the performance at different
resolutions, for instance the classification rate and scalability
robustness. A benefit of using low resolution images is faster
computations. Examples of images from Set 1 and 2 are
shown in Figure 1.
Set
1
2
3
Origin
Digital camera
Mobile camera
Internet search
Area
Urban
Campus
Worldwide
Total number
Buildings
40
66
50
Nature
40
24
50
156
114
TABLE I
N UMBER OF COLLECTED IMAGES . T HE DIGITAL CAMERA IS A 5 M PIXEL
S ONY (DSC-P92) AND THE MOBILE CAMERA IS AN ANALOGUE CAMERA
MOUNTED ON AN I R OBOT ATRV-J R .
B. Test Description
Four tests have been defined for evaluation of our system.
Test 1 shows whether it is possible to collect training data
IROS 2006 workshop: From Sensors to Human Spatial Concepts
Page: 23
Percent
15
10
5
0
0
5
10
15
20
25
20
25
Feature number
40
Percent
30
20
10
0
0
5
10
15
Feature number
Fig. 1. Example of images used for training. The uppermost row shows
buildings in Set 1. The following rows show buildings in Set 2, nature in
Set 1, and nature in Set 2, respectively.
Fig. 2. Histogram describing the feature usage by AdaBoost in Test 2 as
an average of 100 runs, using image size 120 (upper) and 240 (lower).
100
No.
1
2
3
4
Nrun
1
100
1
100
Train Set
1
90% of {1,2}
{1,2}
90% of {1,2,3}
TABLE II
D ESCRIPTION OF DEFINED TESTS (Nrun
Test Set
2
10% of {1,2}
3
10% of {1,2,3}
IS THE NUMBER OF RUNS ).
Percent
90
80
70
60
0
5
10
15
20
25
20
25
Feature number
100
90
Percent
with a consumer camera and use this for training of a
classifier that is evaluated on the intended platform, the
mobile robot. Test 2 shows the performance of the classifier in
the environment that it is designed for. Test 3 shows how well
the learned model, trained with local images, extrapolates
to images taken around the world. Test 4 evaluates the
performance on the complete collection of images. Table II
summarizes the test cases. These tests have been performed
with AdaBoost and BOC separately for each of the two image
sizes. For Test 2 and 4, a random function is used to select
the training partition and the images not selected are used
for the evaluation of the classifiers. This was repeated Nrun
times.
80
70
60
0
5
10
15
Feature number
Fig. 3. Histogram of classification rate of individual features in Test 2 as
an average of 100 runs with T = 1, image size 120 (upper) and 240 (lower).
D t is updated. Becase the importance of correctly classified
examples is decreased after a particular weak classifier is
added to the strong classifier, similar weak classifiers might
not be selected in subsequent iterations.
As a comparison to the test results presented in Section VI,
the result obtained on the training data using combinations
of image sets is also presented in Table III.
VI. R ESULTS
C. Analysis of the Training Results
AdaBoost can compute multiple weak classifiers from the
same features by means of a different threshold, for example.
Figure 2 presents statistics on the usage of different features
in Test 2. The feature most often used for image size 240 is
the orientation histogram (2). For image size 120, features 2,
8, 13 and 14 dominate. Figure 3 shows how each individual
feature manages to classifiy images in Test 2. Several of the
histograms based on edge orientation are in themselves close
to the result achieved for the classifiers presented in the next
section. Comparing Figure 2 and Figure 3 one can note that
several features with high classification rates are not used by
AdaBoost to the expected extent, e.g., features 1, 3, 4, and
5. This can be caused by the way in which the distribution
Training and evaluation have been performed for the tests
specified in Table II for features extracted both from images
of size 120 and 240. The result is presented in Table IV
and V respectively. The tables show the mean value of the
total classification rate, its standard deviation, and the mean
value of the classification rates for building images and nature
images separately. Results from both AdaBoost and BOC
using the same training and testing data are given.
Test 1 shows a classification rate of over 92% for image
size 240. This shows that it is possible to build a classifier
based on digital camera images and achieve very good results
for classification of images from our mobile robot, even
though Set 1 and 2 have structural differences, see Section
V-A.
IROS 2006 workshop: From Sensors to Human Spatial Concepts
Page: 24
1,2
120
1,2,3
120
1
240
1,2
240
1,2,3
240
Classifier
AdaBoost
BOC
AdaBoost
BOC
AdaBoost
BOC
AdaBoost
BOC
AdaBoost
BOC
AdaBoost
BOC
Build. [%]
100.0
100.0
97.2
95.3
89.7
86.5
100.0
100.0
100.0
98.1
98.7
95.5
Nat. [%]
100.0
100.0
100.0
93.8
94.7
94.7
100.0
100.0
100.0
100.0
99.1
98.2
Total [%]
100.0
100.0
98.2
94.7
91.9
90.0
100.0
100.0
100.0
98.8
98.9
96.7
TABLE III
R ESULTS ON THE TRAINING IMAGE SETS IN TABLE I.
Test no.
1
2
3
4
Classifier
AdaBoost
BOC
AdaBoost
BOC
AdaBoost
BOC
AdaBoost
BOC
Build. [%]
81.8
93.9
93.0
95.7
68.0
72.0
86.6
86.4
Nat. [%]
91.7
58.3
91.8
89.0
90.0
74.0
89.8
88.5
Total [%]
84.4
84.4
92.6 ± 5.8
93.4 ± 5.5
79.0
73.0
87.9 ± 6.2
87.3 ± 6.0
TABLE IV
R ESULTS FOR T EST 1-4 USING IMAGES WITH SIZE 120.
Test 2 is the most interesting test for us. This uses images
that have been collected with the purpose of training and
evaluating the system in the intended environment for the
mobile robot. This test shows high (and highest) classification
rates. For both AdaBoost and BOC they are around 97%
using the image size 240.
Figure 4 shows the distribution of wrongly classified
images for AdaBoost compared to BOC. It can be noted
that for image size 120 several images give both classifiers
problems, while for image size 240 different images cause
problems.
Test 3 is the same type of test as Test 1. They both train
on one set of images and then validate on a different set.
Test no.
1
2
3
4
Classifier
AdaBoost
BOC
AdaBoost
BOC
AdaBoost
BOC
AdaBoost
BOC
Build. [%]
89.4
95.5
96.1
98.1
88.0
90.0
94.1
94.8
Nat. [%]
100.0
87.5
98.3
95.7
94.0
82.0
95.5
93.4
Total [%]
92.2
93.3
96.9 ± 4.3
97.2 ± 4.0
91.0
86.0
94.6 ± 3.8
94.2 ± 4.7
TABLE V
R ESULTS FOR T EST 1-4 USING IMAGES WITH SIZE 240.
20
15
#
Size
120
10
5
0
0
2
4
6
8
10
12
14
16
18
20
18
20
Image number (starting with the most times wrongly classified)
15
10
#
Sets
1
5
0
0
2
4
6
8
10
12
14
16
Image number (starting with the most times wrongly classified)
Fig. 4. Distribution of the 20 most frequently wrongly classified images
from AdaBoost (gray) and BOC (white), using image size 120 (upper) and
240 (lower).
Test 3 shows lower classification rates than Test 1 with the
best result for AdaBoost using image size 240. This is not
surprising since the properties of the downloaded images
differ from the other image sets. The main difference between
the image sets is that the buildings in Set 3 often are larger
and located at a greater distance from the camera. The same
can be noted in the nature images, where Set 3 contains a
number of landscape images that do not show close range
objects. The conclusion from this test is that the classification
still works very well and that AdaBoost generalizes better
than BOC.
The result from Test 4 is compared with the result of
Test 2. We can note that the classification rate is lower for
Test 4, especially for image size 120. Investigation of the
misclassified images in Test 4 shows that the share belonging
to image Set 3 (Internet) is large. For both image sizes 60%
of the misclassified images came from Set 3.
To show scale invariance we trained two classifiers on Test
2 with images of size 120 and evaluated them with images of
size 240 and vice versa. The result is presented in Table VI
and should be compared to Test 2 in Tables IV and V. The
conclusion from this test is that the features we use have scale
invariant properties over a certain range and that AdaBoost
shows significantly better scale invariance than BOC, which
again demonstrates AdaBoost’s better extrapolation capability.
VII. V IRTUAL S ENSOR F OR B UILDING D ETECTION
We have used the learned building detection algorithm to
construct a virtual sensor. This sensor indicates the presence
of buildings in different directions related to a mobile robot.
In our case we let the robot perform a sweep with its camera
(±120◦ in relation to its heading) at a number of points along
its track. The images are classified into buildings and ‘nature’
IROS 2006 workshop: From Sensors to Human Spatial Concepts
Page: 25
Train
120
Test
240
240
120
Classifier
AdaBoost
BOC
AdaBoost
BOC
B. [%]
94.2
93.0
95.1
100.0
N. [%]
96.7
94.3
90.8
44.8
Total [%]
95.1 ± 4.2
93.5 ± 5.3
93.6 ± 6.0
80.5 ± 6.7
TABLE VI
R ESULTS FOR T EST 2 USING TRAINING WITH IMAGES SIZED 120 AND
TESTING WITH IMAGES SIZED
240 AND VICE VERSA .
Fig. 5. Classification of images used as a virtual sensor pointing at the two
classes (blue arrows indicate buildings and red lines non-buildings).
(or non-buildings) using AdaBoost trained on set 1. The
experiments were performed using a Pioneer robot equipped
with GPS and a camera on a PT-head. Figure 5 shows the
result of a tour in the Campus area. The blue arrows show
the direction towards buildings and the red lines point toward
non-buildings. Figure 6 shows an example of the captured
images and their classes from a sweep with the camera at the
first sweep point (the lowest leftmost sweep point in Figure
5). This experiment was conducted with yet another camera
and during winter, and the result was qualitatively found to be
convincing. Note that the good generalization of AdaBoost is
expressed by the fact that the classifier was trained on images
taken in a different environment and season.
Fig. 6. Example of one sweep with the camera. The blue arrows point at
images classified as buildings and the red lines point at non-buildings.
VIII. C ONCLUSIONS
We have shown how a virtual sensor for pointing out
buildings along a mobile robot’s track can be designed
using image classification. A virtual sensor relates the robot
sensor readings to a human concept and is applicable, e.g.,
when semantic information is necessary for communication
between robots and humans. The suggested method using
machine learning and generic image features will make
it possible to extend virtual sensors to a range of other
important human concepts such as cars and doors. To handle
these new concepts, features that capture their characteristic
properties should be added to the present feature set, which
is expected to be reused in the future work.
Two classifiers intended for use on a mobile robot to
discriminate buildings from nature utilizing vision have been
evaluated. The results from the evaluation show that high
classification rates can be achieved, and that Bayes classifier and AdaBoost have similar classification results in the
majority of the performed tests. The number of wrongly
classified images is reduced by about 50% when the higher
resolution images are used. The features that we use have
scale invariant properties to a certain range, showed by the
cross test where we trained the classifier with one image
size and tested on another size. The benefits gained from
Adaboost include the highlighting of strong features and its
improved generalization properties over the Bayes classifier.
R EFERENCES
[1] Marjorie Skubic, Pascal Matsakis, George Chronis, and James Keller.
Generating multi-level linguistic spatial descriptions from range sensor
readings using the histogram of forces. Autonomous Robots, 14(1):51–
69, Jan 2003.
[2] J. Malobabic, H. Le Borgne, N. Murphy, and N. O’Connor. Detecting
the presence of large buildings in natural images. In Proc. 4th
International Workshop on Content-Based Multimedia Indexing, Riga,
Latvia, June 21-23, 2005.
[3] Yi Li and Linda G. Shapiro. Consistent line clusters for building
recognition in CBIR. In Proc. Int. Conf. on Pattern Recognition,
volume 3, pages 952–957, Quebec City, Quebec, CA, Aug 2002.
[4] John Canny. A computational approach for edge detection. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 8(2):279–
98, Nov 1986.
[5] Peter Kovesi. http://www.csse.uwa.edu.au/∼pk/Research/MatlabFns/,
University of Western Australia, Sep 2005.
[6] Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of
Computer and System Sciences, 55(1):119–139, 1997.
[7] Kinh Tieu and Paul Viola. Boosting image retrieval. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition,
Hilton Head Island, South Carolina, June 2000.
[8] André Treptow, Andreas Masselli, and Andreas Zell. Real-time object
tracking for soccer-robots without color information. In European
Conf. on Mobile Robotics (ECMR 2003), Radziejowice, Poland, 2003.
[9] O. Martínez Mozos, C. Stachniss, and W. Burgard. Supervised learning
of places from range data using adaboost. In Proc. of the IEEE
International Conference on Robotics and Automation (ICRA), pages
1742–1747, Barcelona, Spain, April 2005.
[10] Robert E. Schapire. A brief introduction to boosting. In Proc. of the
Sixteenth Int. Joint Conf. on Artificial Intelligence, 1999.
[11] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. Wiley,
New York, second edition, 2001.
IROS 2006 workshop: From Sensors to Human Spatial Concepts
Page: 26
Learning the sensory-motor loop with neurons :
recognition, association, prediction, decision
Nicolas Do Huu and Raja Chatila LAAS-CNRS
Robotic and Artificial Intelligence Research Group
Toulouse, France
Email: {ndohuu, raja.chatila}@laas.fr
Abstract— This paper introduces a learning framework to
grant an autonomous system with recognition, in sight objects
localization an decision capabilities. The ”Integrate and Fire”
neuron model and global network architecture are exposed with
an original associative scheme. In a final part, we propose a
predictive and decisional algorithm compatible with the neuron
computation process.
I. I NTRODUCTION
The way knowledge is represented and manipulated is a
fundamental and opened problem in the context of building
a robotic system that must interact with its environment, take
decisions, and even build new knowledge from observations
and experiments, say an autonomous system. The architecture
and algorithms we propose are to provide a framework for
representing knowledge in the uniform language of neurons
activations. This paper addresses three main aspects of learning : learning representations from perceptive data, learning to
associate these representations to capture their evolution with
respect to the robot’s actions, and learning to produce actions
in order to increase a global reward value using reinforcement
learning.
First, we use an artificial neural network to learn and
recognise objects in the robot’s environment. Objects are
considered to be quite small in order to fit in a single image
acquired by the robot’s cameras. The goal of this subsystem
is to build an internal representation of the objects. These
representations are made of neurons which learn to recognise
patterns, a competition-based hebbian learning rule (see [5]
and [3]) has been developed to learn representations in an
unsupervised way. Neurons are organised in maps which are
two dimensional arrays of neurons sharing properties such as
synaptic weights and incoming connections [4]. The neural
network we use is able to pool together the representations of
multiple views of the same object using temporal associations
: different views of the same object tend to appear close
together in time and space and can thus be associated [10][5].
Moreover, we will explain how to use this kind of neural
network to learn representations of the objects position and
representations of the robot’s last actions using proprioceptive
data. All the representations activated by the recognition subsystem at a given time are called the perceptive context : object
appearance, position and the robot’s last actions.
Another learning task aims to associate a reward value
called ”effect” to each perceptive context and then learn to
produce actions in order to reactivate context which generate
the best rewards. This can be done by learning a local reward
value (or local effect) associated to each representation (in fact
with each neuron). The global reward value is then defined as
the sum of all local effects generated by the current perceptive
context.
In order to choose the best action i.e. the action that will
bring the most increase in reward, the system must learn how
each action can affect the perceptive context and thus the
global reward. When the global reward is affected by an action,
the system by learns a sensori-motor association in which the
action is associated with the modification of the perceptive
context it has produced. This ”driven-by-reward” associative
memory is the third aspect of learning propose in this paper.
Finally, the sensory-motor associations are used to predict
perceptive contexts that are reachable within several steps.
Then, the action selection process can be done by comparing
the cumulative rewards (utility) of each action and selecting
the most rewarding action in this horizon of time.
The paper is organized as follows. In Section II, we describe the sensory-motor loop organization. In Section III,
we introduce our ”Integrate and Fire” neuron model and the
neural connectivities we use. In Section IV the perceptive
part is then explained with the objective transitions building
mechanism. Section V is dedicated to the active part, global
reinforcement learning mechanism. The predictive decisional
process is presented in Section VI. We briefly describe the
robot platform and software implementation in Section VII
and we finally conclude in Section VIII.
II. T HE SENSORY- MOTOR LOOP
The purpose of this architecture is to autonomously extract
representations and learn to produce appropriate actions in a
given context in order to reach one other context which increases the global reward value. One first requirement is to rate
perceptive contexts, to give a score that permits to discriminate
them according to some ”system optimum”. We call global
effect the signed measure of optimality. As representations are
learnt during the course of time, and because contexts are composed by sets of these representations, the representation/score
associations also needs to be learnt. In this learning scheme we
call objective representations the ”unflavored” description built
from sensor data and subjective representations the objective
representation/local effect association.
IROS 2006 workshop: From Sensors to Human Spatial Concepts
Page: 27
Learnt
Associations
(section 5)
Context
Decomposition
(section 4)
Sensors
Global
Effect
(sections 5,6)
Initial
Representations
(sections 3,5,6)
Action
Synthesis
(section 6)
f) Global Effects (GE)): is a scoring system which associates a value with each representation. GE represents the
criterion the system wants to maximize. When the effects of
a representation are negative the system will produce actions
to increase the criterion value.
III. N EURONS , C ONNECTIVITY AND L OCAL A LGORITHMS
Actuators
"Macro" algorithms
high-level knowledge
low distribution
Reinforcement learning and decision making
High-level knowledge learning and association
(object recognition, object position, action)
Environment
Neural map connectivity and competition
Fig. 1.
Neuron local learning algorithms
Sensory-motor loop.
The system must extract representations from the environment in order to recognise perceptive contexts. The environment is perceived through sensors as exteroceptive and proprioceptive stimuli. One class of proprioceptive stimuli relates to
the actions performed by the robot. We in fact consider that the
robot has no initial knowledge about the actions it can perform
and thus, they must be learnt. Concurrently, an associative
memory learns sensory-motor associations rated by a value of
the global effect contribution. The learnt associations are then
combined to predict future perceptive contexts while the global
effect contribution is used to choose the most rewarding action.
These mechanisms take place in the global sensory-motor loop
shown in Figure 1 ([1], [2]).
The system is structured in slices of connected Pulsed
Neural Networks (PNN) based on a discrete integrate and fire
model (see Section III). PNN provide a level of description
that allows to develop the learning process [3], categorization
and association, while avoiding combinatorial explosion.
The functions of the seven subsystems are as follows:
a) Sensors (SEN): is the input of the global system and
is the frontier between the environment and the neural space.
It is composed of maps of neurons whose purpose is to convert
physical value to computable information.
b) Initial Representations (IR): is where the system start
from. This set of neurons is connected to SEN and permits to
generate the initial variations of the global effect.
c) Context Decomposition (CD): is the categorisation
and recognition engine. Its outputs are high level representations which are extracted by the use of multilayer neural
networks (see Sections III and IV).
d) Learned Associations (LA): receives inputs from CD
and itself. This sub-system has the properties of an associative
memory. It is responsible for learning of new skills and
determines the global system behaviour. Each neuron of LA
is connected to GE for criteria evaluation (see Section V).
e) Elementary Actions and Action Synthesis (AS and
ACT): are the action sub-systems but ACT is also where the
prediction and decision take place (see Section VI).
"Micro" algorithms
low-level knowledge
high distribution
Fig. 2. Global architecture for knowledge and skill acquisition composed by
a set of algorithms hierarchically structured.
The global architecture of the system is composed of several
algorithm which are performed concurrently (Figure 2). At
the bottom of the hierarchy, the neuron learning mechanism
is used to extract representations and learn associations. The
network connectivity grant our object recognition process
with shift-invariance property while lateral inhibition avoid
multiple learning of the same pattern. The associative memory
structure is designed to avoid an all-to-all connection scheme
as three high-level classes of concepts are predefined : object
recognition, object position and actions. Finally, a global
reinforcement learning algorithm is used for prediction and
decision to select the most rewarding action.
A. A Look Into the Map Architecture
A map is a set of local classifiers, or neurons, organized in a
retinotopic way. Every neuron of each map shares the same set
of synaptic weights which represent the memory component
of the learning process. These neurons are then able to detect
their learnt patterns whatever the position on the input map.
This property called shift invariance due to our network
connectivity is inspired from Fukushima’s Neocognitron [4].
B. Neural Competition for Distributed Meanings
As we want our system to autonomously extract representations from sensor data, we must provide our maps the ability
to detect whatever the input signals needs to be learnt or not.
With this end in view, we add to the inter-maps connectivities a
communication channel between neurons situated at the same
coordinates on maps belonging to the same layer (a layer is a
set of maps connected to the same inputs, see Figure 3). This
communication channel is the support for a local competition
that permits to distribute extracted meanings among a learning
layer in an unsupervised way [5]. Each neuron computes its
own detection value and compares it with distant detection
IROS 2006 workshop: From Sensors to Human Spatial Concepts
Page: 28
We also define two weight sums as follows :
+
X
Siβ (t) =
x
wij (t)
X
Si+ (t) =
wij (t)
wij (t)∈Wi+ (t)
Uj ∈Ω+
i (t)
where Wi+ (t) = {wij (t), Uj ∈ Ωi and wij (t) > 0}.
Then the integrated potential Pi at time t + 1, which expresses
a level of similarity, is given by:
y
+
Pi (t + 1) =
αP Pi (t) + (1 − αP )Siβ (t)
Si+ (t)
Where αP ∈ [0, 1] is a fixed potential leak.
Layer1
Fig. 3.
Layer2
Layer3
a)
blocking signals
x
y
step 0
step 500
step 1500
5
b)
b)
Layer1
Fig. 4.
Layer2
Maps local competition and specialization.
values.Then, a learning step occurs if the local detection value
won the competition which corresponds to a Winner Takes All
mechanism.
C. Integration, Firing and Learning
Lateral signals used for competition
Weight kernel
Receptive
field x
5
10 15 20 25 30 35 40 45 50 55 60
Burst levels
Fig. 6. Burst sampling distribution and cumulative distribution evolution. a)
burst sampling distributions computed at steps 0, 500 and 1500. Burst values
are discretized within 64 levels. b) corresponding cumulative distribution used
as a transfer function for integration.
(
0
if Pi (t) < Ti ,
βi (t) =
Fi (t, Pi (t)) otherwise.
Integrate
and Fire
Max
Learn
Learning stream
Integration and Fire I/O
Fig. 5. A map is a collection of neurons sharing the same weights. A neuron
is composed as a three layer pipeline.
1) Integration: Given a neuron Ui , we denote βi (t) ∈ [0, 1]
its calculated output burst at time t. Let wij (t) ∈ Wi (t) be the
learning weight associated to the connection between neurons
Ui and Uj , and Ωi+ (t) a subset of afferent Ωi defined as:
Wi (t) = {wij (t), Uj ∈ Ωi }
step 0
step 500
step 1500
2) Firing: The burst level is thus computed as follows:
Wq
y
60
10 15 20 25 30 35 40 45 50 55 60
Burst levels
CUMULATIVE SAMPLING DISTRIBUTION
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Layer3
b)
BURST LEVEL SAMPLING DISTRIBUTION
0.1
0.09
0.08
0.07
0.06
0.05
0.04
0.03
0.02
0.01
0
Pattern decomposition and shift invariance.
Ω+
i (t) = {Uj ∈ Ωi , βj (t) > 0}
Fi (t, x) =
X
fi (t, b)
0<b≤x
Where Ti is a fixed threshold and Fi is an adaptive transfer
function whose variations across time are illustrated by Figure
6.b. The transfer function Fi is in fact the associative cumulative distribution at time t and fi is a sampling distribution
of burst levels updated at each step (Figure 6.a).
3) Learning: We use in our model a hebbian-like learning
rule in which neurons tend to learn patterns responsible for
their activations. In other words, a learning process occurs
when one neuron has both fired and won the competition in
its layer. Moreover, we compute a stochastic standard deviation
σij for each weight wij which is computed and used as
coefficient in the learning calculation as follows:
µij (t + 1) = (1 − αW )µij (t) + αW | γij (t) |
σij (t + 1)2 = (1 − αW )σij (t) + αW [µij (t + 1) − γij (t + 1)]2
wij (t + 1) = (1 − αW )wij (t) + αW [1 − 2σij (t + 1)]γij (t + 1)
IROS 2006 workshop: From Sensors to Human Spatial Concepts
Page: 29
Where µij is a stochastic mean. Hence we ensure that weights
corresponding to noisy inputs (with high standard deviation)
will tend to 0 and then won’t take part in the representation. γij
is a function of input burst of neuron i which takes its values
in {−1, 0, 1}. Maps among the same layer are in competition
as shown in Figure 4. If at time t, an input burst from neuron
j is received γij (t) = 1, else if a burst is received from any
other afferent neuron then γij (t) = −1. Finally if no burst
was present γij (t) = 0.
IV. L EARNING PERCEPTIVE CONTEXTS AND ACTIONS
We assume here that our robot only focuses on a single
object. Then a perceptive context is defined by the conjunction
of the position of the object, its nature and all the proprioceptive informations defining the robot’s state. In this Section we
will focus on the autonomous extraction and internal encoding
of these contexts in order for the system to build pertinent
sets of representations and then (Section V) associate them
with actions to reach its optimality through the global effect
reinforcement learning process.
Where
What
Where
How
What
SEN
MAX
How
SUM
or
Where
Where
How
How
How
What
Where
How
or
How
How
How
How
Where
Where
What
or
inter-maps
connection
in-line map
creation
Acting
referent map
specific map
Referent maps activation mode:
MAX maximum specific map activation
SEN sensor activation
SUM sum of specific maps activations
Morphing
Moving
How
How
How
What
What
Where
Where
Where
What
or
Fig. 7. Context decomposition and associations. Three classes of representations are learnt, from left to right: View-based representations of objects
(’What’), representation of object’s position (’Where’) and representations
of the robot’s actions (’How’). These representations are associated for
prediction: ’Morphing’ and ’Moving’ associations. A third association is learnt
to represent applicable actions in a given perceptive context: ’Acting’.
A. Temporal Binding of Views for Object Learning
In the image-based or view-based model, an object is
represented as a collection of view-specific local features [5],
[6], [7], [8]. Representations are organised in trees in which
a set of view-tuned units constitutes the weighted inputs to
a higher level object-invariant unit. Each unit measures the
similarity between the input image and its stored view, and
the higher level unit computes the weigthed sum from its
incoming connections. If the resulting value reaches a given
threshold, then the learned object is recognized. Riesenhuber’s
HMAX model of object recognition in the ventral visual
stream of primates also proposes a similar grouping method
where higher level cells compute the maximum response of
view-tuned cells [9]. In the view-based model framework, the
representations the system produces are not just pixel images
but, thanks to the hierarchical and distributed architecture, can
reflect in some extend the structure of the object. Moreover,
we use the stability of the object localization (temporal coherence) in order to pool together the consecutive views the
robot perceives [10], [11]. A high learning rate grants to a
special map, called referent, the ability to track the object
position during scale and angle modifications. We also give
this referent-map the ability to create view-specific maps when
it discovers an unknown view of the object (like adding a
picture in the collection).
B. Coding the egocentric ’Where’ in a View-Based Approach
How to make a robot learn the concept of where the objects
are from its own perceptions since it has neither map of
its environment nor geometrical representation of space ? To
answer this question we propose an egocentric representation
of space as a mixture of proprioceptive and exteroceptive
stimuli. Firstly, we encode at each step the positions of the pan
and tilt axis with the respective firing positions of two neurons
belonging to two distinct neuron vectors (or 1D neuron maps).
When an object is recognized, a saccade module, which is
autonomous and purely reflexive, is in charge for centering the
object in front of the two cameras by modifying the pan and tilt
positions. Then, a third neuron vector, we call depth vector,
is activated at a position which reflects the shift amount in
position of the detected pattern between left and right images
(stereo-disparity). From these three vectors, and with the use of
the competition process, our system is able to learn, as for light
patterns, the proprioceptive classes of where in sight objects
are. These specialized neurons correspond to the referent-maps
in the ’Where’ layers (see Figure 7). This coding framework
doesn’t take into account out of sight positions; however we
plan to add in future works the ability to encode further
positions with patterns of actions that are required to reach the
object. Then the hard part would be to select relevant actions
when different sets are available.
V. A SSOCIATIONS AND TRANSITIONS
A. ’What’ Transitions: Morphing the World
As explained in Section IV-A, our learning framework is
able to autonomously build a view-based representation of
an object. In Figure 7 we illustrate the resulting ’What’
layer containing its referent-map (object specific, non viewspecific) and a set of specific-map (view-specific). Then we
call ’Morphing’ transitions the learnt associations of a specificmap with an action (’How’) that induces the recognition of a
specific-map right after the action completion. This resulting
specific-map must belong to the same ’What’ layer and can be
the same as the initial one. This concept is illustrated by Figure
8. On the upper right, we show a example of memory trace
for the activations of specific-maps belonging to the ’What’
and ’How’ layers which are the input states of this learning
process. Then the initial and final values of synaptic weights
are presented with a gray color scale. And finally we show
the use of the learnt association for prediction. A learning
IROS 2006 workshop: From Sensors to Human Spatial Concepts
Page: 30
'What' : Specific-Maps Activations
Across Time present
past
Initial Weights Values
V1
V2
V3
Learning Weights
V1
V2
V3
'How' : Referent-Maps Activations
Across Time present
A1
A2
A3
extraction in the learning process. From the subjective point
of view, the resulting ’How’ specific-map is able to learn its
effect contribution, it is thus possible to associate a local effect
to a sensory-motor association (eg. red signal+stop=positive
effect, red signal+run=negative effect). This preconditions can
finally be applied in the decision making process exposed in
next Section.
past
VI. P REDICTION AND DECISION
A1
A2
A3
p0
p2
p1
A0
Resulting Predictive Model
Learnt Weights Values
past
Learning Weights
V1
V2
V3
present
A1
A1
future
S0
V1
V2
V3
A2
S2
A1
S3
A0
S1
A1
A2
A3
recognized pattern
Fig. 8.
prediction
Learning a ’Morphing’ Association for map V2.
W a0
W a1
W a2
W e0
W e1
W e2
A0
A1
A2
S2
A1
State
Perceptive
Context
Possible
Transitions
S0
(Wa1, We0)
S0. A1 → S2
S3
A2
A2
S0
A1
S1
S2
A2
A1
A2
A3
p3
S0
S0. A2 → S3
S1
(Wa0, We0)
S1. A0 → S2
S2
(Wa2, We1)
S2. A2 → S0
S2. A1→ S1
S3
(Wa1, We2)
S3. A1 → S0
S3. A0→ S2
Fig. 9. A decision tree is represented as predictive timelines and generated
with ”Morphing”, ”Moving” and ”Acting” associations.
process occurs when the referent-map of the layer activates at
present time and global effect variation reaches a threshold.
Then, the system learns associations responsible for a global
effect increment (actions to perform) or decrement (actions to
avoid).
B. ’Where’ Transitions: Moving the World
In Section IV-B, we presented the competitive building
process for each ’Where’ referent-map. Now we will focus
on one particular ’Where’ layer, its referent-map and the
construction of ’Where’ specific-maps. This process is quite
similar to the one exposed in previous Section. The only
difference is the use of ’Where’ referent-maps, which have
previously learnt a {pan, tilt, depth} pattern, replacing ’What’
specific-maps in the input state of the learning neuron. Then
a ’Moving’ transition associate two positions with an action
able to produce the relative motion from one position to the
other. With the help of ’Morphing’ and ’Moving’ transition
the system is then able to predict a future perceptive context
which is a first step for decision making (see Section VI).
C. ’How’ Transitions: Learning Applicable Actions
This last association is quite different from the ’Morphing’ and ’Moving’ ones. The input domain differs since
we here use the association of ’What’ (referent or specific)
and ’Where’ (only referent) activation patterns. This learnt
perceptive context can then be viewed as preconditions for
the ’How’ referent-map (or action pattern) of the layer. We
want to emphasis on the fact that no preexisting knowledge of
such a kind is given to the system, so we need to include this
In Sections IV and V we have presented the way the system
builds ’What’, ’Where’ and ’How’ associations using patterns
of neuron activations learnt in memory traces. The point here is
to build a coherent projection of the robot possible perceptions
and actions using the previous associations and choose the
correct actions leading to the maximum increasement of the
system global effect. In Figure 9, we illustrate this idea by
the use of a decision tree in which each state corresponds
to a perceptive context. State correspondences and possible
transitions are summarized in table of Figure 9. All of these
informations needed to build decision trees come from the
learnt transitions. Moreover, our system also learns to associate
a score to every states and transitions. For a state, the score
is equal to the sum of its W ai and W ej respective local
effects. For a transition, the score is the one acquired during
the learning of the action preconditions as explained in Section
V-C. As neuron activity is restricted to only one ”Integrate and
Fire” phase at each step, and as the building of a decision
tree requires a sequenced process to apply transitions and
respect coherence, we must build an algorithm which is able
to build the tree during the course of time. The main idea
is to add new states and/or transitions to the tree during the
time duration needed to complete an action. Moreover, as
the tree is in fact translated as predicted activation timelines
we must grant each prediction with informations reflecting its
position in the tree in order to ”filter” incoherent states when
choosing one specific action. We associate to each state the
most rewarding path leading to it for coherence checking. At
the beginning of a time step, if the observed state is not the one
predicted we totally erase the tree. Then we add the present
IROS 2006 workshop: From Sensors to Human Spatial Concepts
Page: 31
observations to the timelines and create new states and/or
transitions with respect to the states and/or transitions created
during the previous step. As the size of the tree increases, we
update all predicted local effects as follows: a state adds its
own learnt effect to the maximum possible effects produced by
all its possible transitions and a transition adds its own effect
to its resulting state’s effect. Then we extract the path in the
tree leading to the maximum reachable effect and generate the
first action. Finally we filter incoherent states.
VII. ROBOT PLATFORM , IMPLEMENTATION AND
PRELIMINARY RESULTS
VIII. C ONCLUSION
We have firstly presented a uniform framework for autonomous knowledge extraction. We have then shown that
an associative memory can be built by the use of three core
concepts: ”What”, ”Where” and ”How”. Moreover we provide
a predictive and decisional framework involving the generated
associations with respect to our neurons computation capabilities. One next step would be to make the system learn
sophisticated actions in order to produce long term predictions
involving object’s appearance. We also plan to support multiple objects localizations by the use of an additional value we
call the phase in order to link one recognition with the correct
localization.
ACKNOWLEDGMENT
The work described in this paper was partially conducted
within the EU Integrated Project COGNIRON (The Cognitive Robot Companion) funded by the European Commission
Division FP6-IST Future and Emerging Technologies under
Contract FP6-002020. Nicolas Do Huu is supported by the
European Social Fund.
R EFERENCES
Fig. 10.
Athos, a SuperScout 2 robot built by Nomadic Technologies.
Athos (see Figure 10) is a SuperScout 2 robot built by
Nomadic Technologies. It is a 35 cm wide and tall cylinder equipped with two Sony DFW-VL500 digital cameras
mounted on a Directed Perception pan-tilt unit. Its has also
24 sonar range sensors which are not used during our experiments. The robot motion controller and the Pan-Tilt unit are
both connected to a Apple Powerbook G4 867MHz for video
acquisition, pan-tilt and motion control via USB ports.
The computational part is based on a pulsed neural network simulator called NeuSter (neuron-cluster). An eventdriven mechanism is used to simulate thousands of neurons
and millions of synaptic connections in real-time. To obtain
maximum performance, the computation can be distributed to
several POSIX machines on the local network.
During our first experiments the neural network has shown
a successful extraction of representations from the perceptive
context : view based object recognition, position and action
were learnt in an unsupervised way. Then we set an experiment
to test the associative memory in which the system perceived
negative effect variation when the pan angle moved away from
the central position. The resulting ’Moving’ associations and
global effect reinforcement learning managed to produce the
awaited behavior : the robot turns around its central axis to
be positioned in front of the object. The next step will be to
set up an experiment involving the object apprearance as its
color or shape. More results are to come.
[1] W. Paquier and R. Chatila, “An architecture for robot learning,” in
Intelligent Autonomous Systems, 2002, pp. 575–578.
[2] A Uniform Model For Developemental Robotic, 2003.
[3] W. Gerstner and W. Kistler, “Mathematical formulations of hebbian
learning,” Biological Cybernetics, no. 87, pp. 404–415, 2002.
[4] K. Fukushima, “Neocognitron for handwritten digit recognition.” Neurocomputing, vol. 51, pp. 161–180, April 2003.
[5] N. Do Huu, W. Paquier, and R. Chatila, “Combining structural descriptions and image-based representations for image object and scene
recognition,” in International Joint Conference on Artificial Intelligence
(IJCAI), 2005.
[6] T. Poggio and S. Edelman, “A network that learns to recognize three
dimensional objects,” Letters to Nature, vol. 343, pp. 263–266, 1990.
[7] S. Ullman, “Three-dimensional object recognition based on the combination of views.” Cognition, vol. 67, pp. 21–44, 1998.
[8] M. Tarr and H. Bülthoff, “Image-based object recognition in man,
monkey and machine.” Cognition, no. 67, pp. 1–20, 1998.
[9] M. Riesenhuber and T. Poggio, “Hierarchical models of object recognition in cortex.” Nature Neuroscience, vol. 28, pp. 1019–1025, 1999.
[10] G. Wallis, “Using spatio-temporal correlations to learn invariant object
recognition,” Neural Networks, vol. 9, no. 9, pp. 1513–1519, 1996.
[11] S. M. Stinger and E. T. Rolls, “Invariant object recognition in the visual
system with novel views of 3d objects,” Neural Computation, vol. 14,
no. 11, pp. 2585–2596, November 2002.
IROS 2006 workshop: From Sensors to Human Spatial Concepts
Page: 32
Semantic Labeling of Places using Information
Extracted from Laser and Vision Sensor Data
Óscar Martı́nez Mozos∗ , Axel Rottmann∗ , Rudolph Triebel∗ , Patric Jensfelt† and Wolfram Burgard∗
∗ University
of Freiburg, Department of Computer Science, Freiburg, Germany
Institute of Technology, Center for Autonomous Systems, Stockholm, Sweden
Email: {omartine|rottmann|triebel|burgard}@informatik.uni-freiburg.de, patric@nada.kth.se
† Royal
Abstract— Indoor environments can typically be divided into
places with different functionalities like corridors, kitchens,
offices, or seminar rooms. The ability to learn such semantic
categories from sensor data enables a mobile robot to extend the
representation of the environment facilitating the interaction with
humans. As an example, natural language terms like “corridor”
or “room” can be used to communicate the position of the
robot in a map in a more intuitive way. In this work, we first
propose an approach based on supervised learning to classify the
pose of a mobile robot into semantic classes. Our method uses
AdaBoost to boost simple features extracted from range data and
vision into a strong classifier. We present two main applications
of this approach. Firstly, we show how our approach can be
utilized by a moving robot for an online classification of the poses
traversed along its path using a hidden Markov model. Secondly,
we introduce an approach to learn topological maps from
geometric maps by applying our semantic classification procedure
in combination with a probabilistic relaxation procedure. We
finally show how to apply associative Markov networks (AMNs)
together with AdaBoost for classifying complete geometric maps.
Experimental results obtained in simulation and with real robots
demonstrate the effectiveness of our approach in various indoor
environments.
I. I NTRODUCTION
In the past, many researchers have considered the problem
of building accurate maps of the environment from the data
gathered with a mobile robot. The question of how to augment
such maps by semantic information, however, is virtually
unexplored. Whenever robots are designed to interact with
their users, semantic information about places can improve
the human-robot communication. From the point of view of
humans, terms like “corridor” or “room” give a more intuitive
idea of the position of the robot than using, for example, the
2D coordinates in a map.
In this work, we address the problem of classifying places
of the environment of a mobile robot using range finder and
vision data, as well as building topological maps based on
that knowledge. Indoor environments, like the one depicted
in Figure 1, can typically be divided into areas with different
functionalities such as laboratories, office rooms, corridors, or
kitchens. Whereas some of these places have special geometric
structures and can therefore be distinguished merely based on
laser range data, other places can only be identified according
to the objects found there like, for example, monitors in a
laboratory. To detect such objects, we use vision data acquired
by a camera system.
corridor
room
doorway
Fig. 1.
The left image shows a map of a typical indoor environment.
The middle image depicts the classification into three semantic classes as
colors/grey levels. For this purpose the robot was positioned in each free pose
of the original map and the corresponding laser observations were simulated
and classified. The right images show typical laser and image observations
together with some extracted features, namely the average distance between
two consecutive beams in the laser and the number of monitors detected in
the image.
The key idea is to classify the pose of the robot based on
the current laser and vision observations. Examples for typical
observations obtained in an office environment are shown in
the right images of Figure 1. The classification is then done
applying a sequence of classifiers learned with the AdaBoost
algorithm [18]. These classifiers are built in a supervised
fashion from simple geometric features that are extracted from
the current laser scan and from objects extracted from the
current images as shown in the right images of Figure 1. As
an example, the left image in Figure 1 shows a typical indoor
environment and the middle image depicts the classification
obtained using our method.
We furthermore present two main applications of this approach. Firstly, we show how to classify the different poses
of the robot during a trajectory and improve the final classification using a hidden Markov model. Secondly, we introduce
an approach to learn topological maps from geometric maps
by applying our semantic classification in combination with a
probabilistic relaxation procedure. In this last case we compare the results when using an associative Markov networks
(AMNs) with those obtained with AdaBoost.
The rest of this work is organized as follows. Section II
presents related work. In Section III, we describe the sequential AdaBoost classifier. In Section IV, we present the
application of a hidden Markov model to the online place
classification with a moving robot. Section V contains our
approach for topological map building. In Section VI we
present some results when using a range finder with a restricted
IROS 2006 workshop: From Sensors to Human Spatial Concepts
Page: 33
field of view. Finally, Section VII presents experimental results
obtained using our methods.
II. R ELATED W ORK
In the past, several authors considered the problem of adding
semantic information to places. Buschka and Saffiotti [5]
describe a virtual sensor to identify rooms from range data.
Koenig and Simmons [9] apply a pre-programmed routine
to detect doorways. Finally, Althaus and Christensen [1]
use sonar data to detect corridors and doorways. Learning
algorithms have additionally been used to identify objects
in the environment. For example, Anguelov et al. [2], [3]
apply the EM algorithm to cluster different types of objects
from sequences of range data and to learn the state of doors.
Limketkai et al. [12] use relational Markov networks to detect
objects like doorways based on laser range data. Finally,
Torralba and colleagues [23] use hidden Markov models for
learning places from image data.
Compared to these approaches, our algorithm is able to
combine arbitrary features extracted from different sensors to
form a sequence of binary strong classifiers to label places.
Our approach is also supervised, which has the advantage that
the resulting labels correspond to user-defined classes.
On the other hand, different algorithms for creating topological maps have been proposed. Kuipers and Byun [11] extract
distinctive points in the map defined as local maxima of a
distinctiveness measure. Kortenkamp and Weymouth [10] fuse
vision and ultrasound information to determine topologically
relevant places. Shatkey and Kaelbling [19] apply a HMM
learning approach to learn topological maps. Thrun [22] uses
the Voronoi diagram to find critical points, which minimize the
clearance locally. Choset [7] encodes metric and topological
information in a generalized Voronoi graph to solve the SLAM
problem. Additionally, Beeson et al. [4] used an extension of
the Voronoi graph for detecting topological places. Zivkovic
et al. [26] use visual landmarks and geometric constraints
to create a higher level conceptual map. Finally, Tapus and
Siegwart [20] used fingerprints to create topological maps.
In contrast to these previous approaches, the technique
described in this paper applies a supervised learning method to
identify complete regions in the map like corridors, rooms or
doorways that have a direct relation with a human understanding of the environment. This knowledge about semantic labels
of places is used then to build topological maps with a mobile
robot. We also apply associative Markov networks (AMNs)
together with AdaBoost to label each point in a geometric
map.
III. S EMANTIC C LASSIFICATION OF P OSES USING
A DA B OOST
Boosting is a general method for creating an accurate
strong classifier by combining a set of weak classifiers. The
requirement to each weak classifier is that its accuracy is
better than a random guessing. In this work we will use the
boosting algorithm AdaBoost in its generalized form presented
by Schapire and Singer [18]. The input to the algorithm is a
set of labeled training examples (xn , yn ), n = 1, . . . , N , where
each xn is an example and each yn ∈ {+1, −1} is a value
indicating whether xn is positive or negative respectively.
In our case, the training examples are composed by laser
and vision observations. In several iterations the algorithm
repeatedly selects a weak classifier using a weight distribution
over the training examples. The final strong classifier is a
weighted majority vote of the best weak classifiers.
Throughout this work, we use the approach presented by
Viola and Jones [25] in which the weak classifiers depend on
single-valued features fj ∈ ℜ. For a more detail description
see [17].
The so far described method is able to distinguish between
two classes of examples, namely positives and negatives. In
practical applications, however, we want to distinguish between more than two classes. To create a multi-class classifier
we used the approach applied by Martı́nez Mozos et al. [14]
and create a sequential multi-class classifier using K − 1
binary classifiers, where K is the number of classes we want
to recognize. The classification output of the decision list is
then represented by a histogram z. Each bin of z stores the
probability that the classified example belongs to the k-th
class. The order of the classifiers in the decision list can be
selected according to different methods as described in [13]
and [14].
A. Features from Laser and Vision Data
In this section, we describe the features used to create
the weak classifiers in the AdaBoost algorithm. Our robot
is equipped with a 360 degree field of view laser sensor
and a camera. Each laser observation consists of 360 beams.
Each vision observation consists of eight images which form a
panoramic view. Figure 1 shows a typical laser range reading
as well as one of the images from the panoramic view taken
in an office environment. Accordingly, each training example
for the AdaBoost algorithm consist of one laser observation,
one vision observation, and its classification.
Our method for place classification is based on single-valued
features extracted from laser and vision data. All features are
invariant with respect to rotation to make the classification
of a pose dependent only on the position of the robot and
not on its orientation. Most of our laser features are standard
geometrical features used for shape analysis as the one shown
in Figure 1. In the case of vision, the selection of the features is
motivated by the fact that typical objects appear with different
probabilities at different places. For example, the probability
of detecting a computer monitor is larger in an office than
in a kitchen. For each type of object, a vision feature is
defined as a function that takes as argument a panoramic vision
observation and returns the number of detected objects of this
type in it. This number represents the single-valued feature fj
as explained in Section III. As an example, Figure 1 shows
one image of a panoramic view and its detected monitors. A
more detailed list of laser and image features is contained in
our previous work [14].
IROS 2006 workshop: From Sensors to Human Spatial Concepts
Page: 34
IV. P ROBABILISTIC C LASSIFICATION OF T RAJECTORIES
The approach described so far is able to classify single
observations only but does not take into account past classifications when determining the type of place the robot is currently
at. However, whenever a mobile robot moves through an
environment, the semantic labels of nearby places are typically
identical. Furthermore, certain transitions between classes are
unlikely. For example, if the robot is currently in a kitchen
then it is rather unlikely that the robot ends up in an office
given it moved a short distance only. In many environments,
to get from the kitchen to the office, the robot has to move
through a doorway first.
To incorporate such spatial dependencies between the individual classes, we apply a hidden Markov model (HMM) and
maintain a posterior Bel (lt ) about the type of the place lt the
robot is currently at
X
Bel (lt ) = αP (zt | lt )
P (lt | lt−1 , ut−1 )Bel (lt−1 ).(1)
lt−1
In this equation, α is a normalizing constant ensuring that
the left-hand side sums up to one over all lt . To implement
this HMM, three components need to be known. First, we
need to specify the observation model P (zt | lt ) which is the
likelihood that the classification output is zt given the actual
class is lt . Second, we need to specify the transition model
P (lt | lt−1 , ut−1 ) which defines the probability that the robot
moves from class lt−1 to class lt by executing action ut−1 .
Finally, we need to specify how the belief Bel (l0 ) is initialized.
In our current system, we choose a uniform distribution
to initialize Bel (l0 ). The quantity P (zt |lt ) has been obtained
by a statistics about the classification output of the AdaBoost
algorithm given that the robot was at a place corresponding
to lt . To realize the transition model P (lt |lt−1 , ut−1 ) we
only consider the two actions ut−1 ∈ {MOVE , STAY }.
The transition probabilities were estimated by running 1000
simulation experiments. A more complete description is given
in [17].
V. T OPOLOGICAL M AP B UILDING
A second application of our semantic place classification
is the extraction of topological maps from geometric maps.
Throughout this section we assume that the robot is given a
map of the environment in the form of an occupancy grid [15].
Our approach then determines for each unoccupied cell of such
a grid its semantic class. This is achieved by simulating a range
scan of the robot given it is located in that particular cell, and
then labeling this scan into one of the semantic classes. To
remove noise and clutter from the resulting classifications,
we apply an approach denoted as probabilistic relaxation
labeling [16]. This method takes into account the labels of the
neighborhood when changing (or maintaining) the label of a
given cell. From the resulting labeling we construct a graph
whose nodes correspond to the regions of identically labeled
poses and whose edges represent the connections between
them. Additionally we apply a heuristic region correction
to the topological map to increase the classification rate. A
typical topological map obtained with our approach is shown
in the Figure 7. For more detail see [14].
A. Semantic Classification of Maps using Associative Markov
Networks
The improvement on the labeling of free cells given by
our AdaBoost approach can also be seen as a collective
classification problem [6]. In this approach, the labeling of
each free cell in the map is also influenced by the labeling
of other cells in the vicinity. One popular method for the
task of collective classification are relational Markov networks
(RMNs) [21]. In addition to the labels of neighboring points,
RMNs also consider the relations between different objects.
E.g., we can model the fact that two classes A and B are more
strongly related to each other than, say, classes A and C. This
modeling is done on the abstract class level by introducing
clique templates [6]. Applying these clique templates to a
given data set yields an ordinary Markov network (MN). In this
MN, the result is a higher weighting of neighboring points with
labels A and B than of points labeled A and C. Additionally,
each node in the network is associated a set of features.
The whole process of labeling is composed of two steps.
First, a supervised learning process is used to learn the
parameters of the RMN used as a training set. Second, a new
network is classified using these parameters. This last step is
also called inference. In this work, we will use a special type
of RMNs known as associative Markov networks (AMNs).
Efficient algorithms are available for learning and inference in
AMNs (for more detail see [24]).
In our case we create an AMN in which each node represents a cell in the geometric map. Each node is given a
semantic label corresponding to the place in the map (corridor,
doorway or room). We also create a 8-neighborhood for each
cell. Furthermore, a set of features is calculated for each cell.
These features correspond to the geometric ones extracted
from a simulated laser beam as explained in Section III-A. To
reduce the number of features during the training and inference
steps, we select a subset of them. This selection is done using
the AdaBoost algorithm [13].
VI. L ASER O BSERVATIONS WITH R ESTRICTED F IELD OF
V IEW
In this section we present some practical issues when
classifying a trajectory using range data with a restricted field
of view. Specifically, we explain how to extract features when
using a laser range finder which only covers 180o in front
of the robot. This is one of the most common configurations
when using mobile robots. As an example, if a robot is looking
at the end of a corridor, then it is not able to see the rest of
the corridor, as is the case with an additional rear laser. This
situation is shown in Figure 2. When classifying a trajectory
we propose to maintain a local map around the robot as shown
in the right image of Figure 2. This local map can be updated
during the movements of the robot and then used to simulate
the rear laser beams. In Section VII we show some results
when learning and classifying a place using this method.
IROS 2006 workshop: From Sensors to Human Spatial Concepts
Page: 35
Fig. 2. The left image shows a robot at the end of a corridor with only a
front laser (red). In the middle image the robot has an additional rear laser
(blue). The right image depicts an example local map (shaded area).
corridor
corridor
room
doorway
Fig. 3. The left image depicts the training data. The right image shows the
test set with a classification rate of 97.3%. The training and test data were
obtained by simulating laser range scans in the map.
VII. E XPERIMENTS
The approaches described above have been implemented
and tested on real robots as well as in simulation. The robots
used to carry out the experiments were an ActivMedia Pioneer
2-DX8 equipped with two SICK lasers, an iRobot B21r robot
equipped with a camera system and an ActivMedia PowerBot
equipped only with a front laser.
The goal of the experiments is to demonstrate that our
simple features can be boosted to a robust classifier of places.
Additionally, we analyze whether the resulting classifier can
be used to classify places in environments for which no
training data was available. Furthermore, we demonstrate the
advantages of utilizing the vision information to distinguish
between different rooms like, e.g., kitchens, offices, or seminar
rooms. Additionally, we illustrate the advantages of the HMM
filtering for classifying places with a moving mobile robot. We
also present results applying our method for building semantic
topological maps. Finally, we show experiments using a robot
with only a front laser.
A. Results with the Sequential Classifier using Laser Data
The first experiment was performed using simulated data
from our office environment in building 79 at the University
of Freiburg. The task was to distinguish between three different
types of places, namely rooms, doorways, and a corridor based
on laser range data only. In this experiment, we applied the
sequential classifier without any filtering. For the sake of
clarity, we separated the test from the training data by dividing
the overall environment into two areas. Whereas the left part of
the map contains the training examples, the right part includes
only test data (Figure 3). The optimal decision list for this
classification problem, in which the robot had to distinguish
between three classes, is room-doorway. This decision list
correctly classifies 97.3% of all test examples (right image
of Figure 3). Additionally, we performed an experiment using
a map of the entrance hall at the University of Freiburg which
room
doorway
Fig. 4. The left map depicts the occupancy grid map of the Intel Research
Lab and the right image depicts the classification results obtained by applying
the classifier learned from the environment depicted in Figure 1 to this
environment. The fact that 83.0% of all places could be correctly classified
illustrates that the resulting classifiers can be applied to so far unknown
environments.
contained four different classes, namely rooms, corridors,
doorways, and hallways. The optimal decision list is corridorhallway-doorway with a success rate of 89.5%. The worst
configurations of the decision list are those in which the
doorway classifier is in the first place. This is probably due to
the fact, that doorways are hard to detect because typically
most parts of a range scan obtained in a doorway cover
the adjacent room and the corridor. The high error in the
first element of the decision list then leads to a high overall
classification error.
B. Transferring the Classifiers to New Environments
The second experiment is designed to analyze whether a
classifier learned in a particular environment can be used to
successfully classify the places of a new environment. To carry
out this experiment, we trained our sequential classifier in
the left map of Figure 1, which corresponds to the building
52 at the University of Freiburg. The resulting classifier was
then evaluated on scans simulated given the map of the Intel
Research Lab in Seattle depicted in Figure 4. Although the
classification rate decreased to 83.0%, the result indicates
that our algorithm yields good generalizations which can
also be applied to correctly label places of so far unknown
environments. Note that a success rate of 83.0% is quite
high for this environment, since even humans typically cannot
consistently classify the different places.
C. Classification of Trajectories using HMM Filtering
The third experiment was performed using real laser and
vision data obtained in an office environment, which contains
six different types of places, namely offices, doorways, a
laboratory, a kitchen, a seminar room, and a corridor. The
true classification of the different places in this environments
is shown in Figure 5. The classification performance of the
classifier along a sample trajectory taken by a real robot is
shown in the left image of Figure 6. The classification rate in
this experiment is 82.8%. If we additionally apply the HMM
IROS 2006 workshop: From Sensors to Human Spatial Concepts
Page: 36
Door 1
F
F
L
D
D
D
C
D
K
Fig. 5.
F S
F
D
D
F
C
D
L Laboratory
S Seminar
C Corridor
F Office
D Doorway
K Kitchen
Door 2
Room 1
Corridor
Door 3
Room 2
Door 4
Door 5
Door 6
Room 3
Room 4
Room 5
S
Ground truth labeling of the individual areas in the environment.
L
L
L
L
LS
SL
F
D
SF F D
L
L
F
F
L
F
F F
SL L
DF
L F
F
DD
DD
C
C
C
CC
C C C
C
C
C C
C C C
C
CC
C C
DD
D
D
D
D
S S S LS D
S
KK
LF
D
S
K
F
F
FK
S
L F
S
KK
K
F F
F
L
L
L
L
LL
LS
F
D
SF F F
L
L
F
F
L
F
F
LL L L
SFF
L
L
SFD
D
C
C
C
CC
C C C
C
C
C C
C C C
C
CC
C C
DS
D
S
D
D
S S S SS D
S
KF
FF
S
S
K
F
F
KK
S
F F
S
KK
K
Fig. 6. The left image depicts a typical classification result for a test set
obtained using only the output of the sequence of classifiers. The right image
shows the resulting classification in case a HMM is additionally applied to
filter the output of the sequential classifier.
for temporal filtering, the classification rate increases up to
87.9% as shown in the right image of Figure 6.
A further experiment was carried out using test data obtained in a different part of the same building. We applied
the same classifier as in the previous experiment. Whereas
the sequential classifier yields a classification rate of 86.0%,
the combination with the HMM generated the correct answer
in 94.7% of all cases. A two-sample t-test applied to the
classification results obtained along the trajectories for both
experiments showed that the improvements introduced by the
HMM are significant on the α = 0.05 level. Furthermore, we
classified the same data based solely on the laser features and
ignoring the vision information. In this case, only 67.7% could
be classified correctly without the HMM. The application of
the HMM increases the classification performance to 71.7%.
These three experiments illustrate that the HMM significantly improves the overall rate of correctly classified places.
Moreover, the third experiment shows that only the laser
information is not sufficient to distinguish robustly between
places with similar structure (see “office” and “kitchen” in
Figure 6).
D. Building Topological Maps
The next experiment is designed to analyze our approach
to build topological maps. It was carried out in the office
environment depicted in the motivating example shown in
Figure 1. The length of the complete corridor in this environment is approx. 20 m. After applying the sequential AdaBoost
classifier (see middle image in Figure 1), we applied the
probabilistic relaxation method together with the heuristics
explained in Section V. The resulting topological map is
shown in Figure 7. The final result gives a classification rate of
98.0% for all data points. The doorway between the two rightmost rooms under the corridor is correctly detected. Therefore,
Fig. 7.
Final tropological map with of building 52 at Freiburg University.
the rooms are labeled as two different regions in the final
topological map.
E. Learning Topological Maps of Unknown Environments
This experiment is designed to analyze whether our approach can be used to create a topological map of a new
unseen environment. To carry out the experiment we trained a
sequential AdaBoost classifier using the training examples of
the maps shown in Figure 3 and Figure 1 with different scales.
The resulting classifier was then evaluated on scans simulated
in the map denoted as “SDR site B” in Radish [8]. This map
represents an empty building in Virginia, USA. The corridor
is approx. 26 meters long. The whole process for obtaining
the topological map is depicted in Figure 8. The Adaboost
classifier gives a first classification of 92.4%. As can be seen
in Figure 8(d), rooms number 11 and 30 are actually part of
the corridor, and thus falsely classified. Moreover, the corridor
is detected as only one region, although humans potentially
would prefer to separate it into six different corridors: four
horizontal and two vertical ones. Doorways are difficult to
detect and the majority of them dissappear after the relaxation
process because they are very sparse. In the final topological
map 96.9% of the data points are correctly classified.
F. Learning Topological Maps using Associative Markov Networks (AMNs)
In this experiment, we classify the map of the building
79 at the University of Freiburg applying the learning and
inference process for AMNs as explained in Section V-A.
We divide the map in two parts and use one of them for
training (see left image in Figure 3) and the second one for
testing. In this experiment we reduce the resolution of the
maps to 20cm. The reason is that the original resolution of
5cm generates a huge network which exceeds the memory
resources of our computers during the training step of the
corresponding AMN. The left image of Figure 9 shows the
results of the classification using AMNs. The classification rate
using AMNs was 98.8%. We compare this method with the
classification obtained using our sequential AdaBoost together
with the probabilistic relaxation procedure. The right image
of Figure 9 depicts the classification results. In this case only
92.1% of the cells were correctly classified. As we can see,
one consequence of changing the resolution to 20cm, is that
the classification rate decreases (see right image of Figure 3).
We think this is due to the worse quality of the simulated
beams in such a granulated map. On the other hand, AMNs
IROS 2006 workshop: From Sensors to Human Spatial Concepts
Page: 37
(a) Original map
(b) Sequential AdaBoost classification
R1
R2
R3
R4
R5
R6
CORRIDOR
R7
R8
R9
R 46
R 10 R 11 R 12 R 13
R 14
R 15
R 16
R 21 R 22
R 23
R 24
R 25
R 31 R 32
R 33
R 34
R 35
CORRIDOR
R 17
R 18
R 19 R 20
R 26
R 27
R 28
R 30
R 29
CORRIDOR
R 36
R 38
R 37
CORRIDOR
R 39 R 40
(c) Relaxation and region correction
R 41
R 42 R 43 R 44
R 45
corridor
room
Fig. 8. This figure shows (a) the original map of the building, (b) the results
of applying the sequential AdaBoost classifier with a classification rate of
93%, (c) the resulting classification after the relaxation and region correction,
and (d) the final topological map with semantic information. The regions are
omitted in each node. The rooms are numbered left to right and top to bottom
with respect to the map in (a). For the sake of clarity, the corridor-node is
drawn maintaining part of its region structure.
corridor
corridor
(d) Final topological map
room
doorway
Fig. 9. The left image depicts a classification of 98.8% of the building 79 at
University of Freiburg using AMNs. The right image shows the classification
of the same building using the sequential AdaBoost classifier together with the
probabilistic labeling method. In this case the classification rate was 92.1%.
The training and test data were obtained by simulating laser range scans in
the left map of Figure 3.
seems to be more robust to changes in resolution and give
better classifications results.
G. Laser Observations with Restricted Field of View
In this experiments we show the results of applying our classification methods when the laser range scan has a restricted
field of view. No image data was used. We first steered a
PowerBot robot equipped with only a front laser along the 6th
floor of the CAS building at KTH (right to left). The trajectory
is shown in the top image of Figure 10. The data recorded in
this floor was used to train the AdaBoost classifier. We then
classified a trajectory on the 7th floor in the same building. We
started the trajectory in an opposite direction (left to right). The
room
doorway
Fig. 10. The top image shows the training trajectory on the 6th floor of the
CAS building at KTH. The middle image depicts the labeling of the trajectory
of the 7th floor using only a front laser with a classification rate of 84.4%.
Finally, the bottom image shows the same labelled trajectory using a complete
laser field of view together with a local map. In this case the classification
rate decreases slightly to 81.6%.
resulting classification rate of 84.4% is depicted in the middle
image of Figure 10. We repeated the experiment simulating
the rear laser using a local map. The classification decreases
slightly to 81.6%. Most of the errors appear in poses where
the robot still sees a doorway due to the rear beams. This is
not the case when using only a front laser, because the robot
only sees a doorway when facing it.
To verify that the doorways can be the reason of the lack of
improvement using local maps, we repeat both experiments,
but in this case using only two classes, namely room and corridor. The results are shown in Figure 11. The top image depicts
the labeling using only a front laser with a classification rate
of 87.3%. The bottom image shows the result of simulating
the rear beams using a local map. The classification rate in
this case increases to 95.8%.
VIII. C ONCLUSION
In this paper, we presented a novel approach to classify
different places in the environment of a mobile robot into
semantic classes, like rooms, hallways, corridors, offices,
kitchens, or doorways. Our algorithm uses simple geometric
features extracted from a single laser range scan and information extracted from camera data and applies the AdaBoost
algorithm to form a binary strong classifier. To distinguish
between more than two classes, we use a sequence of strong
binary classifiers arranged in a decision list.
We presented two applications of our approach. Firstly, we
perform an online classification of the positions along the
trajectories of a mobile robot by filtering the classification
IROS 2006 workshop: From Sensors to Human Spatial Concepts
Page: 38
corridor
room
Fig. 11. In this experiments only two classes where used, namely room and
corridor. The top image depicts the classification of the trajectory of the 7th
floor using only a front laser with a classification rate of 87.3%. The bottom
image shows the same trajectory using a complete laser field of view together
with a local map. In this case the classification rate increases to 95.8%.
output using a hidden Markov model. Secondly, we present
a new approach to create topological graphs from occupancy
grids by applying a probabilistic relaxation labeling to take
into account dependencies between neighboring places to
improve the classifications.
Experiments carried out using real robots as well as in simulation illustrate that our technique is well-suited to reliably
label places in different environments. It allows us to robustly
separate different semantic regions and in this way it is able to
learn topologies of indoor environments. Further experiments
illustrate that a learned classifier can even be applied to so far
unknown environments.
ACKNOWLEDGMENT
This work has been partially supported by the EU under
contract number FP6-004250-CoSy and by the German Research Foundation under contract number SBF/TR8.
R EFERENCES
[1] P. Althaus and H. Christensen, “Behaviour coordination in structured
environments,” Advanced Robotics, vol. 17, no. 7, pp. 657–674, 2003.
[2] D. Anguelov, R. Biswas, D. Koller, B. Limketkai, S. Sanner, and
S. Thrun, “Learning hierarchical object maps of non-stationary environments with mobile robots,” in Proc. of the Conf. on Uncertainty in
Artificial Intelligence (UAI), 2002.
[3] D. Anguelov, D. Koller, P. E., and S. Thrun, “Detecting and modeling
doors with mobile robots,” in Proc. of the IEEE Int. Conf. on Robotics
& Automation (ICRA), 2004.
[4] P. Beeson, N. K. Jong, and B. Kuipers, “Towards autonomous topological place detection using the extended voronoi graph,” in Proc. of the
IEEE Int. Conf. on Robotics & Automation (ICRA), 2005.
[5] P. Buschka and A. Saffiotti, “A virtual sensor for room detection,” in
Proc. of the IEEE/RSJ Int. Conf. on Intelligent Robots and Systems
(IROS), 2002, pp. 637–642.
[6] S. Chakrabarti, B. Dom, and P. Indyk, “Enhanced hypertext categorization using hyperlinks,” in SIGMOD ’98: Proceedings of the 1998 ACM
SIGMOD international conference on Management of data. New York,
NY, USA: ACM Press, 1998, pp. 307–318.
[7] H. Choset, “Topological simultaneous localization and mapping
(SLAM): Toward exact localization without explicit localization,” IEEE
Transactions on Robotics and Automation, 2001.
[8] A. Howard and N. Roy, “Radish: The robotics data set repository.”
[Online]. Available: http://radish.sourceforge.net/
[9] S. Koenig and R. Simmons, “Xavier: A robot navigation architecture
based on partially observable markov decision process models,” in
Artificial Intelligence Based Mobile Robotics: Case Studies of Successful
Robot Systems, D. Kortenkamp, R. Bonasso, and R. Murphy, Eds. MITPress, 1998, pp. 91–122.
[10] D. Kortenkamp and T. Weymouth, “Topological mapping for mobile
robots using a combination of sonar and vision sensing,” in Proc. of the
Twelfth National Conference on Artificial Intelligence, 1994, pp. 979–
984.
[11] B. Kuipers and Y. Byun, “A robot exploration and mapping strategy
based on a semantic hierarchy of spatial representations,” Robotics and
Autonomous Systems, 8 1981.
[12] B. Limketkai, L. Liao, and D. Fox, “Relational object maps for mobile
robots,” in Proc. of the Int. Conf. on Artificial Intelligence (IJCAI),
Edinburgh, Scotland, 2005, pp. 1471–1476.
[13] O. Martı́nez Mozos, C. Stachniss, and W. Burgard, “Supervised learning
of places from range data using adaboost,” in Proc. of the IEEE
Int. Conf. on Robotics & Automation (ICRA), Barcelona, Spain, April
2005, pp. 1742–1747.
[14] O. Martı́nez Mozos, C. Stachniss, A. Rottmann, and W. Burgard, “Using
adaboost for place labeling and topological map building,” in Proc. of
the Int. Symposium of Robotics Research (ISRR), San Francisco, CA,
USA, October 2005.
[15] H. P. Moravec, “Sensor fusion in certainty grids for mobile robots,” AI
Magazine, pp. 61–74, Summer 1988.
[16] A. Rosenfeld, R. A. Hummel, and S. W. Zucker, “Scene labeling by
relaxation operations,” IEEE Trans. Systems. Man. Cybernet, vol. 6,
no. 6, pp. 420–433, 1976.
[17] A. Rottmann, O. Martı́nez Mozos, C. Stachniss, and W. Burgard, “Place
classification of indoor environments with mobile robots using boosting,”
in Proc. of the National Conference on Artificial Intelligence (AAAI),
Pittsburgh, PA, USA, 2005, pp. 1306–1311.
[18] R. Schapire and Y. Singer, “Improved boosting algorithms using
confidence-rated predictions,” Mach. Learn., vol. 37, no. 3, pp. 297–
336, 1999.
[19] H. Shatkey and L. Kaelbling, “Learning topological maps with weak
local odometric information,” in Proc. of the Int. Conf. on Artificial
Intelligence (IJCAI), 1997.
[20] A. Tapus and R. Siegwart, “Incremental robot mapping with fingerprints
of places,” in Proc. of the IEEE/RSJ Int. Conf. on Intelligent Robots and
Systems (IROS), August 2005, pp. 2429– 2434.
[21] B. Taskar, P. Abbeel, and D. Koller, “Discriminative probabilistic models
for relational data,” in Proc. Eighteenth Conference on Uncertainty in
Artificial Intelligence (UAI), Edmonton, Canada, 2002.
[22] S. Thrun, “Learning metric-topological maps for indoor mobile robot
navigation,” Artificial Intelligence, vol. 99, no. 1, pp. 21–71, 1998.
[23] A. Torralba, K. Murphy, W. Freeman, and M. Rubin, “Context-based
vision system for place and object recognition,” in Proc. of the
Int. Conf. on Computer Vision (ICCV), 2003.
[24] R. Triebel, P. Pfaff, and W. Burgard, “Multi-level surface maps for
outdoor terrain mapping and loop closing,” in ”Proc. of the International
Conference on Intelligent Robots and Systems (IROS)”, 2006.
[25] P. Viola and M. Jones, “Robust real-time object detection,” in Proc. of
IEEE Workshop on Statistical and Theories of Computer Vision, 2001.
[26] Z. Zivkovic, B. Bakker, and B. Kröse, “Hierarchical map building using
visual landmarks and geometric constraints,” in Proc. of the IEEE/RSJ
Int. Conf. on Intelligent Robots and Systems (IROS), 2005.
IROS 2006 workshop: From Sensors to Human Spatial Concepts
Page: 39
...
IROS 2006 workshop: From Sensors to Human Spatial Concepts
Page: 40
Towards Stratified Spatial Modeling
for Communication & Navigation
Robert J. Ross, Christian Mandel, John A. Bateman, Shi Hui, Udo Frese
Abstract— In this paper we present NavSpace, a stratified
spatial representation developed for service robotics. While
NavSpace’s lower tiers, including metric and voronoi-graph
information, focus on the needs of navigation and localization
systems, the upper tiers have been explicitly developed to
support relatively natural human-robot interaction. Specifically,
NavSpace’s upper tiers include: (a) a coarse grained Topological
Level which includes Route Graph and Region Space models
of the environment; and (b) the Concept Level, a cognitively
grounded, coarse grained representation which links together the
places, regions, and paths of the topological level. Furthermore,
we describe the NavSpace model with respect to our completed
prototype on Rolland, a semi-autonomous wheelchair, and in so
doing, describe how the conceptual spatial representation relates
to natural language technology through an empirically derived
linguistic semantics bridge.
has been considered from different perspectives; in Section III
we review and critique two of the more prominent of such
models, the Spatial Semantic Hierarchy and the Route Graph.
In Section IV we then detail our own developed spatial model,
i.e., NavSpace, which takes a more focused view on the issues
of cognitive conceptualization and HRI needs than had been
considered in previous approaches. In Section V we place the
NavSpace model in the context of our target application, a
semi-autonomous robotic wheelchair, by both describing the
use of the various layers by navigation, safety, and localization
systems, while also showing how the upper conceptual layers
are mapped to a linguistic semantics model which drives the
spatial communication interface.
Index Terms— Spatial Modeling, HRI
I. I NTRODUCTION
E
MPERICAL studies show that users employ models of
space which are schematized mental constructions within
which exact metric data is often systematically simplified
and distorted [25], [26]. For example, several studies show
that a user’s space of navigation is essentially topological,
consisting of landmarks, places, and paths [4], [20], [21].
Unsurprisingly then, the language people use to describe space
is similarly coarse grained yet complex [23]. For robotic
systems to effectively communicate with users they must be
able to communicate in terms of these coarse-grained spatial
concepts. One way to approach this problem would be to
attempt to directly map complex spatial language constructions
directly to standard robotic spatial models such as voronoi
graphs and metric data. The alternative view, pursued here,
is that specific abstract spatial representations can be derived
and used in parallel with low-level representations to facilitate
the communication process. A need for such a stratified
spatial representations is by no means new [9], [11]; however,
significant questions remain regarding the exact composition
of such layers, their inter-relationships, and the relationship to
natural language processing.
In this paper we present NavSpace, an implemented stratified spatial representation for service robots engaged in navigation tasks. While the lower tiers consist of sub-conceptual
metric and voronoi graph information which provide suitable
representation for robot navigation and localization, the upper
levels provide spatial conceptual representations for the communicative tasks. We begin in Section II with a brief review
of some of the challenging features of verbal spatial language
which we have observed between users and mobile robots.
As mentioned above, the need for stratified spatial models
II. S TRUCTURED S PATIAL I NTERACTIONS
Our development of the NavSpace tiers have been driven
partially through the analysis of two series of empirical studies
which were conducted to investigate the nature of spatial
language used in joint human-robot navigation tasks. The first
series used a number of robotic platforms, and investigated
language used to describe spatial concepts, route descriptions,
and spatial relations between objects [6], [20], [22]. The
second study which we considered was Bugmann et al’s
Instruction Based Learning (IBL) project which looked at
navigation tasks performed by robots moving around a toy
town [3], [12].
Generalizing the corpus data for compactness, Exchange 1
below presents one prototypical example of a relatively simple
exchange between a user who wishes to be driven to a location,
and the wheelchair which does not know at first how to get
to that location, but which initiates a clarification dialogue to
determine where that location might be:
(1) a. User: drive me to the lab
b. System: Sorry, I don’t know about any lab.
c. System: Where is it?
d. User: it is beside the kitchen.
e. System: Is it before or after the kitchen?
f. User: After
g. System: OK, driving to the lab
While the user in exchange 1 attempts to localize an object
with respect to a second mutually known object, users will
often provide localization information in terms of a set of
instructions on how to reach the goal location, i.e., a route
instruction, such as that below:
(2) a. User: drive me to the lab
b. System: Sorry, I don’t know about any lab
c. System: Where is it?
d. User: Where are we now?
IROS 2006 workshop: From Sensors to Human Spatial Concepts
Page: 41
e. System: We are beside the kitchen
f. User: Go to the elevators
g. System: OK, to the elevators
h. User: Drive forward about 10 meters
i. System: OK
j. User: then turn
k. System: Should I turn left or right there?
l. User: right
m. System: OK
n. User: the lab is the second room to the left
o. System: OK, driving to the lab
These examples are intended to qualitatively illustrate a
number of features common to exchanges concerning spatial
concepts: (a) granularity of environment space concepts used
in verbal scene description and localization, e.g., “kitchen”,
“elevator”; (b) course grained action descriptions, “turn”,
“drive”; (c) quantitative terms, e.g., “10 meters”; (d) projective
spatial relations, e.g., “to the left”; and (e) localization terms
“we are beside the kitchen”. We argue that the variety of
constructions used mean that a number of different representational and processing approaches must be harnessed to account
for the intrinsic variability of users’ cognitive modeling of
space. In the next section we review two existing models
which have attempted to capture this stratified representation
of spatial knowledge.
III. H IERARCHICAL S PATIAL M ODELS
Over the past ten years, the need for robotic systems to
incorporate a sufficiently adequate spatial modeling to support
a range of tasks ranging from navigation to human-robot interaction has become apparent. In the following we review two
of the more prominent models which have attempted to span
the robotics and spatial cognition communities, i.e., Kuipers’s
Spatial Semantic Hierarchy [11], and Krieg-Brückner’s Route
Graph [9].
A. The Spatial Semantic Hierarchy
The Spatial Semantic Hierarchy (SSH) [11] is a stratified model of an agent’s spatial representation encompassing
sensor data, behaviors, topological representation, and metric
maps. Specifically, the SSH includes five distinct ontological
levels: (a) the sensory level, (b) the control level, (c) the causal
level, (d) the topological level, and (e) the metrical level.
The sensory and control levels are sub-symbolic interface to
sensory and behavioral capabilities which provide a mapping
between continuous numerical values of the physical platform
to be abstracted to discrete symbolic representation for use
at the causal and topological levels. The causal level uses the
situation calculus to capture causal relations among the robot’s
views (symbolic abstractions over the sensory input perceived
by the robot at some time t), actions (which bring about
changes in views), and events, the realization of a change in
view. The metrical level on the other hand consists of a global
2-D geometric map of the environment in a single frame of
reference, a so-called “Map in the Head” [11, p195].
In terms of the modeling of space for interaction with users,
one of the more interesting levels is the topological level which
defines notions of places, paths and regions, along with
their associated connectivity and containment relations. Places
are defined as zero-dimensional entities which may lie on a
path. A topological, or place, graph can then be constructed
as a map of the environment consisting of sets of places and
their connecting paths. Paths also serve as the boundaries for
regions. A region is defined as a two-dimensional subset of
the environment, i.e., a set of places. Path directedness also
allows a reference system to be determined. Each directed path
divides the world into two regions: one on the right and one on
the left. A bounded region is then defined by a directed path
with the region on this path’s right or on its left. The SSH
also uses regions to define a hierarchical view of space. There
are therefore two levels of abstraction within the topological
layer, one for place and one for region.
One of the key themes of the SSH is that information at different ontological layers can be inter-related through mappings
and abduction processes. Thus, the places, paths and regions
of the topological level are created by deducing some minimal
description that is sufficient to explain the regularities found
among the observed views and actions of the causal level. As
such the SSH effectively provides something approaching a
complete robot control architecture which directly integrates
spatial representation with issues of deliberation and sensorymotor issues. However the spatial models used within the SSH
were not developed with cognitive modeling or HRI issues in
mind.
B. The Route Graph
One spatial modeling approach which has been developed
with a view to cognitive plausibility and HRI issues is KriegBrückner et al’s Route Graph (RG: [8], [9], [27]). The RG is an
abstract graph like representation of navigation space which
may be instantiated to different kinds, layers, and levels of
granularity. Instantiations of the RG have been made both for
large scale navigation space such as tram-networks [13], as
well as for robotic applications in medium scale space such
as office environments [18].
As defined in [9], the principle concepts within the abstract
RG specification are Places, Segments, and Routes. Places
are anywhere that an agent can ‘be’, and are defined as having
their own reference system related to an origin (position and
orientation) associated with the place. In turn, local reference
systems may or may not be rooted on a global reference system
depending on the cognitive characteristics of the instantiated
application. A Place’s origin is used to define the orientation
of connecting Segments which are directed connections from
one Place to another. A Segment is said to have a Course,
an Entry, and an Exit, the latter two of which are defined with
respect to the connecting place’s origin as described earlier,
while the Course is some description of the actual route
taken between the Entry and Exit. A Route is then intuitively
defined as a sequence of Segments without repetition (i.e.,
cycles are not permitted within route definitions). Places
and Segments may be specialized with respect to particular
applications. For example, in a voronoi-styled instantiation of
the RG, Places may have a width denoting free space, while
IROS 2006 workshop: From Sensors to Human Spatial Concepts
Page: 42
a Segment’s Course may be instantiated with quantitative
information such as distance or width of the segment – or
alternatively with qualitative information such as the action to
be performed to transverse the route segment.
The RG also includes a notion of abstraction similar to that
of the SSH. For example, at a particular level of abstraction
an entry and an exit ramp of a highway can be considered as
two different places (nodes in the RG), whereas, at a higher
level of abstraction, the two nodes could be considered one
place corresponding to the complex notion of a ‘road junction’.
Similarly, the complex possibilities for navigation within a
train station might quite appropriately, at a higher level of
abstraction, be collapsed to a single place: ‘the station’.
In [9] some relations required between the world of route
graphs (places, route segments, paths, etc.) and other modeling
domains more established. Two such domains are explicitly
identified: a spatial ontology that is expected to provide spatial
regions, and a ‘commonsense ontology’, providing everyday
objects such as rooms, offices, corridors, and so on. The
relationships provided are intended to allow inferences back
and forth between places in a route graph, the spatial regions
that such places occupy, and the everyday objects that those
regions ‘cover’. The relationship to commonsense concepts
was not however fully established and needs to be expanded
upon before adequate connections to natural communication or
a wide range of reasoning tasks can be established. Similarly,
questions remain as to how the RG concept should be used
alongside spatial modeling tasks which are better captured
with quantitative means. The model presented in the following
is intended to overcome some of these issues.
IV. T HE NAVIGATION S PACE M ODEL
In this section we describe the representational layers which
together constitute the NavSpace model.
A. The Metric & Voronoi Layers
Our lowest level long terms spatial memory structure is an
occupancy or Evidence Grid which denotes the probability
of occupancy across a map-style 2D structure. Derived from
the EvidenceGrid, the DistanceGrid is also organized as a
rectangular array of grid cells, now containing the distances
to the closest obstacle points. A graph like derivative of the
DistanceGrid completes the lower-level representation. This
Voronoi Diagram, depicted in figure 1(a) is a reduction of the
DistanceGrid that describes the space with maximal clearance
to the surrounding obstacles by a graph like structure. In
practice the low-level spatial representation is quite simple,
and has been computed from either local sensor-based grid
maps on physical robots, as well as from a pre-existing global
grid map derived from CAD blueprints.
The EvidenceGrid and Voronoi Graphs provide a comprehensive spatial model for all navigation, localization, and
safety tasks as illustrated by our demonstrator described in
Section V. Moreover, in a recent paper [14], it was shown
that through the definition of fuzzy spatial relations interpretation functions, it was possible to use the Voronoi Graph
directly to interpret a number of course route instructions
given by a user. However, that interpretation algorithm relied
on the user providing relatively long route instruction chunks
such that statistical methods could be used to eliminate nonroute vertices from the voronoi graph. The interpretation of
shorter route instructions, and similarly the generation of
spatial descriptions directly against a Voronoi graph, is less
straightforward. In the next section we describe our abstracted
topological representation which we use to overcome these
issues.
B. The Topological Layer
Our principle representation of conceptual topological space
is a Route Graph (RG) instantiation that abstracts between
quantitative and qualitative structure. This conceptual topology is used not for low-level robot control, but purely for
communication purposes, i.e., for interpreting and generating
spatial descriptions.
While the original RG specification provides a good framework for topological representations, the modeling of projective and other spatial relations, e.g, left, after, remained
unspecified and were to be filled by the domain application.
In a recent work [10], the RG was combined with the Double
Cross qualitative spatial reasoning calculus [7] to provide a
qualitative route interpretation approach. The model proved
useful in integrating strictly qualitative terms when no quantitative information is available, but it cannot by definition be
used to process higher-level reasoning tasks such as the general
process of location identification.
For the specific spatial concept definition system developed
for this paper, we have used a single 2D global reference
frame to place all Places within. Although this is not the
favored approach with the Route Graph specification, we feel
that this is a reasonable decision since the actual size of
spatial environments, as used in the demonstration system
described in Section V, are relatively small. The principle
spatial characteristics of Places and RouteSegments then
become cartesian coordinates and orientations with respect to
the global reference frame.
In summary then, our RG structure applied here consists of:
•
•
P a set of Places
S a set of Segments that connect Places
where Places are located within a 2D reference frame, and the
entry and exit points of each Segment are defined with respect
to a global orientation (analogous to the defining orientations
with respect to North). Furthermore, each RG Place and
Segment is said to mark a given conceptual place such as
Kitchen or Corridor. This conceptual mapping is addressed
further in Section IV-C.
With a simple Route Graph styled topology in place, the
question becomes how do we communicate about space with
the user? Core communicative tasks in the navigation domain
are the interpretation and description of object locations.
Such descriptions invariably make use of relations such as
behind, left of, and next to which are essentially dynamic in
that they cannot as such be encoded directly into a spatial
representation, but are relative to a given relatum and origin
IROS 2006 workshop: From Sensors to Human Spatial Concepts
Page: 43
(a)
(b)
(c)
Fig. 1. Illustration of the various levels of the NavSpace model. Figure 1(a) presents the voronoi graph superimposed over line segment structure resulting
from sensory data. Figure 1(b) then shows an abstract topological representation of the same floor. Finally 1(c) shows a simplified fragment of the conceptual
layer.
at time t1 . To interpret such context specific relations, we have
defined a number of relation operators which operate on one
or two Places in the RG to produce a region which in turn
may include one or more specific Places. In the following we
define this approach for the projective and ordering relations.
1) Projective Relations: To interpret and produce qualitative spatial relations against the underlying quantitative model,
we first require concrete definitions of projective relations like
left and right. We apply a simple model which was empirically
validated with 25 German speakers in an earlier study with a
robot perception system [24].
2) Ordering Relations: Computational definitions of spatiotemporal relations such as before and after make use of partial
orderings of Places on a given Route. In the cases of places
which lie directly on a proper route, this ordering becomes
trivial and is given directly through the topological structuring
of the graph. For places which lie “off the route”, a projection
must first be made of the place onto its connecting node on
the RG, after which standard partial orderings can be applied.
3) The Abstraction Process: Before progressing, it should
be noted that the creation of the abstract Topological level
cannot be simply abstracted from the voronoi graph structure.
This is because that key to the use of the abstract topological
layer is knowledge of the types of rooms and junctions
being covered by the graph. Such classification tasks require
either the use of visual room identification techniques based
on heuristics of form and function, or through in-advance
annotations or dialogues with users.
C. The Concept Layer
The Concept Layer comprises a conceptual ontology which
facilitates three functions. Firstly, it provides a framework for
1 Qualitative Spatial Reasoning (QSR) models do attempt to encode this
information statically into a representation, but we consider computation of
such relations with respect to a quantitatively described underlying model to
be more advantageous to current application needs.
all topology entities such as the spatial relations, places and
segments to be placed. Secondly, it delivers an ontologically
described structuring of the physical entities in the robot’s
environment such to allows navigation space entities to be
related to entities in other domain application models, e.g.,
user models. Finally, and perhaps most importantly, it provides
a semantics for communication with other agents in the
environment (artificial or human).
Rather than taking an ad-hoc approach to our modeling of
the agent’s concepts, we have leveraged off existing work
in formal ontology and knowledge engineering. From an
engineering perspective, Formal Upper Conceptual Ontologies provide structuring principles for knowledge based information systems. While we accepted that such symbolic
reasoning systems should be kept quite separated from lowlevel robotic control, we argue that the non-reactive nature
of communication requires the presence of such symbolic
structuring at higher-cognitive levels. By conceptual ontologies
we refer to such efforts as the Suggested Upper Merged
Ontology (SUMO) [16] one of three starter documents which
were submitted to the IEEE for a Standard Upper Ontology (SUO), or the Descriptive Ontology for Linguistic and
Cognitive Engineering (DOLCE) [15]. Furthermore, in recent
years there have been a number of efforts to provide concrete
treatments of space within Upper Ontologies by extending the
traditional conceptual view of space (mereology, topology)
with Geographic Information Systems, and Qualitative Spatial
Reasoning.
As upper conceptual ontology for the Navigation Space
model we have chosen a Description Logic fragment of
DOLCE [15]. While DOLCE gives a well-defined upper
ontology, issues such as the structuring of topological relations
are left to instantiated domain ontologies. We instantiated
the ontology with suitable classes to model the environment
spaces used by users in our interactions, e.g., Room, Junction,
categories for the agent’s themselves, e.g., Person, the spatial
IROS 2006 workshop: From Sensors to Human Spatial Concepts
Page: 44
relations that can hold between entities in the navigation
space, e.g., leftOf, and the relations which allow mapping
between the nodes of the RG structure and the conceptual
entities they mark. Figure 1(c) presents an extremely simplified
representation of what the resultant conceptual layer ‘looks
like’.
V. A PPLYING THE NAV S PACE M ODEL
In this section we place the NavSpace mode in the context
of our development scenario and platform. We have developed
NavSpace as a spatial representation system for Rolland,
a semi-autonomous robotic wheelchair platform, which is
intended for use with people of limited cognitive and or
physical abilities within a rehabilitation or care environment.
Such an application requires that the system be capable of
functioning in and communicating about a range of spatial
domains ranging from individual rooms, to complex indoor
environments such as hospitals, and large-scale environments
such as hospital grounds and metropolitan zones.
A. Experimental Platform: Rolland III
The experimental platform is Rolland III, a battery powered
Meyra Champ 1.594 wheelchair. Rolland III is equipped with
two laser scanners mounted at ground level, which allow
for scanning beneath the feet of a human operator. As an
additional sensor device the system provides two incremental
encoders which measure the rotational velocity of the two
independently actuated wheels.
For Local Navigation, we decided to employ a geometric
path planner using cubic Bezier curves, since they are able to
connect two given points while accounting for a desired curve
progression and for directional requirements in the start point
and end point. The key feature of the algorithm is that given
the current pose of the wheelchair startP ose = (xs , ys , θs )
and the desired target goalP ose = (xg , yg , θg ), we search the
space of cubic Bezier curves for paths that:
• connect p
~0 = (xs , ys ) with p~3 = (xg , yg ),
• are smoothly aligned with θs in p
~0 and with θg in p~3
• are obstacle free in the sense that a contour of the robot
shifted tangentially along the path does not intersect with
any obstacle point from a given occupancy grid.
For Global Localization, a Monte-Carlo-Localization
method was implemented that is based on the self-locator used
by the German RoboCup Team [17]. To establish hypotheses
of the current location of the wheelchair, particles are drawn
from a pre-computed table that indexes the global environment
by the area of the scan perceived from a certain position.
In addition, only distance measurements resulting from flat
surfaces i.e. from segmented lines are used to determine the
current position. Thereby, persons standing around are ignored.
The number of particles drawn from observations depends on
how good the actual sensor measurements match the expected
measurements from the positions of particles.
A Safety Layer (Lankenau and Rofer, 2001) analyzes any
driving command with respect to the current obstacle situation.
It then decides whether the given command is to be forwarded
to the actuators or to be replaced by a necessary deceleration
manoeuvre. The key concept in the implementation of the
Safety Layer is the so-called Virtual Sensor. For a given initial
orientation (~e) of the robot and a pair of translational (v)
and rotational (w) speed, it stores the indices of cells of an
EvidenceGrid that the robots shape would occupy when initiating an immediate full stop manoeuvre. A set of precomputed
Virtual Sensors for all combinations of (~e, v, w) then allows us
to check the safety of any driving command, either instructed
by the operator via joystick or by an autonomous navigation
process, in real time.
B. The Linguistic Semantics Bridge
The variability of spoken language is too great, except
perhaps in the most trivial of applications, simply to map
words and phrases onto a pure conceptual level. Features
such as metaphor and the polysemy of meaning require that
there must be a strict separation of non-linguistic levels of
knowledge such as the NavSpace model, and a linguistic
semantics, which provides the bridge to natural language
grammars and discourse reasoning. We specify our linguistic
semantics – a model of the surface form of language – with
the Generalized Upper Model (GUM) [1], [2], a so-called
linguistic ontology, which while being related to the types
of conceptual ontologies introduced earlier, models the world
from the underspecified perspective of natural language.
The current version of GUM specifically extends earlier
works to account for the range of spatial conceptual language
encountered in the studies introduced in Section II [1]. To illustrate the model, below is a simplified compact-serialization
of the linguistic semantics for “‘the bowl is to the left of the
microwave”:
(SL1 / SpatialLocating
:locatum (h1 / lm-bowl)
:direction (l1 / GeneralizedLocation
:hasSpatRel (c1 / LeftProjection)
:relatum (mw / microwave) ))
Mapping back and forth between linguistic semantics representations and the conceptual spatial representation is the
responsibility of the dialogue management system [19]. Conceptual to linguistic relationships have been established to aid
this mapping [5], but in general the mapping of of course
context dependent and relies upon other elements of the
system’s state.
VI. C ONCLUSIONS & F UTURE W ORK
In this paper we have attempted to give an overview of an
implemented spatial representation for mobile robots which
makes use of an ontologically grounded conceptual layer to
drive a spatial reasoning engine for dialogue based interaction.
The representation includes an abstract topological layer which
performs the job of pruning the environmental search space
of information which is too fine grained for communicative
tasks. A conceptual ontology allowed for more meaningful
representations of the objects in the navigation space, thus
moving beyond simple “annotations”. Current work includes
the formalization of the relationship between the conceptual
and linguistic ontologies, and the preparation of Rolland for
IROS 2006 workshop: From Sensors to Human Spatial Concepts
Page: 45
clinical trials and evaluation in a nursing home environment.
While for reasons of space we have had to gloss many of
the precise details of our approach, we believe that formal
conceptual ontology along with qualitative spatial reasoning
will play a key role in bridging the human-robot divide.
R EFERENCES
[1] J. Bateman, S. Farrar, and J. Hois, “The generalized upper model 3.0,”
Technical Report – The SFB/TR8 Spatial Cognition Research Centre,
June 2006.
[2] J. A. Bateman, R. Henschel, and F. Rinaldi, “Generalized upper model
2.0: documentation,” GMD/Institut für Integrierte Publikations- und
Informationssysteme, Darmstadt, Germany, Tech. Rep., 1995. [Online].
Available: http://purl.org/net/gum2
[3] G. Bugmann, E. Klein, S. Lauria, and T. Kyriacou, “Corpus-Based
Robotics: A Route Instruction Example,” in In Proceedings of IAS-8,
2004.
[4] M. Denis, “The description of routes: A cognitive approach to the
production of spatial discourse,” Cahiers de Psychologie Cognitive,
vol. 16, pp. 409–458, 1997.
[5] S. Farrar, J. Bateman, and R. J. Ross, “On the Role of Conceptual &
Linguistic Ontologies in Spoken Dialogue Systems,” in The Symposium
on Dialogue Modelling and Generation, submitted.
[6] K. Fischer, What Computer Talk Is and Is not: Human-Computer Conversation as Intercultural Communication. Computational Linguistics,
2006, vol. Vol 17.
[7] C. Freksa, “Using orientation information for qualitative spatial reasoning,” in Theories and Methods of Spatio-Temporal Reasoning in
Geographic Space, ser. LNCS, vol. 639. Springer-Verlag, 1992, pp.
162–178.
[8] B. Krieg-Brückner, T. Röfer, H.-O. Carmesin, and R. Müller, “A
taxonomy of spatial knowledge for navigation and its application
to the bremen autonomous wheelchair,” in Spatial Cognition I
- An interdisciplinary approach to representing and processing
spatial knowledge, C. Freksa, C. Habel, and K. Wender, Eds.
Berlin: Springer, 1998, pp. 373–397. [Online]. Available: http:
//link.springer-ny.com/link/service/series/0558/tocs/t1404.htm
[9] B. Krieg-Brückner, U. Frese, K. Lüttich, C. Mandel, T. Mossakowski,
and R. Ross, “Specification of an Ontology for Route Graphs,” in Spatial
Cognition IV: Reasoning, Action, Interaction. International Conference
Spatial Cognition 2004, Frauenchiemsee, Germany, October 2004, Proceedings, C. Freksa, M. Knauff, B. Krieg-Brückner, B. Nebel, and
T. Barkowsky, Eds. Berlin, Heidelberg: Springer, 2005, pp. 390–412.
[10] B. Krieg-Brückner and H. Shi, “Orientation calculi and route graphs:
Towards semantic representations for route descriptions,” in Proc. International Conference GIScience 2006, Münster, Germany, 2006, (to
appear).
[11] B. Kuipers, “The spatial semantic hierarchy,” Artificial Intelligence,
vol. 19, pp. 191–233, 2000.
[12] S. Lauria, G. Bugmann, T. Kyriacou, and E. Klein, “Mobile robot programming using natural language,” Robotics and Autonomous Systems,
vol. 38, no. 3–4, pp. 171–181, feb 2002.
[13] K. Lüttich, B. Krieg-Brückner, and T. Mossakowski, “Tramway networks as route graphs,” in FORMS/FORMAT 2004 – Formal Methods for
Automation and Safety in Railway and Automotive Systems, E. Schnieder
and G. Tarnai, Eds., 2004, pp. 109–119.
[14] C. Mandel, U. Frese, and T. Röfer, “Robot navigation based on the
mapping of coarse qualitative route descriptions to route graphs,” in
Proceedings of the IEEE/RSJ International Conference on Intelligent
Robots and Systems (IROS 2006).
[15] C. Masolo, S. Borgo, A. Gangemi, N. Guarino, A. Oltramari, and
L. Schneider, “The WonderWeb library of foundational ontologies:
preliminary report,” ISTC-CNR, Padova, Italy, WonderWeb Deliverable
D17, August 2002.
[16] A. Pease and I. Niles, “IEEE Standard Upper Ontology: A progress
report,” Knowledge Engineering Review, vol. 17, 2002, special Issue on
Ontologies and Agents.
[17] T. Röfer and M. Jüngel, “Vision-based fast and reactive monte-carlo
localization,” in Proceedings of the IEEE International Conference on
Robotics and Automation (ICRA-2003), 2003, pp. 856–861.
[18] T. Röfer and A. Lankenau, “Route-based robot navigation,” Knstliche
Intelligenz, 2002, Themenheft Spatial Cognition.
[19] H. Shi, R. J. Ross, and J. Bateman, “Formalising control in robust spoken
dialogue systems,” in Software Engineering & Formal Methods 2005,
Germany, Sept 2005.
[20] H. Shi and T. Tenbrink, “Telling Rolland where to go: HRI dialogues
on route navigation,” in WoSLaD Workshop on Spatial Language and
Dialogue, October 23-25, 2005, 2005.
[21] L. Talmy, “How language structures space,” in Spatial Orientation:
Theory, Research, and Application, H. Pick and L. Aredolo, Eds. New
York: Plenum Press, 1983.
[22] T. Tenbrink, “Identifying objects in english and german: Empirical
investigations of spatial contrastive reference,” in WoSLaD Workshop
on Spatial Language and Dialogue, October 23-25, 2005, 2005.
[23] ——, Localising objects and events: Discoursal applicability conditions
for spatiotemporal expressions in English and German. Dissertation.
Bremen: University of Bremen, FB10 Linguistics and Literature, 2005.
[24] T. Tenbrink and R. Moratz, “Group-based spatial reference in linguistic
human-robot interaction,” in Proceedings of EuroCogSci 2003: The
European Cognitive Science Conference, 2003, pp. 325–330.
[25] B. Tversky, “Structures of mental spaces – how people think about
space,” Environment and Behavior, vol. Vol.35, No.1, pp. 66–80, 2003.
[26] B. Tversky and P. Lee, “How space structures language,” in Spatial Cognition: An interdisciplinary Approach to Representation and Processing
of Spatial Knowledge, ser. Lecture Notes in Artificial Intelligence,
C. Freksa, C. Habel, and K. Wender, Eds., vol. 1404. Springer-Verlag,
1998, pp. 157–175.
[27] S. Werner, B. Krieg-Brückner, and T. Herrmann, “Modelling navigational knowledge by route graphs,” in Spatial Cognition II - Integrating
Abstract Theories, Empirical Studies, Formal Methods, and Practical
Applications, C. Freksa, W. Brauer, C. Habel, and K. Wender, Eds.
Berlin: Springer, 2000, pp. 295–316.
IROS 2006 workshop: From Sensors to Human Spatial Concepts
Page: 46
Learning Spatial Concepts from RatSLAM Representations
Ruth Schulz, Michael Milford, David Prasser, Gordon Wyeth and Janet Wiles
School of Information Technology and Electrical Engineering
The University of Queensland
Brisbane, Australia
{ruth, milford, prasserd, wyeth, wiles} @itee.uq.edu.au
Abstract – RatSLAM is a biologically-inspired visual SLAM
and navigation system that has been shown to be effective indoors
and outdoors on real robots. The spatial representations at the
core of RatSLAM, the pose cells, form in a distributed fashion as
the robot learns the environment. The activity in the pose cells,
while being coherent, does not possess strong geometric
properties, making it difficult to use as the basis for
communication with the robot. The pose cells’ companion
representation, the experience map, possesses stronger geometric
properties, but still does not represent the world in a human
readable form. A new system, dubbed RatChat, has been
introduced to enable meaningful communication with the robot.
The intention is to use the “language games” paradigm to build
spatial concepts that can be used as the basis for communication.
This paper describes the first step in the language game
experiments, showing the potential for meaningful categorization
of the spatial representations in RatSLAM. The categorization
performance is compared for both the pose cells and experience
map, with the results showing stronger concept formation using
the more geometrically structured experience map.
Index Terms – Spatial conceptualization, RatSLAM
I. INTRODUCTION
Recent research in mobile robotics has been dominated by
the problem of Simultaneous Localization And Mapping
(SLAM). Roboticists have investigated a wide range of
approaches to solving the problem and have created a number
of probabilistic methods that can perform SLAM under
appropriate assumptions [1-3]. However, by focusing on the
SLAM problem other considerations such as map usability
have mostly been neglected. Traditional metrics such as
accuracy are starting to be supplanted by concepts such as
map usability and communicability.
Geometric space representations are a natural choice for
many robot mapping and localization methods, but are
dissimilar to the more abstract ways in which humans view
their environments. Humans can conceptualize their
environment in terms of concepts such as rooms: “the
bathroom”, “the kitchen”; or objects: “behind the couch”, or
“on top of the table”. If humans are to easily and naturally
interact with robots, the robots must be able to understand and
process such concepts. For instance, one goal for domestic
robots would be the ability for a human to tell a robot to
“clean the bathroom”, rather than specifying a range of
geometric co-ordinates.
Many algorithms have been developed to solve
components of the mapping and navigation problem such as
SLAM. The most successful simultaneous localization and
mapping algorithms are all probabilistic and can be separated
into three categories: Kalman Filter (KF), Expectation
Maximization (EM), and particle filter algorithms.
Methods based on these algorithms typically produce two
types of map. Landmark or feature maps store the locations of
interesting objects in the environment, such as rocks [3] or
trees [4]. Occupancy grid maps represent an environment with
a high resolution grid, with each grid cell encoding whether
the corresponding location in the environment is free or
occupied [5]. While both types of map can be accurate, in their
raw form they are very different to the abstract spatial
concepts a human uses.
A. RatSLAM
RatSLAM is a biologically inspired, vision-based
mapping and navigation system. The system uses an extended
computational model of the rodent hippocampus to solve the
SLAM problem. The world representations produced by
RatSLAM are coherent but differ in several respects from the
maps produced by probabilistic methods. RatSLAM maps are
locally metric but globally topological, and are not directly
usable for higher level tasks such as goal navigation.
RatSLAM is complemented by an algorithm known as
experience mapping, which creates spatio-temporal-behavioral
maps from the RatSLAM representations. Experience maps
are used to implement methods for exploration, goal
navigation, and adaptation to environment change.
Experiments in indoor and (to a lesser extent) outdoor
environments on two different robot platforms have
demonstrated the system’s ability to autonomously explore,
SLAM, navigate to goals, and adapt to simple environment
changes. The maps are implemented in a fashion that best suits
the robot’s autonomous operations, and are not designed for
effective robot-human communication. This paper shows the
first steps towards building human-friendly representations
based on the RatSLAM system.
B. Spatial Conceptualization
Conceptualization and communication can be investigated
in embodied agents that develop languages for labeling objects
in their environment, or coordinating signaling and motor
behaviors for the completion of a task. Embodied agents have
been implemented as both static and mobile robots, and
communication based on synthetic and natural languages.
While the majority of studies investigating communication in
embodied agents involve robot-robot communication, some
involve interaction with humans. By interacting with humans,
robots can be taught to understand natural language for labels
[7] or descriptions [8], or games can be played [9], where the
IROS 2006 workshop: From Sensors to Human Spatial Concepts
Page: 47
human is another agent with which the robot can interact. For
a review on communication in embodied agents see [6].
In a language game, communities of agents interact to
evolve a common language. When two agents in the
community have shared attention, they attempt to
communicate about something in their environment. One
agent is the speaker, and produces an utterance for the topic,
while the other agent is the listener, and attempts to
comprehend the utterance. After each game, the agents update
their conceptualizations to improve the chance of successful
communication, eventually resulting in a shared
communication system. The conceptualizations are typically
grounded in the image domain or in the simple sensory
perceptions of the robots.
The representations of these robots include vision, where
scenes are segmented and processed for concepts of color,
position, and size [9], minimal proximity sensors, and light
sensors [10]. In addition, mobile robots can be taught spatial
descriptions such as left, right, front, and back, and can follow
natural language instructions relating to movement, such as
“move forward” [8]. Representations can be obtained from
occupancy grids, where the robots are provided with routines
that determine how spatial concepts are formed.
Spatial conceptualization in mobile robots has been
limited to commands or descriptions about the current location
of the robot where the robot has been provided with a set of
routines for forming spatial concepts. Other forms of
conceptualization have incorporated the ability of agents to
form new concepts based on experience and interactions with
other agents.
C. RatChat
RatChat is a proposed language system that extends
RatSLAM, investigating spatial conceptualization in mobile
robots based on experiences and interactions with humans and
other robots. The concepts formed by RatChat agents are to be
the location in an environment, spatial relationships between
locations, and using different perspectives to talk about these
spatial relationships. RatChat will develop a framework to
enable the robots to form a language in a population of robots,
or with a human, by playing language games. Preliminary
studies have investigated the robot representations, and how
agents can generalize from the training sets to novel meanings
and terms [11].
II. RATSLAM MODEL AND E XPERIENCE MAPPING ALGORITHM
This section briefly describes the RatSLAM model, vision
processing system and experience mapping algorithm – a
more detailed description is given in [12] and [13].
Fig. 1 RatSLAM system structure. The conceptual map can be formed from
representations stored in the pose cell network or the experience map.
A. RatSLAM Model
Fig. 1 shows the core structure of the RatSLAM system.
The robot’s pose is represented by activity in a competitive
attractor neural network called the pose cells. Wheel encoder
information is used to perform path integration by
appropriately shifting the current pose cell activity. Vision
information is converted into a local view (LV) representation
that is associated with the currently active pose cells. If
familiar, the current visual scene also causes activity to be
injected into the particular pose cells associated with the
currently active local view cells.
B. Vision System
RatSLAM recognizes locations from external sensor
information provided by an appearance based view
recognition system. The role of this system is to create
patterns of activity in the local view cells using the camera
data, which are dependent upon robot location.
Camera information is matched against a growing
database of learnt images, each of which has an associated cell
in the local view. Recognition of an image activates the
appropriate cell allowing the formation of new view to pose
associations and the injection of energy into the pose cells.
Unrecognized camera images are added to the database so that
the system is able to explore a previously unseen environment.
The city block metric is used to compare low resolution
(24 × 18) normalized grayscale images. Learnt views that are
within a threshold distance of the current view have their local
view cells activated in inverse proportion to their distance
from the current view.
Visual ambiguity or redundancy in the environment is
accounted for by the view to pose associations which enable
one view to correspond to multiple physical locations and
vice-versa. Since robot position is filtered by the pose cells,
incorrect recognition and visual ambiguity in the local view is
not catastrophic.
C. Experience Mapping Algorithm
The experience mapping algorithm creates maps from the
representations stored in the local view and pose cell
networks. The premise of the algorithm is the creation and
maintenance of a collection of experiences and interexperience links. The algorithm creates experiences to
represent certain states of activity in the pose cell and local
view networks. The algorithm also learns behavioral,
IROS 2006 workshop: From Sensors to Human Spatial Concepts
Page: 48
temporal, and spatial information in the form of interexperience links. Fig. 2 shows the relationship between the
experience map and the core RatSLAM representations. A
more detailed discussion of the algorithm is given in [13].
Fig. 2 An experience is associated with certain pose and local view cells, but
exists within the experience map’s own coordinate space.
Experiences have an activity level that is dependent on the
activity peaks in the pose cells and the local view cells. The
most active experience is known as the peak experience.
Learning of new experiences is triggered by the peak
experience’s activity level dropping below a threshold value.
Inter-experience links store temporal, behavioral, and
odometric information about the robot's movement between
experiences. Repeated transitions between experiences result
in an averaging of the odometric information. Discrepancies
between a transition’s odometric information and the linked
experiences’ ( x , y ,θ ) coordinates are minimized through a
process of map correction.
As well as learning experience transitions, the algorithm
also monitors transition ‘failures’ in order to adapt to
environment changes. Failures occur when the robot's current
experience switches to an experience other than the one
expected given the robot's current movement behavior. If
enough of these failures occur for a particular transition that
link is deleted, thereby updating the experience map.
agents learn this association using a single layer neural
network shown in Fig. 3, with pose cells or experiences as
inputs and a set of output units referring to the different
locations in the world. The concept associated with a pattern
of pose cells or experiences is the most active output unit. If
the activation of the second most active unit is more than 2 3
the activation of the most active unit, the agent is considered
to be ‘uncertain’ about the concept.
Fig. 3 – The fully connected single layer neural network of the language agent
takes pose cells or experiences as inputs and has outputs associated with labels
for locations in the world. The most active output is the label associated with
the active pose cells or experiences.
IV. EXPERIMENTAL SETUP AND PROCEDURE
The experiments used a Pioneer 2 DXE mobile robot with
a forward facing camera to explore a test environment. The
resulting dataset was processed by the RatSLAM model and
experience mapping algorithm in order to provide the input for
the spatial conceptualization method.
A. Environment and Robot
The environment was one floor of a university building
consisting mostly of open-plan offices and corridors. Fig. 4
shows the environment in which the robot operated and the
approximate trajectory of the robot. The robot was manually
driven along a repeated path through the environment. The
robot visited every place on its path at least twice, providing
an opportunity for both learning and recognition.
III. A METHOD FOR PRODUCING HUMAN SPATIAL CONCEPTS
In a conceptualization process, agents form concepts from
their representations of the world. One form of
conceptualization involves interaction with a teacher, in which
the different concepts that the agent is to learn are provided. In
this process, agents learn to associate input patterns with
different concepts.
For RatChat agents, the inputs are pose cells or
experiences. The simplest concepts that can be formed,
considering the information contained in pose cell and
experience representations, are labels for locations in the
world. The outputs of the language agents should group input
patterns into categories or concepts. The simplest output
representation is one-hot encoding, with each concept
associated with a single output unit. For locations in an indoor
environment, each room or corridor can be associated with an
output unit. The conceptualization process for RatChat agents
is the association of the input patterns of pose cells and
experiences with the output representations of locations. The
Fig. 4 Floor plan of the area used for the experiment and the approximate
trajectory of the robot. Shaded areas were impassable by the robot.
A dataset was acquired with camera images logged at 7
Hz and on-board odometry and sonar data at 12 Hz. The data
set contains 20,350 monochrome images covering a period of
almost 40 minutes. The data set was then presented to the
RatSLAM system in a manner indistinguishable from online
operation.
B. Spatial Conceptualization Training and Testing
The conceptualization process was implemented offline
following the construction of the pose cell and experience
maps. The route of the robot was divided into two sections of
about 20 minutes duration each. Each section corresponded to
the robot exploring and then revisiting one half of the building
floor. These two sections were further divided into learning
and recognition phases. The learning phase, in which the robot
IROS 2006 workshop: From Sensors to Human Spatial Concepts
Page: 49
first visited an area, was used for the training set, while the
recognition phase, where the robot revisited an area, was used
to test if the concepts had been learnt. This was equivalent to
the areas being labeled on the first circuit of the environment,
and testing whether the robot had learnt these labels on later
circuits.
The language agents were implemented using fully
connected single layer neural networks with pose cells or
experiences as inputs and six output units. In the first study,
the inputs were pose cells. The 35,402 pose cells that were
active at some point in the run were used. In the second study,
the inputs were experiences. The 2384 experiences that were
active at some point in the run were used. The output units
corresponded to the concepts of four rooms and two corridors.
Targets were created with a single active output unit
corresponding to the current location of the robot. Transitions
between rooms and corridors occurred at doorways and turns.
For both studies, there were 403 time steps in the first learning
phase, 233 in the second learning phase, 398 in the first
recognition phase, and 187 in the second recognition phase.
In each of the studies, agents were initially trained on the
first learning phase and tested on the first recognition phase.
Agents were then trained on both the first and second learning
phase and tested on the second recognition phase. For each
training segment, agents were trained for 2000 epochs using
gradient descent with momentum (momentum constant = 0.9)
and an adaptive learning rate (initial learning rate = 0.01,
increasing ratio = 1.05, decreasing ratio = 0.7). The
performance of the agents was tested on the first and second
recognition phase by considering the concepts used by the
agents for each location.
V. RESULTS AND ANALYSIS
This section presents the results of the experiments.
Study 1 involved conceptualizing the representations stored in
the pose cell network, while Study 2 investigated the
conceptualization of the experience map.
A. Study 1 - RatSLAM Conceptualization
The pose cell representation produced by RatSLAM
contained both discontinuities and multiple representations of
the same place, as shown by Fig. 5. The discontinuities were
caused by visually driven re-localization jumps after long
periods of exploration where the robot relied only on wheel
odometry to remain localized. Odometric drift and delayed relocalization created multiple representations, where more than
one group of pose cells represented the same physical
location.
Fig. 5 – Trajectory of the most highly activated pose cell during the
experiment. Thick dashed lines show re-localization jumps driven by
visual input. Each grid square contains 4 × 4 pose cells in the ( x ' , y ' )
plane.
In the first learning phase of the pose cell
conceptualization process, 96.77% of the instances were
labeled correctly, with 64.32% labeled correctly in the test set
(Fig. 6a, 6b). Errors in the training set were generally on the
borders of the categories. Errors in the test set were mainly in
Room 1, and were due to the different trajectory used in the
learning and recognition phases. The part of the room
incorrectly classified was not visited in the learning phase, and
was not included in the training. Most of these untrained areas
were classified as Corridor 1, as the robot spent most of the
first learning phase there, and the language network was
biased towards categorizing patterns as Corridor 1. For
patterns where there was only a small difference between the
most active and the second most active concepts, the robot
was uncertain which concept was most appropriate. This
generally occurred on the borders of concepts.
In the second learning phase, 98.27% were labeled
correctly, with 73.26% labeled correctly in the test set (Figs.
6c, 6d). In the test set, there were many instances where the
robot was uncertain of the label for the current location. While
most of these were on the borders between concepts, there
were also other locations of uncertainty, particularly in Rooms
3 and 4. Different pose cells were active in these locations
during the recognition and the learning phases. The RatChat
agents were generally good at classifying pose cells into
locations in the world, with errors mainly occurring when
different trajectories were taken during the learning and
recognition phases.
B. Study 2 – Experience Mapping Conceptualization
The experience mapping algorithm produced a map
containing none of the spatial discontinuities of the RatSLAM
representations, and grouped together multiple representations,
as shown in Fig. 7.
In the first learning phase of the experience
conceptualization process, 98.26% of the instances were
labeled correctly, with 90.45% labeled correctly in the test set
(Fig. 8a, 8b). Compared to the pose cell conceptualization, a
greater proportion of the instances in Room 1 were labeled
correctly. The majority of the errors in the test again occurred
in Room 1 where a different trajectory was taken. In this case,
the agent was uncertain about the label, rather than labeling
the instances incorrectly.
IROS 2006 workshop: From Sensors to Human Spatial Concepts
Page: 50
Fig. 6 – Conceptualization of the agent using pose cells in (a) the learning phase of section 1, (b) the recognition phase of section 1, (c) the
learning phase of section 2, and (d) the recognition phase of section 2. In the learning phases there are uncertain areas in the Room 1 /
Corridor 1, Room 2 / Corridor 1, and Room 3 / Corridor 2 borders. In the recognition phases there are uncertain areas throughout, including
in all of the rooms and borders between rooms and corridors. In the first recognition phase, part of Room 1 has been labeled as Corridor 1.
Each grid square contains 8 × 8 pose cells in the ( x ' , y ' ) plane.
In the second learning phase, 98.43% were labeled
correctly, with 89.84% labeled correctly in the test set (Fig.
8c, 8d). The errors in this case were due to differences in the
boundaries between rooms and corridors. Fig. 8d shows that
the RatChat agents were successful in clustering the
experiences appropriately, with minimal uncertain errors on
the borders between areas.
Fig. 7 – The experience map. The map is continuous and has a high
degree of correspondence to the spatial arrangement of the environment.
RatChat agents were better at generalizing when using
experiences rather than pose cells. At all locations, except for
those on borders between areas, and those not visited during
the learning phase, the agents were able to appropriately label
their current location.
VI. DISCUSSION
The conceptualization experiments tested both the extent
to which the RatSLAM system’s maps could be classified
using spatial concepts, and the degree to which different
representation
types
were
suitable.
The
spatial
conceptualization method was able to learn and then recognize
both the RatSLAM pose cell maps and the experience maps.
During the learning phase both representation types performed
well. However, during the later test sets, higher recognition
rates were achieved when using the experience maps than
when using the pose cell maps.
These results demonstrate that phenomena in the pose
cells such as multiple representations can impede the
conceptualization process. The experience mapping algorithm,
which was specifically developed to create maps from the
pose cell representations that could be used for goal
navigation, also appears to create maps more suited to spatial
conceptualization.
The spatial conceptualization process described in this
paper is an implicit one. Abstract spatial concepts, such as
rooms, are identified during training only by entry and exit
times. The robot does not explicitly recognize specific features
that identify a room type, but rather learns the spatial
correspondence between physical place and concept through
the intermediary layer that is the pose cell or experience map.
IROS 2006 workshop: From Sensors to Human Spatial Concepts
Page: 51
Fig. 8 – Conceptualization of the agent using experiences in (a) the learning phase of section 1, (b) the recognition phase of section 1, (c) the
learning phase of section 2, and (d) the recognition phase of section 2. In the learning phases there are uncertain areas in the Room 1 /
Corridor 1, Room 2 / Corridor 1, and Room 3 / Corridor 2 borders. In the recognition phases there are uncertain areas in Room 1, and in the
Room 1 / Corridor 1, Room 2 / Corridor 1, and Room 3 / Corridor 2 borders.
VII. CONCLUSION
This paper has investigated a method for learning and
recognizing abstract spatial concepts using the RatSLAM
model and experience mapping algorithm. Using a simple
neural network, the conceptualization process associates pose
cells or experiences with spatial concepts. The experiments
demonstrate that it is possible to learn abstract spatial concepts
such as rooms and corridors and then generalize about these
concepts when revisiting the physical areas.
ACKNOWLEDGMENT
The authors thank the Australian Research Council for
partial funding of the RatSLAM and RatChat projects.
[5]
[6]
[7]
[8]
[9]
[10]
REFERENCES
[1] G. Dissanayake, P. M. Newman, S. Clark, H. Durrant-Whyte, and M.
Csorba, "A solution to the simultaneous localisation and map building
(SLAM) problem," IEEE Transactions on Robotics and Automation, vol.
17, pp. 229-241, 2001.
[2] S. Thrun, "Probabilistic Algorithms and the Interactive Museum TourGuide Robot Minerva," Journal of Robotics Research, vol. 19, pp. 972999, 2000.
[3] M. Montemerlo, S. Thrun, D. Koller, and B. Wegbreit, "FastSLAM: A
Factored Solution to the Simultaneous Localization and Mapping
Problem," presented at AAAI National Conference on Artificial
Intelligence, Edmonton, Canada, 2002.
[4] M. Montemerlo, S. Thrun, D. Koller, and B. Wegbreit, "FastSLAM 2.0:
An improved particle filtering algorithm for simultaneous localization
[11]
[12]
[13]
and mapping that provably converges," presented at International Joint
Conference on Artificial Intelligence, Acapulco, Mexico, 2003.
S. Thrun, "Robotic Mapping: A Survey," Carnegie Mellon University,
Pittsburgh, Faculty Report 2002.
S. Nolfi, "Emergence of communication in embodied agents: coadapting communicative and non-communicative behaviours,"
Connection Science, vol. 17, pp. 231-248, 2005.
L. Steels and F. Kaplan, "AIBO's first words. The social learning of
language and meaning," Evolution of Communication, vol. 4, pp. 3-32,
2001.
M. Skubic, D. Perzanowski, S. Blisard, A. Schultz, W. Adams, M.
Bugajska, and D. Brock, "Spatial language for human-robot dialogs,"
IEEE Transactions on Systems, Man, and Cybernetics Part C:
Applications and Reviews, vol. 34, pp. 154-167, 2004.
L. Steels, The Talking Heads Experiment, vol. I. Words and Meanings.
Brussels: Best of Publishing, 1999.
D. Marocco and S. Nolfi, "Emergence of communication in teams of
embodied and situated agents," in The Evolution of Language, A.
Cangelosi, A. D. M. Smith, and K. Smith, Eds. Singapore: World
Scientific Publishing, 2006, pp. 198-205.
R. Schulz, P. Stockwell, M. Wakabayashi, and J. Wiles, "Generalization
in languages evolved for mobile robots," in ALIFE X: Proceedings of the
Tenth International Conference on the Simulation and Synthesis of
Living Systems: MIT Press, 2006, pp. 486-492.
M. J. Milford, G. Wyeth, and D. Prasser, "RatSLAM: A Hippocampal
Model for Simultaneous Localization and Mapping," presented at
International Conference on Robotics and Automation, New Orleans,
USA, 2004.
M. J. Milford, D. Prasser, and G. Wyeth, "Experience Mapping:
Producing Spatially Continuous Environment Representations using
RatSLAM," presented at Australasian Conference on Robotics and
Automation, Sydney, Australia, 2005.
IROS 2006 workshop: From Sensors to Human Spatial Concepts
Page: 52
From images to rooms
Olaf Booij, Zoran Zivkovic and Ben Kröse
Intelligent Systems Laboratory,
University of Amsterdam, The Netherlands
Figure 1. An overview of the algorithm.
Abstract— In this paper we start from a set of images obtained
by the robot while it is moving around an environment. We
present a method to automatically group the images into groups
that correspond to convex subspaces in the environment which
are related to the human concept of rooms. Pairwise similarities
between the images are computed using local features extracted
from the images and geometric constraints. The images with
the proposed similarity measure can be seen as a graph or
in a way a base level dense topological map. From this low
level representation the images are groped using a graphclustering technique which effectively finds convex spaces in the
environment. The method is tested and evaluated on challenging
data sets acquired in real home environments. The resulting
higher level maps are compared with the maps humans made
based on the same data1 .
I. I NTRODUCTION
Mobile robots need an internal representation for localization and navigation. Most current methods for map building
are evaluated using error measures in the geometric domain,
for example covariance ellipsis indicating uncertainty in feature location and robot location.
Now that robots are moving into public places and homes,
human beings have to be taken into account. This changes
the task of building a representation of the environment. Semantic information must be added to sensory data. This helps
to enable a better representation (avoid aliasing problems),
and makes it possible to communicate with humans about
its environment. Incorporating these tasks in traditional map
building methods is non trivial. Even more, evaluating such
methods is hard while user studies are difficult and there is a
lack of good evaluation criteria.
One of the more complicated issues is what sort of spatial
concepts should be chosen. For most indoor applications,
objects (and their location) and rooms seems a natural choice.
Rooms are generally defined as convex spaces, in which
objects reside, and which are connected to other rooms with
’gateways’ [1], [2]. In [3] a hierarchical representation is
used in which at the low level the nodes indicate objects, at
1 The work described in this paper was conducted within the EU FP6002020 COGNIRON (”The Cognitive Companion”) project.
a higher level the nodes represent ’regions’ (parts of space
defined by collections of objects) and at the highest level the
nodes indicate ’locations’ (’rooms’). However, detecting and
localizing objects is not yet a trivial task.
In this paper we consider the common concept of ’rooms’.
We present our appearance based method to automatically
group images obtained by the robot into groups that correspond to convex subspaces in the environment which are
related to the human concept of rooms. The convex subspace
is defined as a part of the environment where the images from
this subspace are similar to each other and not similar to the
other subspaces. The method starts from a set of unlabelled
images. Every image is treated as a node in a graph, where
an edge between two nodes (images) is weighted according
to the similarity between the images. We propose a similarity
measure which considers two images similar if it is possible
to perform 3D reconstruction using these two images [4], [5].
This similarity measure is closely related to the navigation
task since reconstructing the relative positions between two
images means also the it is possible to move the robot from
the location of where one images is taken to the location where
the other image is taken given that there are no obstacles
in between. We propose a criterion for grouping the images
from convex spaces. The criterion is formalized as a graph cut
problem and we present an efficient approximate solution. In
an (optional) semi-supervised paradigm, we allow the user to
label some of the images. The graph similarity matrix is then
modified to incorporate the user-supplied labels prior to the
graph cut step.
Section II presents a short overview of the related work.
Section III describes our method of constructing a low level
appearance based map. In Section IV it is explained how
to find parts of this map belonging to convex spaces in the
environment. The method used for resampling the datasets
is described in Section V. In Section VI we report the
experiments we did in real home environments. Our approach
is also compared to other similarity measures and standard
k-means clustering. Finally we draw some conclusions and
discuss future work in Section VII.
IROS 2006 workshop: From Sensors to Human Spatial Concepts
Page: 53
II. R ELATED WORK
The traditional topological maps represent the environment
as a graph where the nodes present distinctive locations and
edges describe the transitions [1]. The distinctive locations
can be obtained from the geometric map, e.g. using Voronoi
graphs [6], [7] or from images, for example using fingerprint
representation as in [8]. However, the extracted distinctive
locations are mainly related to the robot navigation task and
not the human concepts such as the rooms.
Another related task is the task of place or location recognition. To distinguish between different rooms, often visual cues
are used, such as color histograms [9] or visual fingerprints
[10]. A combination of spatial cues and objects detected in
images taken from that room has been used by [11]. Instead
of explicit object detection, also implicit visual cues such as
SIFT features have been used [12]. More general problem of
recognizing scenes from images is addressed [13]. However
all these approaches assume that the human given labels are
provided.
We present here an unsupervised algorithm to group the
images into groups that are related the human concept of
rooms. Our approach is similar to [5] where the images are
also grouped on basis on their similarities. Similar approach
was also used in [14] but for the task of of finding object
categories from images. In this paper we present a grouping
criterion that is more appropriate for detecting convex spaces.
Furthermore, in [5] the data is obtained in a highly controlled
way by taking the images at uniformly spaced locations. Here
we will consider the realistic situation where the data is
obtained by just moving the robot around the environment. The
graph clustering will then depend on the robot movements and
we propose a re sampling scheme to improve the results, see
Section V. Finally, we consider a semi-supervised approach
where the user provides a number of labels.
III. I MAGE SIMILARITY MEASURE
We start from a set of unlabelled images. In all our
experiments the omnidirectional images were used taken by
a mobile robot while driving through the environment (see
figure 2 for the image positions of of one of the data sets used
for testing). Every image is treated as a node in a graph, where
an edge between two nodes (images) is weighted according to
the similarity between the images. This graph can be seen as a
topological map. Various similarity measure can be used. We
will use here the similarity measure as in [5]. We define that
there is an edge between two nodes in the graph if it is possible
to perform a 3D reconstruction of the local space using visual
features from the two corresponding images. We use SIFT features [15] as the automatically detected landmarks. Therefore
an image can be summarized by the landmark positions and
descriptions of their local appearance. The 3D reconstruction
was performed using the 8 point algorithm [16] constrained
to planar camera movement [17] and the RANSAC estimator
was used to be robust to false matches [16]. A big advantage
of such similarity measure over the pure appearance based
measures is that it also considers geometry [4]. Therefore the
Home 1
Home 2
Fig. 2. Ground floor maps of the two home environments. The circles denote
the positions of the robot, according to the wheel encoders, from which an
image was taken.
chance is small that images from two different rooms are found
similar while they might be similar in appearance [5].
As the result of N images we obtain a graph that is
described with a set S of N nodes and a symmetric matrix
W called the ’similarity matrix’. For each pair of nodes
i,jǫ[1, ..., N ] the value of the element Wij from the matrix
W defines similarity of the nodes. In our case this is equal to
1 if there is a link between the nodes and 0 if there is no link.
Examples of such a graphs that we obtained from real data sets
are given in Figure 4. If there is a non-zero edge in the graph
this also means that if the robot is at one of the connected
nodes (corresponding to one image), it can determine the
relative location of the other node (corresponding to the
other image). If there are no obstacles in between, the robot
can directly navigate from one node to the other. If there
are obstacles, one could rely, for example, on an additional
reactive algorithm for obstacle avoidance using range sensors.
In this sense the graph obtained using the proposed similarity
measure can be seen as a base level dense topological map
that can be used for navigation and localization.
This graph contains, in a natural way, the information about
how the space in an indoor environment is separated by the
walls and other barriers. Images from a convex space, for
example a room, will have many connection between them and
just a few connections to some images that are from another
space, for example a corridor, that is connected with the room
via a narrow passage, for example a door. By clustering the
graph we want to obtain groups of images that belong to a
convex space, for example a room.
IV. G ROUPING IMAGES
Starting from the graph representation we will group the
images by cutting the graph (S, W ) , described above, into K
separate subgraphs {(S1 , W1 )..., (SK , WK )}. If the subgraphs
(clusters) correspond to convex subspaces we expect that there
IROS 2006 workshop: From Sensors to Human Spatial Concepts
Page: 54
will be many links within each cluster and a few between the
clusters. The subgraphs should also be connected graphs. This
is formalized as a graph cut criterion further in this Section.
An efficient approximate solution is also presented.
Note that we assume that the images are recorded at
positions that approximately uniformly sample the available
space. If this is not true the images from the positions close
to each which are usually very similar tend to group together
and the resulting clusters depend on the positions where the
images are taken.
A. Grouping criterion
We will start by introducing some graph-theoretic terms.
The degree of the i-th node of a graph (S, W ) is defined
P as the
sum of all the edges that start from that node: di = j Wij .
For nodes Sj (where
Sj is a subset of S), volume is defined
P
as vol(Sj ) = i di . vol(Sj ) describes the ”strength” of the
interconnections within the subset Sj . A subgraph (Sj , Wj )
can be ”cut out” from the graph (S, W ) by cutting a number
of edges. The sum of the values of the edges that are cut is
called a graph cut:
X
cut(Sj , S\Sj ) =
Wij
(1)
iǫSj ,jǫS\Sj
where S\Sj denotes the set of all nodes except the ones from
Sj . One may cut the base level graph into q1 clusters by
minimizing the number of cut edges:
1
q
X
cut(Sj , S\Sj ).
(2)
j
This would mean that the graph is cut at the weakly connected
places, which in our case would usually correspond to natural
segmentation at doors between the rooms or other narrow
passages. However, such segmentation criteria often leads to
undesirable results. For example, if there is an isolated node
connected to the rest of the graph by only one link, then (2)
will be in favor of cutting only this link. To avoid such artifacts
we use a normalized version:
1
q
X
cut(Sj , S\Sj )
j
vol(Sj )
.
(3)
Minimizing this criterion means cutting a minimal number of
connections between the subsets but also choosing larger subsets with strong connections within the subsets. This criterion
naturally groups together convex areas, like a room, and makes
cuts between areas that are weakly connected.
However, the criterion (3) can lead to solutions where the
clusters present disconnected graphs. The requirement that
the subgraphs should also be connected graphs need to be
considered also in addition.
B. Approximate solution
For completeness of the text we briefly sketch a wellbehaved spectral clustering algorithm from [18] that leads to a
good approximate solution of the normalized cut criteria (3):
1) Define D to be a diagonal matrix of node degrees Dii =
di and construct the normalized similarity matrix L =
D−1/2 W D−1/2 .
2) Find x1 , ..., xK the K largest eigenvectors of L and form
the matrix X = [x1 , ..., xK ] ∈ RN ×K .
3) Renormalize
of X to have unit length Xij ←
P 2 rows
)1/2 .
Xij /( j Xij
4) Treat each row of X as a point in RK and cluster using
for example the k-means algorithm. Instead of the kmeans step in [19] a more principled but more complex
approach is used, following [20] where a good initial
start for the k-means clustering is proposed. We tested
the mentioned algorithms, and in practice, for our type
of problems, they lead to similar solutions.
5) The i-th node from S is assigned to cluster j if and only
if the row i of the matrix X was assigned to the cluster
j.
Although in practice very rarely, the normalized cut criteria
(3) can lead to disconnected solutions as mentioned above. A
practical split and merge solution to ensure that the subgraphs
are connected is as follows:
1) group the images using the normalized cut criteria (and
using the spectral clustering technique).
2) Split step: if there are disconnected subgraphs in the
result generate new clusters from the disconnected subgraph components.
3) Merge step: the connected clusters that minimize the
normalized cut criteria (3)should be merged.
The final result presents a practical and efficient approximate
solution for our criterion from the previous section. The exact
solution is a NP-hard problem and usually not feasible.
C. Semi-supervised learning
This framework allows the introduction of weak semisupervision in the form of pairwise constraints between the unlabelled images. Specifically, a user may specify cannotgroup
or must-group connections between any number of pairs in
the data set. Following the paradigm suggested in [21], we
modify the graph (S, W ) to incorporate this information to
assist category learning: entries in the affinity matrix S are
set to the maximal (diagonal) value for pairs that ought to be
reinforced in the groupings, or set to zero for pairs that ought
to be divided.
V. R EALISTIC ( NON - UNIFORM ) SAMPLED DATA
The images should be recorded at positions that approximately uniformly sample the available space. However, this
is often difficult to perform in practice. For example some
of the data sets we will consider in the experimental section
were recorded by letting the robot record the images at regular
time intervals. For such data the clustering will depend on the
robot movements. An illustration of a non-uniformly sampled
data set is given in Figure 3. The images taken close to each
other depicted in the figure near the transition from ’room2’ to
’corridor’ will usually be similar to each other and therefore
grouped together. The on-line appearance topological mapping
IROS 2006 workshop: From Sensors to Human Spatial Concepts
Page: 55
2) Construct a new graph sampling N samples from
the multinomial distribution with coefficients w̃ (i) .
The corresponding nodes and links from the original graph (S, W ) are added to the new graph
(S resampled , W resampled ).
3) if the new graph (S resampled , W resampled ) is not connected continue sampling and adding nodes as in the
previous step until it gets connected.
The result is the new graph (S resampled , W resampled ) where
the images come from positions that approximately uniformly
sample the available space.
Fig. 3.
Top image depicts an example of non-uniformly sampled data
and undesired clustering results. The clustering results can be improved by
detecting such situations and generating a new graph with approximately
uniformly sampled images as depicted below.
[8] will also suffer from the same problem. In this Section
we will use information about Euclidean geometric distances
between the images and present a simple sampling approach
aimed to approximate the uniform sampling of the space and
improve the clustering results.
A. Importance sampling
Let there be N images recorded while robot was moving
around the environment and let x(i) denote the 2D position
where i-th image was recorded. We can consider x(i) -s as N
independent samples from P
some distribution q. A sample based
N
approximation is q(x) ≈ i=1 δ(x − x(i) )/N . Then we can
approximate uniform distribution using importance sampling:
N
U nif orm(x) = c =
X
c
q(x) ≈
w̃(i) δ(x − x(i) )
q(x)
i=1
(4)
PN
where w̃(i) = w(i) / j=1 w(j) and w(i) = c/q(x(i) ). One can
interpret the w̃ (i) as correction factors to compensate for the
fact that we have sampled from the “incorrect” distribution
q(x). Approximate uniform sample can be generated now by
sampling from the sample based approximation above. This
is equivalent to sampling from the multinomial distribution
with coefficients w̃ (i) . The original distribution of the original
sampling q(x(i) ) can be estimated for example using a simple
K-nearest neighbor density estimate q(x(i) ) ∼ 1/V where
the V = dk (x(i) )2 and dk (x(i) ) is the distance to the k-th
nearest neighbor in the Euclidean 2D space. The distances
can be obtained form odometry or some SLAM procedure.
Alternatively the distances can be approximated from the
images directly. For all our data we used the k = 7.
B. Practical algorithm
We start with the original graph (S, W ) and an empty graph
(S resampled , W resampled ). The practical algorithm we will be
using is as follows:
1) Compute the local density estimates and the weight
factors w̃ (i) .
VI. E XPERIMENTS
The method of finding the convex spaces in an environment
is tested in two real home environments and is compared to the
annotation based on the same sensor data. Our mobile robot
was driven around while taking panoramic images with an
omnidirectional camera, see figures 2 for ground floor maps
of the environments and the positions where images were
taken. The task of building a map using these image sets
is challenging in a number of ways. First of all the lighting
conditions were not good, much worse than the conditions during previous evaluations in office environments. Also, people
were walking through the environment blocking the view of
the robot. Furthermore, the robot was driven rather randomly
through the rooms, which has the effect that some parts of the
environment are represented by a lot of images while others
parts only with a few (see www2.science.uva.nl/sites/cogniron/
for videos acquired by the robot).
The data sets were annotated by a inexperienced person,
based solely on the sensor data and the maps as shown in
figures 2 but without the robot positions. The person had
never visited one of the two houses. For both homes labels
were provided corresponding to the rooms, from which one
should be picked per panoramic image. Between some of the
rooms there was no good geometrical boundary separating
them, so from most places in one room the other room was
still clearly visible and vise verse. This is common in real
home environments but makes conceptualization of it harder.
From both image sets an appearance graph is made using
the methods explained in III. These graphs are then used as
input for the clustering algorithm to find convex spaces in
environment, first with all images and then with a subset
obtained by resampling. The results are compared with the
annotation, to see how well the convex spaces found by
clustering correspond to separate rooms.
A. Results
In Figure 4 it can be seen that the appearance based methods
were quite successful in creating a low level topological map.
All links of the graphs connect nodes originating from images
that were taken close to each other in world coordinates. In
some parts of the graph the nodes are more densely connected
than others. This could be the result of bad image quality
for example caused by changing lighting conditions, but it
IROS 2006 workshop: From Sensors to Human Spatial Concepts
Page: 56
a)
b)
c)
a)
b)
c)
Fig. 4. The clustering results for the Home 1 (above) and the Home 2 (below) data sets. a) The appearance based graph. Each line indicates two matching
images. b) The clusters found in the whole dataset. c) Clusters found in the resampled dataset. Note that the odometry data used to draw these figures are
not used only for the resampling.
could also be the result of lack of features in that part of the
environment.
Clustering without resampling (see Figures 4b) results in a
grouping of the images which is not perfect. As can be seen in
Figure 4 some of the images of Home 1 are grouped together
which were taken from completely different positions and that
images taken in the kitchen are split among two clusters. In
Figure 4 it can be clearly seen that some images taken in the
living room are grouped with images taken in the work room.
After the split and merge steps these images are regrouped
with the living room images.
Better clustering results are obtained after resampling the
data as indicated by figure 4c. Both data sets are clustered almost perfectly, often cutting the graph at nodes corresponding
to images taken at the doorpost between the rooms. The only
error left is at the bedroom of Home 1, from which images are
grouped with images from the living room. This is probably
caused by the large opening between the two rooms, as can
be seen in Figure 2.
The mismatch between the clusters found by our method
and the labels provided by the annotator is made clear by the
confusion matrices, see tables I to IV. Of course the clustered
data does not provide a label. Each cluster is appointed the
label corresponding to the true set with which it has the largest
overlap, taking care that no two clusters get the same label.
The percentage of correctly clustered images from home 1 is
85% for the whole dataset and 92% for the resampled set. For
home 2 this was 73% and 83%.
B. Comparison with other clustering methods and similarity
measures
We compare our method with the common k-means clustering and a PCA based similarity measure [22]. We used 10
PCA components and clustered the images using k-means.
We also used the Euclidean distances in the PCA space and
TABLE I
H OME 1 WHOLE DATASET
True
label
Living room
Bedroom
Kitchen
Living r
0.9681
0.1832
0
Inferred label
Bedroom Kitchen
0.0319
0
0.8168
0
0.5000
0.5000
TABLE II
H OME 1 RESAMPLED AVERAGED OVER 10 TRIALS
True
label
Living room
Bedroom
Kitchen
Living r
1.0000
0.3014
0
Inferred label
Bedroom Kitchen
0
0
0.6915
0.0071
0.0396
0.9604
TABLE III
H OME 2 WHOLE DATASET
True
label
Corridor
Living room
Bedroom
Kitchen
Work room
Corridor
0.6812
0
0
0.0556
0
Inferred label
Living r Bedroom Kitchen
0.1159
0.2029
0
0.5732
0
0
0
1.0000
0
0
0
0.9444
0
0
0
Work r
0
0.4268
0
0
1.0000
TABLE IV
H OME 2 RESAMPLED
True
label
Corridor
Living room
Bedroom
Kitchen
Work room
IROS 2006 workshop: From Sensors to Human Spatial Concepts
Corridor
0.6344
0.0291
0
0
0
Inferred label
Living r Bedroom Kitchen
0.0323
0.2473
0.0860
0.8301
0
0
0
1.0000
0
0
0
1.0000
0
0
0
Work r
0
0.1408
0
0
1.0000
Page: 57
TABLE V
C LUSTERING ACCURACY FOR VARIOUS CLUSTERING METHODS FOR THE
H OME 2 DATA SET. PCA PROJECTION WITH 10 COMPONENTS IS USED
PCA + kmeans
0.60
PCA + spectral
clustering
0.38
our method
0.73
our method (with
resampling)
0.83
1
Mean
Standard deviation
Max and min value
0.95
Accuracy
0.9
R EFERENCES
0.85
0.8
0.75
0.7
0.65
0
2
4
6
8
Number of labelled images per cluster
they just entered the kitchen, then this information should be
used to build the higher level map. Problems might occur if
the user is using different labels for the same space or when
the clustering obtained by the robot does not correspond to
the human concept. These problems need to be addressed and
resolved for example through dialog with the user. Finally, the
algorithms should work online in order to facilitate interaction
between the map building process and the guide.
10
Fig. 5. The clustering accuracy for the semi supervised case - average from
100 trials. Different number of randomly chosen ground truth labels per cluster
are used to simulate user input.
applied spectral clustering. The results were poor compared to
our method. The results also show that this simple appearance
based similarity is not suitable for spectral clustering methods.
C. Semi-supervised clustering
To demonstrate the semi supervised learning we used a set
labelled points to enforce that the points with the same label
should group together and the points with different labels
should not group together. The set of the labelled points is
randomly chosen and the results for the Home 2 data set are
presented in Figure 5. The graphs show how the accuracy
increases with the amount of labelled images.
VII. C ONCLUSION
The experiments show that the proposed clustering method
seems appropriate for finding the convex spaces by grouping
images obtained by the robot. The cuts made in the graphs are
at or close to the doorways dividing two rooms. The convex
spaces thus found in the real home environments correspond
to the concept of “room” as shown by comparing it with
annotated data.
For some cuts the clustering relies on a good sampling of
the data, which was clearly visible tests in Home 1. In table I
it can be seen that the kitchen is split into two parts. After
resampling (table I) 96% of the image annotated as the kitchen
fell in a single cluster.
The proposed methods are very suitable as a basis for
human robot communication about the spaces the robot travels
through. The system will be developed further in this direction,
with the goal to enable a robot to build a higher level map by
listening to and asking a human guide. Our method naturally
allows semi-supervised learning as we demonstrated, using
the input of the guide. If the guide says to the robot that
[1] B. Kuipers, “The spatial semantic hierarchy,” Artif. Intell., vol. 119, no.
1-2, pp. 191–233, 2000.
[2] D. Kortenkamp and T. Weymouth, “Topological mapping for mobile
robots using a combination of sonar and vision sensing,” in In Proc. of
the Twelfth National Conference om Artificial Intelligence, 1994.
[3] A. Tapus, S. Vasudevan, and R. Siegwart, “Towards a multilevel cognitive probabilistic representation of space,” In Proc. of the International
Conference on Human Vision and Electronic Imaging X, part of the
IST-SPIE Symposium on Electronic Imaging, 2005.
[4] F. Schaffalitzky and A. Zisserman, “Multi-view matching for unordered
image sets, or “How do I organize my holiday snaps?”,” in Proceedings
of the 7th European Conference on Computer Vision, Copenhagen,
Denmark, vol. 1. Springer-Verlag, 2002, pp. 414–431.
[5] Z. Zivkovic, B. Bakker, and B. Kröse, “Hierarchical map building using
visual landmarks and geometric constraints,” in Intl. Conf. on Intelligent
Robotics and Systems. Edmundton, Canada: IEEE/JRS, August 2005.
[6] H. Choset and K. Nagatani, “Topological simultaneous localisation and
mapping: Towards exact localisation without explicit localisation,” IEEE
Transactions on Robotics and Automation, vol. 17, no. 2, pp. 125–137,
April 2001.
[7] P.Beeson, N.K.Jong, and B.Kupiers, “Towards autonomous place detection using the extended voronoi graph,” In Proceedings of the IEEE
International Conference on Robotics and Automation, 2005.
[8] A. Tapus and R. Siegwart, “Incremental robot mapping with fingerprints
of places,” in IROS, 2005.
[9] I. Ulrich and I. Nourbakhsh, “Appearance-based place recognition for
topological localization,” in Proceedings of ICRA 2000, vol. 2, April
2000, pp. 1023 – 1029.
[10] A. Tapus and R. Siegwart, “A cognitive modeling of space using
fingerprints of places for mobile robot navigation,” in ICRA, 2006.
[11] A. Rottmann, O. Martı́nez Mozos, C. Stachniss, and W. Burgard, “Place
classification of indoor environments with mobile robots using boosting,”
in AAAI, 2005.
[12] F. Li and J. Kosecka, “Probabilistic location recognition using reduced
feature set,” in IEEE International Conference on Robotics and Automation, 2006.
[13] A. Torralba, K. Murphy, W. Freeman, and M. Rubin, “Context-based
vision system for place and object recognition,” In Proc. of the Intl.
Conf. on Computer Vision, 2003.
[14] K. Grauman and T. Darrell, “Unsupervised learning of categories from
sets of partially matching image features,” cvpr, vol. 1, pp. 19–25, 2006.
[15] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,”
International Journal of Computer Vision, vol. 60, no. 2, pp. 91–110,
2004.
[16] R. Hartley and A. Zisserman, Multiple view geometry in computer vision,
secon edition. Cambridge University Press, 2003.
[17] M. Brooks, L. de Agapito, D. Huynh, and L. Baumela, “Towards
robust metric reconstruction via a dynamic uncalibrated stereo head,”
1998. [Online]. Available: citeseer.csail.mit.edu/brooks98towards.html
[18] A. Ng, M. Jordan, and Y. Weiss, “On spectral clustering: Analysis and an
algorithm,” In Proc. Advances in Neural Information Processing Systems
14, 2001.
[19] S. X. Yu and J. Shi, “Multiclass spectral clustering,” In Proc. International Conference on Computer Vision, pp. 11–17, 2003.
[20] L. Zelnik-Manor and P. Perona, “Self-tuning spectral clustering,” In
Proc. Advances in Neural Information Processing Systems, 2004.
[21] S. Kamvar, D. Klein, and C. Manning, “Spectral learning,” In Proc.of
the International Conference on Artificial Intelligence, 2003.
[22] B. Krose, N. Vlassis, R. Bunschoten, and Y. Motomura, “A probabilistic model for appearance-based robot localization,” Image and Vision
Computing, vol. 6, no. 19, pp. 381–391, 2001.
IROS 2006 workshop: From Sensors to Human Spatial Concepts
Page: 58
Robust Models of Object Geometry
Jared Glover, Daniela Rus and Nicholas Roy
Geoff Gordon
Computer Science and Artificial Intelligence Laboratory
Massachusetts Institute Of Technology
Cambridge, MA 02139
Email: {jglov,rus,nickroy}@mit.edu
School of Computer Science
Carnegie Mellon University
Pittsburgh, PA 15289
Email: ggordon+@cs.cmu.edu
Abstract— Precise and accurate models of the world are critical
to autonomous robot operation. Just as robot navigation typically
requires an accurate map of the world, robot manipulation
typically requires accurate models of the objects to be grasped.
However, existing shape modelling and object recognition techniques, while excellent for general purpose object recognition,
do not preserve the global geometric information necessary for
geometric tasks such as manipulation.
We describe an inference algorithm for learning statistical
models of global object geometry from image data, using a
representation that allows us to compute a distribution over
the complete geometry of different objects., Finally, we describe
how learned object models can be used to infer the geometry of
occluded parts of objects.
I. I NTRODUCTION
Precise and accurate models of the world are critical to
autonomous robot operation. Just as robot navigation typically
requires an accurate map of the world, robot manipulation
typically requires accurate models of the objects to be grasped.
However, existing shape modelling and object recognition
techniques do not preserve the global geometric information
necessary for good manipulation. Most existing shape recognition techniques are focused on the object appearance, such
as a bitmap, rather than a notion of the object geometry.
One of the principal difficulties in learning object models
is finding a good representation; frequently, the object representation that is most useful for one task is not particularly
useful for another. For instance, there exist sparse featurebased object models such as SIFT features [14] that allow
for very reliable object recognition and tracking but are completely inappropriate for motion planning in that the complete
geometry of the object is not recovered. Techniques such as
shape contexts [1] or spherical harmonics [10] allow general
classes of objects to be learned over time, but again, the
geometry of individual objects is not always preserved.
For autonomous robot operation, we would like to be able
to infer the complete geometry of objects in a manner that is
robust to variances in shape from object to object, and is robust
to perceptual occlusions. First, our algorithm should learn a
description of the complete geometry of the object. In order
to allow a robot to carry out control tasks such as navigation,
manipulation, grasping, etc. it is not sufficient to describe an
object as a set of features; we will need the ability to describe
the complete boundary of the object for computing potential
collisions, grasp closures, etc. Secondly, our algorithm should
allow us to infer the object’s geometry from a series of
independent measurements. Just as in robot mapping, we
would like to integrate a series of measurements in time, in
order to learn the object model. This problem differs from the
mapping problem in that we (usually) will not have to solve
Figure 1. Contour extraction. Left: Raw image (poor white balance in the
camera results in poor colour rendering). Middle: Contours extracted using
pyramid segmentation. Right: Contours extracted using intensity thresholding.
Our goal is to recognize different objects from the same class from the
complete or partial outline of each object.
for the object description and sensor position simultaneously.
Finally, our algorithm should allow us to estimate the geometry
of any occluded parts of the object from a partial view — we
would like to have a robot that can recognize and pick up
a tool without first building a complete and accurate model
of the tool. In robot navigation, the map is usually assumed
to be complete before any autonomous motion planning is
attempted. In a populated, dynamic environment, the assumption of a complete description of the environment is clearly
brittle as new objects will appear regularly that will require
robot actions. Our representation must therefore be able to
recognize objects it may not have seen before, and be able to
make reasonable inferences about the parts of the object that
are occluded.
This paper presents a unified probabilistic approach to modelling object geometry, in particular focusing on the problem of
incomplete data. We draw upon a representation [11] for object
geometry which is invariant to changes in position, scale, and
orientation, and show how to learn object models directly from
sensor data. This representation is robust to sensor noise, as
well as to imperfect segmentation or feature extraction. It also
allows us to capture the variation between shapes in the same
object class in a meaningful way. By learning a probabilistic,
generative model of the complete geometry of an object, we
can make predictions about sections of the object geometry
that are not visible, using an approximate maximum-likelihood
estimation scheme. We demonstrate this approach on some
example images of everyday objects. The results presented in
this paper are restricted to monocular video images, and these
results depend on a particular parameterization of the data in
terms of complex numbers. We have in principle generalized
this approach to three dimensions from data such as stereo
or laser range data, where the corresponding representation is
in terms of quaternions, but there remain many details to be
worked out, such as data association and feature orderings, so
we do not include these results here.
Finally, this paper is not about object recognition; there
are many other (better) methods available for this task [6],
[1]. Rather, we are concerned with modelling the complete
IROS 2006 workshop: From Sensors to Human Spatial Concepts
Page: 59
contours of objects for use in navigation and grasp planning
problems. And, while it remains to be seen which representation will ultimately prove most useful for such planning
problems, we hypothesize that probabilistically modelling the
complete geometry of objects will allow a richer and more
robust approach to these problems.
II. S HAPE S PACE
Let us represent a shape z as a set of n points z1 , z2 , . . . zn
in some Euclidean space. We will restrict ourselves to twodimensional points (representing shapes in a plane) such that
zi = (xi , yi ), although extensions to three dimensions are
feasible. We will assume these points are ordered (so that z can
be defined as a vector), and represent a closed contour (such
as the letter “O”, as opposed to the letter “V”). In general, the
shape model can be made de facto invariant to point ordering
if we know correspondences between the boundary points of
any two shapes we wish to compare. Even in the case of
closed contours, however, care must still be taken to choose
the correct starting point. In order to make our model invariant
to changes in position and scale, we can normalize the shape
so as to have unit length with centroid at the origin; that is,
z′
τ
= {z′ i = (xi − x̄, yi − ȳ)}
z′
=
|z′ |
(1)
(2)
where |z′ | is the L2 -norm of z′ . We call τ the pre-shape of z.
Since τ is a unit vector, the space of all possible pre-shapes
of n points is a unit hyper-sphere, S2n−3
, called pre-shape
∗
space1 .
Any pre-shape is a point on the hypersphere, and all rotations of the shape lie on an orbit, O(τ ), of this hypersphere.
If we wish to compare shapes using some distance metric
between them, the spherical geometry of the shape space
requires a geodesic distance rather than Euclidean distance.
Additionally, in order to ensure this distance is invariant to
rotation, we define the distance between two shapes τ1 and τ2
as the smallest distance between their orbits:
dp [τ1 , τ2 ] =
d(φ, ψ) =
inf[d(φ, ψ) : φ ∈ O(τ1 ), ψ ∈ O(τ2 )] (3)
cos−1 (φ · ψ)
(4)
We call dp the Procrustean metric [11] where d(φ, ψ) is the
geodesic distance between φ and ψ. Since the inverse cosine
function is monotonically decreasing over its domain, it is
sufficient to maximize φ·ψ, which is equivalent to minimizing
the sum of squared distances between corresponding points
on φ and ψ (since φ and ψ are unit vectors). For every
rotation of φ there exists a rotation of ψ which will find the
global minimum geodesic distance. Thus, to find the minimum
distance, we need only rotate one pre-shape while holding the
other one fixed. We call the rotated ψ which achieves this
optimum (θα∗ (τ2 )) the orthogonal Procrustes fit of τ2 onto
τ1 , and the angle α∗ is called the Procrustes fit angle.
Representing the points of τ1 and τ2 in complex coordinates,
which naturally encode rotation in the plane by scalar complex
1 Following [16], the star subscript is added to remind us that S2p−3 is
∗
embedded in R2p , not the usual R2p−2 .
Figure 2. Distances on a hyper-sphere. r is a linear approximation to the
geodesic distance ρ.
multiplication, the Procrustes distance minimization can be
solved:
dp [τ1 , τ2 ] = cos−1 |τ2H τ1 |
∗
α =
arg(τ2H τ1 ),
(5)
(6)
where τ2H is the Hermitian, or complex conjugate transpose
of the complex vector τ2 .
III. S HAPE I NFERENCE
Given a set of measurements of an object, we would like to
infer a distribution of possible object geometry. We will first
derive an expression for the mean shape, and then derive the
distribution covariance. Finally, we will show how to infer
the full object shape using measurements of partial object
geometry.
1) The Shape Distribution Mean: Let us assume that our
data set consists of a set of measurements {m1 , m2 , . . .} of the
same object, where each measurement mi is a complete (but
noisy) description of the object geometry, a vector of length
p. We can normalize each measurement to be an independent
pre-shape and use these pre-shapes to compute the mean, that
is, the maximum likelihood object shape. If each measurement
mi is normalized to a pre-shape τi , then the mean is
X
[dp (τi , µ)]2
(7)
µ∗ = arg inf
kµk=1
i
The pre-shape µ∗ is called the Fréchet mean 2 of the samples
τ1 , . . . , τn with respect to the distance measure ‘dp ’. Note
that the mean shape is not trivially the arithmetic mean of
all pre-shapes; dp is non-Euclidean, and we wish to preserve
the constraint that the mean shape has unit length.
Unfortunately, the non-linearity of the Procrustes distance
in the cos−1 (·) term leads to an intractable solution for
the Fréchet mean minimization. Thus, we approximate the
geodesic distance ρ between two pre-shape vectors, φ ∈ O(τ1 )
and ψ ∈ O(τ2 ), as the projection distance, r,
p
(8)
r = sin ρ = 1 − cos2 ρ.
Figure 2 depicts this approximation to the geodesic distance
graphically 3 .
2 More precisely, µ∗ is the Fréchet mean of the random variable ξ drawn
from a distribution having uniform weight on each of the samples τ1 , . . . , τn .
3 The straight-line Euclidean distance, s, is another possible approximation
to the geodesic distance, but it will not enable us to easily solve the mean
shape minimization in closed form.
IROS 2006 workshop: From Sensors to Human Spatial Concepts
Page: 60
Using this linear projection distance r in the mean shape
minimization yields an expression for the mean shape µ∗
X
µ∗ = arg inf
(1 − |τiH µ|2 )
(9)
kµk=1
= arg sup
kµk=1
i
X
(τiH µ)H (τiH µ)
(10)
i
H
X
= arg sup µ (
τi τiH )µ
(11)
= arg sup µH Sµ,
(12)
kµk=1
space coordinates for pre-shape τi are given by
∗
vi = (I − µµH )ejθi τi ,
(13)
where j 2 = −1 and θi∗ is the optimal Procrustes-matching
rotation angle of τi onto µ. We can also now apply a dimensionality reduction technique such as PCA to the tangent space
data v1 , · · · , vn to get a compact representation of estimated
shape distribution (Figure 4).
i
0.25
kµk=1
∗
thus, µ is the complex eigenvector corresponding to the
largest eigenvalue of S [5].
0.25
0.2
0.2
0.15
0.15
0.1
0.1
0.05
0.05
0
0
−0.05
−0.05
−0.1
−0.1
−0.15
−0.15
−0.2
−0.25
−0.25
0.25
−0.2
−0.2
−0.15
−0.1
−0.05
0
0.05
0.1
0.15
0.2
0.25
−0.25
−0.25
−0.2
−0.15
−0.1
−0.05
0
0.05
0.1
0.15
0.2
0.25
−0.2
−0.15
−0.1
−0.05
0
0.05
0.1
0.15
0.2
0.25
0.2
0.25
0.25
0.15
0.1
0.05
0.2
0.2
0.15
0.15
0
−0.05
−0.1
0.1
0.1
0.05
0.05
−0.15
0
0
−0.05
−0.05
−0.1
−0.1
−0.15
−0.15
−0.2
−0.25
−0.25
−0.2
−0.15
−0.1
−0.05
0
0.05
0.1
0.15
0.2
0.25
Figure 3. Full-contour shape representation. Left to right: original image, an
example extracted contour and the mean of the learned shape class. The top
of the tool is slightly rounded off due both to the small number of points (50)
sampled from the contour to form the preshape, as well as to the smoothing
effect of computing the mean shape of the tool as it opens and closes.
Figure 3 shows an example of the mean shape learned for
the tool class. On the left is an example raw image, in the
middle is an extracted measurement contour, and on the right
is the learned mean of the tool distribution.
2) The Shape Distribution Covariance: With our measurements {m1 , m2 , . . .}, we can now fit a probabilistic model of
shape. In many applications, pre-shape data will be tightly
localized around a mean shape, in such cases, the tangent
space to the preshape hypersphere located at the mean shape
will be a good approximation to the preshape space, as in
figure 4. By linearizing our distribution in this manner, we
take advantage of standard multivariate statistical analysis
techniques, representing our shape distribution as a Gaussian.
In cases where the data is very spread out, one can use a
complex Bingham distribution [5].
Figure 4.
Tangent space distribution
In order to transform our set of pre-shapes into an appropriate tangent space, we first compute a mean shape, µ as
above 4 . We then fit the observed pre-shapes to µ, and project
each fitted pre-shape into the tangent space at µ. The tangent
4 We use µ to refer to the µ∗ computed from the optimization in 7 for the
remainder of the paper.
−0.2
−0.25
−0.25
−0.2
−0.2
−0.15
−0.1
−0.05
0
0.05
0.1
0.15
0.2
0.25
−0.25
−0.25
Figure 5. Shape model for a wire-cutter tool–the effects of the first three
principle components (eigenvectors) on the mean shape. Top left: mean shape
as in figure 3, top right: a sample shape from the first eigenshape, bottom left:
a sample from the second eigenshape, bottom right: a sample from the third
eigenshape. In each sampled shape, the mean shape is overlaid for comparison.
Figure 5 shows samples drawn from the learned generative
model of the tool class along each eigenvector, in comparison
to the mean shape. The first eigenshape largely captures how
the shape deforms as the tool opens and closes, and the second
and third eigenshapes largely capture how the shape deforms
due to perspective geometry.
IV. S HAPE C OMPLETION
We now turn to the problem of estimating the complete
geometry of an object from an observation of part of its
contour. We phrase this as a maximum likelihood estimation
problem, estimating the missing points of a shape with respect
to the Gaussian tangent space shape distribution.
Let us represent a shape as:
z
z= 1
(14)
z2
where z1 = m contains the p points of our partial observation
of the shape, and z2 contains the n − p unknown points that
complete the shape. Given a shape distribution D on n points
with mean µ and covariance matrix Σ, and given z1 containing
p measurements (p < n) of our shape, our task is to compute
the last n − p points which maximize the joint likelihood,
PD (z). In contrast to previous sections, we will now work
in real, vectorized coordinates (z = (x1 , y1 , ..., xn , yn )T )
rather than in the complex coordinates which were useful for
encoding rotation.
In order for us to transform our completed vector, z =
(z1 , z2 )T , into a pre-shape, we must first normalize translation
and scale. However, this cannot be done without knowing
the last n − p points. Furthermore, the Procrustes minimizing
rotation from z’s pre-shape to µ depends on the missing points,
so any projection into the tangent space (and corresponding
IROS 2006 workshop: From Sensors to Human Spatial Concepts
Page: 61
0.25
0.25
0.2
0.2
0.15
0.15
0.1
0.1
0.05
0.05
0
0
−0.05
−0.05
−0.1
−0.1
−0.15
−0.15
−0.2
−0.25
−0.25
−0.2
−0.2
−0.15
−0.1
−0.05
0
0.05
0.1
0.15
0.2
0.25
−0.25
−0.25
−0.2
−0.15
−0.1
−0.05
0
0.05
0.1
0.15
0.2
0.25
Figure 6. An example of occluded objects, where the right tool occludes the
left tool. Left to right: The original image, the measured contour, the contour
segments that must be completed.
likelihood) will depend in a highly non-linear way on the
location of the missing points. We can, however, compute the
missing points z2 given an orientation and scale. This leads to
an iterative algorithm that holds the orientation and scale fixed,
computes z2 and then computes a new orientation and scale
given the new z2 . The translation term can then be computed
from the completed contour z.
We derive z2 given a fixed orientation θ and scale α in the
following manner. For a complete contour z, we normalize for
orientation and scale using
1
Rθ z
(15)
α
where Rθ is the rotation matrix of θ. To center z′ , we then
subtract off the centroid:
1
(16)
w = z′ − Cz′
n
where C is the 2n × 2n checkerboard matrix,
1 0 ··· 1 0
0 1 · · · 0 1
. . .
. . ... ...
(17)
C=
.
.. ..
1 0 · · · 1 0
0 1 ··· 0 1
z′ =
To design such a sampling algorithm, we must choose a
distribution from which to sample orientations and scales. One
idea is to match the partial shape, z1 , to the partial mean shape,
µ1 , by computing the pre-shapes of z1 and µ1 and finding the
Procrustes fitting rotation, θ∗ , from the pre-shape of z1 onto
the pre-shape of µ1 . This angle can then be used as a mean for
a von Mises distribution (the circular analog of a Gaussian)
from which to sample orientations. Similarly, we can sample
scales from a Gaussian with mean α0 –the ratio of scales of
the partial shapes z1 and µ1 as in
α0 =
kz1 − p1 C1 z1 k
kµ1 − p1 C1 µ1 k
.
(23)
Any sampling method for shape completion will have a
scale bias–completed shapes with smaller scales project to
a point closer to the origin in tangent space, and thus have
higher likelihood. One way to fix this problem is to solve for
z2 by performing a constrained optimization on dΣ where the
scale of the centered, completed shape vector is constrained
to have unit length:
1
Cx′ k = 1.
n
kx′ −
(24)
This constrained optimization problem can be attacked
with the method of Lagrange multipliers, and reduces to the
problem of finding the zeros of a (n−p)th order polynomial in
one variable, for which numerical techniques are well-known.
In preliminary experiments this scale bias has not appeared
to provide any obvious errors in shape completion, although
more rigorous testing and analysis are needed.
Thus w is the centered pre-shape. Now let M be the tangent
space projection matrix:
0.25
0.3
0.2
0.2
0.15
0.1
0.1
T
M = I − µµ
0.05
(18)
0
0
−0.05
−0.1
−0.1
Then the Mahalanobis distance with respect to D from M w
to the origin in the tangent space is:
dΣ = (M w)T Σ−1 M w
(19)
−0.15
−0.2
−0.2
−0.3
−0.3
−0.2
−0.1
0
0.1
0.2
0.3
−0.25
−0.25
(a) Partial contour to be completed
Minimizing dΣ is equivalent to maximizing PD (·), so we
∂d
continue by setting ∂zΣ2 equal to zero, and letting
−0.2
−0.15
−0.1
−0.05
0
0.05
0.1
0.15
0.2
0.25
(b) Completed as fork
0.25
0.3
0.2
0.15
0.2
0.1
1
1
(20)
W1 = M1 (I1 − C1 ) Rθ1
n
α
1
1
(21)
W2 = M2 (I2 − C2 ) Rθ2
n
α
where the subscripts “1” and “2” indicate the left and right
sub-matrices of M , I, and C that match the dimensions of z1
and z2 . This yields the following system of linear equations
which can be solved for the missing data, z2 :
(c) Completed as spoon
(d) Completed as tool
Figure 7. Shape classification of partial contours. In the second, third and
fourth image, the mean shape of the shape class is shown in blue on the left,
and the approximate maximum likelihood completion to the partial contour
is shown on the right.
(W1 z1 + W2 z2 )T Σ−1 W2 = 0
V. S HAPE C LASSIFICATION
0.1
0.05
0
−0.1
−0.1
−0.2
−0.15
−0.2
−0.25
−0.25
(22)
As described above, equation 22 holds for a specific orientation and scale. We can then use the estimate of z2
to re-optimize θ and α and iterate. Alternatively, we can
simply sample a number of candidate orientations and scales,
complete the shape of each sample, and take the completion
with highest likelihood (lowest dΣ ).
0
−0.05
−0.3
−0.2
−0.15
−0.1
−0.05
0
0.05
0.1
0.15
0.2
0.25
−0.3
−0.2
−0.1
0
0.1
0.2
0.3
Given k previously learned shape classes C1 , . . . , Ck with
shape means µ1 , . . . , µk and covariance matrices Σ1 , . . . , Σk ,
and given a measurement m of an unknown object shape,
we can now compute a distribution over shape classes for
a measured object: {P (Ci |m) : i = 1 . . . k}. The shape
classification problem is to find the posterior mode of this
IROS 2006 workshop: From Sensors to Human Spatial Concepts
Page: 62
distribution, Ĉ,
Ĉ = arg max P (Ci |m)
Ci
= arg max P (m|Ci )P (Ci ).
Ci
(25)
(26)
Assuming a uniform prior on Ci , we have the standard
maximum likelihood (ML) estimator
Ĉ = ĈML = arg max P (m|Ci ).
Ci
(27)
A. Point Correspondences
The difficulty in proceeding with this ML problem lies in
corresponding the measurement m with each object model,
that is, identifiying which points in m correspond to which
points in model Ci . One potential drawback to contour-based
shape analysis (compared to other representations such as
point cloud) is that we require correspondences between all
points of any two shapes we wish to compare using the
Procrustean metric 5 . While this caveat is clearly the major
source of power to our model, it can also be a major difficulty
in real-world problems where correspondences can be difficult
to identify.
The correspondence problem also requires determining
whether m is a complete, or partially occluded measurement.
These problems are complicated even further when m is
allowed to be a measurement of more than one object (Figure
6). In this case, we have a segmentation problem as well as
a correspondence problem. Data association and segmentation
are both very challenging problems in of themselves; as such
they are not problems which we will attempt to fully address
in this paper. We will constrain ourselves to the case where
m is a measurement of a single object. In particular, we will
address the following two cases:
1) m is a complete measurement of a single object.
2) m is a simply, partially occluded measurement of a
single object.
By simply occluded we mean that m consists of one, contiguous segment of a full (closed) contour.
For the 2-D contour case, one method for finding point
correspondences is to identify local features of interest on each
contour (for example, based on curvature or local changes in
shape) and match the higher-level features to similar sets of
features on other contours, interpolating correspondences for
the points between features.
For our experiments, we implemented a feature correspondence algorithm for closed contours based on a hierarchical
probabilistic model incorporating local feature match likelihoods together with the global feature shape match likelihood–
this was done primarily to demonstrate that a purely geometric
feature correspondence algorithm was tenable. However, in
practice algorithms incorporating more robust features (such
as SIFT features) are more desirable than purely geometric algorithms. The details of this feature correspondence algorithm
are omitted due to space constraints.
B. Complete Shape Class Likelihood
Given that a measurement m represents a complete contour,
the problem of finding the maximum likelihood estimate of
5 We
also require that the number of points on each shape be the same.
each shape class, ĈML , is relatively straight-forward. First,
calculate the pre-shape τ associated with m. Then, for each
class Ci : i = 1 . . . k, find correspondences between τ and the
mean shape µi of model Ci and normalize τ to have the same
number of points as µi , with the points around τ ’s contour
corresponding as closely as possible to the points around
the contour of µi . Next, re-normalize this feature-matched τ
so that it lies in pre-shape space (i.e. center at the origin,
length one), and call this re-normalized pre-shape τ ′ . Finally,
compute the orthogonal Procrustes fit of τ ′ onto µi , giving
orientation θ∗ , and project into tangent space so that
z = M Rθ ∗ τ ′ ,
(28)
∗
where again Rθ∗ is the rotation matrix of θ . The probability
of this point x in tangent space is then given with the standard
Gaussian likelihood,
1
1
exp(− zT Σ−1
P (m|Ci ) ≈ P (z|Ci ) = p
i z).
2n
2
(2π) |Σi |
(29)
C. Partial Shape Class Likelihood
The most obvious approach to partial shape class likelihood
is to simply complete the missing portion of the partial shape
corresponding to m with respect to each shape class, then
classify the completed shape as above (Figure 7). To make
this concrete, let z = {z1 , z2 } be the completed shape, where
z1 is the partial shape corresponding to measurement m, and
z2 is unknown. Then
P (Ci , z1 )
∝
P (Ci |z1 ) =
P (z1 )
Z
P (Ci , z1 , z2 )dz2
(30)
Rather than marginalize over the hidden data, z2 , we
can approximate with estimate ẑ2 , the output of our shape
completion algorithm, yielding:
P (Ci |z1 ) ≈ η · P (z1 , ẑ2 |Ci )
(31)
where η is a normalizing constant (and can be ignored during
classification), and P (z1 , ẑ2 |Ci ) is the complete shape class
likelihood of the completed shape.
There are several variables of the shape completion algorithm which may influence the accuracy of the completed
shape, e.g. feature correspondences, sample rotations and
sample scales. As there is some randomness associated with
these variables, they should ideally be marginalized out from
the partial shape class likelihood. Additionally, we may want
to include a prior on the number of points to be completed as
this number must be estimated with a partial shape feature
correspondence algorithm. Such a prior would indicate the
common-sense principle that shape completions with fewer
unknown points are to be preferred over shape completions
with a large number of unknown points.
VI. E XPERIMENTAL R ESULTS
Videos of a controlled desktop environment were processed
with routines from the Intel Open Source Computer Vision
Library (OpenCV). Contours were extracted from the raw
image frames using intensity thresholding and pyramid segmentation edges (as in figure 1). Shape models of three
object classes–forks, spoons, and wire-cutter tools–were then
IROS 2006 workshop: From Sensors to Human Spatial Concepts
Page: 63
0.25
0.3
0.2
0.25
0.2
0.2
0.15
0.15
0.1
0.1
0.05
0.05
0.1
0
0
0
−0.05
−0.05
−0.1
−0.1
−0.1
−0.2
−0.15
−0.15
−0.2
−0.3
−0.3
−0.2
−0.1
0
0.1
0.2
−0.2
−0.25
−0.25
0.3
−0.2
−0.15
−0.1
−0.05
0
0.05
0.1
0.15
0.25
0.3
0.2
0.2
−0.25
−0.25
0.25
−0.2
−0.15
−0.1
−0.05
0
0.05
0.1
0.15
0.2
0.25
0.25
0.2
0.2
0.15
0.15
0.1
0.1
0.05
0.05
0.1
0
0
0
−0.05
−0.05
−0.1
−0.1
−0.1
−0.2
−0.15
−0.15
−0.2
−0.3
−0.3
−0.2
−0.1
0
0.1
0.2
0.3
−0.25
−0.25
−0.2
−0.2
−0.15
−0.1
−0.05
0
0.05
0.1
0.15
0.2
0.25
−0.25
−0.25
−0.2
−0.15
−0.1
−0.05
0
0.05
0.1
0.15
0.2
0.25
Figure 8. Partial contour shape classification. Top row: the extracted contour is correctly classified as a spoon. Left to right: original image, extracted contour,
shape completion using tool, fork, and spoon models. Bottom row: extracted contour is misclassified as a spoon; however, partial contour is uninformative,
even for humans. Left to right: original image, extracted contour, shape completion using tool, fork, and spoon models.
generated from a manually-chosen subset of the extracted
contours. An illustration of the tool shape model is shown
in figure 5.
Complete contour shape classification was then tested on
a subset of the training data, yielding a classification rate of
100% (Figure 3). Next, eight partial contours (three of tools,
three of forks, and two of spoons) were extracted from video
frames of occluded objects. Partial contours were segmented
by hand (where needed), completed with respect to each shape
model, and classified (Figure 8). This process was repeated
while varying N , the number of principle components kept
in each shape model, from one to ten. For N < 5, the
classification rate was 5/8, while for N = 5 and above,
the classification rate was 6/8. Incorporating a prior on the
completion size (as in the previous section) increased the
classification rate to 7/8 for N > 7. The one partial shape
which was consistently misclassified was the end of a fork
handle, impossible even for a human to classify correctly more
than one-third of the time (figure 7, lower row).
VII. R ELATED W ORK
There is a great deal of work on statistical shape modeling,
beginning with the work on landmark data by Kendall [11] and
Bookstein [4] in the 1980’s. In recent years, more complex
statistical shape models have arisen, for example, in the
active contours literature [3]. Procrustes analysis pre-dates
statistical shape theory by two decades; algorithms for finding
Procrustean mean shapes [13], [7], [2] were developed long
before the topology of shape spaces were well-understood
[12]. In terms of shape classification, shape contexts [1] and
spin images [9] provide robust frameworks for estimating
correspondences between shape features for recognition and
modelling problems. An interesting take on shape completion
using probable object symmetries has been done in [17].
VIII. C ONCLUSION
We have presented an approach to geometric object modelling that unified the problems of modelling geometry, recognizing object shapes and inferring occluded portions of the
object model. The algorithm depends on a technique known
as Procrustean shape analysis [11], and in particular we have
derived an expression for the maximum likelihood object
geometry given only a partial observation of the object. We
have shown some preliminary results on everyday objects, but
we plan to extend these results to more complex scenes and
geometric inference problems.
The description of our algorithm given in this paper is
restricted to inferring geometry from two-dimensional images.
However, the technique extends to higher dimensions, and we
plan to demonstrate the same approach to object modelling
on three-dimensional object data, such as from a laser range
finder or stereo camera.
R EFERENCES
[1] Serge Belongie, Jitendra Malik, and Jan Puzicha. Shape matching and
object recognition using shape contexts. IEEE Trans. Pattern Analysis
and Machine Intelligence, 24(24):509–522, April 2002.
[2] Jos M. F. Ten Berge. Orthogonal procrustes rotation for two or more
matrices. Psychometrika, 42(2):267–276, June 1977.
[3] A. Blake and M. Isard. Active Contours. Springer-Verlag, 1998.
[4] F.L. Bookstein. A statistical method for biological shape comparisons.
Theoretical Biology, 107:475–520, 1984.
[5] I. Dryden and K. Mardia. Statistical Shape Analysis. John Wiley and
Sons, 1998.
[6] R. Fergus, P. Perona, and A. Zisserman. Object class recognition by
unsupervised scale-invariant learning. 2003.
[7] J. C. Gower. Generalized procrustes analysis. Psychometrika, 40(1):33–
51, March 1975.
[8] John R. Hurley and Raymond B. Cattell. The procrustes program: Producing direct rotation to test a hypothesized factor structure. Behavioral
Science, 7:258–261, 1962.
[9] Andrew Johnson and Martial Hebert. Using spin images for efficient
object recognition in cluttered 3-d scenes. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 21(5):433 – 449, May 1999.
[10] M. Kazhdan, T. Funkhouser, and S. Rusinkiewicz. Rotation invariant
spherical harmonic representation of 3d shape descriptors, 2003.
[11] D.G. Kendall. Shape manifolds, procrustean metrics, and complex
projective spaces. Bull. London Math Soc., 16:81–121, 1984.
[12] D.G. Kendall, D. Barden, T.K. Carne, and H. Le. Shape and Shape
Theory. John Wiley and Sons, 1999.
[13] Walter Kristof and Bary Wingersky. Generalization of the orthogonal
procrustes rotation procedure to more than two matrices. In Proceedings,
79th Annual Convention, APA, pages 89–90, 1971.
[14] David G. Lowe. Object recognition from local scale-invariant features.
In Proc. of the International Conference on Computer Vision ICCV,
pages 1150–1157, 1999.
[15] Danijel Skocaj and Alea Leonardis. Weighted and robust incremental
method for subspace learning. In Proceedings of the International
Conference on Computer Vision (ICCV), pages 1494–1501, 2003.
[16] C. G. Small. The statistical theory of shape. Springer, 1996.
[17] S. Thrun and B. Wegbreit. Shape from symmetry. In Proceedings of the
International Conference on Computer Vision (ICCV), Beijing, China,
2005.
IROS 2006 workshop: From Sensors to Human Spatial Concepts
Page: 64
V