From sensors to human spatial concepts

Zoran Zivkovic

From sensors to human spatial concepts

2007, Robotics and Autonomous Systems

Sponsored by: Proceedings of the IROS 2006 workshop: From Sensors to Human Spatial Concepts October 10, 2006, Beijing, China Ogranisers: Zoran Zivkovic, University of Amsterdam, The Netherlands Ben Krose, University of Amsterdam, The Netherlands Henrik I. Christensen, KTH, Stockholm, Sweden Roland Siegwart, EPFL, Lausanne, Switzerland Raja Chatila, LAAS-CNRS, Toulouse, France Introduction The aim of the workshop is to bring together researchers that work on space representations appropriate for communicating with humans and on developing algorithms for relating the robot sensor data to the human spatial concepts. Additionally in order to facilitate the discussion a dataset consisting of omnidirectional camera images, laser range readings and robot odometry is provided. The often used geometric space representations present a natural choice for robot localization and navigation. However, it is hard to communicate with a robot in terms of, for example, 2D (x,y) positions. Some common human spatial concepts are: ”the living room”, ”the corridor between the living room and the kitchen”; or more general as ”a room”, ”a corridor”; or more specific and related to objects ”behind the TV in the living room”; etc, etc. Appropriate space representation is needed for the natural communication with the robot. Furthermore, the robot should be able to relate its sensor reading to the human spatial concepts. Suggested topics for the workshop include, but are not limited to the following areas: • space representations appropriate for communicating with humans • papers applying and analyzing the results from cognitive science about the human spatial concepts • space representations suited to cover cognitive requirements of learning, knowledge acquisition and contextual control • methods for relating the robot sensor data to the human spatial concepts • comparison and/or combination of the appearance based and geometric approach for the task of relating the robot sensor data to the human spatial concepts We also encourage the use of the provided dataset: http://staff.science.uva.nl/∼zivkovic/FS2HSC/dataset.htm. Selected papers are considered for a Special Issue of the Robotics and Automation Systems Journal. IROS 2006 workshop: From Sensors to Human Spatial Concepts Page: 3 Contents 1 Cognitive Maps for Mobile Robots – An Object based Approach S. Vasudevan, S. Gächter, M. Berger, R. Siegwart 7 2 Hierarchical localization by matching vertical lines in omnidirectional images A.C. Murillo, C. Sagüéss, J.J. Guerrero, T. Goedemé, T. Tuytelaars, L. Van Gool 13 3 Virtual Sensors for Human Spatial Concepts – Building Detection by an Outdoor Mobile Robot M. Persson, T. Duckett, A. Lilienthal 21 4 Learning the sensory-motor loop with neurons – recognition association prediction decision N. Do Huu, R. Chatila 27 5 Semantic Labeling of Places using Information Extracted from Laser and Vision Sensor Data O.M. Mozos, R. Triebel, P. Jensfelt, W. Burgard 33 6 Towards Stratified Spatial Modeling for Communication and Navigation R.J. Ross, C. Mandel, J. Bateman, S. Hui, U. Frese 41 7 Learning Spatial Concepts from RatSLAM Representations M. Milford, R. Schulz, D. Prasser, G. Wyeth, J. Wiles 47 8 From images to rooms O. Booij, Z. Zivkovic and B. Kröse 53 9 Robust Models of Object Geometry J. Glover, G. Gordon, D. Rus, N. Roy 59 IROS 2006 workshop: From Sensors to Human Spatial Concepts Page: 5 ... Cognitive Maps for Mobile Robots – An Object based Approach Shrihari Vasudevan, Stefan Gächter, Marc Berger & Roland Siegwart Autonomous Systems Laboratory (ASL) Swiss Federal Institute of Technology Zurich (ETHZ) 8092 Zurich, Switzerland {shrihari.vasudevan , stefan.gachter , r.siegwart}@ieee.org Abstract - Robots are rapidly evolving from factory workhorses to robot-companions. The future of robots, as our companions, is highly dependent on their abilities to understand, interpret and represent the environment in an efficient and consistent fashion, in a way that is comprehensible to humans. This paper is oriented in this direction. It suggests a hierarchical probabilistic representation of space that is based on objects. A global topological representation of places with object graphs serving as local maps is suggested. Experiments on place classification and place recognition are also reported in order to demonstrate the applicability of such a representation in the context of understanding space and thereby performing spatial cognition. Further, relevant results from user studies validating the proposed representation are also reported. Thus the theme of the work is – representation for spatial cognition. Index Terms - Cognitive Spatial Representation, Robot Mapping, Conceptualization of spaces, Spatial Cognition I. INTRODUCTION Robotics today, is visibly and very rapidly moving beyond the realm of factory floors. Robots are working their way into our homes in an attempt to fulfill our needs for household servants, pets and other cognitive robot companions. If this “robotic-revolution” is to succeed, it is going to warrant a very powerful repertoire of skills on the part of the robot. Apart from navigation and manipulation, the robot will have to understand, interpret and represent the environment in an efficient and consistent fashion. It will also have to interact and communicate in human-compatible ways. Each of these is a very hard problem. These problems are made difficult by a multitude of reasons including the extensive amount of information, the huge number of types of data (multi-modality), the presence of entities in the environment which change with time, to name a few. Adding to all of these problems are the two simple facts that everything is uncertain and at any time, only partial knowledge of the environment is available. The underlying representation of the robot is probably the single most critical component in that it constitutes the very foundation for all things we might expect the robot to do, these include the many complex tasks mentioned above. Thus, the extent to which robots will evolve from factory workhorses to robot-companions will in some ways (albeit indirectly) be decided by the way they represent their surroundings. This report is thus dedicated towards finding an appropriate representation that will make today’s dream, tomorrow’s reality. II. RELATED WORK Robot mapping is a relatively well researched problem, however, with many very interesting challenges yet to be solved. An excellent and fairly comprehensive survey of robot mapping has been presented in [1]. Robot mapping has traditionally been classified into two broad categories – metric and topological. Metric mapping [2] tries to map the environment using geometric features present in it. A related concept in this context is that of the relative map [3] – a map state with quantities invariant to rotation and translation of the robot. Topological mapping [4] usually involves encoding place related data and information on how to get from one place to another. More recently, a new scheme has become quite popular – the one of hybrid mapping [5, 6]. This kind of mapping typically uses both a metric map for precision navigation in a local space and a global topological map for moving between places. The one similarity between all these representations is that all of them are navigation-oriented, i.e. all of them are built around the single application of robot-navigation. These maps are useful only in the navigation context and fail to encode the semantics of the environment. The focus of this work is to address this deficiency. Several other domains inspire our approach towards addressing this challenge – these include hierarchical representations of space, “high-level” † feature extraction, scene interpretation and the notion of a Cognitive Map. The work presented here closely resembles those that suggest the notion of a hierarchical representation of space. Ref. [7] suggests one such hierarchy for environment modeling. In [8], Kuipers put forward a “Spatial Semantic Hierarchy” which models space in layers comprising respectively of sensorimotor, view-based, place-related and metric information. The work [9] probably bears the most similarity with the work presented in this paper. The authors use a naive technique to perform “object recognition” and add the detected objects to an occupancy grid map. The primary † Objects, doors etc. are considered “high-level” features contrasting with lines, corners etc. which are considered “low-level” ones. IROS 2006 workshop: From Sensors to Human Spatial Concepts Page: 7 difference in the work presented here is that the proposed representation uses objects as the functional basis – i.e. the map is created and grown with the objects perceived. Typically, humans seem to perceive space in terms of high-level information such as objects, states & descriptions, relationships etc. Thus, a human-compatible representation would have to encode similar information. The work reported here attempts to create such a representation using typical household objects and doors. It also attempts to validate the proposed representation in the context of spatial cognition. For object recognition, a very promising approach that has also been used in this work, is the one based on the SIFT [10]. In our experience, it was found to be a very effective tool for recognizing textured objects. Several works have attempted to model and detect doors. The explored techniques range from modeling/estimating door parameters [11] to those that model the door opening [12] and to those like [13], based on more sophisticated algorithms such as boosting. Ref. [13] also addresses the problem of scene interpretation in the context of spatial cognition. The authors use the AdaBoost algorithm and simple low-level scan features and vision together with hidden markov models to classify places. This work takes inspiration from the way we believe humans represent space. The term “Cognitive Map” was first introduced by Tolman in a widely cited work, [14]. Since then, several works in cognitive psychology and AI / robotics have attempted to understand and conceptualize a cognitive map. Some of the more relevant theories are mentioned in this context. Kuipers, in [15], elicited a conceptual formulation of the cognitive map. He suggests the existence of five different kinds of information (topological, metric, routes, fixed features and observations) each with its own representation. More recently, Yeap et al. in their work [16] trace the theories that have been put forward to explain the phenomenon of early cognitive mapping. They classify representations as being space based and object based. The proposed approach in this work is primarily an object based one. Some of the most relevant object based approaches include the MERCATOR (Davis, 1986) and more recently RPLAN (Kortenkamp, 1993, [17]). The former bears the closest resemblance to some of the ideas put forward in this work. It should be emphasized that among most previously explored approaches classified as "object" based, either the works do not necessarily suggest a hierarchical representation or they do not use high-level features. In summary, a single unified representation that is multi-resolution, multi-purpose, probabilistic and consistent is still a vision of the future; it is also the aspiration of this work. The approach can be understood as an engineering solution (as applicable to mobile robots) to the general Cognitive Mapping problem. Although being primarily object based, the proposed approach attempts to overcome some of the believed limitations of purely object based (i.e. no notion of the space) methods by incorporating some spatial elements (in this case doors). The kinds of elements that are incorporated will be gradually upgraded as the work is enhanced. III. APPROACH A. Problem Definition This work is aimed at developing a generic representation of space for mobile robots. Towards this aim, in this particular work, two scientific questions are addressed - (1) How can a robot form a high-level probabilistic representation of space? (2) How can a robot understand and reason about a place? The first question directly addresses the problems of highlevel feature extraction, mapping and place formation. The second question may be considered as the problem of spatial cognition. Together, when appropriately fused, they give rise to the hierarchical representation being sought. This representation must consider and treat information uncertainty in an appropriate manner. Also, in order to understand places, the robot has to be able to conceptualize space; to be able to classify its surroundings and to recognize it, when possible. B. Overview Figures 1 & 2 respectively show the mapping process and the method used to demonstrate spatial cognition using the created map. In an integrated system, the mapping and reasoning processes cannot be totally separated, but it is done here so as to facilitate understanding of the individual processes. Subsection C elicits the details of the perception system – this includes the object recognition and door detection processes. Subsection D specifies the details on how the representation is created (process depicted in fig. 1) – both local probabilistic object graphs and individual places. It also addresses the issue of learning about place categories (kitchens, offices etc.). Subsections E explains how such a representation could be used for spatial cognition (process depicted in fig. 2) and the manner in which the representation is updated. All of these sections are briefly presented. For more details, the interested reader is referred to another recent report by the authors, [18]. The remaining parts of the papers discuss the experiments conducted, the user study and the conclusions drawn thereof. The main contribution of this paper is an enhancement of the previously reported results by the provision of relevant results from user studies, in support of our representation, as a cognitive validation of the theory. Fig. 1 The mapping process. High-level feature extraction is implemented as an object recognition system. Place formation is implemented using door detection. Beliefs are represented and appropriately treated. Together, these are encoded to form a hierarchical representation comprising of places, connected by doors and themselves represented by local probabilistic object graphs. Concepts about place categories are also learnt. IROS 2006 workshop: From Sensors to Human Spatial Concepts Page: 8 Fig. 2 The reasoning process for each place. First step is place classification – the robot uses the objects it perceives to classify the place into one of its known place categories (office, kitchen etc.). Next step is - recognizing specific instances of the place it is aware of – place recognition. Accordingly map update / adding of new place is done. C. Perception This work deals with representing space using high-level features. In particular, two kinds of features are used here – typical household objects and doors. Reliable and robust methods for high-level feature extraction are yet unavailable. It must be emphasized that the perception component of this work, is not the thrust of this work. Thus, established or simplified algorithms have been used. For this work, a SIFT based object recognition system was developed (fig. 3) along the lines of [10]. The objects detected are used to represent places as explained in subsection D. Doors are used in this work in the context of place formation. A method of door detection based on line extraction and the application of certain heuristics was used. The sensor of choice was the laser range finder. More details on the perception of objects and doors with regards to this work can be found in [18]. D. Representation The representation put forward here is a hierarchical one that is composed of places which are connected to each other through doors and are themselves represented by local probabilistic object graphs (a probabilistic graphical representation composed of objects and relationships between them). Objects detected in a place are used to form a relative map for that local space. Doors are incorporated into the representation when they are crossed and link the different places together. Object graphs were used by the authors in [19]. The problem with this work is that the information encoded in the representation was purely semantic and not “persistent” i.e. not invariant and not re-computable based on current viewpoint. This work addresses this drawback by drawing on the relative mapping approach in robotics. It suggests the use of a probabilistic relative object graph as a means of local metric map representation of places. The metric information encoded between objects includes distance and angle measures in 3D space. These measures are invariant to robot translation and rotation in the local space. Such a representation not only encodes the inter-object semantics but also provides for a representation that could be used in the context of robot navigation. The robot uses odometry to know the robot pose which is in turn used towards the creation of the relative object graph. A stereo camera is used to know the positions of various objects in 3D space. As mentioned before, the representation is probabilistic. “Existential” beliefs (discrete probability values) are obtained from the perception system for each object that is observed. Simultaneously, “precision” beliefs are maintained in the form of covariance matrices. By representing both kinds of beliefs, such a representation will serve in the context of high level reasoning / scene interpretation and yet be useful for lower level navigation related tasks. As mentioned earlier, the relative spatial information encoded, include distance and angle measures in 3D space. These also have associated existence and precision beliefs. Details on the sensor models used and the mathematical formulations for belief computation are mentioned in [18]. Concepts are learnt when creating the representation of various places. These encode the occurrence statistics (and thus likelihood values) of different objects in different place categories (office, kitchen, etc.). Thus, in a future exploration task, a robot could actually understand its environment and thereby classify its surroundings based on the objects it perceives. E. Spatial Cognition (Place Classification / Recognition) and Map Update Place classification is done in an online incremental fashion, with every perceived object contributing to one or more hypotheses of previously learnt place concepts. Place recognition is done by a graph matching procedure which matches both the nodes and its relationships to identify a node match. The aim is to find the maximal common set of identically configured objects between places the robot knows (previously mapped) and the one it currently perceives. A map update operation (internal graph representation is updated) is required both for handling the revisiting of places and the reobservation of objects while mapping a place. It involves the addition / deletion of nodes and the update of their beliefs. More details can be obtained from [18]. Fig. 3 Object recognition using SIFT features. Left image shows a mug being recognized, right image shows a table being recognized. Objects used in this work include cartons of different kinds, a table, a chair, a shelf & a mug. IROS 2006 workshop: From Sensors to Human Spatial Concepts Page: 9 IV. EXPERIMENTS A. System Overview and Scenario The robot platform shown in fig. 4 was used for this work. The robot is equipped with several sensors including encoders, stereo and two back-to-back laser range scanners. The robot was driven across 5 rooms covering about 20m in distance. The objects used (and the way they are named) for the representation comprised of different cartons (carton, cartridge, xerox, logitech, elrob, tea), a chair (chair), a mug (mug), a shelf (shelf), a table (table) & a book (book). The experiments were conducted in our lab – thus, the places visited included offices and corridors. B. Mapping Fig. 5 shows the path of the robot. The objects & doors recognized are shown in the object based map depicted in fig. 6. Finally, fig. 7 illustrates the complete probabilistic objectgraph representation formed as a result of the process. The robot performed the mapping process as per expectations. Objects and doors were recognized and the representation was formed as per the methods described in the previous sections. However, the robot often observed multiple doors at the same place (due to the presence of large cupboards) on either side of the door. Further, the robot created multiple occurrences of the corridor as, the topological information between places that is encoded was not used in the experiments in this work. Also, it did not see an identical set of objects through the corridor so as to be able to recognize the previously visited corridor. These two issues (fusing of doors and loop closing) would be addressed in subsequent works. matching spatial configuration to a place that it has visited before. Thus, at this point, the “unknown” place was recognized as the place SV (office) and the internal map representation of the robot is updated to reflect the changes to the place that the robot had perceived. Figure 9 displays the updated internal representation of the robot. The corridor was also successfully classified. More details can be got from [18]. Fig. 5 Map displaying the robot path. The robot traverses through 4 rooms crossing a corridor each time it moves from one room to another. Green/red circles indicate the doors detected. The red circles also serve as the place references for the place explored on crossing the door. The numbers indicate the sequence in which the places were visited. Fig. 6 Object based map produced as a result of exploring the test environment. Zoomed-in view of the above map. Blue squares are the place references, red circles are the objects and the green stars are the doors. Fig. 4 The robot platform that was used for the experiments. The encoders, stereo vision system and laser scanners were used for this work. C. Spatial Cognition – Place classification/recognition The robot was made to traverse a previously visited place – SV (office) and the corridor (refer fig. 6). The locations of movable objects (all but the table, shelf and the door) were changed so that a significant configuration change of both places was observed. The robot was then made to interpret these places. For the first place, the robot perceived the objects in the sequence shelf – xerox – carton – table – logitech – cartridge. Fig. 8 displays the object map for the “unknown” place. On seeing the first two objects, the robot successfully classified the place as an office. Subsequently the robot attempted to match this place with its knowledge of prior offices it has visited. When finally crossing the door, the robot found enough objects (including the door) that are located in a Fig. 8 First ‘unknown’ place at the time of place recognition. The configuration of the objects is different from that of the same place in Fig. 6. Note - The carton is above the table & xerox is above the shelf. Fig. 9 Updated internal representation of the robot after place recognition IROS 2006 workshop: From Sensors to Human Spatial Concepts Page: 10 Fig. 7 Probabilistic object graph representation created as a result of exploring the path shown in fig. 5 V. USER STUDIES – A COGNITIVE VALIDATION OF THE PROPOSED REPRESENTATION A. The study – objectives and methods The broad aim of the study was to validate the proposed representation in a cognitive sense. The aim was to verify our approach and to find out what other details (kinds of features / data) the proposed representation could encode. As mentioned before, the complete representation is beyond the scope of this report. Thus, only results of the survey that are relevant to the aspects of the representation proposed here are quoted. The complete study will be reported in a more appropriate forum. The study was performed with input from 52 people. The people were chosen from a diverse population spanning different nationalities, backgrounds and occupation. Both genders have been appropriately represented. B. Relevant Results In the tables that follow, most criteria correspond to their literal (dictionary) meanings. The “function” of a place refers to the typical functionality / purpose associated with a place. “Ground materials” refer to the floor material (wooden / carpeted /…). “Boundaries” refer to walls, doors, partitions etc. The percentages indicate the number of people, of the total number surveyed, that replied with information corresponding to the particular criteria for the place in consideration. Survey takers were asked to imagine their presence in a living room, an office and a kitchen. They were then asked to describe what they understood / represented about that place in their minds. Table 1 shows the results obtained. The most common objects identified with an office were desks, chairs, computers etc. Living rooms were better understood in terms of the presence of sofas, armchairs, tables etc. and finally kitchens were typically identified with cooker, oven, sink, fridge, utensils etc. TABLE 1 MEANS OF REPRESENTATION OF PLACES. Office Criteria / Place Living Room (%) (%) Objects 98 96 Function 13 21 Boundaries 71 48 People 23 10 Size 17 25 Ambience 19 33 Luminosity 37 37 Ground Material 8 15 Smell - Kitchen (%) 98 13 38 8 35 27 13 12 4 Next, users were taken to three places in our laboratory premises – a “standard” office, a refreshment room and lastly, a large electronics lab-office. Survey takers were asked to describe each place – what they saw in as much detail as possible. The typical ways in which survey takers tend to describe these places are conveyed through the following graphs shown in table 2. TABLE 2 MEANS OF DESCRIPTIONS OF PLACES Office Criteria / Place Refreshment room (%) (%) Objects 100 100 Function 52 90 Boundaries 40 10 Lab (%) 100 63 15 Finally, users were taken from one room to another and asked if they believed they were in a new place and the reason for their belief. The results obtained are shown in a graphical form below in fig. 10. IROS 2006 workshop: From Sensors to Human Spatial Concepts Page: 11 ACKNOWLEDGEMENTS The work described in this paper was conducted within the EU Integrated Project COGNIRON ("The Cognitive Companion") and was funded by the European Commission Division FP6-IST Future and Emerging Technologies under Contract FP6-002020. REFERENCES Fig.10 Criteria to ascertain a change of place C. Analysis / Inference The reason survey takers were first asked to imagine being in a place and then taken to such a place for questioning was to get both inputs – that of the accumulated (through experience) representation of the place and also that obtained from on-site scene interpretation. It was found that objects constituted a very critical component of both a representation and a description. People seem to understand places in terms of the high-level features (objects) that are present in it – the underlying philosophy of this work and the direction of our future works as well. It was also found that boundaries (walls / doors / windows) constituted an important component in describing the places and the “function” of the place (kitchen – cooking etc.) was an important descriptive element. The last graph seems to convey that boundary elements (such as doors and walls) and the arrangement of objects are critical to detecting a change of place. From an implementation perspective, this information seems to validate our choice for using the objects as the functional basis of the representation and doors as the links between places. Lastly, we believe that a transition between places occurs when there is a change of “visibility”, a term we can now implement in terms of the other important factors that crop up in the graph shown in figure 10, including arrangement of objects, luminosity, size, color and ground materials. Thus these results not only validate the proposed representation but further provide ideas on the future enhancements (functionality of a place etc. need to be incorporated) to this representation and how it is formed. VI. CONCLUSIONS & FUTURE WORK A cognitive probabilistic representation of space based on high level features was proposed. The representation was experimentally demonstrated. Spatial cognition using such a representation was shown through experiments on place classification and place recognition. The uncertainty for all required aspects of such a representation were appropriately represented and treated. Relevant results from a user study conducted were also reported, thus validating the representation in a cognitive sense. They also suggest the next steps towards enhancing the proposed representation. Fusing of doors and merging of places are both required to get a more appropriate representation of space. On the conceptual front, the suggested representation needs to be made richer but yet lighter and computationally efficient in applications. A more in-depth survey focusing on specific aspects of the representation is also warranted. [1] S. Thrun, "Robotic mapping: A survey", In G. Lakemeyer and B. Nebel, editors, Exploring Artificial Intelligence in the New Millenium. Morgan Kaufmann, 2002. [2] R.Chatila and J.-P. Laumond. Position referencing and consistent world modeling for mobile robots. In the Proceedings of the IEEE International Conference on Robotics and Automation, 1985. [3] A. Martinelli, A. Svensson, N. Tomatis and R. Siegwart, "SLAM Based on Quantities Invariant of the Robot's Configuration", IFAC Symposium on Intelligent Autonomous Vehicles, 2004. [4] H. Choset and K. Nagatani, "Topological simultaneous localization and mapping (SLAM): toward exact localization without explicit localization", IEEE Transactions on Robotics and Automation, Volume: 17 Issue: 2, Apr 2001. [5] S. Thrun, "Learning metric-topological maps for indoor mobile robot navigation", Artificial Intelligence, Volume 99, Issue 1, February 1998, Pages 21-71. [6] N. Tomatis, I. Nourbakhsh and R. Siegwart, "Hybrid Simultaneous Localization and Map Building: A Natural Integration of Topological and Metric", Robotics and Autonomous Systems, 44, 3-14., July 2003. [7] A. Martinelli, A. Tapus, K.O. Arras and R. Siegwart, "Multi-resolution SLAM for Real World Navigation", In Proceedings of the 11th International Symposium of Robotics Research, Siena, Italy, 2003. [8] B. Kuipers, "The Spatial Semantic Hierarchy", Artificial Intelligence, 119: 191-233, May 2000. [9] C. Galindo, A. Saffiotti, S. Coradeschi, P. Buschka, J.A. FernándezMadrigal and J. González, "Multi-Hierarchical Semantic Maps for Mobile Robotics", Proc. of the IEEE/RSJ Intl. Conf. on Intelligent Robots and Systems (IROS), pp. 3492-3497. Edmonton, CA, 2005. [10] D. G. Lowe, "Distinctive image features from scale-invariant keypoints", International Journal of Computer Vision, vol. 60, no. 2, 2004. [11] D. Anguelov, D. Koller, E. Parker, S. Thrun, "Detecting and Modeling Doors with Mobile Robots", Proceedings of the International Conference on Robotics and Automation (ICRA), 2004. [12] D. Kortenkamp, L. D. Baker, and T. Weymouth, "Using Gateways to Build a Route Map," Proceedings IEEE/RSJ International Conference on Intelligent Robots and Systems, 1992. [13] C. Stachniss, Ó. Martínez-Mozos, A. Rottmann, and W. Burgard, "Semantic Labeling of Places", In Proc. of the Int. Symposium of Robotics Research (ISRR).San Francisco, CA, USA, 2005. [14] E. C. Tolman. “Cognitive maps in rats and men.” Psychological Review, 55:189-208, 1948. [15] B. J. Kuipers, "The cognitive map: Could it have been any other way?", Spatial Orientation: Theory, Research, and Application. New York: Plenum Press, 1983, pages 345-359. [16] W.K. Yeap and M.E. Jefferies, "On early cognitive mapping", Spatial Cognition and Computation 2(2) 85-116, 2001. [17] D. Kortenkamp. “Cognitive maps for mobile robots: A representation for mapping and navigation”. PhD Thesis, University of Michigan, 1993. [18] S. Vasudevan, V. Nguyen and R. Siegwart, “Towards a Cognitive Probabilistic Representation of Space for Mobile Robots”, In the Proceedings of the IEEE International Conference on Information Acquisition (ICIA), 20-23 August 2006, Shandong, China. [19] A. Tapus, S. Vasudevan, and R. Siegwart, "Towards a Multilevel Cognitive Probabilistic Representation of Space", In Proceedings of the International Conference on Human Vision and Electronic Imaging X, part of the IS&T/SPIE Symposium on Electronic Imaging 2005, 16-20 January 2005, CA, USA. IROS 2006 workshop: From Sensors to Human Spatial Concepts Page: 12 Hierarchical localization by matching vertical lines in omnidirectional images A.C.Murillo, C.Sagüés, J.J.Guerrero T. Goedemé, T. Tuytelaars, L. Van Gool DIIS - I3A, University of Zaragoza, Spain {acm, csagues, jguerrer}@unizar.es PSI-VISICS, University of Leuven, Belgium {tgoedeme, tuytelaa, vangool}@esat.kuleuven.be Abstract— In this paper we propose a new vision based method for robot localization using an omnidirectional camera. The method has three steps efficiently combined to deal with big reference image sets, each step evaluates less images than the previous but is more complex and accurate. Given the current uncalibrated image seen by the robot, the hierarchical algorithm gives the possibility of obtaining appearance-based (topological) and metric localization. Compared to other similar vision-based localization methods, the one proposed here has the advantage that it gets accurate metric localization based on a minimal reference image set, using the 1D three view geometry. Moreover, thanks to the linear wide baseline features used, the method is insensitive to illumination changes and occlusions, while keeping the computational load small. The simplicity of the radial line feature used speeds up the process while it keeps acceptable accuracy. We show experiments with two omnidirectional image data-sets to evaluate performance of the method. Index Terms: - topological + metric localization, hierarchical methods, radial line matching, omnidirectional images I. I NTRODUCTION Usually the robots have at their disposal a map where they have to act (given or because if has been automatically acquired). For some robotic tasks, such as navigation or obstacle avoidance, the robot needs to localize itself to correct the trajectory errors due to, for instance, odometry inaccuracy, slipping ... At this point it is clear we need accurate metric localization information. However, the same accuracy in localization is not always needed. Sometimes the robot needs just topological localization (e.g. identifying in which room we are, because the robot is asked for some information about a certain room) and indeed this is a more intuitive information to communicate with humans. When working with computer vision, these maps usually consist of a set of more or less organized reference images. In this paper we present a vision based method for hierarchical robot localization, combining topological and metric information, using a Visual Memory (VM). This VM consists of a database of sorted omnidirectional reference images (we know a priori the room they belong to and their relative positions). Omnidirectional vision has become widespread in the last years, and has many well-known advantages as well as extra difficulties compared to conventional images. There are many works using all kind of omnidirectional images, e.g. a mapbased navigation with images from conic mirrors [1] or localization based on panoramic cylindric images composed of mosaics of conventional ones [2]. We focus our work on hierarchical localization, a quite interesting subject because of the possibility of combining different kinds of localization and also because its higher efficiency allows to handle better big databases of reference images. The localization at different levels of accuracy with omnidirectional images was previously proposed by Gaspar et al. [3]. Their navigation method consists of two steps: a fast but less accurate topological step, useful in long-term navigation, is followed by a more accurate tracking-based step for short distance navigation. One of the main novelties of our proposal is that it gets till metric localization using multiple view geometry constraints. Most of the previous vision-based localization algorithms give as final localization the location of one of the images from the reference set, that implies less accuracy or very high density in the stored reference set. For example in [4], a hierarchical localization with omnidirectional images is performed. It is based on the Fourier signature and gives an area where the robot is located around reference images from one environment. It is computationally efficient but it seems to require a very dense database to achieve precise localization. With the metric localization we propose, the stored images density is only restricted by the wide-baseline matching that the local features used are able to deal with. Our method performs a hierarchical localization in three steps, each one more accurate than the previous but also computationally more expensive. From the two first steps we get the topological localization (the room where we are), using a method inspired by the work of Grauman et al. [5] for object recognition using pyramid matching kernels. In the final and third step, we continue to get a metric localization relative to the VM. For that we use local feature matches in three views to robustly estimate a 1D trifocal tensor. The relative location between views can be obtained with an algorithm based on the 1D trifocal tensor for omnidirectional images [6]. The features used are scene vertical lines. They are projected in the omni-images as radial ones, supposing vertical camera axis, and allow us to estimate the center of projection in the image. The 1D projective cameras have been previously used e.g. for self calibration [7] and the 1D trifocal tensor, that explains the geometry for three of those camera views, was presented in [8]. Correcting radial distortion is another recent application of the 1D tensor for omnidirectional images [9]. The line features are detailed in section II and all the steps of the localization method are explained in section III. In section IROS 2006 workshop: From Sensors to Human Spatial Concepts Page: 13 IV we can see several experiments proving the effectiveness of the different steps of our method with real images. II. L INES AND THEIR D ESCRIPTORS The number of features proposed for matching has increased largely over the last few years, where SIFT [10] has become very popular. Yet, in this work, we have chosen vertical lines with their support regions as features. These lines show several advantages when working with omnidirectional images (e.g. robustness against partial occlusion, easy and fast extraction or more invariant to affine geometrical deformations). Each line is described by a set of descriptors that characterize it in the most discriminant way possible, although it is necessary to find a balance between invariance and discriminative power, as the more invariant the descriptors are, the less discriminant they become. In this section, we first explain lines extraction process (section II-A), whereafter we shed light on the kind of descriptors used to characterize them (section II-B). A. Line Extraction In this work we use lines and their Line Support Region (LSR) as primitives for the matching. They are extracted using our implementation of the Burns algorithm [11]. Firstly, this extraction algorithm computes the image brightness gradient and after it segments the image into line support regions, which consist of adjacent pixels with similar gradient direction and gradient magnitude higher than a threshold. Secondly a line is fitted to each of those regions using a model of brightness variation over the LSR [12]. We can see some extracted LSR in Fig. 1. As mentioned before, we use only vertical lines. We suppose that the optical axis of the camera is perpendicular to the floor and that the motion is done on a plane parallel to the floor. Under these conditions, these lines are the only ones that keep always their straightness, being projected as radial lines in the image. Therefore, they are quite easy to find and we can automatically obtain from them the center of projection, an important parameter to calibrate this kind of images. We estimate it with a simple ransac-based algorithm that checks where the radial lines are pointing. As it can be seen in Fig. 1, the real center of projection is not coincident with the center of the image, and the more accurate we find its real coordinates the better we can estimate later the multi-view geometry and therefore, the robot localization. B. Line Descriptors After extracting the image features, next step is to compute a set of descriptors that characterize them as well as possible. Most descriptors will be computed separately over each of the sides in which the LSR is divided (Fig.1). 1) Geometric Descriptors: We obtain two geometric parameters from the vertical lines. Firstly, its direction δ, a boolean indicating if the line is pointing to the center or not. This direction is established depending on which side of the line is the darkest region. The second parameter is the line orientation in the image (θ ∈ [0, 2π]). Fig. 1. Left: detail of some LSRs, divided by their line (the darkest side on the right in blue and the lighter on the left in yellow). A red triangle shows the lines direction. Right: all the radial lines with their LSR extracted in one image. The estimated center of projection is pointed with a yellow star (∗) and the center of the image with a blue cross (+). 2) Color Descriptors.: We have worked with three of the color invariants suggested in [13], based on combinations of the generalized color moments, abc Mpq = xp y q [R(x, y)]a [G(x, y)]b [B(x, y)]c dxdy. (1) LSR abc being Mpq a generalized color moment of order p + q and degree a+b+c and R(x, y), G(x, y) and B(x, y) the intensities of the pixel (x, y) in each RGB color band centralized around its mean. These invariants are grouped in several classes, depending on the schema chosen to model the photometric transformations. After studying the work where they are defined, and trying to keep a compromise between complexity and discriminative power, we chose as most suitable for us those invariants defined for scale photometric transformations using relations between couples of color bands. The definitions of the 3 descriptors chosen are as follows: DS RG = 110 000 M00 M00 100 M 010 M00 00 DS RB = 101 000 M00 M00 100 M 001 M00 00 DS GB = 011 000 M00 M00 010 M 001 M00 00 (2) 3) Intensity Frequency Descriptors.: Finally we also use as descriptors the first seven coefficients of the Discrete Cosine Transform, DCT , over the intensity signal (I) of the LSR of each line. The DCT is a well known transform in the areas of signal processing [14] and widely used for image compression. It is possible to estimate the number of coefficients necessary to describe a certain percentage of the image content. For example with our test images, seven coefficients are necessary on average to represent 99% of the intensity signal over a LSR. We have chosen 22 descriptors, but to deal with big amounts of images, it would be good to decrease this number. As known, using Principal Component Analysis (PCA) we get several linear combinations of the descriptors we study, each of them with a certain discriminative level indicated. If we have a look at the weights of each descriptor in the linear combinations which turn to be more distinctive for our feature sets, we can get an idea of which descriptors seem more representative. Then we reduced our line descriptors vector to 12 elements, what we will use for the Pyramidal matching: IROS 2006 workshop: From Sensors to Human Spatial Concepts Page: 14 3 color descriptors (DS RG , DS RB , DS GB ) and 2 frequency descriptors (DCT1 , DCT2 ), computed on each side of the LSR, and 2 geometric properties (orientation θ and direction δ). However, for the individual line matching we propose to use also the 5 following coefficients of the DCT in both sides of the LSRs, 22 descriptors per line, because there this size is not that critical as we will deal only with 3 images. are stored in a vector (pyramid) ψ with different levels of resolution. The similarity between two images, the current (c) and one of the visual memory (v) is obtained by finding the intersection of the two pyramids of histograms S(ψ(c), ψ(v)) = L wi Ni (c, v) , (3) i=0 III. H IERARCHICAL L OCALIZATION M ETHOD Scv = 6 value of descriptor 1 As said before, our proposal consists of a hierarchical robot localization in three steps. With this kind of localization, we can deal with large databases of images in an acceptable time. This is possible because we avoid evaluating all the images in the entire data set in the most computationally expensive steps. Note also that the VM stores already the extracted image features and their descriptors. Even the matches between adjacent images can be also pre-computed. The three steps of our method, schematized in Fig. 2, are as follows: with Ni the number of matches (LSRs that fall in the same bin of the histograms, see Fig. 3 ) between images c and v in level i of the pyramid. wi is the weight for the matches in that level, that is the inverse of the current bin size (2i ). This distance is divided by a factor determined by the self-similarity score of each image, in order to avoid giving advantage to images with bigger sets of features, so the distance obtained is Get adjacent S(ψ(c), ψ(v)) PYRAMIDAL MATCHING (with descriptor vector of 2 dimensions) feature image 1 feature image 2 bin of size 2 (-1) 2 matches of level 0 C PYRAMIDAL MATCHING Most similar Image O(n) A O(1) Current Image LOCATION (rotation and translation between A - B and C ) bin of size 2 (0) 5 matches of level 1 (1) bin of size 2 6 matches of level 2 O(n 2 ) Step 3 . . . 1 Steps 1 and 2 3-VIEW MATCHING + ESTIMATION 1D RADIAL TRIFOCAL TENSOR 4 B 3 GLOBAL FILTER Filtered reference Images 2 Visual Memory (reference images) 5 Adjacent Image Topological Localization Metric Localization 0 Current Image 0 Fig. 2. (4) . S(ψ(c), ψ(c)) S(ψ(v), ψ(v)) Diagram of the three steps of the hierarchical localization. A. Global Descriptor Filter In a first step, we compute three color invariant descriptors over all the pixels in the image. The descriptor used is DS, described previously in section II-B.2. We compare all the images in the VM with the current one and we discard the images with more than an established difference in those descriptors previously normalized. B. Pyramidal Matching This step finds which is the most similar image to the current one. For this purpose, the set of descriptors of each line is used to implement a pyramid matching kernel [5]. The idea consists of building for each image several multi-dimensional histograms (each dimension corresponds to one descriptor), where each line feature occupies one of the histogram bins. The value of each line descriptor is rounded to the histogram resolution, which gives a set of coordinates that indicates the bin corresponding to that line. Several levels of histograms are defined. In each level, the size of the bins is increased by powers of two until all the features fall into one bin. The histograms of each image 1 2 3 4 5 6 (n-1) bin of size 2 match of level n value of descriptor 2 Fig. 3. Example of Pyramidal Matching, with correspondences in level 0, 1 and 2. For graphic simplification, with a descriptor of 2 dimensions. Notice that the matches found here are not always individual feature-feature matches, as we just count how many fall in the same bin. The more levels we check in the pyramid, the bigger are the bins, so the easier it is to get multiple coincidences in the same bin. Once the similarity measure between our actual image and the VM images is obtained, we choose the one with highest Scv as the most similar. Based on the annotations of this chosen image, we know in which room the robot is currently. C. Line Matching and Metric Localization For some applications, a more accurate localization information than the room is needed. In this case, we continue with the third step of the hierarchical method, where we need to obtain individual matches to compute the localization through a trifocal tensor. From the previous step, the pyramidal matching we could think to get some matches. However, as we show from several tests, it is not so convenient because many times the matches are multiple and it is not possible to distinguish which one goes with which one, due to the discrete size of the bins in the pyramid. IROS 2006 workshop: From Sensors to Human Spatial Concepts Page: 15 1) Line matching algorithm.: Once we have the most similar image, we find two-view line matches between the current and the most similar, and between that most similar and its adjacent in the VM. The two-view line matching algorithm we propose works as follows: • First decide which lines are compatible with each other. They must have the same direction (every line, has a direction assigned when extracted, depending in which side of the line is the darkest region). The relative rotation between each line and one compatible in the other view has to be consistent. It must be similar to the global rotation between both images obtained with the average gradient of all image pixels. This global rotation is not accurate at all, but if it is for instance 90 degrees, the real rotation between matches should not be zero. As explained, the rest of descriptors are computed in regions around the line in both sides. To classify two lines as compatible, at least the descriptor distances in one side of the LSR should be under an established threshold. • Compute a unique distance, using all the descriptors individual distances, between each pair of compatible lines. Mahalanobis distances are the most commonly used, however a lot of training is needed to compute the required covariance matrix with satisfying accuracy. Alternatively, we have tried to normalize those distances dividing the descriptors by the maximum value of each one. Theoretically, this may not be as correct as Mahalanobis distances, but in practice it works well and it is more adaptable to different queries. So it is enough to normalize the values and to have the distances of the different kinds of descriptors in similar scales, so we can sum them. We also apply a correction or penalty to increase the distance with the ratio of descriptors (Nd ) whose differences were over their corresponding threshold in the compatibility test. So the final distance we consider between two lines, i and j, will be M in in dij = (dM RGB + dDCT )(1 + Nd ), (5) in M in where dM RGB and dDCT are the smallest distances (in color and intensity descriptors respectively) of the two LSR sides. • Perform a nearest neighbour matching between compatible lines. • Apply a topological filter to the matches that helps to reject non-consistent matches. It is an adaptation for radial lines in omnidirectional images of the proposal in [15], that is based on the probability that two lines will keep their relative position in both views. It improves the robustness of this initial matching (although it can reject some good matches, those can be recovered afterwards in next step, now it is more important to have robustness). • Run a re-matching step, that takes into account the fact that the neighbours to a certain matched line should rotate in the image in a similar way from 1 view to the other. 2) 1D Radial Trifocal Tensor and Metric Localization.: It is well known, that to solve the structure and motion problem from lines at least three views are necessary. Then, after computing two-view matches between the two couples of images explained before, we extend the matches to three views intersecting both sets of matches and checking a consistent behaviour. There is a scheme in Fig. 4 showing a vertical line projected in the three views used and the location parameters we want to estimate (α21 , tx21 , ty21 , α31 , tx31 , ty31 ). We apply a robust method (ransac in our case) to the threeview matches to simultaneously reject outliers and estimate a 1D radial trifocal tensor. The orientation of the lines in the image is expressed by 1D homogeneous coordinates (r = [cos θ, sin θ]) to estimate the tensor. The trilinear constraint, imposed by a trifocal tensor in the coordinates of the same line v projected in three views (r1 , r2 , r3 ), is as follows, 2 2 2 i=1 j=1 k=1 Tijk r1 (i) r2 (j) r3 (k) = 0 (6) where Tijk (i, j, k = 1, 2) are the eight elements of the 2 × 2 × 2 trifocal tensor and subindex (·) are the components of vectors r. With five matches and two additional constraints defined for the calibrated situation (internal parameters of the camera are known), we have enough to compute a 1D trifocal tensor [8]. In our case with omnidirectional images we have a radial 1D trifocal tensor. From this tensor, even without camera calibration, we can get a robust set of matches, the camera motion and the structure of the scene [6]. In general, the translation between views is estimated up to a scale, but with the a priori knowledge that we have (we know the relative position of the two images in the visual memory) we can solve it exactly. IV. E XPERIMENTS AND R ESULTS In this section we show the performance of the algorithms from previous sections for detecting the current room (IV-A), for line matching (IV-B) and for metric localization (IV-C). We worked with two data-sets, Almere and our own (named data-set 2). From this second one ground truth data was available, convenient to measure the errors in the localization. From the Almere data-set, we have extracted the frames from the low quality videos from the rounds 1 and 4 (2000 frames extracted in the first, and 2040 in the second). We kept just every 5 frames, selecting half for the visual memory IROS 2006 workshop: From Sensors to Human Spatial Concepts (1) (2) (3) r2 = [r2(1) r2(2) ] Most similar Image to O1 (O2) Current Image (O1) Y Fig. 4. r3 = [r3(1) r3(2) ] Image adjacent to O2 in the VM (O3) r1 = [r1(1) r1(2) ] Y Motion between images and landmark parameters Page: 16 (every 10 frames starting from number 10) and the other half for testing (every 10 frames starting from number 5). The other visual memory has 70 omnidirectional images (640x480 pixels). 37 of them are sorted, classified in different rooms, with between 6 and 15 images of each one (depending on the size of the room). The rest corresponds to unclassified ones from other rooms, buildings or outdoors. In Fig. 5 we can see a scheme of both databases, the second one with details of the relative displacements between them. All images have been acquired with an omnidirectional vision sensor with hyperbolic mirror. Almere round 1 corridor ALMERE DATA SET ROOMS living room bedroom kitchen Data set 2 Fig. 5. Grids of images in rooms used in the experiments. Top: Almere data-set. Bottom: Data-set 2 The present implementation of our method is not optimized for speed, so timing information would be irrelevant. However, we have analyzed the time complexity in the three steps. • The first step (Section III-A) has constant complexity with the number of features in the image. Here all the images in the VM are evaluated. • The second step algorithm (Section III-B) is linear with the number of features. Here only the images that passed the previous step are evaluated, using 12-descriptors/feature. • The third step process (Section III-C) has quadratic complexity with the number of features, and a 22-descriptors vector per feature is used. However here we evaluate only the query image and two from the VM. Indeed, in each step we have a computational cost linear with the number of images remaining on it. Then, it is very important to keep the minimal set of images for the last step. A. Localizing the current room In this experiment we ran the two first steps of the method. We evaluated the topological localization for all possible tests in Data-set 2 (removing 1 test-image and comparing against all the rest). For the experiments with the Almere data-set we used the frames left for testing from the round 1, and also the same frame-number from the round 4. First step: Before computing a similarity measure to the full data-set, we applied the global filter. It was expected to reject a big amount of images, corresponding to very different ones than the query one, leaving more or less the ones of the same room and a small percentage of outliers (images from other rooms). Taking into account amount of images rejected and false negatives obtained, the best performing value for this filter was rejecting the images with more than 80% difference in the global descriptors. With the Data-set 2 around 70% of the images were rejected, from where only an average of 4% of the images were rejected being correct (false negatives). Using the Almere images the results were a little worse (55% rejected and 9% false negatives). This decreasing in the performance become more accentuated when classifying images from the round 4, with many occlusions, where sometimes we got too many false negatives (around 60% rejected with 14% false negatives). In the worst cases all good solutions were rejected already in this first step, making to find a correct classification impossible for the next steps. Second step: The goal of this step is to choose the most similar image to the current one, that for us will be correct if it is in the proper room. Firstly, for building the pyramid of histograms, indeed it is necessary to perform a suitable scaling of the descriptors, as they have values in very different ranges. What we did is to scale the descriptors considered more important (color and geometric ones) between 0 and 100, and the rest between 0 and 50, so we achieve that the accuracy of the most important descriptors decreases in a slower way than the accuracy of the others, in each level of the pyramid. Here, from the Almere set chosen for the VM, we will use only a fifth (every 50 frames: 50-100-150-...), as the images were still quite close and to localize the room it is not necessary to have so much accuracy. In the data-set 2 the images are already more separated, as they were taken every certain distance and not during a whole robot trajectory, so we kept them all for the VM. In Dataset 2, the VM images were more equally distributed in all the rooms and without the conflicts in the Almere one, e.g. when the robot is between two rooms. Due to this fact, and that the VM from Almere trajectory was built automatically from the video sequence, without any post-processing such as clustering nodes, we observe lower performance with the Almere data-set. The results when choosing the most similar image are shown in Table I for all experiments, the one with Dataset 2 and the two using Almere data (Almere1∝1 indicates we are classifying images from round 1 against the VM of other images from round 1, while Almere4∝1 means that we are classifying images from round 4 against the VM made from round 1). The image chosen as most similar (1 Ok in the table) was correct in 81% of the cases in Almere1∝1 test and 97% in the Data-set 2. In the test Almere4∝1 the performance was lower., but we should take into account that this round corresponds to a round with many occlusions, while the reference set (Almere1) is from an occlusion-clean round. This makes the first step work worse and the global IROS 2006 workshop: From Sensors to Human Spatial Concepts Page: 17 Almere1∝1 1 Ok 3 Ok 81% 52% Almere4∝1 1 Ok 3 Ok 71% 20% TABLE I F INDING THE MOST SIMILAR IMAGE IN THE VM. 1Ok: most similar found correct. 3Ok: 3 first images sorted with the similarity score correct. appearance based descriptors less useful. In some cases, even all possible correct images were rejected by it. This confirms the importance of using local features to deal with occlusions. If we skip the first step that is based on global appearance, the performance increases but also the computations will grow quite a lot. We must find a compromise between complexity and correctness. Pyramidal vs. Individual Matching to get a similarity score.: The results of the algorithm we develop for matching lines in section III-C could also be used for searching the most similar image. For this purpose, we can compute a similarity score that depends on the matches found (n) between the pair of images, weighted by the distance (d) between the lines matched. It has to take also into account the number of features not matched in each image (F1 and F2 respectively) weighted by the probability of occlusion of the features (Po ). The defined dissimilarity (DIS) measure is DIS = n d + F1 (1 − Po ) + F2 (1 − Po ) . Rotation α21 α31 o 0 0o o -1.34 -0.90o -1.38o -0.94o (0.07) (0.07) Location Param. reference manual automatic (std) Data-set 2 1 Ok 3 Ok 97% 74 % Transl. direction t21 t31 0o -26.5o o 0.65 -26.35o 0.68o -26.44o (0.13) (0.17) TABLE II M ETRIC LOCALIZATION FROM A Data-set 2 TEST. reference: reference data manual:results with manual matches; automatic: mean and standard deviation for results from automatic matches (50 executions per test). an example of the line matches using images more separated, not with the image that was selected as most similar but with some further one. 13 14 5 1 6 18 24 2 12 15 1314 1 3 11 2221 5 6 4 17 26 18 15 34 17 16 719 9 10 20 23 11 21 16 19 98 10 7 20 2 12 22 24 2523 26 8 25 Fig. 6. Matches with more separated images than the couples obtained in the Pyramidal matching - Room A (A01-A07)- 26 matches/ 7 wrong. (7) We have compared this approach and the Pyramidal matching method to select the most similar image in the previous experiments of this section and we noticed the pyramidal method seems more suitable. The correctness in the results was quite similar for both methods. Using Data-set 2 the percentage of good choices was the same with both, but the percentage when checking the 3 first images chosen was worse with individual matching. However with the Almere data, the individual matchings similarity measure performed a little better. As the performance seems similar but the pyramidal method has the advantage of having linear complexity with the number of features (in the experiments around 25% less execution time), we decided to use it for this step. B. Line Matching Results Here we show the performance of the two-view line matching algorithm developed for the last step of the process. In these experiments we obtained individual matches between each image and the most similar selected by the Pyramidal matching (using the results from the experiments of previous section IV-A, when a current image was compared against the rest of the VM). We got between 6 and 35 matches for the different tests, depending on the rooms, from which only an average of 10% were wrong. The better performance we get with wide-baseline images, the less density needed in the visual memory of each room. So this step to get wide-baseline matches is the key-point to be able to work with a minimal database. In Fig.6 we can see C. Metric Localization Results As explained before, we need three views to get the localization. Therefore, once we have two-view matches, between the current image and the most similar selected, and between this one and one of its neighbours in the VM, we get threeview matches from the intersection of those sets. We robustly estimate the 1D radial tensor, so we get also a robust set of three-view matches (those who fit the geometry of the tensor). In Fig. 7, there are some typical results obtained after applying the whole method in some examples from Almere data-sets. In Almere 1 ∝1 the query image is from round 1, while in Almere 4 ∝1 is from round 4. The localization parameters seem stable and consistent, however we did not have ground truth to evaluate them. We show in TABLE II the metric localization results with a similar test done with images from Data-set 2, where we had ground truth. Notice the small errors, around 1o , specially if we take into account the uncertainty of our ground-truth, that was obtained with a metric tape and goniometer. V. C ONCLUSIONS In this work, we have proposed a hierarchical mobile robot localization method. Our three-step approach uses omnidirectional images and combines topological (which room) and metric localization information. The localization task computes the position of the robot relative to a grid of training images in different rooms. After a rough selection of a number of candidate images based on a global color descriptor, the best resembling image is chosen based on a pyramidal matching IROS 2006 workshop: From Sensors to Human Spatial Concepts Page: 18 QUERY MOST SIMILAR FOUND Image 1 − Almere1LQ00005 − Robust Line Matches ADJACENT IN THE VM Image 2 − Almere1LQ00010 − Robust Line Matches Image 3 − Almere1LQ00020 − Robust Line Matches 12 1 6 1 1 12 2 4 12 5 6 3 2 2 4 3 5 8 8 7 7 9 11 10 8 7 9 9 11 10 Almere 1∝1. 19 initial matches with 11 wrong. After robust estimation: 12 robust matches, 4 wrong. Image 1 − Almere4LQ01125 − Robust Line Matches 3 6 4 5 10 11 α21 4o Localization Parameters Image 2 − Almere1LQ00500 − Robust Line Matches α31 16 o t21 45 o t31 44o Image 3 − Almere1LQ00510 − Robust Line Matches 8 8 12 9 11 7 5 3 6 4 1 9 2 10 2 5 9 10 1 3 4 10 2 6 1 43 6 5 11 12 8 11 7 12 7 Almere 4 ∝1. 15 initial matches with 5 wrong. After robust estimation: 12 robust matches, 3 wrong. Localization Parameters α21 161o α31 150o t21 114o t31 132o Fig. 7. Examples of triple robust matches obtained between the current position (first column image) and 2 reference images (second and third columns), together with the localization parameters obtained. with wide baseline local line features. This enables the grid of training images to be less dense. A third step involving the computation of the omnidirectional trifocal tensor yields the metrical coordinates of the unknown robot position. This allows accurate localization with a minimal reference dataset, contrary to the approaches that give as current localization the location of one of the reference views (as topological information is correct, but as metric can give quite high error). Our experiments, with two different data-sets of omnidirectional images, show the efficiency and good accuracy of this approach. R EFERENCES [1] Y. Nishizawa Y. Yagi and M. Yachida. Map-based navigation for a mobile robot with omnidirectional image sensor copis. In IEEE Trans. Robotics and Automation, pages 634–648, October 1995. [2] Chu-Song Chen, Wen-Teng Hsieh, and Jiun-Hung Chen. Panoramic appearance-based recognition of video contents using matching graphs. IEEE Trans. on Systems Man and Cybernetics, Part B, 34(1):179–199, 2004. [3] J. Gaspar, N. Winters, and J. Santos-Victor. Vision-based navigation and environmental representations with an omnidirectional camera. IEEE Trans. on Robotics and Automation, 16(6):890–898, 2000. [4] E. Menegatti, T. Maeda, and H. Ishiguro. Image-based memory for robot navigation using properties of the omnidirectional images. In Robotics and Autonomous Systems, Elsevier, Vol. 47, Iss. 4 , pp. 251-267, 2004. [5] K. Grauman and T. Darrell. The pyramid match kernels: Discriminative classification with sets of image features. In Proc. of the IEEE Int. Conf. on Computer Vision, 2005. [6] C. Sagüés, A.C. Murillo, J.J. Guerrero, T. Goedemé, T. Tuytelaars, and L. Van Gool. Localization with omnidirectional images using the radial trifocal tensor. In Proc. of the IEEE Int. Conf. on Robotics and Automation, 2006. [7] Olivier Faugeras, Long Quan, and Peter Sturm. Self-calibration of a 1d projective camera and its application to the self-calibration of a 2d projective camera. In Proc. of the 5th European Conference on Computer Vision, Freiburg, Germany, pages 36–52, June 1998. [8] K. Åström and M. Oskarsson. Solutions and ambiguities of the structure and motion problem for 1d retinal vision. Journal of Mathematical Imaging and Vision, 12:121–135, 2000. [9] S. Thirthala and M. Pollefeys. The radial trifocal tensor: A tool for calibrating the radial distortion of wide-angle cameras. In Proc. of Computer Vision Pattern Recognition, 2005. [10] David G. Lowe. Distinctive image features from scale-invariant keypoints. Int. Journal of Computer Vision, 60(2):91–110, 2004. [11] J.B. Burns, A.R. Hanson, and E.M. Riseman. Extracting straight lines. IEEE Trans. on Pattern Analysis and Machine Intelligence, 8(4):425– 455, 1986. [12] J.J. Guerrero and C. Sagüés. Camera motion from brightness on lines. combination of features and normal flow. Pattern Recognition, 32(2):203–216, 1999. [13] F. Mindru, T. Tuytelaars, L. Van Gool, and T. Moons. Moment invariants for recognition under changing viewpoint and illumination. Computer Vision and Image Understanding, 94(1-3):3–27, 2004. [14] K.R. Rao and P. Yip. Discrete Cosine Transform: Algorithms, Advantages, Applications. Academic Press, Boston, 1990. [15] H. Bay, V. Ferrari, and L. Van Gool. Wide-baseline stereo matching with line segments. In Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, volume I, June 2005. IROS 2006 workshop: From Sensors to Human Spatial Concepts Page: 19 ... IROS 2006 workshop: From Sensors to Human Spatial Concepts Page: 20 Virtual Sensors for Human Concepts Building Detection by an Outdoor Mobile Robot Martin Persson1, Tom Duckett2 and Achim Lilienthal1 1 2 Center for Applied Autonomous Sensor Systems Dept. of Technology, Örebro University, Sweden {martin.persson,achim.lilienthal}@tech.oru.se Abstract— In human-robot communication it is often important to relate robot sensor readings to concepts used by humans. We suggest to use a virtual sensor (one or several physical sensors with a dedicated signal processing unit for recognition of real world concepts) and a method with which the virtual sensor can be learned from a set of generic features. The virtual sensor robustly establishes the link between sensor data and a particular human concept. In this work, we present a virtual sensor for building detection that uses vision and machine learning to classify image content in a particular direction as buildings or non-buildings. The virtual sensor is trained on a diverse set of image data, using features extracted from gray level images. The features are based on edge orientation, configurations of these edges, and on gray level clustering. To combine these features, the AdaBoost algorithm is applied. Our experiments with an outdoor mobile robot show that the method is able to separate buildings from nature with a high classification rate, and extrapolate well to images collected under different conditions. Finally, the virtual sensor is applied on the mobile robot, combining classifications of sub-images from a panoramic view with spatial information (location and orientation of the robot) in order to communicate the likely locations of buildings to a remote human operator. Index Terms— Automatic building detection, virtual sensor, vision, AdaBoost, Bayes classifier I. I NTRODUCTION The use of human spatial concepts is very important in, e.g., robot-human communication. Skubic et al. [1] discussed the benefits of linguistic spatial descriptions for different types of robot control, and pointed out that this is especially important when there are novice robot users. In those situations it is necessary for the robot to be able to relate its sensor readings to human spatial concepts. To enable human operators to interact with mobile robots in, e.g., task planning, or to allow the system to combine data from external sources, semantic information is of high value. We believe that virtual sensors can facilitate robot-human communication. We define a virtual sensor as one or several physical sensors with a dedicated signal processing unit for recognition of real world concepts. As an example, this paper describes a virtual sensor for building detection using methods for classification of views as buildings or nature based on vision. The purpose with this is to detect one type of very distinct objects that often is used in, e.g., textual description of route directions. The suggested method to obtain a virtual sensor Department of Computing and Informatics University of Lincoln, Lincoln, UK tduckett@lincoln.ac.uk for building detection is based on learning a mapping from a set of possibly generic features to a particular concept. It is therefore expected that the same method can be extended to virtual sensors for representation of other human concepts. Many systems for building detection, both for aerial and ground-level images, use line and edge related features. Building detection from ground-level images often uses the fact that, in many cases, buildings show mainly horizontal and vertical edges. In nature, on the other hand, edges tend to have more randomly distributed orientations. Inspection of histograms based on edge orientation confirms this observation. Histograms of edge direction in different scales can be classified by, e.g., support vector machines [2]. Another method, developed for building detection in content-based image retrieval uses consistent line clusters with different properties [3]. This clustering is based on edge orientation, edge colors, and edge positions. For more references on ground-level building detection, see [2]. This paper presents a virtual sensor for building detection. We use AdaBoost for learning a classifier for classification of close range monocular gray scale images into ‘buildings’ and ‘nature’. AdaBoost has an ability to select the best socalled weak classifiers out of many features. The selected weak classifiers are linearly combined to produce one strong classifier. Bayes Optimal Classifier, BOC, is used as an alternative classifier for comparison. BOC uses the variance and covariance of the features in the training data to weight the importance of each feature. The proposed method combines different types of features such as edge orientation, gray level clustering, and corners into a system with high classification rate. The method is applied on a mobile robot as a virtual sensor for building detection in an outdoor environment and can be extended to other classes, such as windows and doors. The paper is organized as follows. Section II describes the feature extraction. AdaBoost is presented in Section III and Bayes classifier is presented in Section IV. In Section V the used image sets, the description of the training phase, and some properties of the weak classifiers are presented. Section VI shows the results from the performance evaluation and Section VII describes the virtual sensor for building detection. Finally, conclusions are given in Section VIII. IROS 2006 workshop: From Sensors to Human Spatial Concepts Page: 21 II. F EATURE E XTRACTION We select a large number of image features, divided into three groups, that we expect can capture the properties of man-made structures. The obvious indication of manmade structures, especially buildings, is that they have a high content of vertical and horizontal edges. The first type of features use this property. The second type of features combines the edges into more complex structures such as corners. The third type of features is based on the assumption that buildings often contain surfaces with constant gray level. The features that we calculate for each image are numbered 1 to 24. All features except 9 and 13 are normalized in order to avoid scaling problems. Here, the features were selected with regard to a particular virtual sensor, but one could also use a generic set of features for different virtual sensors. A. Edge Orientation For edge detection we use Canny’s edge detector [4]. It includes a Gaussian filter and is less likely than others to be fooled by noise. A drawback is that the Gaussian filter can distort straight lines. For line extraction in the edge image an algorithm implemented by Peter Kovesi [5] was used. This algorithm includes a few parameters that have been optimized empirically. The absolute values of the line’s orientation are calculated and used in different histograms. The features based on edge orientation are: 1) 3-bin histogram of absolute edge orientation values. 2) 8-bin histogram of absolute edge orientation values. 3) Fraction of vertical lines out of the total number. 4) Fraction of horizontal lines. 5) Fraction of non-horizontal and non-vertical lines. 6) As 1) but based on edges longer than 20% of the longest edge. 7) As 1) but based on edges longer than 10% of the shortest image side. 8) As 1) but weighted with the lengths of the edges. The 3-bin histogram has limits of [0 0.2 1.37 1.57] and the 8-bin histogram [0 0.2 . . . 1.4 1.6] radians. Values for the vertical (3), horizontal (4) and intermediate orientation lines (5) are taken from the 3-bin histogram and normalized with the total number of lines. Features 6, 7, and 8 try to eliminate the influence from short lines. B. Edge Combinations The lines defined above can be combined to form corners and rectangles. The features based on these combinations are: 9) Number of right-angled corners. 10) 9) divided by the number of edges. 11) Share of right-angled corners with direction angles close to 45◦ + n · 90◦ , n ∈ 0, . . . , 3. 12) 11) divided by the number of edges. 13) The number of rectangles. 14) 13) divided by the number of corners. We define a right-angled corner as two lines with close end points and 90◦ ± βdev angle in between. During the experiments βdev = 20◦ was used. Features 9 and 10 are the number of corners. For buildings with vertical and horizontal lines from doors and windows, the corners most often have a direction of 45◦ , 135◦ , 225◦ and 315◦ , where the direction is defined as the ‘mean’ value of the orientation angle for the respective lines. This is captured in features 11 and 12. From the lines and corners defined above rectangles representing, e.g., windows are detected. We allow corners to be used multiple times to form rectangles with different corners. The number of rectangles is stored in features 13 and 14. C. Gray Levels Buildings are often characterized by large homogeneous areas in the facades, while nature images often show larger variation. Other areas in images that can also be homogeneous are, e.g., roads, lawns, water and the sky. Features 15 to 24 are based on gray levels. We use a 25-bin gray level histogram, normalized with the image size and sum up the largest bins. This type of feature works globally in the image. To find local areas with homogeneous gray levels we search for the largest connected areas within the same gray level. Based on the gray level histogram, we calculate the largest regions of interest that are 4-connected. The features based on gray levels are: 15) Largest value in gray level histogram. 16) Sum of the 2 largest values in gray level histogram. 17) Sum of the 3 largest values in gray level histogram. 18) Sum of the 4 largest values in gray level histogram. 19) Sum of the 5 largest values in gray level histogram. 20) Largest 4-connected area. 21) Sum of the 2 largest 4-connected areas. 22) Sum of the 3 largest 4-connected areas. 23) Sum of the 4 largest 4-connected areas. 24) Sum of the 5 largest 4-connected areas. III. A DA B OOST AdaBoost is the abbreviation for adaptive boosting. It was developed by Freund and Schapire [6] and has been used in diverse applications, e.g., as classifiers for image retrieval [7]. In mobile robotics, AdaBoost has, e.g., been used in ball tracking for soccer-robots [8] and to classify laser scans for learning of places in indoor environments [9]. This work is a nice demonstration of using machine learning and a set of generic features to transform sensor readings to human spatial concepts. The main purpose of AdaBoost is to produce a strong classifier by a linear combination of weak classifiers, where weak means that the classification rate has to be only slightly better than 0.5 (better than guessing). The principle of AdaBoost is as follows (see [10] for a formal algorithm). The input to the algorithm is a number, N , of positive (buildings) and IROS 2006 workshop: From Sensors to Human Spatial Concepts Page: 22 negative (nature) examples. The training phase is a loop. For each iteration t, the best weak classifier ht is calculated and a distribution D t is recalculated. The boosting process uses D t to increase the weights of the hard training examples in order to focus the weak learners on the hard examples. The general AdaBoost algorithm does not include rules on how to choose the number of iterations T of the training loop. The training process can be aborted if the distribution D t does not change, otherwise the loop runs through the manually determined T iterations. Boosting is known to be not particularly prone to the problem of overfitting [10]. We used T = 30 for training and did not see any indications of overfitting when evaluating the performance of the classifier on an independent test set. The weak classifiers in AdaBoost use single value features. To be able also to handle feature arrays from the histogram data, we have chosen to use a minimum distance classifier, MDC, to calculate a scalar weak classifier. We use Dt to bias the hard training examples by including it in the calculation of a weighted mean value for the MDC prototype vector: P {n=1...N |yn =k} f (n, l)D t (n) P ml,k,t = {n=1...N |yn =k} D t (n) where ml,k,t is the mean value array for iteration t, class k, and feature l and yn is the class of the nth image. The features for each image are stored in f (n, l) where n is the image number. For evaluation of the MDC at iteration t, a distance value dk,l (n) for each class k (building and nature) is calculated as dk,l (n) = kf (n, l) − ml,k,t k and the shortest distance for each feature l indicates the winning class for that feature. IV. BAYES C LASSIFIER It is instructive to compare the result from AdaBoost with another classifier. For this we have used Bayes Classifier, see e.g. [11] for a derivation. Bayes Classifier, or Bayes Optimal Classifier, BOC, as it is sometimes called, classifies normally distributed data with a minimum misclassification rate. The decision function is 1 1 dk (x) = ln P (wk )− ln |C k |− [(x−mk )T C −1 k (x−mk )] 2 2 where P (wk ) is the prior probability (set to 0.5), mk is the mean vector of class k, and C k is the covariance matrix of class k calculated on the training set, and x is the feature value to be classified. Not all of the defined features can be used in BOC. Linear dependencies between features give numerical problems in the calculation of the decision function. Therefore normalized histograms can not be used, hence features 1, 2, 6, 7, and 8 were not considered. The set of features used in BOC was 3, 4, 9-15, 17, 20, 23. This set was constructed by starting with the best individual feature (see Figure 3, Section V-C) and adding the second best feature etc., while observing the condition value of the covariance matrices. V. E XPERIMENTS A. Image Sets We have used three different sources for the collection of nature and building images used in the experiments. In Set 1, we used images taken by an ordinary consumer digital camera. These were taken over a period of several months in our intended outdoor environment. Our goal with the system is to classify images taken by a mobile robot. In Set 2, we therefore stored images from manually controlled runs with a mobile robot, performed on two different occasions. Set 1 and 2 are disjunctive in the sense that they do not contain images of the same buildings or nature. In order to verify the system performance with an independent set of images, Set 3 contains images that have been downloaded from the Internet using Google’s Image Search. For buildings the search term building was used. The first 50 images with a minimum resolution of 240 × 180 pixels containing a dominant building were downloaded. For nature images, the search terms nature (15 images), vegetation (20 images), and tree (15 images), were used. Only images that directly applied to the search term and were photos of reality (no arts or computer graphics) were used. Borders and text around some images were removed manually. Table I presents the different sets of images and the number of images in each set. All images have been converted to gray scale and stored in two different resolutions (maximum side length 120 pixels and 240 pixels, referred to as size 120 and 240 respectively). By this way we can compare the performance at different resolutions, for instance the classification rate and scalability robustness. A benefit of using low resolution images is faster computations. Examples of images from Set 1 and 2 are shown in Figure 1. Set 1 2 3 Origin Digital camera Mobile camera Internet search Area Urban Campus Worldwide Total number Buildings 40 66 50 Nature 40 24 50 156 114 TABLE I N UMBER OF COLLECTED IMAGES . T HE DIGITAL CAMERA IS A 5 M PIXEL S ONY (DSC-P92) AND THE MOBILE CAMERA IS AN ANALOGUE CAMERA MOUNTED ON AN I R OBOT ATRV-J R . B. Test Description Four tests have been defined for evaluation of our system. Test 1 shows whether it is possible to collect training data IROS 2006 workshop: From Sensors to Human Spatial Concepts Page: 23 Percent 15 10 5 0 0 5 10 15 20 25 20 25 Feature number 40 Percent 30 20 10 0 0 5 10 15 Feature number Fig. 1. Example of images used for training. The uppermost row shows buildings in Set 1. The following rows show buildings in Set 2, nature in Set 1, and nature in Set 2, respectively. Fig. 2. Histogram describing the feature usage by AdaBoost in Test 2 as an average of 100 runs, using image size 120 (upper) and 240 (lower). 100 No. 1 2 3 4 Nrun 1 100 1 100 Train Set 1 90% of {1,2} {1,2} 90% of {1,2,3} TABLE II D ESCRIPTION OF DEFINED TESTS (Nrun Test Set 2 10% of {1,2} 3 10% of {1,2,3} IS THE NUMBER OF RUNS ). Percent 90 80 70 60 0 5 10 15 20 25 20 25 Feature number 100 90 Percent with a consumer camera and use this for training of a classifier that is evaluated on the intended platform, the mobile robot. Test 2 shows the performance of the classifier in the environment that it is designed for. Test 3 shows how well the learned model, trained with local images, extrapolates to images taken around the world. Test 4 evaluates the performance on the complete collection of images. Table II summarizes the test cases. These tests have been performed with AdaBoost and BOC separately for each of the two image sizes. For Test 2 and 4, a random function is used to select the training partition and the images not selected are used for the evaluation of the classifiers. This was repeated Nrun times. 80 70 60 0 5 10 15 Feature number Fig. 3. Histogram of classification rate of individual features in Test 2 as an average of 100 runs with T = 1, image size 120 (upper) and 240 (lower). D t is updated. Becase the importance of correctly classified examples is decreased after a particular weak classifier is added to the strong classifier, similar weak classifiers might not be selected in subsequent iterations. As a comparison to the test results presented in Section VI, the result obtained on the training data using combinations of image sets is also presented in Table III. VI. R ESULTS C. Analysis of the Training Results AdaBoost can compute multiple weak classifiers from the same features by means of a different threshold, for example. Figure 2 presents statistics on the usage of different features in Test 2. The feature most often used for image size 240 is the orientation histogram (2). For image size 120, features 2, 8, 13 and 14 dominate. Figure 3 shows how each individual feature manages to classifiy images in Test 2. Several of the histograms based on edge orientation are in themselves close to the result achieved for the classifiers presented in the next section. Comparing Figure 2 and Figure 3 one can note that several features with high classification rates are not used by AdaBoost to the expected extent, e.g., features 1, 3, 4, and 5. This can be caused by the way in which the distribution Training and evaluation have been performed for the tests specified in Table II for features extracted both from images of size 120 and 240. The result is presented in Table IV and V respectively. The tables show the mean value of the total classification rate, its standard deviation, and the mean value of the classification rates for building images and nature images separately. Results from both AdaBoost and BOC using the same training and testing data are given. Test 1 shows a classification rate of over 92% for image size 240. This shows that it is possible to build a classifier based on digital camera images and achieve very good results for classification of images from our mobile robot, even though Set 1 and 2 have structural differences, see Section V-A. IROS 2006 workshop: From Sensors to Human Spatial Concepts Page: 24 1,2 120 1,2,3 120 1 240 1,2 240 1,2,3 240 Classifier AdaBoost BOC AdaBoost BOC AdaBoost BOC AdaBoost BOC AdaBoost BOC AdaBoost BOC Build. [%] 100.0 100.0 97.2 95.3 89.7 86.5 100.0 100.0 100.0 98.1 98.7 95.5 Nat. [%] 100.0 100.0 100.0 93.8 94.7 94.7 100.0 100.0 100.0 100.0 99.1 98.2 Total [%] 100.0 100.0 98.2 94.7 91.9 90.0 100.0 100.0 100.0 98.8 98.9 96.7 TABLE III R ESULTS ON THE TRAINING IMAGE SETS IN TABLE I. Test no. 1 2 3 4 Classifier AdaBoost BOC AdaBoost BOC AdaBoost BOC AdaBoost BOC Build. [%] 81.8 93.9 93.0 95.7 68.0 72.0 86.6 86.4 Nat. [%] 91.7 58.3 91.8 89.0 90.0 74.0 89.8 88.5 Total [%] 84.4 84.4 92.6 ± 5.8 93.4 ± 5.5 79.0 73.0 87.9 ± 6.2 87.3 ± 6.0 TABLE IV R ESULTS FOR T EST 1-4 USING IMAGES WITH SIZE 120. Test 2 is the most interesting test for us. This uses images that have been collected with the purpose of training and evaluating the system in the intended environment for the mobile robot. This test shows high (and highest) classification rates. For both AdaBoost and BOC they are around 97% using the image size 240. Figure 4 shows the distribution of wrongly classified images for AdaBoost compared to BOC. It can be noted that for image size 120 several images give both classifiers problems, while for image size 240 different images cause problems. Test 3 is the same type of test as Test 1. They both train on one set of images and then validate on a different set. Test no. 1 2 3 4 Classifier AdaBoost BOC AdaBoost BOC AdaBoost BOC AdaBoost BOC Build. [%] 89.4 95.5 96.1 98.1 88.0 90.0 94.1 94.8 Nat. [%] 100.0 87.5 98.3 95.7 94.0 82.0 95.5 93.4 Total [%] 92.2 93.3 96.9 ± 4.3 97.2 ± 4.0 91.0 86.0 94.6 ± 3.8 94.2 ± 4.7 TABLE V R ESULTS FOR T EST 1-4 USING IMAGES WITH SIZE 240. 20 15 # Size 120 10 5 0 0 2 4 6 8 10 12 14 16 18 20 18 20 Image number (starting with the most times wrongly classified) 15 10 # Sets 1 5 0 0 2 4 6 8 10 12 14 16 Image number (starting with the most times wrongly classified) Fig. 4. Distribution of the 20 most frequently wrongly classified images from AdaBoost (gray) and BOC (white), using image size 120 (upper) and 240 (lower). Test 3 shows lower classification rates than Test 1 with the best result for AdaBoost using image size 240. This is not surprising since the properties of the downloaded images differ from the other image sets. The main difference between the image sets is that the buildings in Set 3 often are larger and located at a greater distance from the camera. The same can be noted in the nature images, where Set 3 contains a number of landscape images that do not show close range objects. The conclusion from this test is that the classification still works very well and that AdaBoost generalizes better than BOC. The result from Test 4 is compared with the result of Test 2. We can note that the classification rate is lower for Test 4, especially for image size 120. Investigation of the misclassified images in Test 4 shows that the share belonging to image Set 3 (Internet) is large. For both image sizes 60% of the misclassified images came from Set 3. To show scale invariance we trained two classifiers on Test 2 with images of size 120 and evaluated them with images of size 240 and vice versa. The result is presented in Table VI and should be compared to Test 2 in Tables IV and V. The conclusion from this test is that the features we use have scale invariant properties over a certain range and that AdaBoost shows significantly better scale invariance than BOC, which again demonstrates AdaBoost’s better extrapolation capability. VII. V IRTUAL S ENSOR F OR B UILDING D ETECTION We have used the learned building detection algorithm to construct a virtual sensor. This sensor indicates the presence of buildings in different directions related to a mobile robot. In our case we let the robot perform a sweep with its camera (±120◦ in relation to its heading) at a number of points along its track. The images are classified into buildings and ‘nature’ IROS 2006 workshop: From Sensors to Human Spatial Concepts Page: 25 Train 120 Test 240 240 120 Classifier AdaBoost BOC AdaBoost BOC B. [%] 94.2 93.0 95.1 100.0 N. [%] 96.7 94.3 90.8 44.8 Total [%] 95.1 ± 4.2 93.5 ± 5.3 93.6 ± 6.0 80.5 ± 6.7 TABLE VI R ESULTS FOR T EST 2 USING TRAINING WITH IMAGES SIZED 120 AND TESTING WITH IMAGES SIZED 240 AND VICE VERSA . Fig. 5. Classification of images used as a virtual sensor pointing at the two classes (blue arrows indicate buildings and red lines non-buildings). (or non-buildings) using AdaBoost trained on set 1. The experiments were performed using a Pioneer robot equipped with GPS and a camera on a PT-head. Figure 5 shows the result of a tour in the Campus area. The blue arrows show the direction towards buildings and the red lines point toward non-buildings. Figure 6 shows an example of the captured images and their classes from a sweep with the camera at the first sweep point (the lowest leftmost sweep point in Figure 5). This experiment was conducted with yet another camera and during winter, and the result was qualitatively found to be convincing. Note that the good generalization of AdaBoost is expressed by the fact that the classifier was trained on images taken in a different environment and season. Fig. 6. Example of one sweep with the camera. The blue arrows point at images classified as buildings and the red lines point at non-buildings. VIII. C ONCLUSIONS We have shown how a virtual sensor for pointing out buildings along a mobile robot’s track can be designed using image classification. A virtual sensor relates the robot sensor readings to a human concept and is applicable, e.g., when semantic information is necessary for communication between robots and humans. The suggested method using machine learning and generic image features will make it possible to extend virtual sensors to a range of other important human concepts such as cars and doors. To handle these new concepts, features that capture their characteristic properties should be added to the present feature set, which is expected to be reused in the future work. Two classifiers intended for use on a mobile robot to discriminate buildings from nature utilizing vision have been evaluated. The results from the evaluation show that high classification rates can be achieved, and that Bayes classifier and AdaBoost have similar classification results in the majority of the performed tests. The number of wrongly classified images is reduced by about 50% when the higher resolution images are used. The features that we use have scale invariant properties to a certain range, showed by the cross test where we trained the classifier with one image size and tested on another size. The benefits gained from Adaboost include the highlighting of strong features and its improved generalization properties over the Bayes classifier. R EFERENCES [1] Marjorie Skubic, Pascal Matsakis, George Chronis, and James Keller. Generating multi-level linguistic spatial descriptions from range sensor readings using the histogram of forces. Autonomous Robots, 14(1):51– 69, Jan 2003. [2] J. Malobabic, H. Le Borgne, N. Murphy, and N. O’Connor. Detecting the presence of large buildings in natural images. In Proc. 4th International Workshop on Content-Based Multimedia Indexing, Riga, Latvia, June 21-23, 2005. [3] Yi Li and Linda G. Shapiro. Consistent line clusters for building recognition in CBIR. In Proc. Int. Conf. on Pattern Recognition, volume 3, pages 952–957, Quebec City, Quebec, CA, Aug 2002. [4] John Canny. A computational approach for edge detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 8(2):279– 98, Nov 1986. [5] Peter Kovesi. http://www.csse.uwa.edu.au/∼pk/Research/MatlabFns/, University of Western Australia, Sep 2005. [6] Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119–139, 1997. [7] Kinh Tieu and Paul Viola. Boosting image retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Hilton Head Island, South Carolina, June 2000. [8] André Treptow, Andreas Masselli, and Andreas Zell. Real-time object tracking for soccer-robots without color information. In European Conf. on Mobile Robotics (ECMR 2003), Radziejowice, Poland, 2003. [9] O. Martínez Mozos, C. Stachniss, and W. Burgard. Supervised learning of places from range data using adaboost. In Proc. of the IEEE International Conference on Robotics and Automation (ICRA), pages 1742–1747, Barcelona, Spain, April 2005. [10] Robert E. Schapire. A brief introduction to boosting. In Proc. of the Sixteenth Int. Joint Conf. on Artificial Intelligence, 1999. [11] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. Wiley, New York, second edition, 2001. IROS 2006 workshop: From Sensors to Human Spatial Concepts Page: 26 Learning the sensory-motor loop with neurons : recognition, association, prediction, decision Nicolas Do Huu and Raja Chatila LAAS-CNRS Robotic and Artificial Intelligence Research Group Toulouse, France Email: {ndohuu, raja.chatila}@laas.fr Abstract— This paper introduces a learning framework to grant an autonomous system with recognition, in sight objects localization an decision capabilities. The ”Integrate and Fire” neuron model and global network architecture are exposed with an original associative scheme. In a final part, we propose a predictive and decisional algorithm compatible with the neuron computation process. I. I NTRODUCTION The way knowledge is represented and manipulated is a fundamental and opened problem in the context of building a robotic system that must interact with its environment, take decisions, and even build new knowledge from observations and experiments, say an autonomous system. The architecture and algorithms we propose are to provide a framework for representing knowledge in the uniform language of neurons activations. This paper addresses three main aspects of learning : learning representations from perceptive data, learning to associate these representations to capture their evolution with respect to the robot’s actions, and learning to produce actions in order to increase a global reward value using reinforcement learning. First, we use an artificial neural network to learn and recognise objects in the robot’s environment. Objects are considered to be quite small in order to fit in a single image acquired by the robot’s cameras. The goal of this subsystem is to build an internal representation of the objects. These representations are made of neurons which learn to recognise patterns, a competition-based hebbian learning rule (see [5] and [3]) has been developed to learn representations in an unsupervised way. Neurons are organised in maps which are two dimensional arrays of neurons sharing properties such as synaptic weights and incoming connections [4]. The neural network we use is able to pool together the representations of multiple views of the same object using temporal associations : different views of the same object tend to appear close together in time and space and can thus be associated [10][5]. Moreover, we will explain how to use this kind of neural network to learn representations of the objects position and representations of the robot’s last actions using proprioceptive data. All the representations activated by the recognition subsystem at a given time are called the perceptive context : object appearance, position and the robot’s last actions. Another learning task aims to associate a reward value called ”effect” to each perceptive context and then learn to produce actions in order to reactivate context which generate the best rewards. This can be done by learning a local reward value (or local effect) associated to each representation (in fact with each neuron). The global reward value is then defined as the sum of all local effects generated by the current perceptive context. In order to choose the best action i.e. the action that will bring the most increase in reward, the system must learn how each action can affect the perceptive context and thus the global reward. When the global reward is affected by an action, the system by learns a sensori-motor association in which the action is associated with the modification of the perceptive context it has produced. This ”driven-by-reward” associative memory is the third aspect of learning propose in this paper. Finally, the sensory-motor associations are used to predict perceptive contexts that are reachable within several steps. Then, the action selection process can be done by comparing the cumulative rewards (utility) of each action and selecting the most rewarding action in this horizon of time. The paper is organized as follows. In Section II, we describe the sensory-motor loop organization. In Section III, we introduce our ”Integrate and Fire” neuron model and the neural connectivities we use. In Section IV the perceptive part is then explained with the objective transitions building mechanism. Section V is dedicated to the active part, global reinforcement learning mechanism. The predictive decisional process is presented in Section VI. We briefly describe the robot platform and software implementation in Section VII and we finally conclude in Section VIII. II. T HE SENSORY- MOTOR LOOP The purpose of this architecture is to autonomously extract representations and learn to produce appropriate actions in a given context in order to reach one other context which increases the global reward value. One first requirement is to rate perceptive contexts, to give a score that permits to discriminate them according to some ”system optimum”. We call global effect the signed measure of optimality. As representations are learnt during the course of time, and because contexts are composed by sets of these representations, the representation/score associations also needs to be learnt. In this learning scheme we call objective representations the ”unflavored” description built from sensor data and subjective representations the objective representation/local effect association. IROS 2006 workshop: From Sensors to Human Spatial Concepts Page: 27 Learnt Associations (section 5) Context Decomposition (section 4) Sensors Global Effect (sections 5,6) Initial Representations (sections 3,5,6) Action Synthesis (section 6) f) Global Effects (GE)): is a scoring system which associates a value with each representation. GE represents the criterion the system wants to maximize. When the effects of a representation are negative the system will produce actions to increase the criterion value. III. N EURONS , C ONNECTIVITY AND L OCAL A LGORITHMS Actuators "Macro" algorithms high-level knowledge low distribution Reinforcement learning and decision making High-level knowledge learning and association (object recognition, object position, action) Environment Neural map connectivity and competition Fig. 1. Neuron local learning algorithms Sensory-motor loop. The system must extract representations from the environment in order to recognise perceptive contexts. The environment is perceived through sensors as exteroceptive and proprioceptive stimuli. One class of proprioceptive stimuli relates to the actions performed by the robot. We in fact consider that the robot has no initial knowledge about the actions it can perform and thus, they must be learnt. Concurrently, an associative memory learns sensory-motor associations rated by a value of the global effect contribution. The learnt associations are then combined to predict future perceptive contexts while the global effect contribution is used to choose the most rewarding action. These mechanisms take place in the global sensory-motor loop shown in Figure 1 ([1], [2]). The system is structured in slices of connected Pulsed Neural Networks (PNN) based on a discrete integrate and fire model (see Section III). PNN provide a level of description that allows to develop the learning process [3], categorization and association, while avoiding combinatorial explosion. The functions of the seven subsystems are as follows: a) Sensors (SEN): is the input of the global system and is the frontier between the environment and the neural space. It is composed of maps of neurons whose purpose is to convert physical value to computable information. b) Initial Representations (IR): is where the system start from. This set of neurons is connected to SEN and permits to generate the initial variations of the global effect. c) Context Decomposition (CD): is the categorisation and recognition engine. Its outputs are high level representations which are extracted by the use of multilayer neural networks (see Sections III and IV). d) Learned Associations (LA): receives inputs from CD and itself. This sub-system has the properties of an associative memory. It is responsible for learning of new skills and determines the global system behaviour. Each neuron of LA is connected to GE for criteria evaluation (see Section V). e) Elementary Actions and Action Synthesis (AS and ACT): are the action sub-systems but ACT is also where the prediction and decision take place (see Section VI). "Micro" algorithms low-level knowledge high distribution Fig. 2. Global architecture for knowledge and skill acquisition composed by a set of algorithms hierarchically structured. The global architecture of the system is composed of several algorithm which are performed concurrently (Figure 2). At the bottom of the hierarchy, the neuron learning mechanism is used to extract representations and learn associations. The network connectivity grant our object recognition process with shift-invariance property while lateral inhibition avoid multiple learning of the same pattern. The associative memory structure is designed to avoid an all-to-all connection scheme as three high-level classes of concepts are predefined : object recognition, object position and actions. Finally, a global reinforcement learning algorithm is used for prediction and decision to select the most rewarding action. A. A Look Into the Map Architecture A map is a set of local classifiers, or neurons, organized in a retinotopic way. Every neuron of each map shares the same set of synaptic weights which represent the memory component of the learning process. These neurons are then able to detect their learnt patterns whatever the position on the input map. This property called shift invariance due to our network connectivity is inspired from Fukushima’s Neocognitron [4]. B. Neural Competition for Distributed Meanings As we want our system to autonomously extract representations from sensor data, we must provide our maps the ability to detect whatever the input signals needs to be learnt or not. With this end in view, we add to the inter-maps connectivities a communication channel between neurons situated at the same coordinates on maps belonging to the same layer (a layer is a set of maps connected to the same inputs, see Figure 3). This communication channel is the support for a local competition that permits to distribute extracted meanings among a learning layer in an unsupervised way [5]. Each neuron computes its own detection value and compares it with distant detection IROS 2006 workshop: From Sensors to Human Spatial Concepts Page: 28 We also define two weight sums as follows : + X Siβ (t) = x wij (t) X Si+ (t) = wij (t) wij (t)∈Wi+ (t) Uj ∈Ω+ i (t) where Wi+ (t) = {wij (t), Uj ∈ Ωi and wij (t) > 0}. Then the integrated potential Pi at time t + 1, which expresses a level of similarity, is given by: y + Pi (t + 1) = αP Pi (t) + (1 − αP )Siβ (t) Si+ (t) Where αP ∈ [0, 1] is a fixed potential leak. Layer1 Fig. 3. Layer2 Layer3 a) blocking signals x y step 0 step 500 step 1500 5 b) b) Layer1 Fig. 4. Layer2 Maps local competition and specialization. values.Then, a learning step occurs if the local detection value won the competition which corresponds to a Winner Takes All mechanism. C. Integration, Firing and Learning Lateral signals used for competition Weight kernel Receptive field x 5 10 15 20 25 30 35 40 45 50 55 60 Burst levels Fig. 6. Burst sampling distribution and cumulative distribution evolution. a) burst sampling distributions computed at steps 0, 500 and 1500. Burst values are discretized within 64 levels. b) corresponding cumulative distribution used as a transfer function for integration. ( 0 if Pi (t) < Ti , βi (t) = Fi (t, Pi (t)) otherwise. Integrate and Fire Max Learn Learning stream Integration and Fire I/O Fig. 5. A map is a collection of neurons sharing the same weights. A neuron is composed as a three layer pipeline. 1) Integration: Given a neuron Ui , we denote βi (t) ∈ [0, 1] its calculated output burst at time t. Let wij (t) ∈ Wi (t) be the learning weight associated to the connection between neurons Ui and Uj , and Ωi+ (t) a subset of afferent Ωi defined as: Wi (t) = {wij (t), Uj ∈ Ωi } step 0 step 500 step 1500 2) Firing: The burst level is thus computed as follows: Wq y 60 10 15 20 25 30 35 40 45 50 55 60 Burst levels CUMULATIVE SAMPLING DISTRIBUTION 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Layer3 b) BURST LEVEL SAMPLING DISTRIBUTION 0.1 0.09 0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0 Pattern decomposition and shift invariance. Ω+ i (t) = {Uj ∈ Ωi , βj (t) > 0} Fi (t, x) = X fi (t, b) 0<b≤x Where Ti is a fixed threshold and Fi is an adaptive transfer function whose variations across time are illustrated by Figure 6.b. The transfer function Fi is in fact the associative cumulative distribution at time t and fi is a sampling distribution of burst levels updated at each step (Figure 6.a). 3) Learning: We use in our model a hebbian-like learning rule in which neurons tend to learn patterns responsible for their activations. In other words, a learning process occurs when one neuron has both fired and won the competition in its layer. Moreover, we compute a stochastic standard deviation σij for each weight wij which is computed and used as coefficient in the learning calculation as follows: µij (t + 1) = (1 − αW )µij (t) + αW | γij (t) | σij (t + 1)2 = (1 − αW )σij (t) + αW [µij (t + 1) − γij (t + 1)]2 wij (t + 1) = (1 − αW )wij (t) + αW [1 − 2σij (t + 1)]γij (t + 1) IROS 2006 workshop: From Sensors to Human Spatial Concepts Page: 29 Where µij is a stochastic mean. Hence we ensure that weights corresponding to noisy inputs (with high standard deviation) will tend to 0 and then won’t take part in the representation. γij is a function of input burst of neuron i which takes its values in {−1, 0, 1}. Maps among the same layer are in competition as shown in Figure 4. If at time t, an input burst from neuron j is received γij (t) = 1, else if a burst is received from any other afferent neuron then γij (t) = −1. Finally if no burst was present γij (t) = 0. IV. L EARNING PERCEPTIVE CONTEXTS AND ACTIONS We assume here that our robot only focuses on a single object. Then a perceptive context is defined by the conjunction of the position of the object, its nature and all the proprioceptive informations defining the robot’s state. In this Section we will focus on the autonomous extraction and internal encoding of these contexts in order for the system to build pertinent sets of representations and then (Section V) associate them with actions to reach its optimality through the global effect reinforcement learning process. Where What Where How What SEN MAX How SUM or Where Where How How How What Where How or How How How How Where Where What or inter-maps connection in-line map creation Acting referent map speciﬁc map Referent maps activation mode: MAX maximum speciﬁc map activation SEN sensor activation SUM sum of speciﬁc maps activations Morphing Moving How How How What What Where Where Where What or Fig. 7. Context decomposition and associations. Three classes of representations are learnt, from left to right: View-based representations of objects (’What’), representation of object’s position (’Where’) and representations of the robot’s actions (’How’). These representations are associated for prediction: ’Morphing’ and ’Moving’ associations. A third association is learnt to represent applicable actions in a given perceptive context: ’Acting’. A. Temporal Binding of Views for Object Learning In the image-based or view-based model, an object is represented as a collection of view-specific local features [5], [6], [7], [8]. Representations are organised in trees in which a set of view-tuned units constitutes the weighted inputs to a higher level object-invariant unit. Each unit measures the similarity between the input image and its stored view, and the higher level unit computes the weigthed sum from its incoming connections. If the resulting value reaches a given threshold, then the learned object is recognized. Riesenhuber’s HMAX model of object recognition in the ventral visual stream of primates also proposes a similar grouping method where higher level cells compute the maximum response of view-tuned cells [9]. In the view-based model framework, the representations the system produces are not just pixel images but, thanks to the hierarchical and distributed architecture, can reflect in some extend the structure of the object. Moreover, we use the stability of the object localization (temporal coherence) in order to pool together the consecutive views the robot perceives [10], [11]. A high learning rate grants to a special map, called referent, the ability to track the object position during scale and angle modifications. We also give this referent-map the ability to create view-specific maps when it discovers an unknown view of the object (like adding a picture in the collection). B. Coding the egocentric ’Where’ in a View-Based Approach How to make a robot learn the concept of where the objects are from its own perceptions since it has neither map of its environment nor geometrical representation of space ? To answer this question we propose an egocentric representation of space as a mixture of proprioceptive and exteroceptive stimuli. Firstly, we encode at each step the positions of the pan and tilt axis with the respective firing positions of two neurons belonging to two distinct neuron vectors (or 1D neuron maps). When an object is recognized, a saccade module, which is autonomous and purely reflexive, is in charge for centering the object in front of the two cameras by modifying the pan and tilt positions. Then, a third neuron vector, we call depth vector, is activated at a position which reflects the shift amount in position of the detected pattern between left and right images (stereo-disparity). From these three vectors, and with the use of the competition process, our system is able to learn, as for light patterns, the proprioceptive classes of where in sight objects are. These specialized neurons correspond to the referent-maps in the ’Where’ layers (see Figure 7). This coding framework doesn’t take into account out of sight positions; however we plan to add in future works the ability to encode further positions with patterns of actions that are required to reach the object. Then the hard part would be to select relevant actions when different sets are available. V. A SSOCIATIONS AND TRANSITIONS A. ’What’ Transitions: Morphing the World As explained in Section IV-A, our learning framework is able to autonomously build a view-based representation of an object. In Figure 7 we illustrate the resulting ’What’ layer containing its referent-map (object specific, non viewspecific) and a set of specific-map (view-specific). Then we call ’Morphing’ transitions the learnt associations of a specificmap with an action (’How’) that induces the recognition of a specific-map right after the action completion. This resulting specific-map must belong to the same ’What’ layer and can be the same as the initial one. This concept is illustrated by Figure 8. On the upper right, we show a example of memory trace for the activations of specific-maps belonging to the ’What’ and ’How’ layers which are the input states of this learning process. Then the initial and final values of synaptic weights are presented with a gray color scale. And finally we show the use of the learnt association for prediction. A learning IROS 2006 workshop: From Sensors to Human Spatial Concepts Page: 30 'What' : Speciﬁc-Maps Activations Across Time present past Initial Weights Values V1 V2 V3 Learning Weights V1 V2 V3 'How' : Referent-Maps Activations Across Time present A1 A2 A3 extraction in the learning process. From the subjective point of view, the resulting ’How’ specific-map is able to learn its effect contribution, it is thus possible to associate a local effect to a sensory-motor association (eg. red signal+stop=positive effect, red signal+run=negative effect). This preconditions can finally be applied in the decision making process exposed in next Section. past VI. P REDICTION AND DECISION A1 A2 A3 p0 p2 p1 A0 Resulting Predictive Model Learnt Weights Values past Learning Weights V1 V2 V3 present A1 A1 future S0 V1 V2 V3 A2 S2 A1 S3 A0 S1 A1 A2 A3 recognized pattern Fig. 8. prediction Learning a ’Morphing’ Association for map V2. W a0 W a1 W a2 W e0 W e1 W e2 A0 A1 A2 S2 A1 State Perceptive Context Possible Transitions S0 (Wa1, We0) S0. A1 → S2 S3 A2 A2 S0 A1 S1 S2 A2 A1 A2 A3 p3 S0 S0. A2 → S3 S1 (Wa0, We0) S1. A0 → S2 S2 (Wa2, We1) S2. A2 → S0 S2. A1→ S1 S3 (Wa1, We2) S3. A1 → S0 S3. A0→ S2 Fig. 9. A decision tree is represented as predictive timelines and generated with ”Morphing”, ”Moving” and ”Acting” associations. process occurs when the referent-map of the layer activates at present time and global effect variation reaches a threshold. Then, the system learns associations responsible for a global effect increment (actions to perform) or decrement (actions to avoid). B. ’Where’ Transitions: Moving the World In Section IV-B, we presented the competitive building process for each ’Where’ referent-map. Now we will focus on one particular ’Where’ layer, its referent-map and the construction of ’Where’ specific-maps. This process is quite similar to the one exposed in previous Section. The only difference is the use of ’Where’ referent-maps, which have previously learnt a {pan, tilt, depth} pattern, replacing ’What’ specific-maps in the input state of the learning neuron. Then a ’Moving’ transition associate two positions with an action able to produce the relative motion from one position to the other. With the help of ’Morphing’ and ’Moving’ transition the system is then able to predict a future perceptive context which is a first step for decision making (see Section VI). C. ’How’ Transitions: Learning Applicable Actions This last association is quite different from the ’Morphing’ and ’Moving’ ones. The input domain differs since we here use the association of ’What’ (referent or specific) and ’Where’ (only referent) activation patterns. This learnt perceptive context can then be viewed as preconditions for the ’How’ referent-map (or action pattern) of the layer. We want to emphasis on the fact that no preexisting knowledge of such a kind is given to the system, so we need to include this In Sections IV and V we have presented the way the system builds ’What’, ’Where’ and ’How’ associations using patterns of neuron activations learnt in memory traces. The point here is to build a coherent projection of the robot possible perceptions and actions using the previous associations and choose the correct actions leading to the maximum increasement of the system global effect. In Figure 9, we illustrate this idea by the use of a decision tree in which each state corresponds to a perceptive context. State correspondences and possible transitions are summarized in table of Figure 9. All of these informations needed to build decision trees come from the learnt transitions. Moreover, our system also learns to associate a score to every states and transitions. For a state, the score is equal to the sum of its W ai and W ej respective local effects. For a transition, the score is the one acquired during the learning of the action preconditions as explained in Section V-C. As neuron activity is restricted to only one ”Integrate and Fire” phase at each step, and as the building of a decision tree requires a sequenced process to apply transitions and respect coherence, we must build an algorithm which is able to build the tree during the course of time. The main idea is to add new states and/or transitions to the tree during the time duration needed to complete an action. Moreover, as the tree is in fact translated as predicted activation timelines we must grant each prediction with informations reflecting its position in the tree in order to ”filter” incoherent states when choosing one specific action. We associate to each state the most rewarding path leading to it for coherence checking. At the beginning of a time step, if the observed state is not the one predicted we totally erase the tree. Then we add the present IROS 2006 workshop: From Sensors to Human Spatial Concepts Page: 31 observations to the timelines and create new states and/or transitions with respect to the states and/or transitions created during the previous step. As the size of the tree increases, we update all predicted local effects as follows: a state adds its own learnt effect to the maximum possible effects produced by all its possible transitions and a transition adds its own effect to its resulting state’s effect. Then we extract the path in the tree leading to the maximum reachable effect and generate the first action. Finally we filter incoherent states. VII. ROBOT PLATFORM , IMPLEMENTATION AND PRELIMINARY RESULTS VIII. C ONCLUSION We have firstly presented a uniform framework for autonomous knowledge extraction. We have then shown that an associative memory can be built by the use of three core concepts: ”What”, ”Where” and ”How”. Moreover we provide a predictive and decisional framework involving the generated associations with respect to our neurons computation capabilities. One next step would be to make the system learn sophisticated actions in order to produce long term predictions involving object’s appearance. We also plan to support multiple objects localizations by the use of an additional value we call the phase in order to link one recognition with the correct localization. ACKNOWLEDGMENT The work described in this paper was partially conducted within the EU Integrated Project COGNIRON (The Cognitive Robot Companion) funded by the European Commission Division FP6-IST Future and Emerging Technologies under Contract FP6-002020. Nicolas Do Huu is supported by the European Social Fund. R EFERENCES Fig. 10. Athos, a SuperScout 2 robot built by Nomadic Technologies. Athos (see Figure 10) is a SuperScout 2 robot built by Nomadic Technologies. It is a 35 cm wide and tall cylinder equipped with two Sony DFW-VL500 digital cameras mounted on a Directed Perception pan-tilt unit. Its has also 24 sonar range sensors which are not used during our experiments. The robot motion controller and the Pan-Tilt unit are both connected to a Apple Powerbook G4 867MHz for video acquisition, pan-tilt and motion control via USB ports. The computational part is based on a pulsed neural network simulator called NeuSter (neuron-cluster). An eventdriven mechanism is used to simulate thousands of neurons and millions of synaptic connections in real-time. To obtain maximum performance, the computation can be distributed to several POSIX machines on the local network. During our first experiments the neural network has shown a successful extraction of representations from the perceptive context : view based object recognition, position and action were learnt in an unsupervised way. Then we set an experiment to test the associative memory in which the system perceived negative effect variation when the pan angle moved away from the central position. The resulting ’Moving’ associations and global effect reinforcement learning managed to produce the awaited behavior : the robot turns around its central axis to be positioned in front of the object. The next step will be to set up an experiment involving the object apprearance as its color or shape. More results are to come. [1] W. Paquier and R. Chatila, “An architecture for robot learning,” in Intelligent Autonomous Systems, 2002, pp. 575–578. [2] A Uniform Model For Developemental Robotic, 2003. [3] W. Gerstner and W. Kistler, “Mathematical formulations of hebbian learning,” Biological Cybernetics, no. 87, pp. 404–415, 2002. [4] K. Fukushima, “Neocognitron for handwritten digit recognition.” Neurocomputing, vol. 51, pp. 161–180, April 2003. [5] N. Do Huu, W. Paquier, and R. Chatila, “Combining structural descriptions and image-based representations for image object and scene recognition,” in International Joint Conference on Artificial Intelligence (IJCAI), 2005. [6] T. Poggio and S. Edelman, “A network that learns to recognize three dimensional objects,” Letters to Nature, vol. 343, pp. 263–266, 1990. [7] S. Ullman, “Three-dimensional object recognition based on the combination of views.” Cognition, vol. 67, pp. 21–44, 1998. [8] M. Tarr and H. Bülthoff, “Image-based object recognition in man, monkey and machine.” Cognition, no. 67, pp. 1–20, 1998. [9] M. Riesenhuber and T. Poggio, “Hierarchical models of object recognition in cortex.” Nature Neuroscience, vol. 28, pp. 1019–1025, 1999. [10] G. Wallis, “Using spatio-temporal correlations to learn invariant object recognition,” Neural Networks, vol. 9, no. 9, pp. 1513–1519, 1996. [11] S. M. Stinger and E. T. Rolls, “Invariant object recognition in the visual system with novel views of 3d objects,” Neural Computation, vol. 14, no. 11, pp. 2585–2596, November 2002. IROS 2006 workshop: From Sensors to Human Spatial Concepts Page: 32 Semantic Labeling of Places using Information Extracted from Laser and Vision Sensor Data Óscar Martı́nez Mozos∗ , Axel Rottmann∗ , Rudolph Triebel∗ , Patric Jensfelt† and Wolfram Burgard∗ ∗ University of Freiburg, Department of Computer Science, Freiburg, Germany Institute of Technology, Center for Autonomous Systems, Stockholm, Sweden Email: {omartine|rottmann|triebel|burgard}@informatik.uni-freiburg.de, patric@nada.kth.se † Royal Abstract— Indoor environments can typically be divided into places with different functionalities like corridors, kitchens, offices, or seminar rooms. The ability to learn such semantic categories from sensor data enables a mobile robot to extend the representation of the environment facilitating the interaction with humans. As an example, natural language terms like “corridor” or “room” can be used to communicate the position of the robot in a map in a more intuitive way. In this work, we first propose an approach based on supervised learning to classify the pose of a mobile robot into semantic classes. Our method uses AdaBoost to boost simple features extracted from range data and vision into a strong classifier. We present two main applications of this approach. Firstly, we show how our approach can be utilized by a moving robot for an online classification of the poses traversed along its path using a hidden Markov model. Secondly, we introduce an approach to learn topological maps from geometric maps by applying our semantic classification procedure in combination with a probabilistic relaxation procedure. We finally show how to apply associative Markov networks (AMNs) together with AdaBoost for classifying complete geometric maps. Experimental results obtained in simulation and with real robots demonstrate the effectiveness of our approach in various indoor environments. I. I NTRODUCTION In the past, many researchers have considered the problem of building accurate maps of the environment from the data gathered with a mobile robot. The question of how to augment such maps by semantic information, however, is virtually unexplored. Whenever robots are designed to interact with their users, semantic information about places can improve the human-robot communication. From the point of view of humans, terms like “corridor” or “room” give a more intuitive idea of the position of the robot than using, for example, the 2D coordinates in a map. In this work, we address the problem of classifying places of the environment of a mobile robot using range finder and vision data, as well as building topological maps based on that knowledge. Indoor environments, like the one depicted in Figure 1, can typically be divided into areas with different functionalities such as laboratories, office rooms, corridors, or kitchens. Whereas some of these places have special geometric structures and can therefore be distinguished merely based on laser range data, other places can only be identified according to the objects found there like, for example, monitors in a laboratory. To detect such objects, we use vision data acquired by a camera system. corridor room doorway Fig. 1. The left image shows a map of a typical indoor environment. The middle image depicts the classification into three semantic classes as colors/grey levels. For this purpose the robot was positioned in each free pose of the original map and the corresponding laser observations were simulated and classified. The right images show typical laser and image observations together with some extracted features, namely the average distance between two consecutive beams in the laser and the number of monitors detected in the image. The key idea is to classify the pose of the robot based on the current laser and vision observations. Examples for typical observations obtained in an office environment are shown in the right images of Figure 1. The classification is then done applying a sequence of classifiers learned with the AdaBoost algorithm [18]. These classifiers are built in a supervised fashion from simple geometric features that are extracted from the current laser scan and from objects extracted from the current images as shown in the right images of Figure 1. As an example, the left image in Figure 1 shows a typical indoor environment and the middle image depicts the classification obtained using our method. We furthermore present two main applications of this approach. Firstly, we show how to classify the different poses of the robot during a trajectory and improve the final classification using a hidden Markov model. Secondly, we introduce an approach to learn topological maps from geometric maps by applying our semantic classification in combination with a probabilistic relaxation procedure. In this last case we compare the results when using an associative Markov networks (AMNs) with those obtained with AdaBoost. The rest of this work is organized as follows. Section II presents related work. In Section III, we describe the sequential AdaBoost classifier. In Section IV, we present the application of a hidden Markov model to the online place classification with a moving robot. Section V contains our approach for topological map building. In Section VI we present some results when using a range finder with a restricted IROS 2006 workshop: From Sensors to Human Spatial Concepts Page: 33 field of view. Finally, Section VII presents experimental results obtained using our methods. II. R ELATED W ORK In the past, several authors considered the problem of adding semantic information to places. Buschka and Saffiotti [5] describe a virtual sensor to identify rooms from range data. Koenig and Simmons [9] apply a pre-programmed routine to detect doorways. Finally, Althaus and Christensen [1] use sonar data to detect corridors and doorways. Learning algorithms have additionally been used to identify objects in the environment. For example, Anguelov et al. [2], [3] apply the EM algorithm to cluster different types of objects from sequences of range data and to learn the state of doors. Limketkai et al. [12] use relational Markov networks to detect objects like doorways based on laser range data. Finally, Torralba and colleagues [23] use hidden Markov models for learning places from image data. Compared to these approaches, our algorithm is able to combine arbitrary features extracted from different sensors to form a sequence of binary strong classifiers to label places. Our approach is also supervised, which has the advantage that the resulting labels correspond to user-defined classes. On the other hand, different algorithms for creating topological maps have been proposed. Kuipers and Byun [11] extract distinctive points in the map defined as local maxima of a distinctiveness measure. Kortenkamp and Weymouth [10] fuse vision and ultrasound information to determine topologically relevant places. Shatkey and Kaelbling [19] apply a HMM learning approach to learn topological maps. Thrun [22] uses the Voronoi diagram to find critical points, which minimize the clearance locally. Choset [7] encodes metric and topological information in a generalized Voronoi graph to solve the SLAM problem. Additionally, Beeson et al. [4] used an extension of the Voronoi graph for detecting topological places. Zivkovic et al. [26] use visual landmarks and geometric constraints to create a higher level conceptual map. Finally, Tapus and Siegwart [20] used fingerprints to create topological maps. In contrast to these previous approaches, the technique described in this paper applies a supervised learning method to identify complete regions in the map like corridors, rooms or doorways that have a direct relation with a human understanding of the environment. This knowledge about semantic labels of places is used then to build topological maps with a mobile robot. We also apply associative Markov networks (AMNs) together with AdaBoost to label each point in a geometric map. III. S EMANTIC C LASSIFICATION OF P OSES USING A DA B OOST Boosting is a general method for creating an accurate strong classifier by combining a set of weak classifiers. The requirement to each weak classifier is that its accuracy is better than a random guessing. In this work we will use the boosting algorithm AdaBoost in its generalized form presented by Schapire and Singer [18]. The input to the algorithm is a set of labeled training examples (xn , yn ), n = 1, . . . , N , where each xn is an example and each yn ∈ {+1, −1} is a value indicating whether xn is positive or negative respectively. In our case, the training examples are composed by laser and vision observations. In several iterations the algorithm repeatedly selects a weak classifier using a weight distribution over the training examples. The final strong classifier is a weighted majority vote of the best weak classifiers. Throughout this work, we use the approach presented by Viola and Jones [25] in which the weak classifiers depend on single-valued features fj ∈ ℜ. For a more detail description see [17]. The so far described method is able to distinguish between two classes of examples, namely positives and negatives. In practical applications, however, we want to distinguish between more than two classes. To create a multi-class classifier we used the approach applied by Martı́nez Mozos et al. [14] and create a sequential multi-class classifier using K − 1 binary classifiers, where K is the number of classes we want to recognize. The classification output of the decision list is then represented by a histogram z. Each bin of z stores the probability that the classified example belongs to the k-th class. The order of the classifiers in the decision list can be selected according to different methods as described in [13] and [14]. A. Features from Laser and Vision Data In this section, we describe the features used to create the weak classifiers in the AdaBoost algorithm. Our robot is equipped with a 360 degree field of view laser sensor and a camera. Each laser observation consists of 360 beams. Each vision observation consists of eight images which form a panoramic view. Figure 1 shows a typical laser range reading as well as one of the images from the panoramic view taken in an office environment. Accordingly, each training example for the AdaBoost algorithm consist of one laser observation, one vision observation, and its classification. Our method for place classification is based on single-valued features extracted from laser and vision data. All features are invariant with respect to rotation to make the classification of a pose dependent only on the position of the robot and not on its orientation. Most of our laser features are standard geometrical features used for shape analysis as the one shown in Figure 1. In the case of vision, the selection of the features is motivated by the fact that typical objects appear with different probabilities at different places. For example, the probability of detecting a computer monitor is larger in an office than in a kitchen. For each type of object, a vision feature is defined as a function that takes as argument a panoramic vision observation and returns the number of detected objects of this type in it. This number represents the single-valued feature fj as explained in Section III. As an example, Figure 1 shows one image of a panoramic view and its detected monitors. A more detailed list of laser and image features is contained in our previous work [14]. IROS 2006 workshop: From Sensors to Human Spatial Concepts Page: 34 IV. P ROBABILISTIC C LASSIFICATION OF T RAJECTORIES The approach described so far is able to classify single observations only but does not take into account past classifications when determining the type of place the robot is currently at. However, whenever a mobile robot moves through an environment, the semantic labels of nearby places are typically identical. Furthermore, certain transitions between classes are unlikely. For example, if the robot is currently in a kitchen then it is rather unlikely that the robot ends up in an office given it moved a short distance only. In many environments, to get from the kitchen to the office, the robot has to move through a doorway first. To incorporate such spatial dependencies between the individual classes, we apply a hidden Markov model (HMM) and maintain a posterior Bel (lt ) about the type of the place lt the robot is currently at X Bel (lt ) = αP (zt | lt ) P (lt | lt−1 , ut−1 )Bel (lt−1 ).(1) lt−1 In this equation, α is a normalizing constant ensuring that the left-hand side sums up to one over all lt . To implement this HMM, three components need to be known. First, we need to specify the observation model P (zt | lt ) which is the likelihood that the classification output is zt given the actual class is lt . Second, we need to specify the transition model P (lt | lt−1 , ut−1 ) which defines the probability that the robot moves from class lt−1 to class lt by executing action ut−1 . Finally, we need to specify how the belief Bel (l0 ) is initialized. In our current system, we choose a uniform distribution to initialize Bel (l0 ). The quantity P (zt |lt ) has been obtained by a statistics about the classification output of the AdaBoost algorithm given that the robot was at a place corresponding to lt . To realize the transition model P (lt |lt−1 , ut−1 ) we only consider the two actions ut−1 ∈ {MOVE , STAY }. The transition probabilities were estimated by running 1000 simulation experiments. A more complete description is given in [17]. V. T OPOLOGICAL M AP B UILDING A second application of our semantic place classification is the extraction of topological maps from geometric maps. Throughout this section we assume that the robot is given a map of the environment in the form of an occupancy grid [15]. Our approach then determines for each unoccupied cell of such a grid its semantic class. This is achieved by simulating a range scan of the robot given it is located in that particular cell, and then labeling this scan into one of the semantic classes. To remove noise and clutter from the resulting classifications, we apply an approach denoted as probabilistic relaxation labeling [16]. This method takes into account the labels of the neighborhood when changing (or maintaining) the label of a given cell. From the resulting labeling we construct a graph whose nodes correspond to the regions of identically labeled poses and whose edges represent the connections between them. Additionally we apply a heuristic region correction to the topological map to increase the classification rate. A typical topological map obtained with our approach is shown in the Figure 7. For more detail see [14]. A. Semantic Classification of Maps using Associative Markov Networks The improvement on the labeling of free cells given by our AdaBoost approach can also be seen as a collective classification problem [6]. In this approach, the labeling of each free cell in the map is also influenced by the labeling of other cells in the vicinity. One popular method for the task of collective classification are relational Markov networks (RMNs) [21]. In addition to the labels of neighboring points, RMNs also consider the relations between different objects. E.g., we can model the fact that two classes A and B are more strongly related to each other than, say, classes A and C. This modeling is done on the abstract class level by introducing clique templates [6]. Applying these clique templates to a given data set yields an ordinary Markov network (MN). In this MN, the result is a higher weighting of neighboring points with labels A and B than of points labeled A and C. Additionally, each node in the network is associated a set of features. The whole process of labeling is composed of two steps. First, a supervised learning process is used to learn the parameters of the RMN used as a training set. Second, a new network is classified using these parameters. This last step is also called inference. In this work, we will use a special type of RMNs known as associative Markov networks (AMNs). Efficient algorithms are available for learning and inference in AMNs (for more detail see [24]). In our case we create an AMN in which each node represents a cell in the geometric map. Each node is given a semantic label corresponding to the place in the map (corridor, doorway or room). We also create a 8-neighborhood for each cell. Furthermore, a set of features is calculated for each cell. These features correspond to the geometric ones extracted from a simulated laser beam as explained in Section III-A. To reduce the number of features during the training and inference steps, we select a subset of them. This selection is done using the AdaBoost algorithm [13]. VI. L ASER O BSERVATIONS WITH R ESTRICTED F IELD OF V IEW In this section we present some practical issues when classifying a trajectory using range data with a restricted field of view. Specifically, we explain how to extract features when using a laser range finder which only covers 180o in front of the robot. This is one of the most common configurations when using mobile robots. As an example, if a robot is looking at the end of a corridor, then it is not able to see the rest of the corridor, as is the case with an additional rear laser. This situation is shown in Figure 2. When classifying a trajectory we propose to maintain a local map around the robot as shown in the right image of Figure 2. This local map can be updated during the movements of the robot and then used to simulate the rear laser beams. In Section VII we show some results when learning and classifying a place using this method. IROS 2006 workshop: From Sensors to Human Spatial Concepts Page: 35 Fig. 2. The left image shows a robot at the end of a corridor with only a front laser (red). In the middle image the robot has an additional rear laser (blue). The right image depicts an example local map (shaded area). corridor corridor room doorway Fig. 3. The left image depicts the training data. The right image shows the test set with a classification rate of 97.3%. The training and test data were obtained by simulating laser range scans in the map. VII. E XPERIMENTS The approaches described above have been implemented and tested on real robots as well as in simulation. The robots used to carry out the experiments were an ActivMedia Pioneer 2-DX8 equipped with two SICK lasers, an iRobot B21r robot equipped with a camera system and an ActivMedia PowerBot equipped only with a front laser. The goal of the experiments is to demonstrate that our simple features can be boosted to a robust classifier of places. Additionally, we analyze whether the resulting classifier can be used to classify places in environments for which no training data was available. Furthermore, we demonstrate the advantages of utilizing the vision information to distinguish between different rooms like, e.g., kitchens, offices, or seminar rooms. Additionally, we illustrate the advantages of the HMM filtering for classifying places with a moving mobile robot. We also present results applying our method for building semantic topological maps. Finally, we show experiments using a robot with only a front laser. A. Results with the Sequential Classifier using Laser Data The first experiment was performed using simulated data from our office environment in building 79 at the University of Freiburg. The task was to distinguish between three different types of places, namely rooms, doorways, and a corridor based on laser range data only. In this experiment, we applied the sequential classifier without any filtering. For the sake of clarity, we separated the test from the training data by dividing the overall environment into two areas. Whereas the left part of the map contains the training examples, the right part includes only test data (Figure 3). The optimal decision list for this classification problem, in which the robot had to distinguish between three classes, is room-doorway. This decision list correctly classifies 97.3% of all test examples (right image of Figure 3). Additionally, we performed an experiment using a map of the entrance hall at the University of Freiburg which room doorway Fig. 4. The left map depicts the occupancy grid map of the Intel Research Lab and the right image depicts the classification results obtained by applying the classifier learned from the environment depicted in Figure 1 to this environment. The fact that 83.0% of all places could be correctly classified illustrates that the resulting classifiers can be applied to so far unknown environments. contained four different classes, namely rooms, corridors, doorways, and hallways. The optimal decision list is corridorhallway-doorway with a success rate of 89.5%. The worst configurations of the decision list are those in which the doorway classifier is in the first place. This is probably due to the fact, that doorways are hard to detect because typically most parts of a range scan obtained in a doorway cover the adjacent room and the corridor. The high error in the first element of the decision list then leads to a high overall classification error. B. Transferring the Classifiers to New Environments The second experiment is designed to analyze whether a classifier learned in a particular environment can be used to successfully classify the places of a new environment. To carry out this experiment, we trained our sequential classifier in the left map of Figure 1, which corresponds to the building 52 at the University of Freiburg. The resulting classifier was then evaluated on scans simulated given the map of the Intel Research Lab in Seattle depicted in Figure 4. Although the classification rate decreased to 83.0%, the result indicates that our algorithm yields good generalizations which can also be applied to correctly label places of so far unknown environments. Note that a success rate of 83.0% is quite high for this environment, since even humans typically cannot consistently classify the different places. C. Classification of Trajectories using HMM Filtering The third experiment was performed using real laser and vision data obtained in an office environment, which contains six different types of places, namely offices, doorways, a laboratory, a kitchen, a seminar room, and a corridor. The true classification of the different places in this environments is shown in Figure 5. The classification performance of the classifier along a sample trajectory taken by a real robot is shown in the left image of Figure 6. The classification rate in this experiment is 82.8%. If we additionally apply the HMM IROS 2006 workshop: From Sensors to Human Spatial Concepts Page: 36 Door 1 F F L D D D C D K Fig. 5. F S F D D F C D L Laboratory S Seminar C Corridor F Office D Doorway K Kitchen Door 2 Room 1 Corridor Door 3 Room 2 Door 4 Door 5 Door 6 Room 3 Room 4 Room 5 S Ground truth labeling of the individual areas in the environment. L L L L LS SL F D SF F D L L F F L F F F SL L DF L F F DD DD C C C CC C C C C C C C C C C C CC C C DD D D D D S S S LS D S KK LF D S K F F FK S L F S KK K F F F L L L L LL LS F D SF F F L L F F L F F LL L L SFF L L SFD D C C C CC C C C C C C C C C C C CC C C DS D S D D S S S SS D S KF FF S S K F F KK S F F S KK K Fig. 6. The left image depicts a typical classification result for a test set obtained using only the output of the sequence of classifiers. The right image shows the resulting classification in case a HMM is additionally applied to filter the output of the sequential classifier. for temporal filtering, the classification rate increases up to 87.9% as shown in the right image of Figure 6. A further experiment was carried out using test data obtained in a different part of the same building. We applied the same classifier as in the previous experiment. Whereas the sequential classifier yields a classification rate of 86.0%, the combination with the HMM generated the correct answer in 94.7% of all cases. A two-sample t-test applied to the classification results obtained along the trajectories for both experiments showed that the improvements introduced by the HMM are significant on the α = 0.05 level. Furthermore, we classified the same data based solely on the laser features and ignoring the vision information. In this case, only 67.7% could be classified correctly without the HMM. The application of the HMM increases the classification performance to 71.7%. These three experiments illustrate that the HMM significantly improves the overall rate of correctly classified places. Moreover, the third experiment shows that only the laser information is not sufficient to distinguish robustly between places with similar structure (see “office” and “kitchen” in Figure 6). D. Building Topological Maps The next experiment is designed to analyze our approach to build topological maps. It was carried out in the office environment depicted in the motivating example shown in Figure 1. The length of the complete corridor in this environment is approx. 20 m. After applying the sequential AdaBoost classifier (see middle image in Figure 1), we applied the probabilistic relaxation method together with the heuristics explained in Section V. The resulting topological map is shown in Figure 7. The final result gives a classification rate of 98.0% for all data points. The doorway between the two rightmost rooms under the corridor is correctly detected. Therefore, Fig. 7. Final tropological map with of building 52 at Freiburg University. the rooms are labeled as two different regions in the final topological map. E. Learning Topological Maps of Unknown Environments This experiment is designed to analyze whether our approach can be used to create a topological map of a new unseen environment. To carry out the experiment we trained a sequential AdaBoost classifier using the training examples of the maps shown in Figure 3 and Figure 1 with different scales. The resulting classifier was then evaluated on scans simulated in the map denoted as “SDR site B” in Radish [8]. This map represents an empty building in Virginia, USA. The corridor is approx. 26 meters long. The whole process for obtaining the topological map is depicted in Figure 8. The Adaboost classifier gives a first classification of 92.4%. As can be seen in Figure 8(d), rooms number 11 and 30 are actually part of the corridor, and thus falsely classified. Moreover, the corridor is detected as only one region, although humans potentially would prefer to separate it into six different corridors: four horizontal and two vertical ones. Doorways are difficult to detect and the majority of them dissappear after the relaxation process because they are very sparse. In the final topological map 96.9% of the data points are correctly classified. F. Learning Topological Maps using Associative Markov Networks (AMNs) In this experiment, we classify the map of the building 79 at the University of Freiburg applying the learning and inference process for AMNs as explained in Section V-A. We divide the map in two parts and use one of them for training (see left image in Figure 3) and the second one for testing. In this experiment we reduce the resolution of the maps to 20cm. The reason is that the original resolution of 5cm generates a huge network which exceeds the memory resources of our computers during the training step of the corresponding AMN. The left image of Figure 9 shows the results of the classification using AMNs. The classification rate using AMNs was 98.8%. We compare this method with the classification obtained using our sequential AdaBoost together with the probabilistic relaxation procedure. The right image of Figure 9 depicts the classification results. In this case only 92.1% of the cells were correctly classified. As we can see, one consequence of changing the resolution to 20cm, is that the classification rate decreases (see right image of Figure 3). We think this is due to the worse quality of the simulated beams in such a granulated map. On the other hand, AMNs IROS 2006 workshop: From Sensors to Human Spatial Concepts Page: 37 (a) Original map (b) Sequential AdaBoost classification R1 R2 R3 R4 R5 R6 CORRIDOR R7 R8 R9 R 46 R 10 R 11 R 12 R 13 R 14 R 15 R 16 R 21 R 22 R 23 R 24 R 25 R 31 R 32 R 33 R 34 R 35 CORRIDOR R 17 R 18 R 19 R 20 R 26 R 27 R 28 R 30 R 29 CORRIDOR R 36 R 38 R 37 CORRIDOR R 39 R 40 (c) Relaxation and region correction R 41 R 42 R 43 R 44 R 45 corridor room Fig. 8. This figure shows (a) the original map of the building, (b) the results of applying the sequential AdaBoost classifier with a classification rate of 93%, (c) the resulting classification after the relaxation and region correction, and (d) the final topological map with semantic information. The regions are omitted in each node. The rooms are numbered left to right and top to bottom with respect to the map in (a). For the sake of clarity, the corridor-node is drawn maintaining part of its region structure. corridor corridor (d) Final topological map room doorway Fig. 9. The left image depicts a classification of 98.8% of the building 79 at University of Freiburg using AMNs. The right image shows the classification of the same building using the sequential AdaBoost classifier together with the probabilistic labeling method. In this case the classification rate was 92.1%. The training and test data were obtained by simulating laser range scans in the left map of Figure 3. seems to be more robust to changes in resolution and give better classifications results. G. Laser Observations with Restricted Field of View In this experiments we show the results of applying our classification methods when the laser range scan has a restricted field of view. No image data was used. We first steered a PowerBot robot equipped with only a front laser along the 6th floor of the CAS building at KTH (right to left). The trajectory is shown in the top image of Figure 10. The data recorded in this floor was used to train the AdaBoost classifier. We then classified a trajectory on the 7th floor in the same building. We started the trajectory in an opposite direction (left to right). The room doorway Fig. 10. The top image shows the training trajectory on the 6th floor of the CAS building at KTH. The middle image depicts the labeling of the trajectory of the 7th floor using only a front laser with a classification rate of 84.4%. Finally, the bottom image shows the same labelled trajectory using a complete laser field of view together with a local map. In this case the classification rate decreases slightly to 81.6%. resulting classification rate of 84.4% is depicted in the middle image of Figure 10. We repeated the experiment simulating the rear laser using a local map. The classification decreases slightly to 81.6%. Most of the errors appear in poses where the robot still sees a doorway due to the rear beams. This is not the case when using only a front laser, because the robot only sees a doorway when facing it. To verify that the doorways can be the reason of the lack of improvement using local maps, we repeat both experiments, but in this case using only two classes, namely room and corridor. The results are shown in Figure 11. The top image depicts the labeling using only a front laser with a classification rate of 87.3%. The bottom image shows the result of simulating the rear beams using a local map. The classification rate in this case increases to 95.8%. VIII. C ONCLUSION In this paper, we presented a novel approach to classify different places in the environment of a mobile robot into semantic classes, like rooms, hallways, corridors, offices, kitchens, or doorways. Our algorithm uses simple geometric features extracted from a single laser range scan and information extracted from camera data and applies the AdaBoost algorithm to form a binary strong classifier. To distinguish between more than two classes, we use a sequence of strong binary classifiers arranged in a decision list. We presented two applications of our approach. Firstly, we perform an online classification of the positions along the trajectories of a mobile robot by filtering the classification IROS 2006 workshop: From Sensors to Human Spatial Concepts Page: 38 corridor room Fig. 11. In this experiments only two classes where used, namely room and corridor. The top image depicts the classification of the trajectory of the 7th floor using only a front laser with a classification rate of 87.3%. The bottom image shows the same trajectory using a complete laser field of view together with a local map. In this case the classification rate increases to 95.8%. output using a hidden Markov model. Secondly, we present a new approach to create topological graphs from occupancy grids by applying a probabilistic relaxation labeling to take into account dependencies between neighboring places to improve the classifications. Experiments carried out using real robots as well as in simulation illustrate that our technique is well-suited to reliably label places in different environments. It allows us to robustly separate different semantic regions and in this way it is able to learn topologies of indoor environments. Further experiments illustrate that a learned classifier can even be applied to so far unknown environments. ACKNOWLEDGMENT This work has been partially supported by the EU under contract number FP6-004250-CoSy and by the German Research Foundation under contract number SBF/TR8. R EFERENCES [1] P. Althaus and H. Christensen, “Behaviour coordination in structured environments,” Advanced Robotics, vol. 17, no. 7, pp. 657–674, 2003. [2] D. Anguelov, R. Biswas, D. Koller, B. Limketkai, S. Sanner, and S. Thrun, “Learning hierarchical object maps of non-stationary environments with mobile robots,” in Proc. of the Conf. on Uncertainty in Artificial Intelligence (UAI), 2002. [3] D. Anguelov, D. Koller, P. E., and S. Thrun, “Detecting and modeling doors with mobile robots,” in Proc. of the IEEE Int. Conf. on Robotics & Automation (ICRA), 2004. [4] P. Beeson, N. K. Jong, and B. Kuipers, “Towards autonomous topological place detection using the extended voronoi graph,” in Proc. of the IEEE Int. Conf. on Robotics & Automation (ICRA), 2005. [5] P. Buschka and A. Saffiotti, “A virtual sensor for room detection,” in Proc. of the IEEE/RSJ Int. Conf. on Intelligent Robots and Systems (IROS), 2002, pp. 637–642. [6] S. Chakrabarti, B. Dom, and P. Indyk, “Enhanced hypertext categorization using hyperlinks,” in SIGMOD ’98: Proceedings of the 1998 ACM SIGMOD international conference on Management of data. New York, NY, USA: ACM Press, 1998, pp. 307–318. [7] H. Choset, “Topological simultaneous localization and mapping (SLAM): Toward exact localization without explicit localization,” IEEE Transactions on Robotics and Automation, 2001. [8] A. Howard and N. Roy, “Radish: The robotics data set repository.” [Online]. Available: http://radish.sourceforge.net/ [9] S. Koenig and R. Simmons, “Xavier: A robot navigation architecture based on partially observable markov decision process models,” in Artificial Intelligence Based Mobile Robotics: Case Studies of Successful Robot Systems, D. Kortenkamp, R. Bonasso, and R. Murphy, Eds. MITPress, 1998, pp. 91–122. [10] D. Kortenkamp and T. Weymouth, “Topological mapping for mobile robots using a combination of sonar and vision sensing,” in Proc. of the Twelfth National Conference on Artificial Intelligence, 1994, pp. 979– 984. [11] B. Kuipers and Y. Byun, “A robot exploration and mapping strategy based on a semantic hierarchy of spatial representations,” Robotics and Autonomous Systems, 8 1981. [12] B. Limketkai, L. Liao, and D. Fox, “Relational object maps for mobile robots,” in Proc. of the Int. Conf. on Artificial Intelligence (IJCAI), Edinburgh, Scotland, 2005, pp. 1471–1476. [13] O. Martı́nez Mozos, C. Stachniss, and W. Burgard, “Supervised learning of places from range data using adaboost,” in Proc. of the IEEE Int. Conf. on Robotics & Automation (ICRA), Barcelona, Spain, April 2005, pp. 1742–1747. [14] O. Martı́nez Mozos, C. Stachniss, A. Rottmann, and W. Burgard, “Using adaboost for place labeling and topological map building,” in Proc. of the Int. Symposium of Robotics Research (ISRR), San Francisco, CA, USA, October 2005. [15] H. P. Moravec, “Sensor fusion in certainty grids for mobile robots,” AI Magazine, pp. 61–74, Summer 1988. [16] A. Rosenfeld, R. A. Hummel, and S. W. Zucker, “Scene labeling by relaxation operations,” IEEE Trans. Systems. Man. Cybernet, vol. 6, no. 6, pp. 420–433, 1976. [17] A. Rottmann, O. Martı́nez Mozos, C. Stachniss, and W. Burgard, “Place classification of indoor environments with mobile robots using boosting,” in Proc. of the National Conference on Artificial Intelligence (AAAI), Pittsburgh, PA, USA, 2005, pp. 1306–1311. [18] R. Schapire and Y. Singer, “Improved boosting algorithms using confidence-rated predictions,” Mach. Learn., vol. 37, no. 3, pp. 297– 336, 1999. [19] H. Shatkey and L. Kaelbling, “Learning topological maps with weak local odometric information,” in Proc. of the Int. Conf. on Artificial Intelligence (IJCAI), 1997. [20] A. Tapus and R. Siegwart, “Incremental robot mapping with fingerprints of places,” in Proc. of the IEEE/RSJ Int. Conf. on Intelligent Robots and Systems (IROS), August 2005, pp. 2429– 2434. [21] B. Taskar, P. Abbeel, and D. Koller, “Discriminative probabilistic models for relational data,” in Proc. Eighteenth Conference on Uncertainty in Artificial Intelligence (UAI), Edmonton, Canada, 2002. [22] S. Thrun, “Learning metric-topological maps for indoor mobile robot navigation,” Artificial Intelligence, vol. 99, no. 1, pp. 21–71, 1998. [23] A. Torralba, K. Murphy, W. Freeman, and M. Rubin, “Context-based vision system for place and object recognition,” in Proc. of the Int. Conf. on Computer Vision (ICCV), 2003. [24] R. Triebel, P. Pfaff, and W. Burgard, “Multi-level surface maps for outdoor terrain mapping and loop closing,” in ”Proc. of the International Conference on Intelligent Robots and Systems (IROS)”, 2006. [25] P. Viola and M. Jones, “Robust real-time object detection,” in Proc. of IEEE Workshop on Statistical and Theories of Computer Vision, 2001. [26] Z. Zivkovic, B. Bakker, and B. Kröse, “Hierarchical map building using visual landmarks and geometric constraints,” in Proc. of the IEEE/RSJ Int. Conf. on Intelligent Robots and Systems (IROS), 2005. IROS 2006 workshop: From Sensors to Human Spatial Concepts Page: 39 ... IROS 2006 workshop: From Sensors to Human Spatial Concepts Page: 40 Towards Stratified Spatial Modeling for Communication & Navigation Robert J. Ross, Christian Mandel, John A. Bateman, Shi Hui, Udo Frese Abstract— In this paper we present NavSpace, a stratified spatial representation developed for service robotics. While NavSpace’s lower tiers, including metric and voronoi-graph information, focus on the needs of navigation and localization systems, the upper tiers have been explicitly developed to support relatively natural human-robot interaction. Specifically, NavSpace’s upper tiers include: (a) a coarse grained Topological Level which includes Route Graph and Region Space models of the environment; and (b) the Concept Level, a cognitively grounded, coarse grained representation which links together the places, regions, and paths of the topological level. Furthermore, we describe the NavSpace model with respect to our completed prototype on Rolland, a semi-autonomous wheelchair, and in so doing, describe how the conceptual spatial representation relates to natural language technology through an empirically derived linguistic semantics bridge. has been considered from different perspectives; in Section III we review and critique two of the more prominent of such models, the Spatial Semantic Hierarchy and the Route Graph. In Section IV we then detail our own developed spatial model, i.e., NavSpace, which takes a more focused view on the issues of cognitive conceptualization and HRI needs than had been considered in previous approaches. In Section V we place the NavSpace model in the context of our target application, a semi-autonomous robotic wheelchair, by both describing the use of the various layers by navigation, safety, and localization systems, while also showing how the upper conceptual layers are mapped to a linguistic semantics model which drives the spatial communication interface. Index Terms— Spatial Modeling, HRI I. I NTRODUCTION E MPERICAL studies show that users employ models of space which are schematized mental constructions within which exact metric data is often systematically simplified and distorted [25], [26]. For example, several studies show that a user’s space of navigation is essentially topological, consisting of landmarks, places, and paths [4], [20], [21]. Unsurprisingly then, the language people use to describe space is similarly coarse grained yet complex [23]. For robotic systems to effectively communicate with users they must be able to communicate in terms of these coarse-grained spatial concepts. One way to approach this problem would be to attempt to directly map complex spatial language constructions directly to standard robotic spatial models such as voronoi graphs and metric data. The alternative view, pursued here, is that specific abstract spatial representations can be derived and used in parallel with low-level representations to facilitate the communication process. A need for such a stratified spatial representations is by no means new [9], [11]; however, significant questions remain regarding the exact composition of such layers, their inter-relationships, and the relationship to natural language processing. In this paper we present NavSpace, an implemented stratified spatial representation for service robots engaged in navigation tasks. While the lower tiers consist of sub-conceptual metric and voronoi graph information which provide suitable representation for robot navigation and localization, the upper levels provide spatial conceptual representations for the communicative tasks. We begin in Section II with a brief review of some of the challenging features of verbal spatial language which we have observed between users and mobile robots. As mentioned above, the need for stratified spatial models II. S TRUCTURED S PATIAL I NTERACTIONS Our development of the NavSpace tiers have been driven partially through the analysis of two series of empirical studies which were conducted to investigate the nature of spatial language used in joint human-robot navigation tasks. The first series used a number of robotic platforms, and investigated language used to describe spatial concepts, route descriptions, and spatial relations between objects [6], [20], [22]. The second study which we considered was Bugmann et al’s Instruction Based Learning (IBL) project which looked at navigation tasks performed by robots moving around a toy town [3], [12]. Generalizing the corpus data for compactness, Exchange 1 below presents one prototypical example of a relatively simple exchange between a user who wishes to be driven to a location, and the wheelchair which does not know at first how to get to that location, but which initiates a clarification dialogue to determine where that location might be: (1) a. User: drive me to the lab b. System: Sorry, I don’t know about any lab. c. System: Where is it? d. User: it is beside the kitchen. e. System: Is it before or after the kitchen? f. User: After g. System: OK, driving to the lab While the user in exchange 1 attempts to localize an object with respect to a second mutually known object, users will often provide localization information in terms of a set of instructions on how to reach the goal location, i.e., a route instruction, such as that below: (2) a. User: drive me to the lab b. System: Sorry, I don’t know about any lab c. System: Where is it? d. User: Where are we now? IROS 2006 workshop: From Sensors to Human Spatial Concepts Page: 41 e. System: We are beside the kitchen f. User: Go to the elevators g. System: OK, to the elevators h. User: Drive forward about 10 meters i. System: OK j. User: then turn k. System: Should I turn left or right there? l. User: right m. System: OK n. User: the lab is the second room to the left o. System: OK, driving to the lab These examples are intended to qualitatively illustrate a number of features common to exchanges concerning spatial concepts: (a) granularity of environment space concepts used in verbal scene description and localization, e.g., “kitchen”, “elevator”; (b) course grained action descriptions, “turn”, “drive”; (c) quantitative terms, e.g., “10 meters”; (d) projective spatial relations, e.g., “to the left”; and (e) localization terms “we are beside the kitchen”. We argue that the variety of constructions used mean that a number of different representational and processing approaches must be harnessed to account for the intrinsic variability of users’ cognitive modeling of space. In the next section we review two existing models which have attempted to capture this stratified representation of spatial knowledge. III. H IERARCHICAL S PATIAL M ODELS Over the past ten years, the need for robotic systems to incorporate a sufficiently adequate spatial modeling to support a range of tasks ranging from navigation to human-robot interaction has become apparent. In the following we review two of the more prominent models which have attempted to span the robotics and spatial cognition communities, i.e., Kuipers’s Spatial Semantic Hierarchy [11], and Krieg-Brückner’s Route Graph [9]. A. The Spatial Semantic Hierarchy The Spatial Semantic Hierarchy (SSH) [11] is a stratified model of an agent’s spatial representation encompassing sensor data, behaviors, topological representation, and metric maps. Specifically, the SSH includes five distinct ontological levels: (a) the sensory level, (b) the control level, (c) the causal level, (d) the topological level, and (e) the metrical level. The sensory and control levels are sub-symbolic interface to sensory and behavioral capabilities which provide a mapping between continuous numerical values of the physical platform to be abstracted to discrete symbolic representation for use at the causal and topological levels. The causal level uses the situation calculus to capture causal relations among the robot’s views (symbolic abstractions over the sensory input perceived by the robot at some time t), actions (which bring about changes in views), and events, the realization of a change in view. The metrical level on the other hand consists of a global 2-D geometric map of the environment in a single frame of reference, a so-called “Map in the Head” [11, p195]. In terms of the modeling of space for interaction with users, one of the more interesting levels is the topological level which defines notions of places, paths and regions, along with their associated connectivity and containment relations. Places are defined as zero-dimensional entities which may lie on a path. A topological, or place, graph can then be constructed as a map of the environment consisting of sets of places and their connecting paths. Paths also serve as the boundaries for regions. A region is defined as a two-dimensional subset of the environment, i.e., a set of places. Path directedness also allows a reference system to be determined. Each directed path divides the world into two regions: one on the right and one on the left. A bounded region is then defined by a directed path with the region on this path’s right or on its left. The SSH also uses regions to define a hierarchical view of space. There are therefore two levels of abstraction within the topological layer, one for place and one for region. One of the key themes of the SSH is that information at different ontological layers can be inter-related through mappings and abduction processes. Thus, the places, paths and regions of the topological level are created by deducing some minimal description that is sufficient to explain the regularities found among the observed views and actions of the causal level. As such the SSH effectively provides something approaching a complete robot control architecture which directly integrates spatial representation with issues of deliberation and sensorymotor issues. However the spatial models used within the SSH were not developed with cognitive modeling or HRI issues in mind. B. The Route Graph One spatial modeling approach which has been developed with a view to cognitive plausibility and HRI issues is KriegBrückner et al’s Route Graph (RG: [8], [9], [27]). The RG is an abstract graph like representation of navigation space which may be instantiated to different kinds, layers, and levels of granularity. Instantiations of the RG have been made both for large scale navigation space such as tram-networks [13], as well as for robotic applications in medium scale space such as office environments [18]. As defined in [9], the principle concepts within the abstract RG specification are Places, Segments, and Routes. Places are anywhere that an agent can ‘be’, and are defined as having their own reference system related to an origin (position and orientation) associated with the place. In turn, local reference systems may or may not be rooted on a global reference system depending on the cognitive characteristics of the instantiated application. A Place’s origin is used to define the orientation of connecting Segments which are directed connections from one Place to another. A Segment is said to have a Course, an Entry, and an Exit, the latter two of which are defined with respect to the connecting place’s origin as described earlier, while the Course is some description of the actual route taken between the Entry and Exit. A Route is then intuitively defined as a sequence of Segments without repetition (i.e., cycles are not permitted within route definitions). Places and Segments may be specialized with respect to particular applications. For example, in a voronoi-styled instantiation of the RG, Places may have a width denoting free space, while IROS 2006 workshop: From Sensors to Human Spatial Concepts Page: 42 a Segment’s Course may be instantiated with quantitative information such as distance or width of the segment – or alternatively with qualitative information such as the action to be performed to transverse the route segment. The RG also includes a notion of abstraction similar to that of the SSH. For example, at a particular level of abstraction an entry and an exit ramp of a highway can be considered as two different places (nodes in the RG), whereas, at a higher level of abstraction, the two nodes could be considered one place corresponding to the complex notion of a ‘road junction’. Similarly, the complex possibilities for navigation within a train station might quite appropriately, at a higher level of abstraction, be collapsed to a single place: ‘the station’. In [9] some relations required between the world of route graphs (places, route segments, paths, etc.) and other modeling domains more established. Two such domains are explicitly identified: a spatial ontology that is expected to provide spatial regions, and a ‘commonsense ontology’, providing everyday objects such as rooms, offices, corridors, and so on. The relationships provided are intended to allow inferences back and forth between places in a route graph, the spatial regions that such places occupy, and the everyday objects that those regions ‘cover’. The relationship to commonsense concepts was not however fully established and needs to be expanded upon before adequate connections to natural communication or a wide range of reasoning tasks can be established. Similarly, questions remain as to how the RG concept should be used alongside spatial modeling tasks which are better captured with quantitative means. The model presented in the following is intended to overcome some of these issues. IV. T HE NAVIGATION S PACE M ODEL In this section we describe the representational layers which together constitute the NavSpace model. A. The Metric & Voronoi Layers Our lowest level long terms spatial memory structure is an occupancy or Evidence Grid which denotes the probability of occupancy across a map-style 2D structure. Derived from the EvidenceGrid, the DistanceGrid is also organized as a rectangular array of grid cells, now containing the distances to the closest obstacle points. A graph like derivative of the DistanceGrid completes the lower-level representation. This Voronoi Diagram, depicted in figure 1(a) is a reduction of the DistanceGrid that describes the space with maximal clearance to the surrounding obstacles by a graph like structure. In practice the low-level spatial representation is quite simple, and has been computed from either local sensor-based grid maps on physical robots, as well as from a pre-existing global grid map derived from CAD blueprints. The EvidenceGrid and Voronoi Graphs provide a comprehensive spatial model for all navigation, localization, and safety tasks as illustrated by our demonstrator described in Section V. Moreover, in a recent paper [14], it was shown that through the definition of fuzzy spatial relations interpretation functions, it was possible to use the Voronoi Graph directly to interpret a number of course route instructions given by a user. However, that interpretation algorithm relied on the user providing relatively long route instruction chunks such that statistical methods could be used to eliminate nonroute vertices from the voronoi graph. The interpretation of shorter route instructions, and similarly the generation of spatial descriptions directly against a Voronoi graph, is less straightforward. In the next section we describe our abstracted topological representation which we use to overcome these issues. B. The Topological Layer Our principle representation of conceptual topological space is a Route Graph (RG) instantiation that abstracts between quantitative and qualitative structure. This conceptual topology is used not for low-level robot control, but purely for communication purposes, i.e., for interpreting and generating spatial descriptions. While the original RG specification provides a good framework for topological representations, the modeling of projective and other spatial relations, e.g, left, after, remained unspecified and were to be filled by the domain application. In a recent work [10], the RG was combined with the Double Cross qualitative spatial reasoning calculus [7] to provide a qualitative route interpretation approach. The model proved useful in integrating strictly qualitative terms when no quantitative information is available, but it cannot by definition be used to process higher-level reasoning tasks such as the general process of location identification. For the specific spatial concept definition system developed for this paper, we have used a single 2D global reference frame to place all Places within. Although this is not the favored approach with the Route Graph specification, we feel that this is a reasonable decision since the actual size of spatial environments, as used in the demonstration system described in Section V, are relatively small. The principle spatial characteristics of Places and RouteSegments then become cartesian coordinates and orientations with respect to the global reference frame. In summary then, our RG structure applied here consists of: • • P a set of Places S a set of Segments that connect Places where Places are located within a 2D reference frame, and the entry and exit points of each Segment are defined with respect to a global orientation (analogous to the defining orientations with respect to North). Furthermore, each RG Place and Segment is said to mark a given conceptual place such as Kitchen or Corridor. This conceptual mapping is addressed further in Section IV-C. With a simple Route Graph styled topology in place, the question becomes how do we communicate about space with the user? Core communicative tasks in the navigation domain are the interpretation and description of object locations. Such descriptions invariably make use of relations such as behind, left of, and next to which are essentially dynamic in that they cannot as such be encoded directly into a spatial representation, but are relative to a given relatum and origin IROS 2006 workshop: From Sensors to Human Spatial Concepts Page: 43 (a) (b) (c) Fig. 1. Illustration of the various levels of the NavSpace model. Figure 1(a) presents the voronoi graph superimposed over line segment structure resulting from sensory data. Figure 1(b) then shows an abstract topological representation of the same floor. Finally 1(c) shows a simplified fragment of the conceptual layer. at time t1 . To interpret such context specific relations, we have defined a number of relation operators which operate on one or two Places in the RG to produce a region which in turn may include one or more specific Places. In the following we define this approach for the projective and ordering relations. 1) Projective Relations: To interpret and produce qualitative spatial relations against the underlying quantitative model, we first require concrete definitions of projective relations like left and right. We apply a simple model which was empirically validated with 25 German speakers in an earlier study with a robot perception system [24]. 2) Ordering Relations: Computational definitions of spatiotemporal relations such as before and after make use of partial orderings of Places on a given Route. In the cases of places which lie directly on a proper route, this ordering becomes trivial and is given directly through the topological structuring of the graph. For places which lie “off the route”, a projection must first be made of the place onto its connecting node on the RG, after which standard partial orderings can be applied. 3) The Abstraction Process: Before progressing, it should be noted that the creation of the abstract Topological level cannot be simply abstracted from the voronoi graph structure. This is because that key to the use of the abstract topological layer is knowledge of the types of rooms and junctions being covered by the graph. Such classification tasks require either the use of visual room identification techniques based on heuristics of form and function, or through in-advance annotations or dialogues with users. C. The Concept Layer The Concept Layer comprises a conceptual ontology which facilitates three functions. Firstly, it provides a framework for 1 Qualitative Spatial Reasoning (QSR) models do attempt to encode this information statically into a representation, but we consider computation of such relations with respect to a quantitatively described underlying model to be more advantageous to current application needs. all topology entities such as the spatial relations, places and segments to be placed. Secondly, it delivers an ontologically described structuring of the physical entities in the robot’s environment such to allows navigation space entities to be related to entities in other domain application models, e.g., user models. Finally, and perhaps most importantly, it provides a semantics for communication with other agents in the environment (artificial or human). Rather than taking an ad-hoc approach to our modeling of the agent’s concepts, we have leveraged off existing work in formal ontology and knowledge engineering. From an engineering perspective, Formal Upper Conceptual Ontologies provide structuring principles for knowledge based information systems. While we accepted that such symbolic reasoning systems should be kept quite separated from lowlevel robotic control, we argue that the non-reactive nature of communication requires the presence of such symbolic structuring at higher-cognitive levels. By conceptual ontologies we refer to such efforts as the Suggested Upper Merged Ontology (SUMO) [16] one of three starter documents which were submitted to the IEEE for a Standard Upper Ontology (SUO), or the Descriptive Ontology for Linguistic and Cognitive Engineering (DOLCE) [15]. Furthermore, in recent years there have been a number of efforts to provide concrete treatments of space within Upper Ontologies by extending the traditional conceptual view of space (mereology, topology) with Geographic Information Systems, and Qualitative Spatial Reasoning. As upper conceptual ontology for the Navigation Space model we have chosen a Description Logic fragment of DOLCE [15]. While DOLCE gives a well-defined upper ontology, issues such as the structuring of topological relations are left to instantiated domain ontologies. We instantiated the ontology with suitable classes to model the environment spaces used by users in our interactions, e.g., Room, Junction, categories for the agent’s themselves, e.g., Person, the spatial IROS 2006 workshop: From Sensors to Human Spatial Concepts Page: 44 relations that can hold between entities in the navigation space, e.g., leftOf, and the relations which allow mapping between the nodes of the RG structure and the conceptual entities they mark. Figure 1(c) presents an extremely simplified representation of what the resultant conceptual layer ‘looks like’. V. A PPLYING THE NAV S PACE M ODEL In this section we place the NavSpace mode in the context of our development scenario and platform. We have developed NavSpace as a spatial representation system for Rolland, a semi-autonomous robotic wheelchair platform, which is intended for use with people of limited cognitive and or physical abilities within a rehabilitation or care environment. Such an application requires that the system be capable of functioning in and communicating about a range of spatial domains ranging from individual rooms, to complex indoor environments such as hospitals, and large-scale environments such as hospital grounds and metropolitan zones. A. Experimental Platform: Rolland III The experimental platform is Rolland III, a battery powered Meyra Champ 1.594 wheelchair. Rolland III is equipped with two laser scanners mounted at ground level, which allow for scanning beneath the feet of a human operator. As an additional sensor device the system provides two incremental encoders which measure the rotational velocity of the two independently actuated wheels. For Local Navigation, we decided to employ a geometric path planner using cubic Bezier curves, since they are able to connect two given points while accounting for a desired curve progression and for directional requirements in the start point and end point. The key feature of the algorithm is that given the current pose of the wheelchair startP ose = (xs , ys , θs ) and the desired target goalP ose = (xg , yg , θg ), we search the space of cubic Bezier curves for paths that: • connect p ~0 = (xs , ys ) with p~3 = (xg , yg ), • are smoothly aligned with θs in p ~0 and with θg in p~3 • are obstacle free in the sense that a contour of the robot shifted tangentially along the path does not intersect with any obstacle point from a given occupancy grid. For Global Localization, a Monte-Carlo-Localization method was implemented that is based on the self-locator used by the German RoboCup Team [17]. To establish hypotheses of the current location of the wheelchair, particles are drawn from a pre-computed table that indexes the global environment by the area of the scan perceived from a certain position. In addition, only distance measurements resulting from flat surfaces i.e. from segmented lines are used to determine the current position. Thereby, persons standing around are ignored. The number of particles drawn from observations depends on how good the actual sensor measurements match the expected measurements from the positions of particles. A Safety Layer (Lankenau and Rofer, 2001) analyzes any driving command with respect to the current obstacle situation. It then decides whether the given command is to be forwarded to the actuators or to be replaced by a necessary deceleration manoeuvre. The key concept in the implementation of the Safety Layer is the so-called Virtual Sensor. For a given initial orientation (~e) of the robot and a pair of translational (v) and rotational (w) speed, it stores the indices of cells of an EvidenceGrid that the robots shape would occupy when initiating an immediate full stop manoeuvre. A set of precomputed Virtual Sensors for all combinations of (~e, v, w) then allows us to check the safety of any driving command, either instructed by the operator via joystick or by an autonomous navigation process, in real time. B. The Linguistic Semantics Bridge The variability of spoken language is too great, except perhaps in the most trivial of applications, simply to map words and phrases onto a pure conceptual level. Features such as metaphor and the polysemy of meaning require that there must be a strict separation of non-linguistic levels of knowledge such as the NavSpace model, and a linguistic semantics, which provides the bridge to natural language grammars and discourse reasoning. We specify our linguistic semantics – a model of the surface form of language – with the Generalized Upper Model (GUM) [1], [2], a so-called linguistic ontology, which while being related to the types of conceptual ontologies introduced earlier, models the world from the underspecified perspective of natural language. The current version of GUM specifically extends earlier works to account for the range of spatial conceptual language encountered in the studies introduced in Section II [1]. To illustrate the model, below is a simplified compact-serialization of the linguistic semantics for “‘the bowl is to the left of the microwave”: (SL1 / SpatialLocating :locatum (h1 / lm-bowl) :direction (l1 / GeneralizedLocation :hasSpatRel (c1 / LeftProjection) :relatum (mw / microwave) )) Mapping back and forth between linguistic semantics representations and the conceptual spatial representation is the responsibility of the dialogue management system [19]. Conceptual to linguistic relationships have been established to aid this mapping [5], but in general the mapping of of course context dependent and relies upon other elements of the system’s state. VI. C ONCLUSIONS & F UTURE W ORK In this paper we have attempted to give an overview of an implemented spatial representation for mobile robots which makes use of an ontologically grounded conceptual layer to drive a spatial reasoning engine for dialogue based interaction. The representation includes an abstract topological layer which performs the job of pruning the environmental search space of information which is too fine grained for communicative tasks. A conceptual ontology allowed for more meaningful representations of the objects in the navigation space, thus moving beyond simple “annotations”. Current work includes the formalization of the relationship between the conceptual and linguistic ontologies, and the preparation of Rolland for IROS 2006 workshop: From Sensors to Human Spatial Concepts Page: 45 clinical trials and evaluation in a nursing home environment. While for reasons of space we have had to gloss many of the precise details of our approach, we believe that formal conceptual ontology along with qualitative spatial reasoning will play a key role in bridging the human-robot divide. R EFERENCES [1] J. Bateman, S. Farrar, and J. Hois, “The generalized upper model 3.0,” Technical Report – The SFB/TR8 Spatial Cognition Research Centre, June 2006. [2] J. A. Bateman, R. Henschel, and F. Rinaldi, “Generalized upper model 2.0: documentation,” GMD/Institut für Integrierte Publikations- und Informationssysteme, Darmstadt, Germany, Tech. Rep., 1995. [Online]. Available: http://purl.org/net/gum2 [3] G. Bugmann, E. Klein, S. Lauria, and T. Kyriacou, “Corpus-Based Robotics: A Route Instruction Example,” in In Proceedings of IAS-8, 2004. [4] M. Denis, “The description of routes: A cognitive approach to the production of spatial discourse,” Cahiers de Psychologie Cognitive, vol. 16, pp. 409–458, 1997. [5] S. Farrar, J. Bateman, and R. J. Ross, “On the Role of Conceptual & Linguistic Ontologies in Spoken Dialogue Systems,” in The Symposium on Dialogue Modelling and Generation, submitted. [6] K. Fischer, What Computer Talk Is and Is not: Human-Computer Conversation as Intercultural Communication. Computational Linguistics, 2006, vol. Vol 17. [7] C. Freksa, “Using orientation information for qualitative spatial reasoning,” in Theories and Methods of Spatio-Temporal Reasoning in Geographic Space, ser. LNCS, vol. 639. Springer-Verlag, 1992, pp. 162–178. [8] B. Krieg-Brückner, T. Röfer, H.-O. Carmesin, and R. Müller, “A taxonomy of spatial knowledge for navigation and its application to the bremen autonomous wheelchair,” in Spatial Cognition I - An interdisciplinary approach to representing and processing spatial knowledge, C. Freksa, C. Habel, and K. Wender, Eds. Berlin: Springer, 1998, pp. 373–397. [Online]. Available: http: //link.springer-ny.com/link/service/series/0558/tocs/t1404.htm [9] B. Krieg-Brückner, U. Frese, K. Lüttich, C. Mandel, T. Mossakowski, and R. Ross, “Specification of an Ontology for Route Graphs,” in Spatial Cognition IV: Reasoning, Action, Interaction. International Conference Spatial Cognition 2004, Frauenchiemsee, Germany, October 2004, Proceedings, C. Freksa, M. Knauff, B. Krieg-Brückner, B. Nebel, and T. Barkowsky, Eds. Berlin, Heidelberg: Springer, 2005, pp. 390–412. [10] B. Krieg-Brückner and H. Shi, “Orientation calculi and route graphs: Towards semantic representations for route descriptions,” in Proc. International Conference GIScience 2006, Münster, Germany, 2006, (to appear). [11] B. Kuipers, “The spatial semantic hierarchy,” Artificial Intelligence, vol. 19, pp. 191–233, 2000. [12] S. Lauria, G. Bugmann, T. Kyriacou, and E. Klein, “Mobile robot programming using natural language,” Robotics and Autonomous Systems, vol. 38, no. 3–4, pp. 171–181, feb 2002. [13] K. Lüttich, B. Krieg-Brückner, and T. Mossakowski, “Tramway networks as route graphs,” in FORMS/FORMAT 2004 – Formal Methods for Automation and Safety in Railway and Automotive Systems, E. Schnieder and G. Tarnai, Eds., 2004, pp. 109–119. [14] C. Mandel, U. Frese, and T. Röfer, “Robot navigation based on the mapping of coarse qualitative route descriptions to route graphs,” in Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2006). [15] C. Masolo, S. Borgo, A. Gangemi, N. Guarino, A. Oltramari, and L. Schneider, “The WonderWeb library of foundational ontologies: preliminary report,” ISTC-CNR, Padova, Italy, WonderWeb Deliverable D17, August 2002. [16] A. Pease and I. Niles, “IEEE Standard Upper Ontology: A progress report,” Knowledge Engineering Review, vol. 17, 2002, special Issue on Ontologies and Agents. [17] T. Röfer and M. Jüngel, “Vision-based fast and reactive monte-carlo localization,” in Proceedings of the IEEE International Conference on Robotics and Automation (ICRA-2003), 2003, pp. 856–861. [18] T. Röfer and A. Lankenau, “Route-based robot navigation,” Knstliche Intelligenz, 2002, Themenheft Spatial Cognition. [19] H. Shi, R. J. Ross, and J. Bateman, “Formalising control in robust spoken dialogue systems,” in Software Engineering & Formal Methods 2005, Germany, Sept 2005. [20] H. Shi and T. Tenbrink, “Telling Rolland where to go: HRI dialogues on route navigation,” in WoSLaD Workshop on Spatial Language and Dialogue, October 23-25, 2005, 2005. [21] L. Talmy, “How language structures space,” in Spatial Orientation: Theory, Research, and Application, H. Pick and L. Aredolo, Eds. New York: Plenum Press, 1983. [22] T. Tenbrink, “Identifying objects in english and german: Empirical investigations of spatial contrastive reference,” in WoSLaD Workshop on Spatial Language and Dialogue, October 23-25, 2005, 2005. [23] ——, Localising objects and events: Discoursal applicability conditions for spatiotemporal expressions in English and German. Dissertation. Bremen: University of Bremen, FB10 Linguistics and Literature, 2005. [24] T. Tenbrink and R. Moratz, “Group-based spatial reference in linguistic human-robot interaction,” in Proceedings of EuroCogSci 2003: The European Cognitive Science Conference, 2003, pp. 325–330. [25] B. Tversky, “Structures of mental spaces – how people think about space,” Environment and Behavior, vol. Vol.35, No.1, pp. 66–80, 2003. [26] B. Tversky and P. Lee, “How space structures language,” in Spatial Cognition: An interdisciplinary Approach to Representation and Processing of Spatial Knowledge, ser. Lecture Notes in Artificial Intelligence, C. Freksa, C. Habel, and K. Wender, Eds., vol. 1404. Springer-Verlag, 1998, pp. 157–175. [27] S. Werner, B. Krieg-Brückner, and T. Herrmann, “Modelling navigational knowledge by route graphs,” in Spatial Cognition II - Integrating Abstract Theories, Empirical Studies, Formal Methods, and Practical Applications, C. Freksa, W. Brauer, C. Habel, and K. Wender, Eds. Berlin: Springer, 2000, pp. 295–316. IROS 2006 workshop: From Sensors to Human Spatial Concepts Page: 46 Learning Spatial Concepts from RatSLAM Representations Ruth Schulz, Michael Milford, David Prasser, Gordon Wyeth and Janet Wiles School of Information Technology and Electrical Engineering The University of Queensland Brisbane, Australia {ruth, milford, prasserd, wyeth, wiles} @itee.uq.edu.au Abstract – RatSLAM is a biologically-inspired visual SLAM and navigation system that has been shown to be effective indoors and outdoors on real robots. The spatial representations at the core of RatSLAM, the pose cells, form in a distributed fashion as the robot learns the environment. The activity in the pose cells, while being coherent, does not possess strong geometric properties, making it difficult to use as the basis for communication with the robot. The pose cells’ companion representation, the experience map, possesses stronger geometric properties, but still does not represent the world in a human readable form. A new system, dubbed RatChat, has been introduced to enable meaningful communication with the robot. The intention is to use the “language games” paradigm to build spatial concepts that can be used as the basis for communication. This paper describes the first step in the language game experiments, showing the potential for meaningful categorization of the spatial representations in RatSLAM. The categorization performance is compared for both the pose cells and experience map, with the results showing stronger concept formation using the more geometrically structured experience map. Index Terms – Spatial conceptualization, RatSLAM I. INTRODUCTION Recent research in mobile robotics has been dominated by the problem of Simultaneous Localization And Mapping (SLAM). Roboticists have investigated a wide range of approaches to solving the problem and have created a number of probabilistic methods that can perform SLAM under appropriate assumptions [1-3]. However, by focusing on the SLAM problem other considerations such as map usability have mostly been neglected. Traditional metrics such as accuracy are starting to be supplanted by concepts such as map usability and communicability. Geometric space representations are a natural choice for many robot mapping and localization methods, but are dissimilar to the more abstract ways in which humans view their environments. Humans can conceptualize their environment in terms of concepts such as rooms: “the bathroom”, “the kitchen”; or objects: “behind the couch”, or “on top of the table”. If humans are to easily and naturally interact with robots, the robots must be able to understand and process such concepts. For instance, one goal for domestic robots would be the ability for a human to tell a robot to “clean the bathroom”, rather than specifying a range of geometric co-ordinates. Many algorithms have been developed to solve components of the mapping and navigation problem such as SLAM. The most successful simultaneous localization and mapping algorithms are all probabilistic and can be separated into three categories: Kalman Filter (KF), Expectation Maximization (EM), and particle filter algorithms. Methods based on these algorithms typically produce two types of map. Landmark or feature maps store the locations of interesting objects in the environment, such as rocks [3] or trees [4]. Occupancy grid maps represent an environment with a high resolution grid, with each grid cell encoding whether the corresponding location in the environment is free or occupied [5]. While both types of map can be accurate, in their raw form they are very different to the abstract spatial concepts a human uses. A. RatSLAM RatSLAM is a biologically inspired, vision-based mapping and navigation system. The system uses an extended computational model of the rodent hippocampus to solve the SLAM problem. The world representations produced by RatSLAM are coherent but differ in several respects from the maps produced by probabilistic methods. RatSLAM maps are locally metric but globally topological, and are not directly usable for higher level tasks such as goal navigation. RatSLAM is complemented by an algorithm known as experience mapping, which creates spatio-temporal-behavioral maps from the RatSLAM representations. Experience maps are used to implement methods for exploration, goal navigation, and adaptation to environment change. Experiments in indoor and (to a lesser extent) outdoor environments on two different robot platforms have demonstrated the system’s ability to autonomously explore, SLAM, navigate to goals, and adapt to simple environment changes. The maps are implemented in a fashion that best suits the robot’s autonomous operations, and are not designed for effective robot-human communication. This paper shows the first steps towards building human-friendly representations based on the RatSLAM system. B. Spatial Conceptualization Conceptualization and communication can be investigated in embodied agents that develop languages for labeling objects in their environment, or coordinating signaling and motor behaviors for the completion of a task. Embodied agents have been implemented as both static and mobile robots, and communication based on synthetic and natural languages. While the majority of studies investigating communication in embodied agents involve robot-robot communication, some involve interaction with humans. By interacting with humans, robots can be taught to understand natural language for labels [7] or descriptions [8], or games can be played [9], where the IROS 2006 workshop: From Sensors to Human Spatial Concepts Page: 47 human is another agent with which the robot can interact. For a review on communication in embodied agents see [6]. In a language game, communities of agents interact to evolve a common language. When two agents in the community have shared attention, they attempt to communicate about something in their environment. One agent is the speaker, and produces an utterance for the topic, while the other agent is the listener, and attempts to comprehend the utterance. After each game, the agents update their conceptualizations to improve the chance of successful communication, eventually resulting in a shared communication system. The conceptualizations are typically grounded in the image domain or in the simple sensory perceptions of the robots. The representations of these robots include vision, where scenes are segmented and processed for concepts of color, position, and size [9], minimal proximity sensors, and light sensors [10]. In addition, mobile robots can be taught spatial descriptions such as left, right, front, and back, and can follow natural language instructions relating to movement, such as “move forward” [8]. Representations can be obtained from occupancy grids, where the robots are provided with routines that determine how spatial concepts are formed. Spatial conceptualization in mobile robots has been limited to commands or descriptions about the current location of the robot where the robot has been provided with a set of routines for forming spatial concepts. Other forms of conceptualization have incorporated the ability of agents to form new concepts based on experience and interactions with other agents. C. RatChat RatChat is a proposed language system that extends RatSLAM, investigating spatial conceptualization in mobile robots based on experiences and interactions with humans and other robots. The concepts formed by RatChat agents are to be the location in an environment, spatial relationships between locations, and using different perspectives to talk about these spatial relationships. RatChat will develop a framework to enable the robots to form a language in a population of robots, or with a human, by playing language games. Preliminary studies have investigated the robot representations, and how agents can generalize from the training sets to novel meanings and terms [11]. II. RATSLAM MODEL AND E XPERIENCE MAPPING ALGORITHM This section briefly describes the RatSLAM model, vision processing system and experience mapping algorithm – a more detailed description is given in [12] and [13]. Fig. 1 RatSLAM system structure. The conceptual map can be formed from representations stored in the pose cell network or the experience map. A. RatSLAM Model Fig. 1 shows the core structure of the RatSLAM system. The robot’s pose is represented by activity in a competitive attractor neural network called the pose cells. Wheel encoder information is used to perform path integration by appropriately shifting the current pose cell activity. Vision information is converted into a local view (LV) representation that is associated with the currently active pose cells. If familiar, the current visual scene also causes activity to be injected into the particular pose cells associated with the currently active local view cells. B. Vision System RatSLAM recognizes locations from external sensor information provided by an appearance based view recognition system. The role of this system is to create patterns of activity in the local view cells using the camera data, which are dependent upon robot location. Camera information is matched against a growing database of learnt images, each of which has an associated cell in the local view. Recognition of an image activates the appropriate cell allowing the formation of new view to pose associations and the injection of energy into the pose cells. Unrecognized camera images are added to the database so that the system is able to explore a previously unseen environment. The city block metric is used to compare low resolution (24 × 18) normalized grayscale images. Learnt views that are within a threshold distance of the current view have their local view cells activated in inverse proportion to their distance from the current view. Visual ambiguity or redundancy in the environment is accounted for by the view to pose associations which enable one view to correspond to multiple physical locations and vice-versa. Since robot position is filtered by the pose cells, incorrect recognition and visual ambiguity in the local view is not catastrophic. C. Experience Mapping Algorithm The experience mapping algorithm creates maps from the representations stored in the local view and pose cell networks. The premise of the algorithm is the creation and maintenance of a collection of experiences and interexperience links. The algorithm creates experiences to represent certain states of activity in the pose cell and local view networks. The algorithm also learns behavioral, IROS 2006 workshop: From Sensors to Human Spatial Concepts Page: 48 temporal, and spatial information in the form of interexperience links. Fig. 2 shows the relationship between the experience map and the core RatSLAM representations. A more detailed discussion of the algorithm is given in [13]. Fig. 2 An experience is associated with certain pose and local view cells, but exists within the experience map’s own coordinate space. Experiences have an activity level that is dependent on the activity peaks in the pose cells and the local view cells. The most active experience is known as the peak experience. Learning of new experiences is triggered by the peak experience’s activity level dropping below a threshold value. Inter-experience links store temporal, behavioral, and odometric information about the robot's movement between experiences. Repeated transitions between experiences result in an averaging of the odometric information. Discrepancies between a transition’s odometric information and the linked experiences’ ( x , y ,θ ) coordinates are minimized through a process of map correction. As well as learning experience transitions, the algorithm also monitors transition ‘failures’ in order to adapt to environment changes. Failures occur when the robot's current experience switches to an experience other than the one expected given the robot's current movement behavior. If enough of these failures occur for a particular transition that link is deleted, thereby updating the experience map. agents learn this association using a single layer neural network shown in Fig. 3, with pose cells or experiences as inputs and a set of output units referring to the different locations in the world. The concept associated with a pattern of pose cells or experiences is the most active output unit. If the activation of the second most active unit is more than 2 3 the activation of the most active unit, the agent is considered to be ‘uncertain’ about the concept. Fig. 3 – The fully connected single layer neural network of the language agent takes pose cells or experiences as inputs and has outputs associated with labels for locations in the world. The most active output is the label associated with the active pose cells or experiences. IV. EXPERIMENTAL SETUP AND PROCEDURE The experiments used a Pioneer 2 DXE mobile robot with a forward facing camera to explore a test environment. The resulting dataset was processed by the RatSLAM model and experience mapping algorithm in order to provide the input for the spatial conceptualization method. A. Environment and Robot The environment was one floor of a university building consisting mostly of open-plan offices and corridors. Fig. 4 shows the environment in which the robot operated and the approximate trajectory of the robot. The robot was manually driven along a repeated path through the environment. The robot visited every place on its path at least twice, providing an opportunity for both learning and recognition. III. A METHOD FOR PRODUCING HUMAN SPATIAL CONCEPTS In a conceptualization process, agents form concepts from their representations of the world. One form of conceptualization involves interaction with a teacher, in which the different concepts that the agent is to learn are provided. In this process, agents learn to associate input patterns with different concepts. For RatChat agents, the inputs are pose cells or experiences. The simplest concepts that can be formed, considering the information contained in pose cell and experience representations, are labels for locations in the world. The outputs of the language agents should group input patterns into categories or concepts. The simplest output representation is one-hot encoding, with each concept associated with a single output unit. For locations in an indoor environment, each room or corridor can be associated with an output unit. The conceptualization process for RatChat agents is the association of the input patterns of pose cells and experiences with the output representations of locations. The Fig. 4 Floor plan of the area used for the experiment and the approximate trajectory of the robot. Shaded areas were impassable by the robot. A dataset was acquired with camera images logged at 7 Hz and on-board odometry and sonar data at 12 Hz. The data set contains 20,350 monochrome images covering a period of almost 40 minutes. The data set was then presented to the RatSLAM system in a manner indistinguishable from online operation. B. Spatial Conceptualization Training and Testing The conceptualization process was implemented offline following the construction of the pose cell and experience maps. The route of the robot was divided into two sections of about 20 minutes duration each. Each section corresponded to the robot exploring and then revisiting one half of the building floor. These two sections were further divided into learning and recognition phases. The learning phase, in which the robot IROS 2006 workshop: From Sensors to Human Spatial Concepts Page: 49 first visited an area, was used for the training set, while the recognition phase, where the robot revisited an area, was used to test if the concepts had been learnt. This was equivalent to the areas being labeled on the first circuit of the environment, and testing whether the robot had learnt these labels on later circuits. The language agents were implemented using fully connected single layer neural networks with pose cells or experiences as inputs and six output units. In the first study, the inputs were pose cells. The 35,402 pose cells that were active at some point in the run were used. In the second study, the inputs were experiences. The 2384 experiences that were active at some point in the run were used. The output units corresponded to the concepts of four rooms and two corridors. Targets were created with a single active output unit corresponding to the current location of the robot. Transitions between rooms and corridors occurred at doorways and turns. For both studies, there were 403 time steps in the first learning phase, 233 in the second learning phase, 398 in the first recognition phase, and 187 in the second recognition phase. In each of the studies, agents were initially trained on the first learning phase and tested on the first recognition phase. Agents were then trained on both the first and second learning phase and tested on the second recognition phase. For each training segment, agents were trained for 2000 epochs using gradient descent with momentum (momentum constant = 0.9) and an adaptive learning rate (initial learning rate = 0.01, increasing ratio = 1.05, decreasing ratio = 0.7). The performance of the agents was tested on the first and second recognition phase by considering the concepts used by the agents for each location. V. RESULTS AND ANALYSIS This section presents the results of the experiments. Study 1 involved conceptualizing the representations stored in the pose cell network, while Study 2 investigated the conceptualization of the experience map. A. Study 1 - RatSLAM Conceptualization The pose cell representation produced by RatSLAM contained both discontinuities and multiple representations of the same place, as shown by Fig. 5. The discontinuities were caused by visually driven re-localization jumps after long periods of exploration where the robot relied only on wheel odometry to remain localized. Odometric drift and delayed relocalization created multiple representations, where more than one group of pose cells represented the same physical location. Fig. 5 – Trajectory of the most highly activated pose cell during the experiment. Thick dashed lines show re-localization jumps driven by visual input. Each grid square contains 4 × 4 pose cells in the ( x ' , y ' ) plane. In the first learning phase of the pose cell conceptualization process, 96.77% of the instances were labeled correctly, with 64.32% labeled correctly in the test set (Fig. 6a, 6b). Errors in the training set were generally on the borders of the categories. Errors in the test set were mainly in Room 1, and were due to the different trajectory used in the learning and recognition phases. The part of the room incorrectly classified was not visited in the learning phase, and was not included in the training. Most of these untrained areas were classified as Corridor 1, as the robot spent most of the first learning phase there, and the language network was biased towards categorizing patterns as Corridor 1. For patterns where there was only a small difference between the most active and the second most active concepts, the robot was uncertain which concept was most appropriate. This generally occurred on the borders of concepts. In the second learning phase, 98.27% were labeled correctly, with 73.26% labeled correctly in the test set (Figs. 6c, 6d). In the test set, there were many instances where the robot was uncertain of the label for the current location. While most of these were on the borders between concepts, there were also other locations of uncertainty, particularly in Rooms 3 and 4. Different pose cells were active in these locations during the recognition and the learning phases. The RatChat agents were generally good at classifying pose cells into locations in the world, with errors mainly occurring when different trajectories were taken during the learning and recognition phases. B. Study 2 – Experience Mapping Conceptualization The experience mapping algorithm produced a map containing none of the spatial discontinuities of the RatSLAM representations, and grouped together multiple representations, as shown in Fig. 7. In the first learning phase of the experience conceptualization process, 98.26% of the instances were labeled correctly, with 90.45% labeled correctly in the test set (Fig. 8a, 8b). Compared to the pose cell conceptualization, a greater proportion of the instances in Room 1 were labeled correctly. The majority of the errors in the test again occurred in Room 1 where a different trajectory was taken. In this case, the agent was uncertain about the label, rather than labeling the instances incorrectly. IROS 2006 workshop: From Sensors to Human Spatial Concepts Page: 50 Fig. 6 – Conceptualization of the agent using pose cells in (a) the learning phase of section 1, (b) the recognition phase of section 1, (c) the learning phase of section 2, and (d) the recognition phase of section 2. In the learning phases there are uncertain areas in the Room 1 / Corridor 1, Room 2 / Corridor 1, and Room 3 / Corridor 2 borders. In the recognition phases there are uncertain areas throughout, including in all of the rooms and borders between rooms and corridors. In the first recognition phase, part of Room 1 has been labeled as Corridor 1. Each grid square contains 8 × 8 pose cells in the ( x ' , y ' ) plane. In the second learning phase, 98.43% were labeled correctly, with 89.84% labeled correctly in the test set (Fig. 8c, 8d). The errors in this case were due to differences in the boundaries between rooms and corridors. Fig. 8d shows that the RatChat agents were successful in clustering the experiences appropriately, with minimal uncertain errors on the borders between areas. Fig. 7 – The experience map. The map is continuous and has a high degree of correspondence to the spatial arrangement of the environment. RatChat agents were better at generalizing when using experiences rather than pose cells. At all locations, except for those on borders between areas, and those not visited during the learning phase, the agents were able to appropriately label their current location. VI. DISCUSSION The conceptualization experiments tested both the extent to which the RatSLAM system’s maps could be classified using spatial concepts, and the degree to which different representation types were suitable. The spatial conceptualization method was able to learn and then recognize both the RatSLAM pose cell maps and the experience maps. During the learning phase both representation types performed well. However, during the later test sets, higher recognition rates were achieved when using the experience maps than when using the pose cell maps. These results demonstrate that phenomena in the pose cells such as multiple representations can impede the conceptualization process. The experience mapping algorithm, which was specifically developed to create maps from the pose cell representations that could be used for goal navigation, also appears to create maps more suited to spatial conceptualization. The spatial conceptualization process described in this paper is an implicit one. Abstract spatial concepts, such as rooms, are identified during training only by entry and exit times. The robot does not explicitly recognize specific features that identify a room type, but rather learns the spatial correspondence between physical place and concept through the intermediary layer that is the pose cell or experience map. IROS 2006 workshop: From Sensors to Human Spatial Concepts Page: 51 Fig. 8 – Conceptualization of the agent using experiences in (a) the learning phase of section 1, (b) the recognition phase of section 1, (c) the learning phase of section 2, and (d) the recognition phase of section 2. In the learning phases there are uncertain areas in the Room 1 / Corridor 1, Room 2 / Corridor 1, and Room 3 / Corridor 2 borders. In the recognition phases there are uncertain areas in Room 1, and in the Room 1 / Corridor 1, Room 2 / Corridor 1, and Room 3 / Corridor 2 borders. VII. CONCLUSION This paper has investigated a method for learning and recognizing abstract spatial concepts using the RatSLAM model and experience mapping algorithm. Using a simple neural network, the conceptualization process associates pose cells or experiences with spatial concepts. The experiments demonstrate that it is possible to learn abstract spatial concepts such as rooms and corridors and then generalize about these concepts when revisiting the physical areas. ACKNOWLEDGMENT The authors thank the Australian Research Council for partial funding of the RatSLAM and RatChat projects. [5] [6] [7] [8] [9] [10] REFERENCES [1] G. Dissanayake, P. M. Newman, S. Clark, H. Durrant-Whyte, and M. Csorba, "A solution to the simultaneous localisation and map building (SLAM) problem," IEEE Transactions on Robotics and Automation, vol. 17, pp. 229-241, 2001. [2] S. Thrun, "Probabilistic Algorithms and the Interactive Museum TourGuide Robot Minerva," Journal of Robotics Research, vol. 19, pp. 972999, 2000. [3] M. Montemerlo, S. Thrun, D. Koller, and B. Wegbreit, "FastSLAM: A Factored Solution to the Simultaneous Localization and Mapping Problem," presented at AAAI National Conference on Artificial Intelligence, Edmonton, Canada, 2002. [4] M. Montemerlo, S. Thrun, D. Koller, and B. Wegbreit, "FastSLAM 2.0: An improved particle filtering algorithm for simultaneous localization [11] [12] [13] and mapping that provably converges," presented at International Joint Conference on Artificial Intelligence, Acapulco, Mexico, 2003. S. Thrun, "Robotic Mapping: A Survey," Carnegie Mellon University, Pittsburgh, Faculty Report 2002. S. Nolfi, "Emergence of communication in embodied agents: coadapting communicative and non-communicative behaviours," Connection Science, vol. 17, pp. 231-248, 2005. L. Steels and F. Kaplan, "AIBO's first words. The social learning of language and meaning," Evolution of Communication, vol. 4, pp. 3-32, 2001. M. Skubic, D. Perzanowski, S. Blisard, A. Schultz, W. Adams, M. Bugajska, and D. Brock, "Spatial language for human-robot dialogs," IEEE Transactions on Systems, Man, and Cybernetics Part C: Applications and Reviews, vol. 34, pp. 154-167, 2004. L. Steels, The Talking Heads Experiment, vol. I. Words and Meanings. Brussels: Best of Publishing, 1999. D. Marocco and S. Nolfi, "Emergence of communication in teams of embodied and situated agents," in The Evolution of Language, A. Cangelosi, A. D. M. Smith, and K. Smith, Eds. Singapore: World Scientific Publishing, 2006, pp. 198-205. R. Schulz, P. Stockwell, M. Wakabayashi, and J. Wiles, "Generalization in languages evolved for mobile robots," in ALIFE X: Proceedings of the Tenth International Conference on the Simulation and Synthesis of Living Systems: MIT Press, 2006, pp. 486-492. M. J. Milford, G. Wyeth, and D. Prasser, "RatSLAM: A Hippocampal Model for Simultaneous Localization and Mapping," presented at International Conference on Robotics and Automation, New Orleans, USA, 2004. M. J. Milford, D. Prasser, and G. Wyeth, "Experience Mapping: Producing Spatially Continuous Environment Representations using RatSLAM," presented at Australasian Conference on Robotics and Automation, Sydney, Australia, 2005. IROS 2006 workshop: From Sensors to Human Spatial Concepts Page: 52 From images to rooms Olaf Booij, Zoran Zivkovic and Ben Kröse Intelligent Systems Laboratory, University of Amsterdam, The Netherlands Figure 1. An overview of the algorithm. Abstract— In this paper we start from a set of images obtained by the robot while it is moving around an environment. We present a method to automatically group the images into groups that correspond to convex subspaces in the environment which are related to the human concept of rooms. Pairwise similarities between the images are computed using local features extracted from the images and geometric constraints. The images with the proposed similarity measure can be seen as a graph or in a way a base level dense topological map. From this low level representation the images are groped using a graphclustering technique which effectively finds convex spaces in the environment. The method is tested and evaluated on challenging data sets acquired in real home environments. The resulting higher level maps are compared with the maps humans made based on the same data1 . I. I NTRODUCTION Mobile robots need an internal representation for localization and navigation. Most current methods for map building are evaluated using error measures in the geometric domain, for example covariance ellipsis indicating uncertainty in feature location and robot location. Now that robots are moving into public places and homes, human beings have to be taken into account. This changes the task of building a representation of the environment. Semantic information must be added to sensory data. This helps to enable a better representation (avoid aliasing problems), and makes it possible to communicate with humans about its environment. Incorporating these tasks in traditional map building methods is non trivial. Even more, evaluating such methods is hard while user studies are difficult and there is a lack of good evaluation criteria. One of the more complicated issues is what sort of spatial concepts should be chosen. For most indoor applications, objects (and their location) and rooms seems a natural choice. Rooms are generally defined as convex spaces, in which objects reside, and which are connected to other rooms with ’gateways’ [1], [2]. In [3] a hierarchical representation is used in which at the low level the nodes indicate objects, at 1 The work described in this paper was conducted within the EU FP6002020 COGNIRON (”The Cognitive Companion”) project. a higher level the nodes represent ’regions’ (parts of space defined by collections of objects) and at the highest level the nodes indicate ’locations’ (’rooms’). However, detecting and localizing objects is not yet a trivial task. In this paper we consider the common concept of ’rooms’. We present our appearance based method to automatically group images obtained by the robot into groups that correspond to convex subspaces in the environment which are related to the human concept of rooms. The convex subspace is defined as a part of the environment where the images from this subspace are similar to each other and not similar to the other subspaces. The method starts from a set of unlabelled images. Every image is treated as a node in a graph, where an edge between two nodes (images) is weighted according to the similarity between the images. We propose a similarity measure which considers two images similar if it is possible to perform 3D reconstruction using these two images [4], [5]. This similarity measure is closely related to the navigation task since reconstructing the relative positions between two images means also the it is possible to move the robot from the location of where one images is taken to the location where the other image is taken given that there are no obstacles in between. We propose a criterion for grouping the images from convex spaces. The criterion is formalized as a graph cut problem and we present an efficient approximate solution. In an (optional) semi-supervised paradigm, we allow the user to label some of the images. The graph similarity matrix is then modified to incorporate the user-supplied labels prior to the graph cut step. Section II presents a short overview of the related work. Section III describes our method of constructing a low level appearance based map. In Section IV it is explained how to find parts of this map belonging to convex spaces in the environment. The method used for resampling the datasets is described in Section V. In Section VI we report the experiments we did in real home environments. Our approach is also compared to other similarity measures and standard k-means clustering. Finally we draw some conclusions and discuss future work in Section VII. IROS 2006 workshop: From Sensors to Human Spatial Concepts Page: 53 II. R ELATED WORK The traditional topological maps represent the environment as a graph where the nodes present distinctive locations and edges describe the transitions [1]. The distinctive locations can be obtained from the geometric map, e.g. using Voronoi graphs [6], [7] or from images, for example using fingerprint representation as in [8]. However, the extracted distinctive locations are mainly related to the robot navigation task and not the human concepts such as the rooms. Another related task is the task of place or location recognition. To distinguish between different rooms, often visual cues are used, such as color histograms [9] or visual fingerprints [10]. A combination of spatial cues and objects detected in images taken from that room has been used by [11]. Instead of explicit object detection, also implicit visual cues such as SIFT features have been used [12]. More general problem of recognizing scenes from images is addressed [13]. However all these approaches assume that the human given labels are provided. We present here an unsupervised algorithm to group the images into groups that are related the human concept of rooms. Our approach is similar to [5] where the images are also grouped on basis on their similarities. Similar approach was also used in [14] but for the task of of finding object categories from images. In this paper we present a grouping criterion that is more appropriate for detecting convex spaces. Furthermore, in [5] the data is obtained in a highly controlled way by taking the images at uniformly spaced locations. Here we will consider the realistic situation where the data is obtained by just moving the robot around the environment. The graph clustering will then depend on the robot movements and we propose a re sampling scheme to improve the results, see Section V. Finally, we consider a semi-supervised approach where the user provides a number of labels. III. I MAGE SIMILARITY MEASURE We start from a set of unlabelled images. In all our experiments the omnidirectional images were used taken by a mobile robot while driving through the environment (see figure 2 for the image positions of of one of the data sets used for testing). Every image is treated as a node in a graph, where an edge between two nodes (images) is weighted according to the similarity between the images. This graph can be seen as a topological map. Various similarity measure can be used. We will use here the similarity measure as in [5]. We define that there is an edge between two nodes in the graph if it is possible to perform a 3D reconstruction of the local space using visual features from the two corresponding images. We use SIFT features [15] as the automatically detected landmarks. Therefore an image can be summarized by the landmark positions and descriptions of their local appearance. The 3D reconstruction was performed using the 8 point algorithm [16] constrained to planar camera movement [17] and the RANSAC estimator was used to be robust to false matches [16]. A big advantage of such similarity measure over the pure appearance based measures is that it also considers geometry [4]. Therefore the Home 1 Home 2 Fig. 2. Ground floor maps of the two home environments. The circles denote the positions of the robot, according to the wheel encoders, from which an image was taken. chance is small that images from two different rooms are found similar while they might be similar in appearance [5]. As the result of N images we obtain a graph that is described with a set S of N nodes and a symmetric matrix W called the ’similarity matrix’. For each pair of nodes i,jǫ[1, ..., N ] the value of the element Wij from the matrix W defines similarity of the nodes. In our case this is equal to 1 if there is a link between the nodes and 0 if there is no link. Examples of such a graphs that we obtained from real data sets are given in Figure 4. If there is a non-zero edge in the graph this also means that if the robot is at one of the connected nodes (corresponding to one image), it can determine the relative location of the other node (corresponding to the other image). If there are no obstacles in between, the robot can directly navigate from one node to the other. If there are obstacles, one could rely, for example, on an additional reactive algorithm for obstacle avoidance using range sensors. In this sense the graph obtained using the proposed similarity measure can be seen as a base level dense topological map that can be used for navigation and localization. This graph contains, in a natural way, the information about how the space in an indoor environment is separated by the walls and other barriers. Images from a convex space, for example a room, will have many connection between them and just a few connections to some images that are from another space, for example a corridor, that is connected with the room via a narrow passage, for example a door. By clustering the graph we want to obtain groups of images that belong to a convex space, for example a room. IV. G ROUPING IMAGES Starting from the graph representation we will group the images by cutting the graph (S, W ) , described above, into K separate subgraphs {(S1 , W1 )..., (SK , WK )}. If the subgraphs (clusters) correspond to convex subspaces we expect that there IROS 2006 workshop: From Sensors to Human Spatial Concepts Page: 54 will be many links within each cluster and a few between the clusters. The subgraphs should also be connected graphs. This is formalized as a graph cut criterion further in this Section. An efficient approximate solution is also presented. Note that we assume that the images are recorded at positions that approximately uniformly sample the available space. If this is not true the images from the positions close to each which are usually very similar tend to group together and the resulting clusters depend on the positions where the images are taken. A. Grouping criterion We will start by introducing some graph-theoretic terms. The degree of the i-th node of a graph (S, W ) is defined P as the sum of all the edges that start from that node: di = j Wij . For nodes Sj (where Sj is a subset of S), volume is defined P as vol(Sj ) = i di . vol(Sj ) describes the ”strength” of the interconnections within the subset Sj . A subgraph (Sj , Wj ) can be ”cut out” from the graph (S, W ) by cutting a number of edges. The sum of the values of the edges that are cut is called a graph cut: X cut(Sj , S\Sj ) = Wij (1) iǫSj ,jǫS\Sj where S\Sj denotes the set of all nodes except the ones from Sj . One may cut the base level graph into q1 clusters by minimizing the number of cut edges: 1 q X cut(Sj , S\Sj ). (2) j This would mean that the graph is cut at the weakly connected places, which in our case would usually correspond to natural segmentation at doors between the rooms or other narrow passages. However, such segmentation criteria often leads to undesirable results. For example, if there is an isolated node connected to the rest of the graph by only one link, then (2) will be in favor of cutting only this link. To avoid such artifacts we use a normalized version: 1 q X cut(Sj , S\Sj ) j vol(Sj ) . (3) Minimizing this criterion means cutting a minimal number of connections between the subsets but also choosing larger subsets with strong connections within the subsets. This criterion naturally groups together convex areas, like a room, and makes cuts between areas that are weakly connected. However, the criterion (3) can lead to solutions where the clusters present disconnected graphs. The requirement that the subgraphs should also be connected graphs need to be considered also in addition. B. Approximate solution For completeness of the text we briefly sketch a wellbehaved spectral clustering algorithm from [18] that leads to a good approximate solution of the normalized cut criteria (3): 1) Define D to be a diagonal matrix of node degrees Dii = di and construct the normalized similarity matrix L = D−1/2 W D−1/2 . 2) Find x1 , ..., xK the K largest eigenvectors of L and form the matrix X = [x1 , ..., xK ] ∈ RN ×K . 3) Renormalize of X to have unit length Xij ← P 2 rows )1/2 . Xij /( j Xij 4) Treat each row of X as a point in RK and cluster using for example the k-means algorithm. Instead of the kmeans step in [19] a more principled but more complex approach is used, following [20] where a good initial start for the k-means clustering is proposed. We tested the mentioned algorithms, and in practice, for our type of problems, they lead to similar solutions. 5) The i-th node from S is assigned to cluster j if and only if the row i of the matrix X was assigned to the cluster j. Although in practice very rarely, the normalized cut criteria (3) can lead to disconnected solutions as mentioned above. A practical split and merge solution to ensure that the subgraphs are connected is as follows: 1) group the images using the normalized cut criteria (and using the spectral clustering technique). 2) Split step: if there are disconnected subgraphs in the result generate new clusters from the disconnected subgraph components. 3) Merge step: the connected clusters that minimize the normalized cut criteria (3)should be merged. The final result presents a practical and efficient approximate solution for our criterion from the previous section. The exact solution is a NP-hard problem and usually not feasible. C. Semi-supervised learning This framework allows the introduction of weak semisupervision in the form of pairwise constraints between the unlabelled images. Specifically, a user may specify cannotgroup or must-group connections between any number of pairs in the data set. Following the paradigm suggested in [21], we modify the graph (S, W ) to incorporate this information to assist category learning: entries in the affinity matrix S are set to the maximal (diagonal) value for pairs that ought to be reinforced in the groupings, or set to zero for pairs that ought to be divided. V. R EALISTIC ( NON - UNIFORM ) SAMPLED DATA The images should be recorded at positions that approximately uniformly sample the available space. However, this is often difficult to perform in practice. For example some of the data sets we will consider in the experimental section were recorded by letting the robot record the images at regular time intervals. For such data the clustering will depend on the robot movements. An illustration of a non-uniformly sampled data set is given in Figure 3. The images taken close to each other depicted in the figure near the transition from ’room2’ to ’corridor’ will usually be similar to each other and therefore grouped together. The on-line appearance topological mapping IROS 2006 workshop: From Sensors to Human Spatial Concepts Page: 55 2) Construct a new graph sampling N samples from the multinomial distribution with coefficients w̃ (i) . The corresponding nodes and links from the original graph (S, W ) are added to the new graph (S resampled , W resampled ). 3) if the new graph (S resampled , W resampled ) is not connected continue sampling and adding nodes as in the previous step until it gets connected. The result is the new graph (S resampled , W resampled ) where the images come from positions that approximately uniformly sample the available space. Fig. 3. Top image depicts an example of non-uniformly sampled data and undesired clustering results. The clustering results can be improved by detecting such situations and generating a new graph with approximately uniformly sampled images as depicted below. [8] will also suffer from the same problem. In this Section we will use information about Euclidean geometric distances between the images and present a simple sampling approach aimed to approximate the uniform sampling of the space and improve the clustering results. A. Importance sampling Let there be N images recorded while robot was moving around the environment and let x(i) denote the 2D position where i-th image was recorded. We can consider x(i) -s as N independent samples from P some distribution q. A sample based N approximation is q(x) ≈ i=1 δ(x − x(i) )/N . Then we can approximate uniform distribution using importance sampling: N U nif orm(x) = c = X c q(x) ≈ w̃(i) δ(x − x(i) ) q(x) i=1 (4) PN where w̃(i) = w(i) / j=1 w(j) and w(i) = c/q(x(i) ). One can interpret the w̃ (i) as correction factors to compensate for the fact that we have sampled from the “incorrect” distribution q(x). Approximate uniform sample can be generated now by sampling from the sample based approximation above. This is equivalent to sampling from the multinomial distribution with coefficients w̃ (i) . The original distribution of the original sampling q(x(i) ) can be estimated for example using a simple K-nearest neighbor density estimate q(x(i) ) ∼ 1/V where the V = dk (x(i) )2 and dk (x(i) ) is the distance to the k-th nearest neighbor in the Euclidean 2D space. The distances can be obtained form odometry or some SLAM procedure. Alternatively the distances can be approximated from the images directly. For all our data we used the k = 7. B. Practical algorithm We start with the original graph (S, W ) and an empty graph (S resampled , W resampled ). The practical algorithm we will be using is as follows: 1) Compute the local density estimates and the weight factors w̃ (i) . VI. E XPERIMENTS The method of finding the convex spaces in an environment is tested in two real home environments and is compared to the annotation based on the same sensor data. Our mobile robot was driven around while taking panoramic images with an omnidirectional camera, see figures 2 for ground floor maps of the environments and the positions where images were taken. The task of building a map using these image sets is challenging in a number of ways. First of all the lighting conditions were not good, much worse than the conditions during previous evaluations in office environments. Also, people were walking through the environment blocking the view of the robot. Furthermore, the robot was driven rather randomly through the rooms, which has the effect that some parts of the environment are represented by a lot of images while others parts only with a few (see www2.science.uva.nl/sites/cogniron/ for videos acquired by the robot). The data sets were annotated by a inexperienced person, based solely on the sensor data and the maps as shown in figures 2 but without the robot positions. The person had never visited one of the two houses. For both homes labels were provided corresponding to the rooms, from which one should be picked per panoramic image. Between some of the rooms there was no good geometrical boundary separating them, so from most places in one room the other room was still clearly visible and vise verse. This is common in real home environments but makes conceptualization of it harder. From both image sets an appearance graph is made using the methods explained in III. These graphs are then used as input for the clustering algorithm to find convex spaces in environment, first with all images and then with a subset obtained by resampling. The results are compared with the annotation, to see how well the convex spaces found by clustering correspond to separate rooms. A. Results In Figure 4 it can be seen that the appearance based methods were quite successful in creating a low level topological map. All links of the graphs connect nodes originating from images that were taken close to each other in world coordinates. In some parts of the graph the nodes are more densely connected than others. This could be the result of bad image quality for example caused by changing lighting conditions, but it IROS 2006 workshop: From Sensors to Human Spatial Concepts Page: 56 a) b) c) a) b) c) Fig. 4. The clustering results for the Home 1 (above) and the Home 2 (below) data sets. a) The appearance based graph. Each line indicates two matching images. b) The clusters found in the whole dataset. c) Clusters found in the resampled dataset. Note that the odometry data used to draw these figures are not used only for the resampling. could also be the result of lack of features in that part of the environment. Clustering without resampling (see Figures 4b) results in a grouping of the images which is not perfect. As can be seen in Figure 4 some of the images of Home 1 are grouped together which were taken from completely different positions and that images taken in the kitchen are split among two clusters. In Figure 4 it can be clearly seen that some images taken in the living room are grouped with images taken in the work room. After the split and merge steps these images are regrouped with the living room images. Better clustering results are obtained after resampling the data as indicated by figure 4c. Both data sets are clustered almost perfectly, often cutting the graph at nodes corresponding to images taken at the doorpost between the rooms. The only error left is at the bedroom of Home 1, from which images are grouped with images from the living room. This is probably caused by the large opening between the two rooms, as can be seen in Figure 2. The mismatch between the clusters found by our method and the labels provided by the annotator is made clear by the confusion matrices, see tables I to IV. Of course the clustered data does not provide a label. Each cluster is appointed the label corresponding to the true set with which it has the largest overlap, taking care that no two clusters get the same label. The percentage of correctly clustered images from home 1 is 85% for the whole dataset and 92% for the resampled set. For home 2 this was 73% and 83%. B. Comparison with other clustering methods and similarity measures We compare our method with the common k-means clustering and a PCA based similarity measure [22]. We used 10 PCA components and clustered the images using k-means. We also used the Euclidean distances in the PCA space and TABLE I H OME 1 WHOLE DATASET True label Living room Bedroom Kitchen Living r 0.9681 0.1832 0 Inferred label Bedroom Kitchen 0.0319 0 0.8168 0 0.5000 0.5000 TABLE II H OME 1 RESAMPLED AVERAGED OVER 10 TRIALS True label Living room Bedroom Kitchen Living r 1.0000 0.3014 0 Inferred label Bedroom Kitchen 0 0 0.6915 0.0071 0.0396 0.9604 TABLE III H OME 2 WHOLE DATASET True label Corridor Living room Bedroom Kitchen Work room Corridor 0.6812 0 0 0.0556 0 Inferred label Living r Bedroom Kitchen 0.1159 0.2029 0 0.5732 0 0 0 1.0000 0 0 0 0.9444 0 0 0 Work r 0 0.4268 0 0 1.0000 TABLE IV H OME 2 RESAMPLED True label Corridor Living room Bedroom Kitchen Work room IROS 2006 workshop: From Sensors to Human Spatial Concepts Corridor 0.6344 0.0291 0 0 0 Inferred label Living r Bedroom Kitchen 0.0323 0.2473 0.0860 0.8301 0 0 0 1.0000 0 0 0 1.0000 0 0 0 Work r 0 0.1408 0 0 1.0000 Page: 57 TABLE V C LUSTERING ACCURACY FOR VARIOUS CLUSTERING METHODS FOR THE H OME 2 DATA SET. PCA PROJECTION WITH 10 COMPONENTS IS USED PCA + kmeans 0.60 PCA + spectral clustering 0.38 our method 0.73 our method (with resampling) 0.83 1 Mean Standard deviation Max and min value 0.95 Accuracy 0.9 R EFERENCES 0.85 0.8 0.75 0.7 0.65 0 2 4 6 8 Number of labelled images per cluster they just entered the kitchen, then this information should be used to build the higher level map. Problems might occur if the user is using different labels for the same space or when the clustering obtained by the robot does not correspond to the human concept. These problems need to be addressed and resolved for example through dialog with the user. Finally, the algorithms should work online in order to facilitate interaction between the map building process and the guide. 10 Fig. 5. The clustering accuracy for the semi supervised case - average from 100 trials. Different number of randomly chosen ground truth labels per cluster are used to simulate user input. applied spectral clustering. The results were poor compared to our method. The results also show that this simple appearance based similarity is not suitable for spectral clustering methods. C. Semi-supervised clustering To demonstrate the semi supervised learning we used a set labelled points to enforce that the points with the same label should group together and the points with different labels should not group together. The set of the labelled points is randomly chosen and the results for the Home 2 data set are presented in Figure 5. The graphs show how the accuracy increases with the amount of labelled images. VII. C ONCLUSION The experiments show that the proposed clustering method seems appropriate for finding the convex spaces by grouping images obtained by the robot. The cuts made in the graphs are at or close to the doorways dividing two rooms. The convex spaces thus found in the real home environments correspond to the concept of “room” as shown by comparing it with annotated data. For some cuts the clustering relies on a good sampling of the data, which was clearly visible tests in Home 1. In table I it can be seen that the kitchen is split into two parts. After resampling (table I) 96% of the image annotated as the kitchen fell in a single cluster. The proposed methods are very suitable as a basis for human robot communication about the spaces the robot travels through. The system will be developed further in this direction, with the goal to enable a robot to build a higher level map by listening to and asking a human guide. Our method naturally allows semi-supervised learning as we demonstrated, using the input of the guide. If the guide says to the robot that [1] B. Kuipers, “The spatial semantic hierarchy,” Artif. Intell., vol. 119, no. 1-2, pp. 191–233, 2000. [2] D. Kortenkamp and T. Weymouth, “Topological mapping for mobile robots using a combination of sonar and vision sensing,” in In Proc. of the Twelfth National Conference om Artificial Intelligence, 1994. [3] A. Tapus, S. Vasudevan, and R. Siegwart, “Towards a multilevel cognitive probabilistic representation of space,” In Proc. of the International Conference on Human Vision and Electronic Imaging X, part of the IST-SPIE Symposium on Electronic Imaging, 2005. [4] F. Schaffalitzky and A. Zisserman, “Multi-view matching for unordered image sets, or “How do I organize my holiday snaps?”,” in Proceedings of the 7th European Conference on Computer Vision, Copenhagen, Denmark, vol. 1. Springer-Verlag, 2002, pp. 414–431. [5] Z. Zivkovic, B. Bakker, and B. Kröse, “Hierarchical map building using visual landmarks and geometric constraints,” in Intl. Conf. on Intelligent Robotics and Systems. Edmundton, Canada: IEEE/JRS, August 2005. [6] H. Choset and K. Nagatani, “Topological simultaneous localisation and mapping: Towards exact localisation without explicit localisation,” IEEE Transactions on Robotics and Automation, vol. 17, no. 2, pp. 125–137, April 2001. [7] P.Beeson, N.K.Jong, and B.Kupiers, “Towards autonomous place detection using the extended voronoi graph,” In Proceedings of the IEEE International Conference on Robotics and Automation, 2005. [8] A. Tapus and R. Siegwart, “Incremental robot mapping with fingerprints of places,” in IROS, 2005. [9] I. Ulrich and I. Nourbakhsh, “Appearance-based place recognition for topological localization,” in Proceedings of ICRA 2000, vol. 2, April 2000, pp. 1023 – 1029. [10] A. Tapus and R. Siegwart, “A cognitive modeling of space using fingerprints of places for mobile robot navigation,” in ICRA, 2006. [11] A. Rottmann, O. Martı́nez Mozos, C. Stachniss, and W. Burgard, “Place classification of indoor environments with mobile robots using boosting,” in AAAI, 2005. [12] F. Li and J. Kosecka, “Probabilistic location recognition using reduced feature set,” in IEEE International Conference on Robotics and Automation, 2006. [13] A. Torralba, K. Murphy, W. Freeman, and M. Rubin, “Context-based vision system for place and object recognition,” In Proc. of the Intl. Conf. on Computer Vision, 2003. [14] K. Grauman and T. Darrell, “Unsupervised learning of categories from sets of partially matching image features,” cvpr, vol. 1, pp. 19–25, 2006. [15] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” International Journal of Computer Vision, vol. 60, no. 2, pp. 91–110, 2004. [16] R. Hartley and A. Zisserman, Multiple view geometry in computer vision, secon edition. Cambridge University Press, 2003. [17] M. Brooks, L. de Agapito, D. Huynh, and L. Baumela, “Towards robust metric reconstruction via a dynamic uncalibrated stereo head,” 1998. [Online]. Available: citeseer.csail.mit.edu/brooks98towards.html [18] A. Ng, M. Jordan, and Y. Weiss, “On spectral clustering: Analysis and an algorithm,” In Proc. Advances in Neural Information Processing Systems 14, 2001. [19] S. X. Yu and J. Shi, “Multiclass spectral clustering,” In Proc. International Conference on Computer Vision, pp. 11–17, 2003. [20] L. Zelnik-Manor and P. Perona, “Self-tuning spectral clustering,” In Proc. Advances in Neural Information Processing Systems, 2004. [21] S. Kamvar, D. Klein, and C. Manning, “Spectral learning,” In Proc.of the International Conference on Artificial Intelligence, 2003. [22] B. Krose, N. Vlassis, R. Bunschoten, and Y. Motomura, “A probabilistic model for appearance-based robot localization,” Image and Vision Computing, vol. 6, no. 19, pp. 381–391, 2001. IROS 2006 workshop: From Sensors to Human Spatial Concepts Page: 58 Robust Models of Object Geometry Jared Glover, Daniela Rus and Nicholas Roy Geoff Gordon Computer Science and Artificial Intelligence Laboratory Massachusetts Institute Of Technology Cambridge, MA 02139 Email: {jglov,rus,nickroy}@mit.edu School of Computer Science Carnegie Mellon University Pittsburgh, PA 15289 Email: ggordon+@cs.cmu.edu Abstract— Precise and accurate models of the world are critical to autonomous robot operation. Just as robot navigation typically requires an accurate map of the world, robot manipulation typically requires accurate models of the objects to be grasped. However, existing shape modelling and object recognition techniques, while excellent for general purpose object recognition, do not preserve the global geometric information necessary for geometric tasks such as manipulation. We describe an inference algorithm for learning statistical models of global object geometry from image data, using a representation that allows us to compute a distribution over the complete geometry of different objects., Finally, we describe how learned object models can be used to infer the geometry of occluded parts of objects. I. I NTRODUCTION Precise and accurate models of the world are critical to autonomous robot operation. Just as robot navigation typically requires an accurate map of the world, robot manipulation typically requires accurate models of the objects to be grasped. However, existing shape modelling and object recognition techniques do not preserve the global geometric information necessary for good manipulation. Most existing shape recognition techniques are focused on the object appearance, such as a bitmap, rather than a notion of the object geometry. One of the principal difficulties in learning object models is finding a good representation; frequently, the object representation that is most useful for one task is not particularly useful for another. For instance, there exist sparse featurebased object models such as SIFT features [14] that allow for very reliable object recognition and tracking but are completely inappropriate for motion planning in that the complete geometry of the object is not recovered. Techniques such as shape contexts [1] or spherical harmonics [10] allow general classes of objects to be learned over time, but again, the geometry of individual objects is not always preserved. For autonomous robot operation, we would like to be able to infer the complete geometry of objects in a manner that is robust to variances in shape from object to object, and is robust to perceptual occlusions. First, our algorithm should learn a description of the complete geometry of the object. In order to allow a robot to carry out control tasks such as navigation, manipulation, grasping, etc. it is not sufficient to describe an object as a set of features; we will need the ability to describe the complete boundary of the object for computing potential collisions, grasp closures, etc. Secondly, our algorithm should allow us to infer the object’s geometry from a series of independent measurements. Just as in robot mapping, we would like to integrate a series of measurements in time, in order to learn the object model. This problem differs from the mapping problem in that we (usually) will not have to solve Figure 1. Contour extraction. Left: Raw image (poor white balance in the camera results in poor colour rendering). Middle: Contours extracted using pyramid segmentation. Right: Contours extracted using intensity thresholding. Our goal is to recognize different objects from the same class from the complete or partial outline of each object. for the object description and sensor position simultaneously. Finally, our algorithm should allow us to estimate the geometry of any occluded parts of the object from a partial view — we would like to have a robot that can recognize and pick up a tool without first building a complete and accurate model of the tool. In robot navigation, the map is usually assumed to be complete before any autonomous motion planning is attempted. In a populated, dynamic environment, the assumption of a complete description of the environment is clearly brittle as new objects will appear regularly that will require robot actions. Our representation must therefore be able to recognize objects it may not have seen before, and be able to make reasonable inferences about the parts of the object that are occluded. This paper presents a unified probabilistic approach to modelling object geometry, in particular focusing on the problem of incomplete data. We draw upon a representation [11] for object geometry which is invariant to changes in position, scale, and orientation, and show how to learn object models directly from sensor data. This representation is robust to sensor noise, as well as to imperfect segmentation or feature extraction. It also allows us to capture the variation between shapes in the same object class in a meaningful way. By learning a probabilistic, generative model of the complete geometry of an object, we can make predictions about sections of the object geometry that are not visible, using an approximate maximum-likelihood estimation scheme. We demonstrate this approach on some example images of everyday objects. The results presented in this paper are restricted to monocular video images, and these results depend on a particular parameterization of the data in terms of complex numbers. We have in principle generalized this approach to three dimensions from data such as stereo or laser range data, where the corresponding representation is in terms of quaternions, but there remain many details to be worked out, such as data association and feature orderings, so we do not include these results here. Finally, this paper is not about object recognition; there are many other (better) methods available for this task [6], [1]. Rather, we are concerned with modelling the complete IROS 2006 workshop: From Sensors to Human Spatial Concepts Page: 59 contours of objects for use in navigation and grasp planning problems. And, while it remains to be seen which representation will ultimately prove most useful for such planning problems, we hypothesize that probabilistically modelling the complete geometry of objects will allow a richer and more robust approach to these problems. II. S HAPE S PACE Let us represent a shape z as a set of n points z1 , z2 , . . . zn in some Euclidean space. We will restrict ourselves to twodimensional points (representing shapes in a plane) such that zi = (xi , yi ), although extensions to three dimensions are feasible. We will assume these points are ordered (so that z can be defined as a vector), and represent a closed contour (such as the letter “O”, as opposed to the letter “V”). In general, the shape model can be made de facto invariant to point ordering if we know correspondences between the boundary points of any two shapes we wish to compare. Even in the case of closed contours, however, care must still be taken to choose the correct starting point. In order to make our model invariant to changes in position and scale, we can normalize the shape so as to have unit length with centroid at the origin; that is, z′ τ = {z′ i = (xi − x̄, yi − ȳ)} z′ = |z′ | (1) (2) where |z′ | is the L2 -norm of z′ . We call τ the pre-shape of z. Since τ is a unit vector, the space of all possible pre-shapes of n points is a unit hyper-sphere, S2n−3 , called pre-shape ∗ space1 . Any pre-shape is a point on the hypersphere, and all rotations of the shape lie on an orbit, O(τ ), of this hypersphere. If we wish to compare shapes using some distance metric between them, the spherical geometry of the shape space requires a geodesic distance rather than Euclidean distance. Additionally, in order to ensure this distance is invariant to rotation, we define the distance between two shapes τ1 and τ2 as the smallest distance between their orbits: dp [τ1 , τ2 ] = d(φ, ψ) = inf[d(φ, ψ) : φ ∈ O(τ1 ), ψ ∈ O(τ2 )] (3) cos−1 (φ · ψ) (4) We call dp the Procrustean metric [11] where d(φ, ψ) is the geodesic distance between φ and ψ. Since the inverse cosine function is monotonically decreasing over its domain, it is sufficient to maximize φ·ψ, which is equivalent to minimizing the sum of squared distances between corresponding points on φ and ψ (since φ and ψ are unit vectors). For every rotation of φ there exists a rotation of ψ which will find the global minimum geodesic distance. Thus, to find the minimum distance, we need only rotate one pre-shape while holding the other one fixed. We call the rotated ψ which achieves this optimum (θα∗ (τ2 )) the orthogonal Procrustes fit of τ2 onto τ1 , and the angle α∗ is called the Procrustes fit angle. Representing the points of τ1 and τ2 in complex coordinates, which naturally encode rotation in the plane by scalar complex 1 Following [16], the star subscript is added to remind us that S2p−3 is ∗ embedded in R2p , not the usual R2p−2 . Figure 2. Distances on a hyper-sphere. r is a linear approximation to the geodesic distance ρ. multiplication, the Procrustes distance minimization can be solved: dp [τ1 , τ2 ] = cos−1 |τ2H τ1 | ∗ α = arg(τ2H τ1 ), (5) (6) where τ2H is the Hermitian, or complex conjugate transpose of the complex vector τ2 . III. S HAPE I NFERENCE Given a set of measurements of an object, we would like to infer a distribution of possible object geometry. We will first derive an expression for the mean shape, and then derive the distribution covariance. Finally, we will show how to infer the full object shape using measurements of partial object geometry. 1) The Shape Distribution Mean: Let us assume that our data set consists of a set of measurements {m1 , m2 , . . .} of the same object, where each measurement mi is a complete (but noisy) description of the object geometry, a vector of length p. We can normalize each measurement to be an independent pre-shape and use these pre-shapes to compute the mean, that is, the maximum likelihood object shape. If each measurement mi is normalized to a pre-shape τi , then the mean is X [dp (τi , µ)]2 (7) µ∗ = arg inf kµk=1 i The pre-shape µ∗ is called the Fréchet mean 2 of the samples τ1 , . . . , τn with respect to the distance measure ‘dp ’. Note that the mean shape is not trivially the arithmetic mean of all pre-shapes; dp is non-Euclidean, and we wish to preserve the constraint that the mean shape has unit length. Unfortunately, the non-linearity of the Procrustes distance in the cos−1 (·) term leads to an intractable solution for the Fréchet mean minimization. Thus, we approximate the geodesic distance ρ between two pre-shape vectors, φ ∈ O(τ1 ) and ψ ∈ O(τ2 ), as the projection distance, r, p (8) r = sin ρ = 1 − cos2 ρ. Figure 2 depicts this approximation to the geodesic distance graphically 3 . 2 More precisely, µ∗ is the Fréchet mean of the random variable ξ drawn from a distribution having uniform weight on each of the samples τ1 , . . . , τn . 3 The straight-line Euclidean distance, s, is another possible approximation to the geodesic distance, but it will not enable us to easily solve the mean shape minimization in closed form. IROS 2006 workshop: From Sensors to Human Spatial Concepts Page: 60 Using this linear projection distance r in the mean shape minimization yields an expression for the mean shape µ∗ X µ∗ = arg inf (1 − |τiH µ|2 ) (9) kµk=1 = arg sup kµk=1 i X (τiH µ)H (τiH µ) (10) i H X = arg sup µ ( τi τiH )µ (11) = arg sup µH Sµ, (12) kµk=1 space coordinates for pre-shape τi are given by ∗ vi = (I − µµH )ejθi τi , (13) where j 2 = −1 and θi∗ is the optimal Procrustes-matching rotation angle of τi onto µ. We can also now apply a dimensionality reduction technique such as PCA to the tangent space data v1 , · · · , vn to get a compact representation of estimated shape distribution (Figure 4). i 0.25 kµk=1 ∗ thus, µ is the complex eigenvector corresponding to the largest eigenvalue of S [5]. 0.25 0.2 0.2 0.15 0.15 0.1 0.1 0.05 0.05 0 0 −0.05 −0.05 −0.1 −0.1 −0.15 −0.15 −0.2 −0.25 −0.25 0.25 −0.2 −0.2 −0.15 −0.1 −0.05 0 0.05 0.1 0.15 0.2 0.25 −0.25 −0.25 −0.2 −0.15 −0.1 −0.05 0 0.05 0.1 0.15 0.2 0.25 −0.2 −0.15 −0.1 −0.05 0 0.05 0.1 0.15 0.2 0.25 0.2 0.25 0.25 0.15 0.1 0.05 0.2 0.2 0.15 0.15 0 −0.05 −0.1 0.1 0.1 0.05 0.05 −0.15 0 0 −0.05 −0.05 −0.1 −0.1 −0.15 −0.15 −0.2 −0.25 −0.25 −0.2 −0.15 −0.1 −0.05 0 0.05 0.1 0.15 0.2 0.25 Figure 3. Full-contour shape representation. Left to right: original image, an example extracted contour and the mean of the learned shape class. The top of the tool is slightly rounded off due both to the small number of points (50) sampled from the contour to form the preshape, as well as to the smoothing effect of computing the mean shape of the tool as it opens and closes. Figure 3 shows an example of the mean shape learned for the tool class. On the left is an example raw image, in the middle is an extracted measurement contour, and on the right is the learned mean of the tool distribution. 2) The Shape Distribution Covariance: With our measurements {m1 , m2 , . . .}, we can now fit a probabilistic model of shape. In many applications, pre-shape data will be tightly localized around a mean shape, in such cases, the tangent space to the preshape hypersphere located at the mean shape will be a good approximation to the preshape space, as in figure 4. By linearizing our distribution in this manner, we take advantage of standard multivariate statistical analysis techniques, representing our shape distribution as a Gaussian. In cases where the data is very spread out, one can use a complex Bingham distribution [5]. Figure 4. Tangent space distribution In order to transform our set of pre-shapes into an appropriate tangent space, we first compute a mean shape, µ as above 4 . We then fit the observed pre-shapes to µ, and project each fitted pre-shape into the tangent space at µ. The tangent 4 We use µ to refer to the µ∗ computed from the optimization in 7 for the remainder of the paper. −0.2 −0.25 −0.25 −0.2 −0.2 −0.15 −0.1 −0.05 0 0.05 0.1 0.15 0.2 0.25 −0.25 −0.25 Figure 5. Shape model for a wire-cutter tool–the effects of the first three principle components (eigenvectors) on the mean shape. Top left: mean shape as in figure 3, top right: a sample shape from the first eigenshape, bottom left: a sample from the second eigenshape, bottom right: a sample from the third eigenshape. In each sampled shape, the mean shape is overlaid for comparison. Figure 5 shows samples drawn from the learned generative model of the tool class along each eigenvector, in comparison to the mean shape. The first eigenshape largely captures how the shape deforms as the tool opens and closes, and the second and third eigenshapes largely capture how the shape deforms due to perspective geometry. IV. S HAPE C OMPLETION We now turn to the problem of estimating the complete geometry of an object from an observation of part of its contour. We phrase this as a maximum likelihood estimation problem, estimating the missing points of a shape with respect to the Gaussian tangent space shape distribution. Let us represent a shape as: z z= 1 (14) z2 where z1 = m contains the p points of our partial observation of the shape, and z2 contains the n − p unknown points that complete the shape. Given a shape distribution D on n points with mean µ and covariance matrix Σ, and given z1 containing p measurements (p < n) of our shape, our task is to compute the last n − p points which maximize the joint likelihood, PD (z). In contrast to previous sections, we will now work in real, vectorized coordinates (z = (x1 , y1 , ..., xn , yn )T ) rather than in the complex coordinates which were useful for encoding rotation. In order for us to transform our completed vector, z = (z1 , z2 )T , into a pre-shape, we must first normalize translation and scale. However, this cannot be done without knowing the last n − p points. Furthermore, the Procrustes minimizing rotation from z’s pre-shape to µ depends on the missing points, so any projection into the tangent space (and corresponding IROS 2006 workshop: From Sensors to Human Spatial Concepts Page: 61 0.25 0.25 0.2 0.2 0.15 0.15 0.1 0.1 0.05 0.05 0 0 −0.05 −0.05 −0.1 −0.1 −0.15 −0.15 −0.2 −0.25 −0.25 −0.2 −0.2 −0.15 −0.1 −0.05 0 0.05 0.1 0.15 0.2 0.25 −0.25 −0.25 −0.2 −0.15 −0.1 −0.05 0 0.05 0.1 0.15 0.2 0.25 Figure 6. An example of occluded objects, where the right tool occludes the left tool. Left to right: The original image, the measured contour, the contour segments that must be completed. likelihood) will depend in a highly non-linear way on the location of the missing points. We can, however, compute the missing points z2 given an orientation and scale. This leads to an iterative algorithm that holds the orientation and scale fixed, computes z2 and then computes a new orientation and scale given the new z2 . The translation term can then be computed from the completed contour z. We derive z2 given a fixed orientation θ and scale α in the following manner. For a complete contour z, we normalize for orientation and scale using 1 Rθ z (15) α where Rθ is the rotation matrix of θ. To center z′ , we then subtract off the centroid: 1 (16) w = z′ − Cz′ n where C is the 2n × 2n checkerboard matrix,   1 0 ··· 1 0 0 1 · · · 0 1  . . . . . ... ...  (17) C= .  .. .. 1 0 · · · 1 0 0 1 ··· 0 1 z′ = To design such a sampling algorithm, we must choose a distribution from which to sample orientations and scales. One idea is to match the partial shape, z1 , to the partial mean shape, µ1 , by computing the pre-shapes of z1 and µ1 and finding the Procrustes fitting rotation, θ∗ , from the pre-shape of z1 onto the pre-shape of µ1 . This angle can then be used as a mean for a von Mises distribution (the circular analog of a Gaussian) from which to sample orientations. Similarly, we can sample scales from a Gaussian with mean α0 –the ratio of scales of the partial shapes z1 and µ1 as in α0 = kz1 − p1 C1 z1 k kµ1 − p1 C1 µ1 k . (23) Any sampling method for shape completion will have a scale bias–completed shapes with smaller scales project to a point closer to the origin in tangent space, and thus have higher likelihood. One way to fix this problem is to solve for z2 by performing a constrained optimization on dΣ where the scale of the centered, completed shape vector is constrained to have unit length: 1 Cx′ k = 1. n kx′ − (24) This constrained optimization problem can be attacked with the method of Lagrange multipliers, and reduces to the problem of finding the zeros of a (n−p)th order polynomial in one variable, for which numerical techniques are well-known. In preliminary experiments this scale bias has not appeared to provide any obvious errors in shape completion, although more rigorous testing and analysis are needed. Thus w is the centered pre-shape. Now let M be the tangent space projection matrix: 0.25 0.3 0.2 0.2 0.15 0.1 0.1 T M = I − µµ 0.05 (18) 0 0 −0.05 −0.1 −0.1 Then the Mahalanobis distance with respect to D from M w to the origin in the tangent space is: dΣ = (M w)T Σ−1 M w (19) −0.15 −0.2 −0.2 −0.3 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 −0.25 −0.25 (a) Partial contour to be completed Minimizing dΣ is equivalent to maximizing PD (·), so we ∂d continue by setting ∂zΣ2 equal to zero, and letting −0.2 −0.15 −0.1 −0.05 0 0.05 0.1 0.15 0.2 0.25 (b) Completed as fork 0.25 0.3 0.2 0.15 0.2 0.1 1 1 (20) W1 = M1 (I1 − C1 ) Rθ1 n α 1 1 (21) W2 = M2 (I2 − C2 ) Rθ2 n α where the subscripts “1” and “2” indicate the left and right sub-matrices of M , I, and C that match the dimensions of z1 and z2 . This yields the following system of linear equations which can be solved for the missing data, z2 : (c) Completed as spoon (d) Completed as tool Figure 7. Shape classification of partial contours. In the second, third and fourth image, the mean shape of the shape class is shown in blue on the left, and the approximate maximum likelihood completion to the partial contour is shown on the right. (W1 z1 + W2 z2 )T Σ−1 W2 = 0 V. S HAPE C LASSIFICATION 0.1 0.05 0 −0.1 −0.1 −0.2 −0.15 −0.2 −0.25 −0.25 (22) As described above, equation 22 holds for a specific orientation and scale. We can then use the estimate of z2 to re-optimize θ and α and iterate. Alternatively, we can simply sample a number of candidate orientations and scales, complete the shape of each sample, and take the completion with highest likelihood (lowest dΣ ). 0 −0.05 −0.3 −0.2 −0.15 −0.1 −0.05 0 0.05 0.1 0.15 0.2 0.25 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 Given k previously learned shape classes C1 , . . . , Ck with shape means µ1 , . . . , µk and covariance matrices Σ1 , . . . , Σk , and given a measurement m of an unknown object shape, we can now compute a distribution over shape classes for a measured object: {P (Ci |m) : i = 1 . . . k}. The shape classification problem is to find the posterior mode of this IROS 2006 workshop: From Sensors to Human Spatial Concepts Page: 62 distribution, Ĉ, Ĉ = arg max P (Ci |m) Ci = arg max P (m|Ci )P (Ci ). Ci (25) (26) Assuming a uniform prior on Ci , we have the standard maximum likelihood (ML) estimator Ĉ = ĈML = arg max P (m|Ci ). Ci (27) A. Point Correspondences The difficulty in proceeding with this ML problem lies in corresponding the measurement m with each object model, that is, identifiying which points in m correspond to which points in model Ci . One potential drawback to contour-based shape analysis (compared to other representations such as point cloud) is that we require correspondences between all points of any two shapes we wish to compare using the Procrustean metric 5 . While this caveat is clearly the major source of power to our model, it can also be a major difficulty in real-world problems where correspondences can be difficult to identify. The correspondence problem also requires determining whether m is a complete, or partially occluded measurement. These problems are complicated even further when m is allowed to be a measurement of more than one object (Figure 6). In this case, we have a segmentation problem as well as a correspondence problem. Data association and segmentation are both very challenging problems in of themselves; as such they are not problems which we will attempt to fully address in this paper. We will constrain ourselves to the case where m is a measurement of a single object. In particular, we will address the following two cases: 1) m is a complete measurement of a single object. 2) m is a simply, partially occluded measurement of a single object. By simply occluded we mean that m consists of one, contiguous segment of a full (closed) contour. For the 2-D contour case, one method for finding point correspondences is to identify local features of interest on each contour (for example, based on curvature or local changes in shape) and match the higher-level features to similar sets of features on other contours, interpolating correspondences for the points between features. For our experiments, we implemented a feature correspondence algorithm for closed contours based on a hierarchical probabilistic model incorporating local feature match likelihoods together with the global feature shape match likelihood– this was done primarily to demonstrate that a purely geometric feature correspondence algorithm was tenable. However, in practice algorithms incorporating more robust features (such as SIFT features) are more desirable than purely geometric algorithms. The details of this feature correspondence algorithm are omitted due to space constraints. B. Complete Shape Class Likelihood Given that a measurement m represents a complete contour, the problem of finding the maximum likelihood estimate of 5 We also require that the number of points on each shape be the same. each shape class, ĈML , is relatively straight-forward. First, calculate the pre-shape τ associated with m. Then, for each class Ci : i = 1 . . . k, find correspondences between τ and the mean shape µi of model Ci and normalize τ to have the same number of points as µi , with the points around τ ’s contour corresponding as closely as possible to the points around the contour of µi . Next, re-normalize this feature-matched τ so that it lies in pre-shape space (i.e. center at the origin, length one), and call this re-normalized pre-shape τ ′ . Finally, compute the orthogonal Procrustes fit of τ ′ onto µi , giving orientation θ∗ , and project into tangent space so that z = M Rθ ∗ τ ′ , (28) ∗ where again Rθ∗ is the rotation matrix of θ . The probability of this point x in tangent space is then given with the standard Gaussian likelihood, 1 1 exp(− zT Σ−1 P (m|Ci ) ≈ P (z|Ci ) = p i z). 2n 2 (2π) |Σi | (29) C. Partial Shape Class Likelihood The most obvious approach to partial shape class likelihood is to simply complete the missing portion of the partial shape corresponding to m with respect to each shape class, then classify the completed shape as above (Figure 7). To make this concrete, let z = {z1 , z2 } be the completed shape, where z1 is the partial shape corresponding to measurement m, and z2 is unknown. Then P (Ci , z1 ) ∝ P (Ci |z1 ) = P (z1 ) Z P (Ci , z1 , z2 )dz2 (30) Rather than marginalize over the hidden data, z2 , we can approximate with estimate ẑ2 , the output of our shape completion algorithm, yielding: P (Ci |z1 ) ≈ η · P (z1 , ẑ2 |Ci ) (31) where η is a normalizing constant (and can be ignored during classification), and P (z1 , ẑ2 |Ci ) is the complete shape class likelihood of the completed shape. There are several variables of the shape completion algorithm which may influence the accuracy of the completed shape, e.g. feature correspondences, sample rotations and sample scales. As there is some randomness associated with these variables, they should ideally be marginalized out from the partial shape class likelihood. Additionally, we may want to include a prior on the number of points to be completed as this number must be estimated with a partial shape feature correspondence algorithm. Such a prior would indicate the common-sense principle that shape completions with fewer unknown points are to be preferred over shape completions with a large number of unknown points. VI. E XPERIMENTAL R ESULTS Videos of a controlled desktop environment were processed with routines from the Intel Open Source Computer Vision Library (OpenCV). Contours were extracted from the raw image frames using intensity thresholding and pyramid segmentation edges (as in figure 1). Shape models of three object classes–forks, spoons, and wire-cutter tools–were then IROS 2006 workshop: From Sensors to Human Spatial Concepts Page: 63 0.25 0.3 0.2 0.25 0.2 0.2 0.15 0.15 0.1 0.1 0.05 0.05 0.1 0 0 0 −0.05 −0.05 −0.1 −0.1 −0.1 −0.2 −0.15 −0.15 −0.2 −0.3 −0.3 −0.2 −0.1 0 0.1 0.2 −0.2 −0.25 −0.25 0.3 −0.2 −0.15 −0.1 −0.05 0 0.05 0.1 0.15 0.25 0.3 0.2 0.2 −0.25 −0.25 0.25 −0.2 −0.15 −0.1 −0.05 0 0.05 0.1 0.15 0.2 0.25 0.25 0.2 0.2 0.15 0.15 0.1 0.1 0.05 0.05 0.1 0 0 0 −0.05 −0.05 −0.1 −0.1 −0.1 −0.2 −0.15 −0.15 −0.2 −0.3 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 −0.25 −0.25 −0.2 −0.2 −0.15 −0.1 −0.05 0 0.05 0.1 0.15 0.2 0.25 −0.25 −0.25 −0.2 −0.15 −0.1 −0.05 0 0.05 0.1 0.15 0.2 0.25 Figure 8. Partial contour shape classification. Top row: the extracted contour is correctly classified as a spoon. Left to right: original image, extracted contour, shape completion using tool, fork, and spoon models. Bottom row: extracted contour is misclassified as a spoon; however, partial contour is uninformative, even for humans. Left to right: original image, extracted contour, shape completion using tool, fork, and spoon models. generated from a manually-chosen subset of the extracted contours. An illustration of the tool shape model is shown in figure 5. Complete contour shape classification was then tested on a subset of the training data, yielding a classification rate of 100% (Figure 3). Next, eight partial contours (three of tools, three of forks, and two of spoons) were extracted from video frames of occluded objects. Partial contours were segmented by hand (where needed), completed with respect to each shape model, and classified (Figure 8). This process was repeated while varying N , the number of principle components kept in each shape model, from one to ten. For N < 5, the classification rate was 5/8, while for N = 5 and above, the classification rate was 6/8. Incorporating a prior on the completion size (as in the previous section) increased the classification rate to 7/8 for N > 7. The one partial shape which was consistently misclassified was the end of a fork handle, impossible even for a human to classify correctly more than one-third of the time (figure 7, lower row). VII. R ELATED W ORK There is a great deal of work on statistical shape modeling, beginning with the work on landmark data by Kendall [11] and Bookstein [4] in the 1980’s. In recent years, more complex statistical shape models have arisen, for example, in the active contours literature [3]. Procrustes analysis pre-dates statistical shape theory by two decades; algorithms for finding Procrustean mean shapes [13], [7], [2] were developed long before the topology of shape spaces were well-understood [12]. In terms of shape classification, shape contexts [1] and spin images [9] provide robust frameworks for estimating correspondences between shape features for recognition and modelling problems. An interesting take on shape completion using probable object symmetries has been done in [17]. VIII. C ONCLUSION We have presented an approach to geometric object modelling that unified the problems of modelling geometry, recognizing object shapes and inferring occluded portions of the object model. The algorithm depends on a technique known as Procrustean shape analysis [11], and in particular we have derived an expression for the maximum likelihood object geometry given only a partial observation of the object. We have shown some preliminary results on everyday objects, but we plan to extend these results to more complex scenes and geometric inference problems. The description of our algorithm given in this paper is restricted to inferring geometry from two-dimensional images. However, the technique extends to higher dimensions, and we plan to demonstrate the same approach to object modelling on three-dimensional object data, such as from a laser range finder or stereo camera. R EFERENCES [1] Serge Belongie, Jitendra Malik, and Jan Puzicha. Shape matching and object recognition using shape contexts. IEEE Trans. Pattern Analysis and Machine Intelligence, 24(24):509–522, April 2002. [2] Jos M. F. Ten Berge. Orthogonal procrustes rotation for two or more matrices. Psychometrika, 42(2):267–276, June 1977. [3] A. Blake and M. Isard. Active Contours. Springer-Verlag, 1998. [4] F.L. Bookstein. A statistical method for biological shape comparisons. Theoretical Biology, 107:475–520, 1984. [5] I. Dryden and K. Mardia. Statistical Shape Analysis. John Wiley and Sons, 1998. [6] R. Fergus, P. Perona, and A. Zisserman. Object class recognition by unsupervised scale-invariant learning. 2003. [7] J. C. Gower. Generalized procrustes analysis. Psychometrika, 40(1):33– 51, March 1975. [8] John R. Hurley and Raymond B. Cattell. The procrustes program: Producing direct rotation to test a hypothesized factor structure. Behavioral Science, 7:258–261, 1962. [9] Andrew Johnson and Martial Hebert. Using spin images for efficient object recognition in cluttered 3-d scenes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(5):433 – 449, May 1999. [10] M. Kazhdan, T. Funkhouser, and S. Rusinkiewicz. Rotation invariant spherical harmonic representation of 3d shape descriptors, 2003. [11] D.G. Kendall. Shape manifolds, procrustean metrics, and complex projective spaces. Bull. London Math Soc., 16:81–121, 1984. [12] D.G. Kendall, D. Barden, T.K. Carne, and H. Le. Shape and Shape Theory. John Wiley and Sons, 1999. [13] Walter Kristof and Bary Wingersky. Generalization of the orthogonal procrustes rotation procedure to more than two matrices. In Proceedings, 79th Annual Convention, APA, pages 89–90, 1971. [14] David G. Lowe. Object recognition from local scale-invariant features. In Proc. of the International Conference on Computer Vision ICCV, pages 1150–1157, 1999. [15] Danijel Skocaj and Alea Leonardis. Weighted and robust incremental method for subspace learning. In Proceedings of the International Conference on Computer Vision (ICCV), pages 1494–1501, 2003. [16] C. G. Small. The statistical theory of shape. Springer, 1996. [17] S. Thrun and B. Wegbreit. Shape from symmetry. In Proceedings of the International Conference on Computer Vision (ICCV), Beijing, China, 2005. IROS 2006 workshop: From Sensors to Human Spatial Concepts Page: 64 V

Log In

From sensors to human spatial concepts

Related papers

Related papers

Related topics