The UPV at GeoCLEF 2007

Davide Buscaldi

The UPV at GeoCLEF 2007

2007

Abstract In this work we attempted to determine the relative importance of the geographical and WordNet-extracted terms with respect to the remainder of the query. Our system is based on Lucene and uses LingPipe for Named Entity recognition. Geographical terms are expanded with WordNet holonyms and synonyms and indexed separately. We checked the relative importance of the terms by boosting them with reduction factors (0.75, 0.5 and 0.25).

The UPV at GeoCLEF 2007 Davide Buscaldi and Paolo Rosso Dpto. de Sistemas Informticos y Computación (DSIC), Universidad Politcnica de Valencia, Spain {dbuscaldi, prosso}@dsic.upv.es Abstract In this work we attempted to determine the relative importance of the geographical and WordNet-extracted terms with respect to the remainder of the query. Our system is based on Lucene and uses LingPipe for Named Entity recognition. Geographical terms are expanded with WordNet holonyms and synonyms and indexed separately. We checked the relative importance of the terms by boosting them with reduction factors (0.75, 0.5 and 0.25). The comparison to the clean system (using only Lucene) shows that it is possible to improve the mean average precision if the importance of geographical terms is equal or less than the half with respect to the content words in the query. We also observed that WordNet holonyms may help in improving the recall but the term expansion is sensible to ambigue place names. As a further work, we will need to implement a toponym disambiguation method in order to reduce the impact of this kind of ambiguity. Categories and Subject Descriptors H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.3 Information Search and Retrieval; H.3.4 Systems and Software General Terms Measurement, Performance, Experimentation Keywords Geographical Information Retrieval, Index Term Expansion 1 Introduction Since our first participation at the GeoCLEF we have been developing a method that can use the information contained in the WordNet [6] ontology for the Geographical Information Retrieval task. In our first attempt [2, 5] we simply used synonyms (alternate names) and meronyms of locations that appeared in the query in order to expand the query itself. This method performed poor, due to the noise introduced by the expansion. Subsequently, we introduced a method that exploits the inverse of the meronymy relationship - holonymy (a concept A is holonym of another concept B if A contains B). We named this method Index Term Expansion [3]. With this method we add to the geographical index terms the informations about their holonyms, such that a user looking information about Spain will find documents containing Valencia, Madrid or Barcelona even if the document itself does not contain any reference to Spain. The results obtained with this method showed that the inclusion of WordNet holonyms allowed to obtain an improvement in recall, although it was not so significant as we hoped (about 1%). Moreover, we noticed that the use of the Index Term Expansion method did not allow to obtain the same precision of the clean system. We individuated the reason of this behaviour in the fact that the geographical terms were assigned the same importance of the other terms of the query. Therefore, in this participation we attempted to determine the relative importance of geographical and WordNet-extracted terms with respect to the remainder of the terms of the query. This has been done by means of the separation of the index of geographical terms from the general index and the creation of another index that contains only WordNet-extracted terms. In the following section, we describe the system and how index term expansion works. In section 3 we describe the characteristics of our submissions and show a resume of the obtained results. 2 Our System The core of the system is constituted by the Lucene1 open source search engine, version 2.1. The engine is supported by a module that uses LingPipe2 for HMM-based Named Entity recognition (this module performs the task of recognizing geographical names in text), and another one that is based on the MIT Java WordNet Interface 3 in order access the WordNet ontology and find synonyms and holonyms of the geographical names. 2.1 Indexing During the indexing phase, the documents are examined in order to find location names (toponym) by means of LingPipe. When a toponym is found, then two actions are performed: first of all, the toponym is added to a separate index (geo index) that contains only the toponyms. In the second place, WordNet is examined in order to find holonyms (recursively) and synonyms of the toponym. The retrieved holonyms and synonyms are put in another separate index (wn index), containing only wordnet-related information. For instance, consider the following text from the document GH950630-000000 in the Glasgow Herald 95 collection: ...The British captain may be seen only once more here, at next month’s world championship trials in Birmingham, where all athletes must compete to win selection for Gothenburg... The following toponyms are added to the geo index: “Birmingham”, “Gothenburg”. Birmingham is found in WordNet both as Birmingham, Pittsburgh of the South, in the United States and Birmingham, Brummagem, an important city in England. The holonyms in the first case are Alabama, Gulf States, South, United States of America and their synonyms. In the second case, we obtain England, United Kingdom, Europe and their synonyms. All these words are added to the wn index for Birmingham, since we did not use any method in order to disambiguate the toponym. For Gothenburg we obtain Sweden and Europe again, together with the original Swedish name of Gothenburg (Goteborg). These words are also added to the wn index. 2.2 Searching For each topic, LingPipe is run again in order to find the geographical terms. In the search phase, we do not use WordNet. However, the toponyms individuated by LingPipe are searched in the geographical and/or WordNet indices. 1 http://lucene.apache.org/ 2 http://www.alias-i.com/lingpipe/ 3 http://www.mit.edu/∼markaf/projects/wordnet/ 3 Experiments We submitted a total of 12 runs at GeoCLEF 2007. Two runs were used as “benchmarks”: they were obtained by using the base Lucene system, without index term expansion, in one case considering only topic title and description, and all fields in the other case. The remaining runs used the geo index or wn index or both, with different weightings that were submitted using the Lucene “Boost” operator. This operator allows to assign relative importance to terms. This means that a term with, for instance, a boost factor of 4 will be four times more important than the other terms in the query. We used 0.75, 0.5 and 0.25 as boost factor for geographical and WordNet terms, in order to study their importance in the retrieval process. In the following tables we show the results obtained in terms of Mean Average Precision and Recall for all the submitted runs. Table 1: Mean Average Precision (MAP) and Recall obtained for all the “Title+Description only” runs. run ID rfiaUPV01 rfiaUPV03 rfiaUPV05 rfiaUPV07 rfiaUPV08 rfiaUPV09 rfiaUPV10 rfiaUPV11 rfiaUPV12 geo boost 0 0.5 0.5 0.75 0.75 0.25 0.25 0.5 0.75 wn boost 0 0.0 0.25 0.0 0.25 0.25 0.0 0.5 0.75 MAP 0.226 0.227 0.238 0.224 0.224 0.239 0.236 0.239 0.231 Recall 0.886 0.869 0.881 0.860 0.860 0.888 0.891 0.886 0.877 Table 2: Mean Average Precision and Recall obtained for the “All fields” runs. run ID rfiaUPV02 rfiaUPV04 rfiaUPV06 geo boost 0 0.5 0.5 wn boost 0 0.0 0.25 MAP 0.247 0.256 0.263 Recall 0.903 0.915 0.926 The results obtained with the topic title and description (Table 1) show that by considering geographical terms less important is possible to obtain a better MAP. The integration of WordNet terms allows to improve further the MAP, although it has almost no effect over recall. This may be due to the noise introduced by the ambiguity of some toponyms. However, if we consider all the topic fields (Table 2) we can observe that the introduction of WordNet allowed to improve also the recall. We need to carry out further study of the data in order to fully understand these results. 4 Conclusions and Further Work The obtained results show that geographical terms are less important than content words in the topics. Reducing the importance of geographical terms allowed to improve the mean average precision. The impact of WordNet is not clear. We suppose that the effects of the introduction of WordNet synonyms and holonyms are conditioned by the ambiguity of some toponyms, such as “Birmingham” that can be a city in Alabama or in England. The ambiguity of toponyms is a common problem in news text [4], and currently various approaches are being developed [7, 1]. We plan to carry out more experiments in order to understand better the impact of toponym ambiguity over geographical information retrieval. Acknowledgements We would like to thank the TIN2006-15265-C06-04 research project for partially supporting this work. References [1] D. Buscaldi and P. Rosso. A conceptual density-based approach for the disambiguation of toponyms. International Journal of Geographical Information Systems, 2008. accepted, to be published. [2] D. Buscaldi, P. Rosso, and E. Sanchis. Using the wordnet ontology in the geoclef geographical information retrieval task. In Carol Peters, Fredric C. Gey, Julio Gonzalo, Henning Mller, Gareth J.F. Jones, Michael Kluck, Bernardo Magnini, Maarten de Rijke, and Danilo Giampiccolo, editors, Accessing Multilingual Information Repositories, volume 4022 of Lecture Notes in Computer Science, pages 939–946. Springer, Berlin, 2006. [3] D. Buscaldi, P. Rosso, and E. Sanchis. A wordnet-based indexing technique for geographical information retrieval. In Carol Peters, Fredric C. Gey, Julio Gonzalo, Henning Mller, Gareth J.F. Jones, Michael Kluck, Bernardo Magnini, Maarten de Rijke, and Danilo Giampiccolo, editors, Lecture Notes in Computer Sciences, volume 4730 of Lecture Notes in Computer Science, pages 954–957. Springer, Berlin, 2007. [4] Eric Garbin and Inderjeet Mani. Disambiguating toponyms in news. In conference on Human Language Technology and Empirical Methods in Natural Language Processing (HLT05), pages 363–370, Morristown, NJ, USA, 2005. Association for Computational Linguistics. [5] Fredric Gey, Ray Larson, Mark Sanderson, Hideo Joho, and Paul Clough. Geoclef: the clef 2005 cross-language geographic information retrieval track. In Working notes for the CLEF 2005 Workshop (C.Peters Ed.), Vienna, Austria, 2005. [6] G. A. Miller. Wordnet: A lexical database for english. In Communications of the ACM, volume 38, pages 39–41, 1995. [7] Simon Overell, Joao Magalhaes, and Stefan Rüger. Place disambiguation with co-occurrence models. In Carol Peters, editor, GeoCLEF 2006 Workshop, Alicante, Spain, 2006.

Log In

The UPV at GeoCLEF 2007

Related papers

Related papers