The purpose of text geolocation is to associate geographic information contained in a document wi... more The purpose of text geolocation is to associate geographic information contained in a document with a set (or sets) of coordinates , either implicitly by using linguistic features and/or explicitly by using geographic metadata combined with heuristics. We introduce a geocoder (location mention disambiguator) that achieves state-of-the-art (SOTA) results on three diverse datasets by exploiting the implicit lexical clues. Moreover, we propose a new method for systematic encoding of geographic metadata to generate two distinct views of the same text. To that end, we introduce the Map Vector (MapVec), a sparse representation obtained by plotting prior geographic probabilities, derived from population figures, on a World Map. We then integrate the implicit (language) and explicit (map) features to significantly improve a range of metrics. We also introduce an open-source dataset for geoparsing of news events covering global disease outbreaks and epidemics to help future evaluation in geoparsing.
Named entities are frequently used in a metonymic manner. They serve as references to related ent... more Named entities are frequently used in a metonymic manner. They serve as references to related entities such as people and organisations. Accurate identification and interpretation of metonymy can be directly beneficial to various NLP applications , such as Named Entity Recognition and Geographical Parsing. Until now, metonymy resolution (MR) methods mainly relied on parsers, taggers, dictionaries , external word lists and other hand-crafted lexical resources. We show how a minimalist neural approach combined with a novel predicate window method can achieve competitive results on the Se-mEval 2007 task on Metonymy Resolution. Additionally, we contribute with a new Wikipedia-based MR dataset called RelocaR, which is tailored towards locations as well as improving previous deficiencies in annotation guidelines.
Geographical data can be obtained by converting place names from free-format text into geographic... more Geographical data can be obtained by converting place names from free-format text into geographical coordinates. The ability to geo-locate events in textual reports represents a valuable source of information in many real-world applications such as emergency responses, real-time social media geographical event analysis, understanding location instructions in auto-response systems and more. However, geoparsing is still widely regarded as a challenge because of domain language diversity, place name ambiguity, metonymic language and limited leveraging of context as we show in our analysis. Results to date, whilst promising, are on laboratory data and unlike in wider NLP are often not cross-compared. In this study, we evaluate and analyse the performance of a number of leading geoparsers on a number of corpora and highlight the challenges in detail. We also publish an automatically geotagged Wikipedia corpus to alleviate the dearth of (open source) corpora in this domain.
Empirical methods in geoparsing have thus far lacked a standard evaluation framework describing t... more Empirical methods in geoparsing have thus far lacked a standard evaluation framework describing the task, data and metrics used to establish state-of-the-art systems. Evaluation is further made inconsistent, even unrepresentative of real world usage, by the lack of distinction between the different types of toponyms, which necessitates new guidelines, a consolidation of metrics and a detailed toponym taxonomy with implications for Named Entity Recognition (NER). To address these deficiencies, our manuscript introduces such a framework in three parts. Part 1) Task Definition: clarified via corpus linguistic analysis proposing a fine-grained Pragmatic Taxonomy of Toponyms with new guidelines. Part 2) Evaluation Data: shared via a dataset called GeoWebNews to provide test/train data to enable immediate use of our contributions. In addition to fine-grained Geotagging and Toponym Resolution (Geocoding), this dataset is also suitable for prototyping machine learning NLP models. Part 3) Metrics: discussed and reviewed for a rigorous evaluation with appropriate recommendations for NER/Geoparsing practitioners.
The purpose of text geolocation is to associate geographic information contained in a document wi... more The purpose of text geolocation is to associate geographic information contained in a document with a set (or sets) of coordinates , either implicitly by using linguistic features and/or explicitly by using geographic metadata combined with heuristics. We introduce a geocoder (location mention disambiguator) that achieves state-of-the-art (SOTA) results on three diverse datasets by exploiting the implicit lexical clues. Moreover, we propose a new method for systematic encoding of geographic metadata to generate two distinct views of the same text. To that end, we introduce the Map Vector (MapVec), a sparse representation obtained by plotting prior geographic probabilities, derived from population figures, on a World Map. We then integrate the implicit (language) and explicit (map) features to significantly improve a range of metrics. We also introduce an open-source dataset for geoparsing of news events covering global disease outbreaks and epidemics to help future evaluation in geoparsing.
Named entities are frequently used in a metonymic manner. They serve as references to related ent... more Named entities are frequently used in a metonymic manner. They serve as references to related entities such as people and organisations. Accurate identification and interpretation of metonymy can be directly beneficial to various NLP applications , such as Named Entity Recognition and Geographical Parsing. Until now, metonymy resolution (MR) methods mainly relied on parsers, taggers, dictionaries , external word lists and other hand-crafted lexical resources. We show how a minimalist neural approach combined with a novel predicate window method can achieve competitive results on the Se-mEval 2007 task on Metonymy Resolution. Additionally, we contribute with a new Wikipedia-based MR dataset called RelocaR, which is tailored towards locations as well as improving previous deficiencies in annotation guidelines.
Geographical data can be obtained by converting place names from free-format text into geographic... more Geographical data can be obtained by converting place names from free-format text into geographical coordinates. The ability to geo-locate events in textual reports represents a valuable source of information in many real-world applications such as emergency responses, real-time social media geographical event analysis, understanding location instructions in auto-response systems and more. However, geoparsing is still widely regarded as a challenge because of domain language diversity, place name ambiguity, metonymic language and limited leveraging of context as we show in our analysis. Results to date, whilst promising, are on laboratory data and unlike in wider NLP are often not cross-compared. In this study, we evaluate and analyse the performance of a number of leading geoparsers on a number of corpora and highlight the challenges in detail. We also publish an automatically geotagged Wikipedia corpus to alleviate the dearth of (open source) corpora in this domain.
Empirical methods in geoparsing have thus far lacked a standard evaluation framework describing t... more Empirical methods in geoparsing have thus far lacked a standard evaluation framework describing the task, data and metrics used to establish state-of-the-art systems. Evaluation is further made inconsistent, even unrepresentative of real world usage, by the lack of distinction between the different types of toponyms, which necessitates new guidelines, a consolidation of metrics and a detailed toponym taxonomy with implications for Named Entity Recognition (NER). To address these deficiencies, our manuscript introduces such a framework in three parts. Part 1) Task Definition: clarified via corpus linguistic analysis proposing a fine-grained Pragmatic Taxonomy of Toponyms with new guidelines. Part 2) Evaluation Data: shared via a dataset called GeoWebNews to provide test/train data to enable immediate use of our contributions. In addition to fine-grained Geotagging and Toponym Resolution (Geocoding), this dataset is also suitable for prototyping machine learning NLP models. Part 3) Metrics: discussed and reviewed for a rigorous evaluation with appropriate recommendations for NER/Geoparsing practitioners.
Uploads
Papers by Milan Gritta
Drafts by Milan Gritta