RGCL-WLV at SemEval-2019 Task 12: Toponym Detection

Ruslan Mitkov

RGCL-WLV at SemEval-2019 Task 12: Toponym Detection Alistair Plum1 ∗, Tharindu Ranasinghe1 ∗, Pablo Calleja2 , Constantin Orăsan1 , Ruslan Mitkov1 1 Research Group in Computational Linguistics, University of Wolverhampton, UK 2 Ontology Engineering Group, Universidad Politécnica de Madrid, ES {a.j.plum, t.d.ranasinghehettiarachchige}@wlv.ac.uk {c.orasan, r.mitkov}@wlv.ac.uk pcalleja@fi.upm.es Abstract This article describes the system submitted by the RGCL-WLV team to the SemEval 2019 Task 12: Toponym resolution in scientific papers. The system detects toponyms using a bootstrapped machine learning (ML) approach which classifies names identified using gazetteers extracted from the GeoNames geographical database. The paper evaluates the performance of several ML classifiers, as well as how the gazetteers influence the accuracy of the system. Several runs were submitted. The highest precision achieved for one of the submissions was 89%, albeit it at a relatively low recall of 49%. 1 Introduction Resolving a toponym, a proper name that refers to a real existing location, is a non-trivial task closely related to named entity recognition (NER) (Piskorski and Yangarber, 2013). For this reason, using an NER system to detect and assign location tags could seem a good way forward. However, NER systems may not be able to detect whether a name refers to a actual location or not (e.g., London in London Bus Company). In addition, location names are usually ambiguous, which means it is crucial that these are disambiguated in order to assign the correct coordinates. While in the past the focus in toponym resolution has been on rule and gazetteer driven methods (Speriosu and Baldridge, 2013), more recent approaches also consider ML-based techniques. DeLozier et al. (2015) describe their ML-based approach, which does not require a gazetteer. The approach calculates the geographical profile of each word, which is refined using Wikipedia statistics, and then fed into an ML classifier. Speriosu and Baldridge (2013) also make ∗ The first two authors contributed equally to the paper. use of an ML classifier which is text-driven. Geotags of documents are used to automatically generate a training set. Although the two previous approaches used two standard corpora for toponym resolution, consisting of news articles and 19th century civil war texts, there are wide areas of application for toponym resolution. For instance, Ireson and Ciravegna (2010) explore the use in social media, while Lieberman and Samet (2012) attempt to analyse news streams. Spitz et al. (2016) have also used an encyclopaedic dataset, compiled from Wikipedia, WordNet and GeoNames. The focus of the SemEval 2019 Task 12 was toponym resolution in journal articles from the biomedical domain (Weissenbacher et al., 2019). The articles that had to be processed were case studies on the epidemiology of viruses, meaning that the developed systems can potentially be used to track viruses. The task was composed of three sub-tasks: (1) toponym detection, followed by (2) disambiguation and the assignment of the appropriate coordinates, as well as (3) the development of an end-to-end system. This paper presents our participation in the first sub-task. Our system performs first a gazetteer look-up for locations, and then uses machine learning (ML) to classify whether or not it represents an actual location. The gazetteers are extracted from the online geographical database GeoNames, whilst the classification is carried out by feeding the context of potential locations in an ML classifier. The rest of the paper presents the system developed (Section 2), followed by its evaluation (Section 3). The paper finishes with conclusions. 2 System Description The system developed for this task was designed as a pipeline consisting of three stages: text clean- 1297 Proceedings of the 13th International Workshop on Semantic Evaluation (SemEval-2019), pages 1297–1301 Minneapolis, Minnesota, USA, June 6–7, 2019. ©2019 Association for Computational Linguistics ing, text processing and identification of locations. The rest of this section presents each of these stages. The system has been made available on online. 1 The output of this module was a list of annotations, including tokens boundaries and tokens matching the gazetteer entries. This information is then used by the ML classifier in the next step. 2.1 2.3 Text Cleaning The first processing stage identifies parts of the text which do not contain any locations that have to be identified according to the task guidelines. These parts include the references section of each text and the information about authors of the journal articles. In addition, the texts also contain genome sequences and abbreviations of chemicals, which resemble abbreviations for locations. Regular expressions were used to replace these text sequences with spaces. We chose to replace the sequences rather than remove them in order to keep the correct offsets of entities which are crucial in the evaluation process. Not all the genome sequences were correctly identified due to the variability of how they are represented. As a result, not all these sequences were being replaced by spaces. This introduced noise in our processing pipeline. In addition, in some cases the regular expressions for excluding the references section would fail to correctly identify the boundaries of this section. Since this left large amounts of these texts blank, three texts did not have their respective references sections removed. 2.2 Once the texts are processed, the next task is to detect whether a candidate location really refers to a location. In addition to cases of common nouns which may also be used as a location, there are also cases where the location names were used as adjectives. For example, in the sentence Other mutations observed in the HA gene of the Kentucky isolates have also been reported by others, even though the gazetteer identifies Kentucky as a location it is actually referring to a virus entity. According to the guidelines, this should not be annotated as a location, making the task quite difficult. Analysis of examples from the training data indicated that the context of candidate locations can be used to assess whether the detected word is an actual location or not. For this reason, we trained a machine learning model which uses the context of candidates to distinguish between real locations and falsely identified locations by the gazetteer look-up component. For the experiments presented here, we used a window of two words before and two words after the candidate location to obtain its context. More precisely, if the detected word from the gazetteer is ωi , the context ci was defined as, Text Processing ci = ωi−2 + ωi−1 + ωi + ωi+1 + ωi+2 Once cleaned, the texts were processed using components from the ANNIE pipeline within GATE (Cunningham et al., 2002, 2011). The ANNIE pipeline was designed for named entity recognition tasks, but for our purpose we used only the tokeniser and gazetteer lookup components. We produced three different gazetteers. The first one contained all locations from the GeoNames geographical database. The second gazetteer contained a list of cities from GeoNames with a population of over 5,000. The third gazetteer features a list of countries, capitals, and cities with a population larger than 15,000 people extracted from GeoNames. The default list of regions included with ANNIE was also used. A list of US regions as well as their abbreviated forms was added manually. 1 https://github.com/TharinduDR/SemEval-2019-Task12-Toponym-Resolution-in-Scientific-Papers Identification of Locations (1) The annotated gold standard provided by the task organisers was used to create a training set which contained both positive and negative instances. Two machine learning approaches were considered for this word window classification task. The first approach was to use traditional machine learning models, while the other approach was to use neural network models. 2.3.1 Traditional ML Approach There are multiple ways that words can be translated into a numerical representation before they can be used as features for a machine learning model. The commonly used representations convert sequences of words to a bag of words or tf-idf vectors. However, since their introduction, word embedding models (Mikolov et al., 2013) have been widely used as features for text classification tasks and have proven successful. In addition, 1298 they have the capability to represent the context better than tf-idf vectors. For this reason, we used the 300 dimensional word2vec embedding model trained on the Google news corpus. The word windows had to be represented by a vector that can be fed as features to a machine learning model, while retaining a unique length over all the training and testing examples, in order to be input into a traditional machine learning model. There are many ways to represent a text window with word embeddings. Simply averaging the word embeddings of all words excluding stop words in a text has proven to be a strong baseline or feature across a multitude of tasks, such as short text similarity tasks (Kenter et al., 2016). Following that, the mean of word vectors in a particular word window was calculated in order to represent the whole word window with a vector, which is a 300 dimensional vector in this scenario. The vector calculated was used as features and fed into several machine learning classifiers such as Support Vector Machines (Cortes and Vapnik, 1995), Random Forest classifier (Breiman, 2001) and XGBoost (Chen and Guestrin, 2015). The parameters were tuned using 10-fold cross validation. For the implementation scikit-learn in python 3.6 was used. 2.3.2 Figure 1: LSTM variant with self attention Neural Network Architectures Figure 2: Capsule net architecture The representation described above performs poorly on classification tasks such as sentiment analysis, because it loses the word order in the same way it happens with the standard bag-ofwords model, and fails to recognise many sophisticated linguistic phenomena (Le and Mikolov, 2014). For this reason, the second approach relies on neural networks which receive as input the embedding vectors corresponding to the context, but without performing any modification on it. Keras was used to implement these neural architectures. Two neural architectures were developed. The first one was adopted from text classification research (Coates and Bollegala, 2018). As depicted in figure 1 it contains variants of Long Short-Term Memories (LSTMs) with self attention followed by average pooling and max pooling layers. It also has a dropout (Srivastava et al., 2014) between 2 dense layers after the concatenate layer. The model was trained with cyclical learning rate (Smith, 2017). The pooling layers in the first architecture are considered as a very primitive type of routing mechanism. The solution that is proposed is a capsule network (Sabour et al., 2017). A capsule network with a bi-directional GRU was also experimented with for this data set. The complete architecture is shown in figure 2. There is a spatial drop out (Tompson et al., 2015) between the embedding layer and bi-directional GRU layer and there is also a dropout (Srivastava et al., 2014) between two dense layers after the capsule layer. The results and evaluation criteria of both traditional approaches and neural network approaches are reported in the results section 3. 3 3.1 Results Gazetteers As described in the previous section, three different gazetteers were tested using the development and training sets. As the machine learning component of the system would make the final prediction, it was important to ensure the maximum number of candidate locations. Therefore, it was 1299 Gazetteer GN all GN 5000+ GN custom Precision 0.2359 0.3584 0.3546 Recall 0.7699 0.7563 0.7678 ML Model Zero-R Random Forest SVM XGBoost Bi-LSTM/Bi-GRU + Max Pooling Bi-GRU + Capsule F-Score 0.3612 0.4863 0.4851 Table 1: Gazetteer evaluation results vital to ensure the highest possible recall, while achieving acceptable precision results. Table 1 shows the precision, recall and F-score values for each of the gazetteers, described in section 2, run on the training set. Rows one and two had a high recall but low precision, and a higher precision, but lower recall, respectively. Row three shows the results for the final gazetteer. It has the best balance between precision and recall, and was selected for use in the final system. 3.2 Identification of Locations Locations in the training set were matched using the gazetteers and then extracted together with their respective word window, in order to compile a separate data set. This data was split into a training set and an evaluation set for the machine learning classifiers. The training set consisted of 80% of the total data set and the evaluation set, containing the gold standard annotations from the previous training set, had the rest of the 20%. The accuracy of each machine learning model evaluated on the evaluation set is shown in Table 2. Predictions were considered to be accurate if the machine learning model and the gold standard matched, including correct and incorrect classifications. All other cases were considered to be non-accurate. Our baseline - a zero-R classifier predicting every instance as a falsely identified location had an accuracy of 71.95%. All of our machine learning models were able to outperform the baseline model significantly, even though the data set is in-balanced. The capsule net architecture, which provided the best performance at an accuracy of 88.73%, was selected for use in the final system. 3.3 Submission Results After we had determined the best components for the system, the GN custom gazetteer and the biGRU + Capsule architecture, the whole system was evaluated on the test set. The submission results are presented in four categories, determined by the organisers. Table 3 shows the results for Accuracy 71.95% 84.21% 83.44% 85.80% 87.75% 88.73% Table 2: ML models evaluation results Test Strict macro Strict micro Overlap macro Overlap micro Precision 0.8280 0.8168 0.8980 0.8936 Recall 0.4746 0.3396 0.4969 0.3654 F-Score 0.6034 0.4798 0.6398 0.5187 Table 3: Final submission results each. Overall, our system achieves the highest values in the overlap macro class, with the lowest in the strict micro class. The system tends to achieve acceptable precision scores, but at low recall values. This trend can most probably be explained by the fact that many candidate locations are not detected by the gazetteers. Together with the machine learning part discarding some proper locations, this has a dramatic affect on the recall. 4 Conclusion and Future Work This paper presented the system we submitted to the SemEval 2019 Task 12: Toponym resolution in scientific papers. Evaluation of the system has shown that a pipeline that combines traditional string matching and advanced machine learning can offer promising results. It has demonstrated that a larger size of the gazetteer does not necessarily have a positive effect on performance. It has also made clear that a higher recall value for the gazetteer look-up component could provide a much better basis on which to train machine learning approaches. On the machine learning side, we have demonstrated that employing word embeddings together with state-of-the-art algorithms can be a viable way of classifying toponyms. Due to time constraints, a large amount of different parameters, as well as optimizing the lookup algorithm and underlying gazetteers were not tested or carried out. For future research we hope to address these problems, so that a better basis on which to train machine learning architectures can be achieved, as well as more deep learning architectures. 1300 References Leo Breiman. 2001. Random forests. Machine Learning, 45:5–32. Tianqi Chen and Carlos Guestrin. 2015. Xgboost : Reliable large-scale tree boosting system. Joshua Coates and Danushka Bollegala. 2018. Frustratingly easy meta-embedding - computing metaembeddings by averaging source word embeddings. In NAACL-HLT. Corinna Cortes and Vladimir Vapnik. 1995. Supportvector networks. Machine Learning, 20:273–297. Hamish Cunningham, Diana Maynard, Kalina Bontcheva, and Valentin Tablan. 2002. GATE: A Framework and Graphical Development Environment for Robust NLP Tools and Applications. In Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (ACL’02). Hamish Cunningham, Diana Maynard, Kalina Bontcheva, Valentin Tablan, Niraj Aswani, Ian Roberts, Genevieve Gorrell, Adam Funk, Angus Roberts, Danica Damljanovic, Thomas Heitz, Mark A. Greenwood, Horacio Saggion, Johann Petrak, Yaoyong Li, and Wim Peters. 2011. Text Processing with GATE (Version 6). Grant DeLozier, Jason Baldridge, and Loretta London. 2015. Gazetteer-independent toponym resolution using geographic word profiles. In Twenty-Ninth AAAI Conference on Artificial Intelligence. Neil Ireson and Fabio Ciravegna. 2010. Toponym resolution in social media. In The Semantic Web – ISWC 2010, pages 370–385, Berlin, Heidelberg. Springer Berlin Heidelberg. Sara Sabour, Nicholas Frosst, and Geoffrey E. Hinton. 2017. Dynamic routing between capsules. CoRR, abs/1710.09829. Leslie N. Smith. 2017. Cyclical learning rates for training neural networks. 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 464–472. Michael Speriosu and Jason Baldridge. 2013. Textdriven toponym resolution using indirect supervision. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 1466–1476. Andreas Spitz, Johanna Geiß, and Michael Gertz. 2016. So far away and yet so close: Augmenting toponym disambiguation and similarity with text-based networks. In Proceedings of the Third International ACM SIGMOD Workshop on Managing and Mining Enriched Geo-Spatial Data, GeoRich ’16, pages 2:1–2:6, New York, NY, USA. ACM. Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15:1929–1958. Jonathan Tompson, Ross Goroshin, Arjun Jain, Yann LeCun, and Christoph Bregler. 2015. Efficient object localization using convolutional networks. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 648–656. Davy Weissenbacher, Arjun Magge, Karen O’Connor, Matthew Scotch, and Graciela Gonzalez. 2019. Semeval-2019 task 12: Toponym resolution in scientific papers. In Proceedings of The 13th International Workshop on Semantic Evaluation. Association for Computational Linguistics. Tom Kenter, Alexey Borisov, and Maarten de Rijke. 2016. Siamese cbow: Optimizing word embeddings for sentence representations. CoRR, abs/1606.04640. Quoc V. Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In ICML. Michael D. Lieberman and Hanan Samet. 2012. Adaptive context features for toponym resolution in streaming news. In Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’12, pages 731–740, New York, NY, USA. ACM. Tomas Mikolov, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. CoRR, abs/1301.3781. Jakub Piskorski and Roman Yangarber. 2013. Information extraction: Past, present and future. In Multi-source, multilingual information extraction and summarization, pages 23–49. Springer. 1301

RELATED PAPERS

RELATED TOPICS

Log In

RGCL-WLV at SemEval-2019 Task 12: Toponym Detection

RGCL-WLV at SemEval-2019 Task 12: Toponym Detection

Related Papers

RELATED PAPERS

RELATED TOPICS