1 Introduction
Location matters, and not just for real estate [
179]. With the rapid development of the Global Navigation Satellite System (GNSS), sensor-rich (e.g., inertial sensors, Wi-Fi module, and cameras) smart devices, and ubiquitous communication infrastructure (e.g., cellular and 4G networks and Wi-Fi access points), our capability of obtaining location information of moving objects and events in both indoor and outdoor spaces has been dramatically improved [
160]. This exponential growth in location-based capabilities has significantly enhanced our understanding of geospatial processes [
179] and fueled the development of location-based services (LBS) with wide-ranging applications in various domains, such as business, entertainment, and crisis management [
87]. Apart from sensor equipment, natural language texts, such as social media posts, web pages, and news stories, serve as a significant source of geospatial information through location references. These location references encompass both simple place names, also known as toponyms, as well as more complex location descriptions that incorporate additional spatial modifiers such as direction, distance, and spatial relationships [
177]. Geoparsing, an ongoing research problem studied extensively over the past two decades [
9,
15,
82,
92,
163], refers to the process of extracting location information from texts. It involves two crucial steps: (1) recognizing location references from texts, also known as toponym recognition or location reference recognition, and (2) identifying the geospatial representations of the recognized location references, commonly referred to as toponym resolution or geocoding. Figure
1 illustrates the workflow of geoparsing.
Geoparsing has traditionally been used in formal texts for location extraction, such as web pages, news, scientific articles, travel blogs, and historical archives [
15,
179]. However, the drastically increased importance of social media data (SMD) in various domains such as social science, political science, policy-making, and humanitarian relief [
18,
22,
38,
76,
171] has facilitated efforts to extend geoparsing to informal texts [
179]. According to Statista
1, the number of worldwide social network users will reach 4.4 billion by 2025. On average, 500 million tweets
2 and 4.75 billion Facebook items
3 are shared each day. Formal texts normally do not have location-related metadata, whereas informal texts, such as tweets, can be geotagged, i.e., a user of X (the platform formerly known as Twitter) can select a location and attach that location to the posted message. However, geotagged tweets are rare and, according to Cheng et al. [
33], Morstatter et al. [
135], and Kumar et al. [
101], only 0.42%, 3.17%, and 7.90% of the total number of tweets contain geotags, respectively. In addition, Twitter removed the precise geotagging feature in 2019, showing only a rough location, e.g., the bounding box of a tagged place rather than a pair of latitude and longitude coordinates. This change could further decrease the number of geotagged tweets [
86] In a nutshell, extracting location information from unstructured texts is often necessary. Notably, informal texts, such as tweets, are short, have few or no formatting or grammatical requirements, and can have uncommon abbreviations, slang, and misspellings, which pose additional challenges for geoparsing [
180].
While there exist many studies on geoparsing [
68,
146], we identify two gaps in the literature that motivate this current review article. First, the many possible applications of geoparsing are scattered in individual papers [
1,
15,
58,
63] or are only partially reviewed [
67,
83], and there is a lack of a systematic and more comprehensive summary of these applications. Consequently, it is difficult for researchers who are new to geoparsing to have a quick view of these many possible applications. Second, existing review papers on geoparsing, such as [
68,
125,
134,
181], focused on the entire workflow of geoparsing (i.e., both of the two steps) rather than location reference recognition alone (i.e., the first step only). While providing more comprehensive coverage on the topic of geoparsing, existing efforts reviewed only some approaches for the step of location reference recognition. In recent years, many new approaches for location reference recognition have been developed, such as Flair NER [
4], NeuroTPR [
182], nLORE [
53], and GazPNE2 [
82]. Given the high importance of location reference recognition in geoparsing (i.e., only those references that are correctly recognized can be geo-located), it is necessary to have a review that specifically focuses on the possible and recent approaches for location reference recognition.
This work aims at filling the two research gaps discussed above. First, we summarize seven typical application domains of geoparsing, which are geographical information retrieval (GIR) [
57,
146], disaster management [
111,
162], disease surveillance [
64,
159,
172], traffic management [
77,
115,
129], spatial humanities [
63,
154], tourism management [
27,
36,
37], and crime management [
17,
42,
178]. Second, we review existing approaches for location reference recognition by categorizing the approaches into four groups: rule-based, gazetteer matching–based, statistical learning–based, and hybrid approaches. Noticing that many existing approaches were not cross-compared on the same datasets, we also conduct experiments to compare and evaluate the reviewed 27 existing approaches on 26 public datasets. We thoroughly analyze various aspects of the existing approaches, encompassing their performance on both formal and informal texts, their effectiveness across different types of places such as administrative units and traffic ways, and their computational efficiency.
The remainder of this article is structured as follows. In Section
2, we summarize seven typical application domains of geoparsing. In Section
3, we review existing approaches for location reference recognition. We evaluate existing approaches on the same public datasets in Section
4. Finally, we conclude the article in Section
5 and discuss some potential future directions.
2 Seven Application Domains OF Geoparsing
Geoparsing offers numerous potential applications. In this section, we provide a concise overview of seven prominent application domains frequently explored in the literature. Figure
2 provides an illustration of these domains.
GIR: One of the primary applications of geoparsing is geographic information retrieval. Historically, documents have been indexed by subject, author, title, and type. However, a diverse and large group of information system users (e.g., readers, natural resources managers, scientists, historians, journalists, and tourists) desire geographically oriented access to document collections, such as by retrieving interesting contents about specific geographic locations [
26,
57,
108,
130,
145,
173,
191]. For instance, resources in digital libraries can be indexed by locations contained in descriptive metadata records associated with the resources, thereby improving users’ experience in searching for their needed resources [
57]. People are looking for web pages containing useful information about everyday tasks, such as local merchants, services, and news [
26]. The public can consume up-to-date information related to COVID-19 (e.g., disease prevention, disease transmission, and death reports) on Twitter by locations [
130].
Disaster management: News stories and SMD contain enormous historical and real-time disaster information. Location-enabled SMD can be very helpful to timely map the situational information, such as rescue requests [
164,
198], resource needs (e.g., food, clothing, water, medical treatment, and shelter) and availability [
21,
50], and facility status (e.g., building collapse, road closure, pipe broken, and power outage) [
23,
52,
121,
157] in the aftermath of disasters. With a crisis map, first responders can track the unfolding situation and identify stricken locations that require prioritized intervention [
19] and realize optimized real-time resource allocation [
164], government agencies can conduct the damage assessment of the disasters in a faster manner [
192], and the public can search for the locations where they can obtain needed resources. By extracting spatiotemporal, environmental, and other information about disaster events from news stories, flood-prone areas can be identified [
194], the responsibility of atmospheric phenomena for floods can be understood [
20], the spatial and temporal distributions of natural disasters during a long period can be analyzed [
114], and the evolution of disasters (e.g., the phases of preparedness, impact, response, and recovery) can be tracked [
88,
183,
184].
Disease surveillance: Scientific articles, historical archives, news reports, and social media contain detailed information about disease events, such as where the disease was first reported and how it spread spatiotemporally. Mining geographic locations and other related information of disease events can help track diseases [
34,
64,
136,
140,
159,
172], perform early warning and quick response [
97], and understand the mechanisms underlying the emergence of diseases [
12,
93]. For example, geoparsing historical archives (e.g., the annual US Patent Office Reports 1840–1850 and Registrar General’s Reports) can help track the spread of potato disease ‘late blight’ in the 19th-century in the United States [
172] and understand the relationship between cholera-related disease and place names during Victorian times [
136]. Scientific articles were geoparsed to analyze the demographic, environmental, and biological correlation of the occurrence of emerging infectious diseases at a global scale [
12,
93]. Social media can also reflect the movement of the public and their feelings during pandemics through geotags or mentioned locations in texts. Location-enabled tweets were applied to analyze the mental health status of the public after the occurrence of COVID-19 [
80,
197], to track and visualize the spread and diffusion of COVID-19 [
16], and to reveal human mobility patterns [
89,
91].
Traffic management: Twitter users report near-real-time information about traffic events (e.g., crashes and congestion). Detecting traffic events, their precise locations, and other related information from tweets is important for an effective transportation management system [
3,
13,
61,
71,
161,
168]. The detected traffic events can also support urban policy-making [
40], such as helping drivers to avoid risk zones and choose the fastest and safest routes [
10], to help the transportation management sector reduce fatalities and restore traffic flow as quickly as possible [
10], to predict future traffic jams [
11], and to improve road safety by recognizing high-risk areas [
129]. By doing so, Twitter users acting as social sensors can complement existing physical transport infrastructure (e.g., video cameras and loop detectors) cost-effectively, which is especially important for developing countries where resources are limited.
Spatial humanities: ‘Spatial turn’ was used to describe a general movement observed since the end of the 1990s emphasizing the reinsertion of place and space in the humanities [
185]. Digitizing and geoparsing large historical textual collections, such as books, reports, and novels, create new ways for research in the humanities (e.g., Archaeology, History, and Literature) [
49,
62,
63,
69,
78,
131,
136,
172], such as to understand the historical geographies of 19th-century Britain and its relationships with the wider world [
62], to identify the significance of specific commodities in relation to particular places and time [
78], to analyze a correspondence between 18th-century aesthetic theory and the use of the terms ‘beautiful’, ‘picturesque’, ‘sublime’, and ‘majestic’ in contemporaneous and later accounts of the Lakes region [
49], and to reveal the spatial structure of a narrative in fictional novels [
131].
Tourism management: According to Statista, among all the active blogs, travel is rated as the top 5 topics shared by bloggers
4. Travel blogs contain a wealth of information about visited places organized as bloggers’ experiences and insights as well as their perceptions of these places [
75]. These narratives reflect the blogger’s behavior and interaction with places as well as the relationships among the places. Geoparsing travel blogs is helpful for understanding places [
74], such as to find their features and related activities, and can help describe a place with tourism attributes to support tour planning [
74,
75,
99,
196]. Applications include helping travelers choose preferred places and visit them in an appropriate order at a proper time and supporting wayfinding given the spatial relation of places [
75].
Crime management: Many countries do not make crime data available to their citizens [
17] or provide only coarse-grained details
5, such as the total number of thefts in a district or a province. According to the Crime Information Need Survey [
17], around 78.3% of respondents in Indonesia agreed that crime information should be available to the public. The needed information includes crime type, perpetrator, victim, time, and, very importantly, location. Meanwhile, crime-related information is often scattered across news and social media. Mining and gathering crime-related information from these text-based sources can be useful for informing the public and may even help predict and prevent some crimes [
14,
41,
42,
150,
156,
166]. In particular, geoparsing can help extract location information of crimes, which can help residents to choose places to live and help travelers to avoid certain unsafe places [
17].
Different applications have distinct requirements for the approaches for location reference recognition. For example, emergency response applications primarily rely on analyzing informal texts such as tweets, whereas scientific articles serve as the main source for understanding the mechanisms underlying disease emergence. GIR needs only coarse-grained geospatial information, such as a city, whereas traffic management requires the fine-grained location (e.g., a street) of traffic events. Geoparsing historical documents that contain billions of words requires a fast processing workflow. Therefore, to guide the selection of proper approaches for location reference recognition based on application needs, examining the characteristics of existing approaches is necessary, which will be introduced in Section
4.