research-article

Open access

On the Opportunities and Challenges of Foundation Models for GeoAI (Vision Paper)

Authors:

Jin Sun,

Rui Zhu,

Ni LaoAuthors Info & Claims

ACM Transactions on Spatial Algorithms and Systems, Volume 10, Issue 2

Article No.: 11, Pages 1 - 46

https://doi.org/10.1145/3653070

Published: 01 July 2024 Publication History

PDF eReader

Abstract

Large pre-trained models, also known as foundation models (FMs), are trained in a task-agnostic manner on large-scale data and can be adapted to a wide range of downstream tasks by fine-tuning, few-shot, or even zero-shot learning. Despite their successes in language and vision tasks, we have not yet seen an attempt to develop foundation models for geospatial artificial intelligence (GeoAI). In this work, we explore the promises and challenges of developing multimodal foundation models for GeoAI. We first investigate the potential of many existing FMs by testing their performances on seven tasks across multiple geospatial domains, including Geospatial Semantics, Health Geography, Urban Geography, and Remote Sensing. Our results indicate that on several geospatial tasks that only involve text modality, such as toponym recognition, location description recognition, and US state-level/county-level dementia time series forecasting, the task-agnostic large learning models (LLMs) can outperform task-specific fully supervised models in a zero-shot or few-shot learning setting. However, on other geospatial tasks, especially tasks that involve multiple data modalities (e.g., POI-based urban function classification, street view image–based urban noise intensity classification, and remote sensing image scene classification), existing FMs still underperform task-specific models. Based on these observations, we propose that one of the major challenges of developing an FM for GeoAI is to address the multimodal nature of geospatial tasks. After discussing the distinct challenges of each geospatial data modality, we suggest the possibility of a multimodal FM that can reason over various types of geospatial data through geospatial alignments. We conclude this article by discussing the unique risks and challenges to developing such a model for GeoAI.

1 Introduction

Recent trends in machine learning (ML) and artificial intelligence (AI) speak to the unbridled powers of data and computing. Extremely large models trained on Internet-scale datasets have achieved state-of-the-art (SOTA) performance on a diverse range of learning tasks. Their unprecedented success has spurred a paradigm shift in the way that modern-day ML models are trained. Rather than learning task-specific models from scratch [45, 95, 183], such pre-trained models (termed foundation models (FMs) [14]) are adapted via fine-tuning or few-shot/zero-shot learning strategies and subsequently deployed on a wide range of domains [16, 150]. Such FMs allow for the transfer and sharing of knowledge across domains and mitigate the need for task-specific training data. Examples of foundation models are (1) large language models ( \(\boldsymbol {LLM}\) s) such as PaLM [188], LLAMA [178], GPT-3 [16], InstrucGPT [143], and ChatGPT [141]; (2) large vision foundation models such as Imagen [167], Stable Diffusion [164], DALL \(\cdot\) E2 [154], and SAM [88]; (3) large multimodal foundation models¹ such as CLIP [150], OpenCLIP [68], BLIP [102], OpenFlamingo [11], KOSMOS-1 [64], and GPT-4 [142]; and (4) large reinforcement learning foundation models such as Gato [161].

Despite their successes, there exists very little work exploring the development of an analogous foundational model for geospatial artificial intelligence (GeoAI), which lies at the intersection of geospatial scientific discoveries and AI technologies [43, 69, 119]. The key technical challenge here is the inherently multimodal nature of GeoAI. The core data modalities in GeoAI include text, images (e.g., remote sensing or street view images), trajectory data, knowledge graphs, and geospatial vector data (e.g., map layers from OpenStreetMap), all of which contain important geospatial information (e.g., geometric and semantic information). Each modality exhibits special structures that require their own unique representation. While existing foundation models contain modules that can readily process some of these data modalities, such as text and images, there are currently no foundation models capable of effectively managing many other ‘distinctive’ data modalities essential for GeoAI tasks, such as movement trajectory data and other geospatial vector data. Moreover, effectively combining all these representations from different data modalities with appropriate inductive biases in a single model requires careful design. The multimodal nature of GeoAI hinders a straightforward application of existing pre-trained FMs across all GeoAI tasks.

In this article, we lay the groundwork for developing FMs for GeoAI [117, 118, 194]. We begin by providing a brief overview of existing FMs in Section 2. In Section 3, we investigate the potential of existing FMs for GeoAI by systematically comparing the performances of several popular FMs with many state-of-the-art fully supervised task-specific machine learning (ML) or deep learning (DL) models on various tasks from different geospatial domains: (1) Geospatial Semantics: toponym recognition and location description recognition task; (2) Health Geography: US state-level and county-level dementia death count time series forecasting task; (3) Urban Geography: Point-of-interest (POI)–based urban function classification task and street-level image-based noise intensity classification task; (4) Remote Sensing: Remote sensing (RS) image scene classification task. The advantages and problems of FMs on different geospatial tasks are discussed accordingly. In Section 4, we detail the challenges involved in developing FMs for GeoAI. Creating one single FM for all GeoAI data modalities can be a daunting task. To address this, we start this discussion by examining each data modality used in GeoAI tasks. Then, we propose our vision for a novel multimodal FM framework for GeoAI that tackles the aforementioned challenges. We highlight some potential risks and challenges that should be considered when developing such general-purpose models for GeoAI in Section 5 and conclude this article in Section 6.

Our contributions can be summarized as follows:

To the best of our knowledge, this is the first work that systematically examines the effectiveness and problems of various existing cutting-edge FMs on different geospatial tasks across multiple geoscience domains.² We establish various FM baselines on seven geospatial tasks for future GeoAGI research.

We discuss the challenges of developing a multimodal FM for GeoAI and provide a promising framework to achieve this goal.

We discuss the risks and challenges that need to be taken into account during the development and evaluation process of the multimodal geo-foundation model.

2 Related Work

2.1 Language Foundation Model

In less than a decade, computational natural language capabilities have been completely revolutionized [16, 84, 146, 153] by LLMs. Language modeling [77] is the simple task of predicting the next token in a sequence given previous tokens,³ and it corresponds to a self-supervised objective in the sense that no human labeling is needed besides a natural text corpus. When applied to vast corpora such as documents of diverse topics from the Internet, LLMs gain significant language understanding and generation capabilities. Various transfer-learning and scaling studies [54, 57, 81] have demonstrated an almost linear relationship between downstream task performance and the log sizes of self-supervised models and data. Combined with the ever-increasing availability of data and computing, language modeling has become a reliable approach for developing increasingly powerful models.

Representative examples of these LLMs are the OpenAI GPTs [16, 142, 143, 151, 152]. By pretraining from vast amounts of Web data, the GPT models gain knowledge of almost all domains on the Web, which can be leveraged to solve problems of diverse verticals [16]. The interfaces to access such knowledge have become increasingly simple and intuitive – ranging from supervised fine-tuning with labeled data [151, 152], to few-shot learning [16] and instructions [143], to conversation [141] and multimodality [142]. In this study, we provide a comprehensive analysis of the potentials and limitations of GPT and other LLMs when applied to different geospatial domains.

2.2 Vision Foundation Model

Computer vision has long been dominated by task-specific models: for example, YOLO [160] for object detection, Detectron [191] for instance segmentation, and SRGAN [97] for image super-resolution. ResNet [51] trained on ImageNet [33] has been used as the backbone feature extractor for many such tasks. It can be seen as the early form of a vision FM.

Inspired by the great success of language FMs, the computer vision community builds large-scale vision FMs that can be adapted to any vision task. The most direct adoption of the idea from language models in computer vision is the image generation models. Since the dominance of Generative Adversarial Networks (GANs) [44, 82], the quality of image generation models has seen a major breakthrough via the development of diffusion-based models [55]. Imagen [167] builds on large transformer-based language models to understand text prompts and generates high-fidelity images using diffusion models. DALL-E \(\cdot\) 2 [154] trains a diffusion decoder to invert an image encoder from visual-language models such as CLIP. After pre-training, it is able to generate images of various styles and characteristics. Stable Diffusion [164] uses a Variational Autoencoder (VAE) [87] to convert raw images from pixel space to latent space where the diffusion processes are more manageable and stable. It has shown great flexibility in conditioning over text, pose, edge maps, semantic maps, and scene depths [211]. GigaGAN [79], on the other hand, is a recent attempt of scaling up GAN models.

Vision-Transformer (ViT) [34] is a widely used architecture in vision FMs. Large-scale ViT has been developed to scale up the model [206]. The Swin Transformer [111] model is designed to handle the unique challenges of adapting regular transformer models with various spatial resolutions in images. Other large-scale non-transformer models are also developed to reach the same level of performance: ConvNext [112] is the “modernized” version of convolutional neural networks (CNNs) that has a large number of parameters and shows a similar level of performance as Swin Transformers. MLP-mixer [177] is an architecture that utilizes only multi-layer perceptrons (MLPs) on image data. It shows competitive scores on image classification datasets.

Recently, the Segment Anything Model (SAM) [88] was proposed by Meta AI as a visual FM which was pre-trained on a large segmentation dataset with over 1 billion segmentation masks and can be transferred to new image distributions and tasks in a zero-shot setting. That is, SAM can be adapted to new tasks without any new labeled examples.

2.3 Multimodal Foundation Model

Developing AI models that are capable of performing multimodal reasoning and understanding on complex data is a promising idea. Humans naturally perform multimodal reasoning in daily life [145]. For example, when someone is thinking about the concept of ‘dog’, the person will not only think about the English word and its meaning but also a visual image and a sound associated with it. In the context of geospatial tasks, multimodal data are ubiquitous. For example, different geospatial tasks related to the Forbidden City (FC) in Beijing, China usually require different data modalities. A tourism question about the history and construction time of the FC requires a text description and knowledge graph triples about the FC. A question about the spatial structure of the FC and its geographic context requires map information and remote sensing images of the FC. In general, data from different modalities provide different ‘views’ that complement each other and provide more information to facilitate a holistic understanding of the data.

Recently, much progress has been made in building large-scale multimodal FMs for joint reasoning from various domains, in particular, vision and language. CLIP [68, 150] is one of the first widely adopted vision–language joint training frameworks. It uses self-supervised contrastive learning to learn a joint embedding of visual and text features. BLIP [102] improves over CLIP by training on synthetically generated captions from images collected from the Internet. It is designed to handle both visual-language understanding and generation tasks. BEiT-3 [186] is a general-purpose multimodal FM that achieves state-of-the-art performance on both vision and vision-language tasks. It combines features from multi-modality expert networks. Florence [204] is a vision-language FM that learns universal visual-language representations for objects, scenes, images, videos, and captions. Similarly, KOSMOS-1 [64] learns from web-scale multimodal data, including text and image pairs. It can transfer knowledge from one modality to another. Flamingo [6] is a family of visual language models that can be adapted to novel tasks using only a few annotated examples, i.e., few-shot learning. It encodes images or videos as inputs along with textual tokens to jointly reason about vision tasks. The newest version of the GPT model, GPT-4 [142], also can perform multimodal analysis, including text, audio, images, and videos.

3 Exploration of the Effectiveness of Existing FMs on Various Geospatial Domains

The first question we would like to ask is how the existing cutting-edge FMs perform when compared with the state-of-the-art fully supervised task-specific models on various geospatial tasks. Geography is a very broad discipline that includes various subdomains, such as Geospatial Semantics [59, 61, 72, 75, 92, 125], Health Geography [21, 31, 83, 165], Urban Geography, [19, 66, 80, 208, 223], Remote Sensing [18, 37, 98, 127, 128, 130, 135, 163], and so on. To address the aforementioned question, in the following, we conduct experiments using various FMs on different tasks in the four geospatial subdomains mentioned earlier. The advantages and weaknesses of existing FMs will be discussed in detail.

3.1 Geospatial Semantics

As a starting point for our discussion, we first demonstrate empirically the promise of leveraging LLMs for solving geospatial semantics tasks. We hope that our results not only demonstrate the effectiveness of such general-purpose, few-shot learners in the geospatial semantics domain but also challenge the current paradigm of training task-specific models as a common practice in GeoAI research.

We compare the performance of 4 pre-trained GPT-2 [152] models of varying sizes provided by Huggingface as well as the most recent GPT-3 [16] (i.e., text-davinci-002), InstructGPT [143] (i.e., text-davinci-003), and ChatGPT [141] (i.e., gpt-3.5-turbo) models developed by OpenAI with multiple supervised, task-specific baselines on two representative geospatial semantics tasks: (1) toponym recognition [45, 182] and (2) location description recognition [62].

Both tasks aim at recognizing parts of the input sentence as named places or location descriptions. We have adapted all seven pre-trained GPT models to these tasks by treating them as question-answering challenges through the use of prompt instructions. As depicted in Lists 1 and 2, we first embed 8 few-shot examples in the prompt by using keywords: “Paragraph”, “Q”, and “A”. “Paragraph:” precedes an input sentence. “Q:” is followed by a question that instructs LLMs what we expect them to do, i.e., “What words in this paragraph represent named places/location descriptions?”. “A:” indicates the expected answers, i.e., a list of named places or a list of location descriptions recognized from the input sentence that are separated by semicolons. Upon presenting these eight few-shot examples in the Paragraph-Q-A structure, we provide a new paragraph highlighted in yellow in Lists 1 and 2. This indicates the place for sentences from the evaluation dataset. Both prompts stop at the last “A:”. All 7 GPT models we used will take this prompt in and generate the subsequent tokens, which will be treated as the recognized place names or location descriptions. The generated outputs from GPT-3, marked in orange, serve as illustrative examples. In the following, we will delve into the specifics of each task and present a comprehensive evaluation of all models.

3.1.1 Toponym Recognition.

Toponym recognition can be considered a subtask of named entity recognition (NER), with the goal of identifying named places from a given text snippet. We use the Hu2014 [60] and Ju2016 [76] datasets as benchmarks for this task. The Hu2014 dataset is constructed by Hu et al. [60] based on Wikipedia. It encompasses 134 entries of sentences containing two commonly used place names, Washington and Greenville. Ju2016 is a larger dataset, with 5,441 entries of sentences constructed by Ju et al. [76]. The dataset was collected based on a list of ambiguous place names provided by Wikipedia. The complete names of these places were subsequently utilized as queries in Bing Search and the sentences about these places were extracted from the search results. More details about the two datasets are available in Hu et al. [60] and Ju et al. [76]. We utilize 7 pre-trained GPT models to perform toponym recognition tasks on both datasets by using appropriate prompts containing 8 few-shot training examples. As we described above, in the prompt, we provide several training samples as few-shot learning samples in the form of natural language instructions. One example of such a prompt is illustrated in Listing 1, while the full prompts can be found in List 7 in Appendix A.1. Note that for our experiments on both Hu2014 and Ju2016, these few-shot examples used in prompts are separately collected and are not from the corresponding evaluation datasets. It is worth noting that ChatGPT, as an FM, is optimized for chatbot purposes and expects conversational inputs rather than a single big prompt. In order to conduct a controlled experiment, we first use the same prompt shown in Listing 1 to instruct all 7 pre-trained GPT models to perform toponym recognition. We also convert the few-shot examples into a list of conversations and use them as the inputs for ChatGPT, which is denoted as ChatGPT (Con.) whereas the ChatGPT using the original prompt is indicated as ChatGPT (Raw.).

Listing 1.

Table 1 compares all 8 GPT models with 15 baselines on two datasets – Hu2014 [60] and Ju2016 [76]. The same test sets have been used to evaluate the performances of all models. In terms of model evaluation for 7 GPT models, we parse the generated tokens into a list of identified place names by splitting them at each semicolon (“;”) and compare them with the ground truth. To make the evaluation comparable to the prior studies [45, 182, 183], we adopt the same evaluation metric, Accuracy – the recognized place names are considered correct only if there is an exact match between the generated token and the ground truth. It is important to note that the chosen evaluation metric sets a stringent standard for all GPT models involved in our study. Unlike all 15 baselines we use, which are limited to selecting text spans directly from the input sentence due to the prompt-based nature of GPT models, we cannot inherently adhere to this constraint. Instead, we only incorporate this requirement as natural language instruction in the “Instruction:” part of the prompt, which does not enforce the same level of restriction. This means that sometimes the generated sentences from GPT models might not be a text span from the input sentence. This discrepancy has the potential to adversely impact the performance metrics of the GPT models when compared with the baselines. Nevertheless, we proceed to juxtapose the performance of eight GPT models against that of 15 baseline models. Those 15 baselines are classified into three groups as shown in Table 1: (A) general NER models; (B) no neural network (NN)–based geoparsers; and (C) fully supervised task-specific NN-based geoparsers. All models in Group C are trained in a supervised manner on the same separated training datasets. Observing the results, it is noteworthy that the GPT models, which operate solely based on a concise set of natural language instructions without necessitating any further training or stringent restrictions on the generated tokens, consistently surpass the performance of the fully supervised baselines on the Hu2014 dataset. This holds true for all variations of the LLMs, with the exception of the smallest GPT-2 model. GPT-3 in particular demonstrated an 8.7% performance improvement over the previous SOTA (TopoCluster [32]). Interestingly, new GPT models such as InstructGPT and ChatGPT do not show higher performances on the Hu2014 dataset. While InstructGPT shows a smaller performance drop, which is acceptable, two ChatGPT models show more significant performance decreases. One reasonable hypothesis is that ChatGPT is further optimized based on InstructGPT for chatbot applications that may not be “flexible” enough to be adapted to new tasks such as toponym recognition.

Table 1.

			Hu2014	Ju2016	HaveyTweet2017
	Model	#Param	Toponym Recognition		Location Description Recognition
			Accuracy \(\downarrow\)	Accuracy \(\downarrow\)	Precision \(\downarrow\)	Recall \(\downarrow\)	F-Score \(\downarrow\)
(A)	Stanford NER (nar. loc.) [40]	-	0.787	0.010	0.828	0.399	0.539
	Stanford NER (bro. loc.) [40]	-	-	0.012	0.729	0.44	0.548
	Retrained Stanford NER [40]	-	-	0.078	0.604	0.410	0.489
	Caseless Stanford NER (nar. loc.) [40]	-	-	0.460	0.803	0.320	0.458
	Caseless Stanford NER (bro. loc.) [40]	-	-	0.514	0.721	0.336	0.460
	spaCy NER (nar. loc.) [58]	-	0.681	0.000	0.575	0.024	0.046
	spaCy NER (bro. loc.) [58]	-	-	0.006	0.461	0.304	0.366
	DBpedia Spotlight[134]	-	0.688	0.447	-	-	-
(B)	Edinburgh [7]	-	0.656	0.000	-	-	-
	CLAVIN [182]	-	0.650	0.000	-	-	-
	TopoCluster [32]	-	0.794	0.158	-	-	-
(C)	CamCoder [45]	-	0.637	0.004	-	-	-
	Basic BiLSTM+CRF [96]	-	-	0.595	0.703	0.600	0.649
	DM NLP (top. rec.) [187]	-	-	0.723	0.729	0.680	0.703
	NeuroTPR [183]	-	0.675 \(^{\dagger }\)	0.821	0.787	0.678	0.728
(D)	GPT2 [152]	117M	0.556	0.650	0.540	0.413	0.468
	GPT2-Medium [152]	345M	0.806	0.802	0.529	0.503	0.515
	GPT2-Large [152]	774M	0.813	0.779	0.598	0.458	0.518
	GPT2-XL [152]	1558M	0.869	0.846	0.492	0.470	0.481
	GPT-3 [16]	175B	0.881	0.811 \(^*\)	0.603	0.724	0.658
	InstructGPT [143]	175B	0.863	0.817 \(^*\)	0.567	0.688	0.622
	ChatGPT (Raw.) [141]	176B	0.800	0.696 \(^*\)	0.516	0.654	0.577
	ChatGPT (Con.) [141]	176B	0.806	0.656 \(^*\)	0.548	0.665	0.601

Table 1. Evaluation Results of Various GPT Models and Baselines on Two Geospatial Semantics Tasks: (1) Toponym Recognition (Hu2014 [60] and Ju2016 [76]) and (2) Location Description Recognition (HaveyTweet2017 [62])

We classify all models into four groups: (A) General NER; (B) No Neural Network (NN) based geoparsers; (C) Fully-supervised NN-based geoparsers; (D) Few-show learning with LLMs. “(#Param)” indicates the number of learnable parameters of LLMs. “(nar. loc.)” and “(bor. loc.)” indicate narrow location models and broad location models defined in [183]. The results of all baselines (i.e., models in Group A, B, and C) are obtained from [182] and [183] except “0.675 \(^{\dag }\) ”, which is obtained by rerunning the official code of [183]. The evaluation results of different GPT models (Group D) are done by using pre-trained GPT2/GPT-3/InstructGPT/ChatGPT models with appropriate prompts. The results of four GPT2 models are obtained by using Huggingface pre-trained GPT2models with various model sizes. The last four models are obtained by using various OpenAI GPT models – text-davinci-002, text-davinci-003, and gpt-3.5-turbo – which are denoted as GPT-3, InstructGPT, and ChatGPT respectively. Since ChatGPT expects conversational inputs rather than a single big prompt, we experiment with two versions of ChatGPT. ChatGPT (Raw.) indicates we use the same prompt as other GPT models while ChatGPT (Con.) indicates we convert the few-shot examples in the prompt into a list of conversations. \(^*\) Due to OpenAI API limitations, we evaluate GPT-3, InstructGPT, and ChatGPT on randomly sampled 544 Ju2016 examples (10% of the dataset).

Based on previous studies [182, 183], utilizing the Ju2016 dataset is a very difficult task. On this dataset, we found that GPT2-XL outperforms the previous state-of-the-art, i.e., NeuroTPR [183], by over 2.5% using only 8 few-shot examples in the prompt. In contrast, a task-specific model, such as NeuroTPR, requires supervised training on 599 labeled tweets and labeled sentences generated from 3,000 Wikipedia articles. GPT-3 and InstructGPT does not show performance improvement on the Ju2016 dataset over GPT2-XL. Similar to the finding on the Hu2014 dataset, ChatGPT shows a significant performance decrease on the Ju2016 dataset. In accordance with existing empirical findings [16, 152], we also found that the performance of these LLMs tended to scale with the number of learnable parameters.

3.1.2 Location Description Recognition.

The location description recognition task is slightly more challenging. Given a text snippet (e.g., a tweet), the goal is to recognize more fine-grained location descriptions such as door number addresses, highway exits, and road intersections instead of large-scale geographic entities such as cities, states, and countries. HaveyTweet2017 [61, 62] is used as one representative benchmark dataset for this task. This dataset contains 1,000 tweets posted during Hurricane Harvey. Location descriptions in these tweets were manually annotated and are in different forms, such as door number addresses, road intersections, road segments, and highway exits. More details about this dataset and its annotation process are available in [61, 62]. The same set of pre-trained GPT models and 15 baselines are used for this task. By following Hu [59], we use three evaluation metrics: precision, recall, and F-score. Listing 2 shows one example prompt used in this task. The full prompt can be seen in Listing 8 in Appendix A.1.

Listing 2.

Table 1 summarizes the evaluation results of different models on the HaveyTweet2017 dataset. The same test set of HaveyTweet2017 is used to evaluate all GPT models as well as 15 baseline models. On the HaveyTweet2017 dataset, GPT-3 achieves the best recall score across all methods. However, all LLMs have rather low precision (and, therefore, low F1-scores). This is because LLMs implicitly convert the location description recognition problem into a natural language generation problem (see Listing 2), meaning that they are not guaranteed to generate tokens that appear in the input text as we discussed above. Based on the experimental results in Table 1, we can clearly see that by using just a small number of few-shot samples, LLMs can outperform the fully supervised, task-specific models on well-defined geospatial semantics tasks. This showcases the potential of LLMs to dramatically reduce the need for customized architectures or large labeled datasets for geospatial tasks. However, how to develop appropriate prompts to instruct LLMs for a given geospatial semantics task requires further investigation.

3.2 Health Geography

The next set of experiments focuses on an important health geography problem – dementia death counts time series forecasting for a given geographic region, such as cities, counties, and states. With a growing share of older adults in the population, it is estimated that more than 7 million US adults aged 65 or older were living with dementia in 2020, and the number could increase to over 9 million by 2030 and nearly 12 million by 2040 [225]. Alzheimer’s disease, the most common type of dementia, has been reported to be one of the top leading causes of death in the United States, with 1 in 3 seniors dying with Alzheimer’s or another dementia by 2019 [9]. Notably, there are substantial and long-standing geographical disparities in mortality due to dementia [4, 8]. Subnational planning and prioritizing dementia prevention strategies require local mortality data. Prediction of dementia deaths at the subnational level will assist in informing future tailored health policies to eliminate geographical disparities in dementia and to achieve national health goals.

In this work, we conduct time series forecasting on the number of deaths due to dementia in two geographic region levels —state level and county level. The dementia data are obtained from the US Centers for Disease Control and Prevention Wide-ranging Online Data for Epidemiologic Research (CDC WONDER⁴), which is a publicly available dataset. The mortality due to dementia is based on information from all death certificates filed in the 50 states and the District of Columbia. The data from the death certificates are coded by the states and provided to the National Center for Health Statistics (NCHS) through the Vital Statistics Cooperative Program or coded by the NCHS from copies of the original death certificates provided to it by the State registration offices. Dementia deaths are classified according to the International Classification of Diseases, 10th Revision (ICD-10), including unspecified dementia (F03), Alzheimer’s disease (G30), vascular dementia (F01), and other degenerative diseases of the nervous system, not elsewhere classified (G31) [90].

3.2.1 US State-Level Dementia Time Series Forecasting.

We collect annual time series of dementia death counts for all 50 US states and District of Columbia between 1999 and 2020. The time series from 1999 to 2019 are used as training data, and the numbers in 2020 are used as ground truth labels. The same set of pre-trained GPT models used in Section 3.1 are utilized in this task. In contrast to the geospatial semantics experiments, we utilize all GPT models in a zero-shot setting since we think the historical time series data is enough for an LLM to perform the forecasting. For all GPT models, we also treat the task as a natural language generation problem. Listing 3 shows one example prompt we use in this experiment, with California as an example. We notice that even when we ask GPT models to only generate one single number as the prediction, in many cases GPT models will generate a long sentence as the answer instead of a single number. In order to perform a fair comparison, for all the GPT models, we will use the first “number token” in the generated sentence as the prediction of this model.

Listing 3.

With only 51 time series, each consisting of 22 data points, many sequential DL models such as recurrent neural networks (RNNs) and Transformers [180] are at risk of overfitting on this dataset. Thus, we pick the state-of-the-art ML–based time series forecasting model, ARIMA (Autoregressive integrated moving average) as the fully supervised task-specific baseline model. We train individual ARIMA models on each state’s time series using data from 1999 to 2019 and perform forecasting on data in 2020. Hyperparameter tuning is performed on all ARIMA hyperparameter combinations to obtain the best results. Additionally, we use a persistence model [140, 144] as a reference. A persistence model assumes that the future value of a time series remains the same between the current time and the forecast time. In our case, we use the dementia death count of each state in 2019 as the prediction for the value in 2020.

Table 2 presents a comparison of model performances among different GPT models and two baselines. We select four commonly used evaluation metrics: mean square error (MSE), mean absolution error (MAE), mean absolute percentage error (MAPE), and \(R^2\) . Interestingly, all GPT2 models perform poorly on all evaluation metrics. Their performances are even worse than the simple persistence model. This suggests that GPT2 may struggle with zero-shot time series forecasting. On the other hand, GPT-3, InstructGPT, and two ChatGPT models demonstrate reasonable performances. Of particular interest is that InstructGPT outperforms the best ARIMA model on all evaluation metrics even though InstructGPT is not fine-tuned on this specific task. We propose two hypothetical reasons for the strong performance of InstructGPT in the time series forecasting task: (1) After training on a large-scale text corpus, InstructGPT may have developed the intelligence necessary to perform zero-shot time series forecasting, which is fundamentally an autoregressive problem. (2) It is possible that InstructGPT and GPT-3 may be exposed to US state-level dementia time series data during their training on the large-scale text corpus.

Table 2.

	Model	#Param	MSE \(\downarrow\)	MAE \(\downarrow\)	MAPE \(\downarrow\)	R \(^2\) \(\uparrow\)
(A) Simple	Persistence [140, 144]	-	985,179	630	0.096	0.971
(B) Supervised ML	ARIMA [73]	-	562,768	462	0.067	0.984
(C) Zero-shot LM	GPT2 [152]	117M	44,635,055	4,898	0.955	\(-\) 0.271
	GPT2-Medium [152]	345M	42,315,630	4,616	0.745	\(-\) 0.209
	GPT2-Large [152]	774M	39,039,733	4,250	0.779	\(-\) 0.132
	GPT2-XL [152]	1558M	35,355,840	3,912	0.709	\(-\) 0.026
	GPT-3 [16]	175B	587,263	474	0.070	0.983
	InstructGPT [143]	175B	387,413	365	0.055	0.989
	ChatGPT (Raw.) [141]	176B	1,143,675	623	0.121	0.967
	ChatGPT (Con.) [141]	176B	4,224,811	1,131	0.240	0.890

Table 2. Evaluation Results of Various GPT Models and Baselines on the US State-Level Dementia Time Series Forecasting Task

We classify all models into four groups: (A) Simple persistent model; (B) Fully supervised machine learning models such as ARIMA; (C) Zero-shot learning with LLMs. “(#Param)” indicates the number of learnable parameters of LLMs. The denotations of different GPT models are the same as Table 1. Four evaluation metrics are used: MSE (mean square error), MAE (mean absolute error), MAPE (mean absolute percentage error), and R \(^2\) . \(\uparrow\) and \(\downarrow\) indicate the direction of better models for each metric. For all GPT models, we encode time series information between 1999 and 2019 in the prompt and let it generate data in 2020.

While we cannot determine which of these reasons is the primary factor behind InstructGPT’s success, these results are very encouraging. Similar to the results in Table 1, two ChatGPT models underperform InstructGPT. More experiment analysis can be seen in the county-level experiments.

3.2.2 US County-Level Dementia Time Series Forecasting.

In terms of county-level data, we utilized the dementia death count time series of all US counties with available data, resulting in a total of 2,447 US counties selected for analysis. We only considered counties with dementia annual death records spanning more than 4 years between 1999 and 2020. Similar to Section 3.2.1, we utilize all available data up to the given year for training ARIMA models and generating GPT prompts, and then make predictions for the following year. We employ the same set of GPT models and baselines as in the state-level experiment to conduct the county-level experiment. Listing 4 shows one example prompt we use in this experiment involving Santa Barbara County, CA as an example. The same setting and evaluation metrics as Table 2 are utilized in this task.

Listing 4.

Table 3 compares the results of different models. Similar findings can be seen from these results. All GPT2 models perform poorly. However, both GPT-3 and InstructGPT outperform the best ARIMA models on all evaluation metrics, whereas two ChatGPT models underperform them. Among the two ChatGPT models, ChatGPT (Con.) are slightly better than ChatGPT (Raw.) on all metrics except MAPE.

Table 3.

(A) Simple	Persistence [140, 144]	-	1,648	16.9	0.189	0.979
	Model	#Param	MSE \(\downarrow\)	MAE \(\downarrow\)	MAPE \(\downarrow\)	R \(^2\) \(\uparrow\)
(B) Supervised ML	ARIMA [73]	-	1,133	15.1	0.193	0.986
(C) Zero-shot LLMs	GPT2 [152]	117M	77,529	92.0	0.587	\(-\) 0.018
	GPT2-Medium [152]	345M	226,259	108.1	0.611	\(-\) 2.824
	GPT2-Large [152]	774M	211,881	94.3	0.581	\(-\) 1.706
	GPT2-XL [152]	1,558M	162,778	99.8	0.627	\(-\) 1.082
	GPT-3 [16]	175B	1,105	14.5	0.180	0.986
	InstructGPT [143]	175B	831	13.3	0.179	0.989
	ChatGPT (Raw.) [141]	176B	4,115	23.2	0.217	0.955
	ChatGPT (Con.) [141]	176B	3,402	20.7	0.231	0.944

Table 3. Evaluation Results of Various GPT Models and Baselines on the US County-Level Dementia Time Series Forecasting Task

We use same model set and evaluation metrics as Table 2.

To further understand the geographical distributions of prediction errors for each model, we visualize the prediction errors of each model on each US county in Figure 1. In the figure, red represents overestimations of the corresponding model whereas blue indicates underestimations. The intensity of the color indicates the magnitude of the prediction error, with darker colors representing larger errors. We can see that Persistence, ARIMA, GPT-3, and InstructGPT generally demonstrate better forecasting performance. However, the prediction percentage errors are not uniformly distributed across different US counties. As Persistence uses the previous year’s data as the prediction, Figure 1(a) indicates that the growth rates of dementia death counts are uneven for different counties. The southwest of the United States shows a recent increase in dementia death counts, which leads the Persistence model to underestimate the true data. The current maps of prediction errors show that the distribution of errors of GPT-3 and InstructGPT are not uniform across the US counties; it is unclear whether the uneven distribution is due to the geographic bias encoded in the models or the spatial heterogeneity of the growth rate of dementia death counts. Further analysis is needed to determine the cause of these uneven distributions.

Fig. 1.

One obvious observation from Figure 1 is that all GPT2 models turn to significantly underestimate the dementia data. To understand the cause of this behavior and the superiority of GPT-3 and InstructGPT, we showcase the generated answers of different GPT models for four US counties in Table 4. From Table 4, it is evident that GPT2 will often repeat the information provided in our prompt rather than generating novel predictions. For example, in the Clarke County, GA and Santa Barbara County, CA cases, all three GPT2 models (i.e., GPT2-Medium, GPT2-Large, and GPT2-XL) predict the same numbers as the data in 1999. This suggests that these models rely heavily on the prompt information instead of learning from the time series data, which could explain their inferior performance compared with other models such as GPT-3 and InstructGPT. In the other two counties’ cases, the predictions of the GPT2 models vary significantly. In most cases, both InstructGPT and ChatGPT (Raw.) generate a single number as the prediction, indicating that they understand the task they are expected to perform. The only exception is the Santa Barbara County case, in which ChatGPT (Raw.) generates a short sentence containing a reasonable prediction. However, based on our evaluation, the predictions of ChatGPT (Raw.) are not as good as those of GPT-3. Interestingly, when using ChatGPT in a conversational context, i.e., ChatGPT (Con.), ChatGPT usually returns a very long sentence. In the New York County case, ChatGPT (Con.) is unable to give a prediction, suggesting that ChatGPT is useful in a chatbot context but may not be the best choice for other tasks such as time series forecasting.

Table 4.

County	Label	Model	Generated Answer
Clarke County, GA	79	GPT2	The number of deaths from Alzheimer’s disease is 21 in 1999, 37 in 2000, 32 in 2001, 25 in 2002, 29 in 2003, 44 in 2004, 38 in 2005, 41 in 2006, 52 in 2007, 46 in 2008.
		GPT2-Medium	The number of deaths from Alzheimer’s disease in Clarke County, GA is 21 in 2020.
		GPT2-Large	The number of deaths from Alzheimer’s disease in 2020 is 21 in 2020, 37 in 2021, 32 in 2022, 25 in 2023, 29 in 2024, 44 in 2025, 38 in 2026, 42 in 2027, 42
		GPT2-XL	The number in 2020 is 21.
		GPT-3	The number of deaths from Alzheimer’s disease in 2020 at Clarke County, GA is forecast to be 81.
		InstructGPT	83
		ChatGPT (Raw.)	78
		ChatGPT (Con.)	Based on the available historical data, the forecast for the number of deaths from Alzheimer’s disease in Clarke County, GA in 2020 would be 81.
New York County, NY	835	GPT2	The number of deaths from Alzheimer’s disease is 91 in 1999, 121 in 2000, 118 in 2001, 155 in 2002, 214 in 2003, 243 in 2004, 219 in 2005, 294 in 2006, 296 in 2007, 364 in 2008.
		GPT2-Medium	The number of deaths from Alzheimer’s disease in New York County, NY is 9,965.
		GPT2-Large	The number of deaths from Alzheimer’s disease in New York County, NY is 91 in 2020.
		GPT2-XL	The number in 2020 is 901.
		GPT-3	The number of deaths from Alzheimer’s disease in 2020 at New York County, NY is 932.
		InstructGPT	902
		ChatGPT (Raw.)	855
		ChatGPT (Con.)	Based on the historical data provided, a common approach for time series forecasting is to use methods such as exponential smoothing or ARIMA. However, without further information on the specific pattern or trend in the data, it is difficult to provide an accurate prediction for the number of deaths from Alzheimer’s disease in New York County, NY for 2020. Therefore, I am unable to provide a single number as the prediction.
San Francisco County, CA	592	GPT2	The number of deaths from Alzheimer’s disease is 1,000. The number of deaths from Alzheimer’s disease is 2,000. The number of deaths from Alzheimer’s disease is 3,000. The number of deaths from
		GPT2-Medium	The number of deaths from Alzheimer’s disease is 201 in 2020.
		GPT2-Large	The number of deaths from Alzheimer’s disease in 2020 is 574.
		GPT2-XL	The number in 2020 is 201.
		GPT-3	571
		InstructGPT	590
		ChatGPT (Raw.)	579
		ChatGPT (Con.)	Based on the historical data provided, a possible forecasting method is to use a time series model, such as ARIMA or exponential smoothing. Without fitting such a model, a simple approach would be to assume a trend of similar magnitude to recent years and a similar level of annual variation. Using this approach, a possible prediction for the number of deaths from Alzheimer’s disease in San Francisco County, CA in 2020 is 583.
Santa Barbara County, CA	373	GPT2	The number of deaths from Alzheimer’s disease is 126 in 1999, 114 in 2000, 124 in 2001, 127 in 2002, 156 in 2003, 154 in 2004, 175 in 2005, 172 in 2006, 171 in 2007, 248 in 2008.
		GPT2-Medium	The number of deaths from Alzheimer’s disease in Santa Barbara County, CA is 126 in 2020.
		GPT2-Large	The number of deaths from Alzheimer’s disease in Santa Barbara County, CA is 126 in 2020.
		GPT2-XL	The number in 2020 is 126.
		GPT-3	The number of deaths from Alzheimer’s disease in 2020 at Santa Barbara County, CA is expected to be about 350.
		InstructGPT	345
		ChatGPT (Raw.)	I predict the number of deaths from Alzheimer’s disease in 2020 at Santa Barbara County, CA to be 356.
		ChatGPT (Con.)	Based on the historical data provided, the prediction for the number of deaths from Alzheimer’s disease in 2020 at Santa Barbara County, CA is 327.

Table 4. Generated Answers by Various GPT Models for Four US Counties in US County-Level Dementia Time Series Forecasting Problem

“County” and “Label” columns indicate the US county and its true dementia death count in 2020. “Generated Answer” indicates the generated answer for different GPT models in which the numbers in bold are the predictions we extract from these answers.

3.3 Urban Geography

The third set of FM experiments focuses on research problems in the Urban Geography domain. Two representative tasks are selected: (1) an urban function task that aims at predicting the urban functions of a geographic region based on the Points of Interest (POIs) within it [65, 66, 139, 199, 205] and (2) an urban perception task that focuses on predicting the urban neighborhood characteristics (e.g., housing price, safety, noise intensity level) based on street view imagery (SVI) [80, 208, 218]. Since these tasks involve different data modalities such as point data, text, and images, we use different FMs to handle each task.

3.3.1 POI-Based Urban Function Classification.

The first experiment focuses on predicting the urban functions of a geographic region based on the POIs within it. This is a common Urban Geography task aimed at understanding the structure of the urban space [65, 66, 139, 199, 205].

To quantitively evaluate the performance of LLMs on this urban function prediction task, we utilize a POI dataset from Shenzhen, China that consists of 303,428 POIs and 5,461 urban neighborhoods with POIs [35, 36, 215, 216]. We denote this dataset as \({UrbanPOI5K}\) . Figure 2 shows the geographic distributions of the POIs and regions. The ground truth data is from the Urbanscape Essential Dataset of Peking University. The dataset provides detailed spatial distributions of 10 urban function types in the study area: forest, water, unutilized, transportation, green space, industrial, educational and governmental, commercial, residential, and agricultural. To simplify the task, we merge the uncommon urban function types forest, water, unutilized, green space, and agricultural into the function type outdoors and natural. This results in six urban function types: (1) residential; (2) commercial; (3) industrial; (4) education, health care, civic, governmental, and cultural; (5) transportation facilities; and (6) outdoors and natural. In total, 5,344 of the regions have ground truth labels. We randomly split this dataset into training, validation, and test sets with the ratio 60%:20%:20%. The test dataset is used to evaluate the performance of different models, whereas the validation set is only used for supervised baselines.

Fig. 2.

In order to enable an LLM to handle such a task, we convert the set of POIs inside an urban region into a textual paragraph that describes the frequencies of POIs with different place types. Then, we ask the LLM to predict the urban function of the region based on the paragraph (here, we ask for the most dominating function in spite of the common presence of mixed-used urban regions). Listing 5 shows one example prompt for this task, which includes a paragraph-question-answer tuple as a demonstration. LLMs adapted by this kind of prompt are conducting prediction under a one-shot setting. The paragraph highlighted in yellow in Listing 5 indicates the POI types and frequency information of a new neighborhood we would like to classify. The text highlighted in orange is the generated answers from GPT-3, which are treated as the prediction results. For the zero-shot setting, we simply remove this paragraph-question-answer tuple from the prompt. We use GPT2 with various sizes, GPT-3, and two ChatGPT models to perform this task under both zero-shot and one-shot settings. For comparison, we use two supervised learning neural network baselines:

Listing 5.

Place2Vec: We first learn POI category embeddings following the Place2Vec method [195]. Then, given an urban region with K POIs, we convert each POI into its corresponding Place2Vec embedding and perform mean pooling to obtain region embeddings as Zhai et al. [205] did. The resulting neighborhood embeddings are fed into a one-hidden-layer MLP to supervise learning its urban function over the \({UrbanPOI5K}\) training dataset.

HGI: HGI is an unsupervised method for learning region representations based on POIs. It takes into account the categorical semantics of POIs as well as POI-level and region-level spatial adjacency, and the multi-faceted influence from POIs to regions [66]. The HGI region embeddings are fed into an MLP with the same setup to predict the primary urban function. HGI is currently considered a state-of-the-art method that generates effective region embeddings for the urban function task.

Table 5 shows the evaluation results of all models on the test dataset of \({UrbanPOI5K}\) . Additionally, we visualize the confusion matrics of two baseline models, 7 zero-shot GPT models, and 7 one-shot GPT models in Figures 3 to 5. We can see the following.

Table 5.

	Model	Accuracy	Precision	Recall
(A) Supervised NN	Place2Vec [195, 205]	0.540	0.512	0.516
(A) Supervised NN	HGI [66]	0.584	0.568	0.563
(B) Zero-shot LLMs	GPT2 [152]	0.318	0.105	0.158
	GPT2-Medium [152]	0.025	0.102	0.040
	GPT2-Large [152]	0.005	0.001	0.002
	GPT2-XL [152]	0.001	0.108	0.002
	GPT-3 [16]	0.144	0.448	0.141
	ChatGPT (Raw.) [141]	0.075	0.376	0.106
	ChatGPT (Con.) [141]	0.051	0.232	0.046
(C) One-shot LLMs	GPT2 [152]	0.149	0.079	0.085
	GPT2-Medium [152]	0.317	0.104	0.156
	GPT2-Large [152]	0.057	0.083	0.021
	GPT2-XL [152]	0.324	0.105	0.159
	GPT-3 [16]	0.176	0.486	0.190
	ChatGPT (Raw.) [141]	0.195	0.524	0.245
	ChatGPT (Con.) [141]	0.093	0.451	0.085

Table 5. Evaluation Results of Various GPT Models and Supervised Baseline on the \({UrbanPOI5K}\) Dataset for the POI-Based Urban Function Classification Task

We divide the models into three groups: (A) supervised learning-based neural network models; (B) Zero-shot learning with LLMs. (C) One-shot learning with LLMs. We use accuracy, weighted precision, and weighted recall as evaluation metrics. We do not include weighted F1 scores since it is the same as the accuracy score. The best model of each group is highlighted.

Fig. 3.

Fig. 4.

Fig. 5.

In the zero-shot setting, GPT-3 achieves the best precision scores among all GPT models but still underperforms HGI models.

Interestingly, in the zero-shot setting, the smallest GPT2 achieves the best accuracy and recall scores, which is counterintuitive. The reason can be seen in Figure 4(a). GPT2 predicts almost all neighborhoods as “Residential”, which accounts for 30+% of the ground truth data.

In the one-shot setting, ChatGPT (Raw.) becomes the best model among all GPT models in terms of both precision and recall. It achieves 52.4% precision, which is only 4.4% less than HGI. Its confusion matrix in Figure 5(f) also demonstrates that ChatGPT (Raw.) has reasonably good performance on all urban function classes.

In the one-shot setting, GPT2-XL has the highest accuracy. However, Figure 5(d) shows that GPT2-XL is highly biased towards the “Residential” class.

These experimental results highlight the challenges of using LLMs for urban function classification. Two main reasons contribute to their inadequate performance:

POIs are initially used for search in online map services. By nature, they contain rich information about commercial venues such as restaurants and hotels. In contrast, the venues that are not closely related to our daily life, e.g., factories, are often missing. In this regard, Shenzhen is a heavily industrial-oriented city, and the ground truth data indicates that there are many more industrial regions than commercial ones. However, LLMs tend to predict that a large number of regions are commercial in view of the commercial-related POIs fed into it.

In addition, LLMs are unable to access the spatial distributions of POIs, which highly influence POI-based urban function prediction since different spatial distributions of POIs yield different spatial interaction patterns and, thus, different urban functions. Although both supervised methods Place2Vecand HGI are learned from POI spatial distributions during their place type embedding unsupervised training stage, it is not possible to inform LLMs of the spatial distributions of POIs. Converting a POI set into an image will also not work. This is because different POI types usually have spatial distributions with very different characteristics [124]. POIs with types of nightclubs or bars are usually clustered together whereas other POI types such as post offices, fire stations, and elementary schools are rather evenly distributed. A large pixel size will make a large number of POIs with the former types fall into one single pixel. On the other hand, a finer pixel size will make the image of an urban space too large and cannot be handled by other deep image encoders. Moreover, an urban space image with a finer pixel size will have very sparse information, which is hard for image encoders to learn. In other words, we need to use specialized neural architectures to directly handle point data (also polyline data and polygon data). This necessitates incorporating encoding architectures of various geospatial vector data, such as location encoding [122, 124], polyline encoding [155, 202], and polygon encoding techniques[126] into the GeoAI FM development. We will discuss this in detail in Section 4.6.

3.3.2 Street-View Image-Based Urban Noise Intensity Classification.

Street-view images (SVIs) are widely used in many Urban Geography studies to understand different characteristics of an urban neighborhood, such as safety [208], beauty, affluence [99], depressing atmosphere [208], housing prices [80], noise intensity levels [218], and accessibility [50]. It becomes an important data source that complements remote sensing images.

In this work, we use a recently developed street-view image noise intensity dataset developed by Zhao et al. [218] as a representative urban perception task. This dataset consists of 579 street-view images collected from Singapore. The noise intensity scores (between 0 and 1) were collected based on a human survey. Refer to their Github⁵ site for a detailed description of this dataset. Since the sound-intensity score is not a commonly agreed metric but rather an indicator defined by Zhao et al. [218], it would be challenging for visual FMs trained on general web data such as OpenCLIP [68] and BLIP [102] to directly predict such a score. Therefore, we discretize the original noise-intensity score of each street-view image into four classes: very quiet (0–0.25), quiet (0.25–0.50), noisy (0.50–0.75), and very noisy (0.75–1.00). We denote this dataset as \({SingaporeSVI579}\) . Figure 6 illustrates some street-view image examples from each noise-intensity class. We randomly split \({SingaporeSVI579}\) into 50% training and 50% testing sets. The testing dataset is used to evaluate different CNN and foundation models.

Fig. 6.

All GPT models (except GPT-4) used in previous experiments are pure language models that cannot handle data modalities such as images. Thus, for the street-view image-based noise intensity prediction task, we select the latest high-performance open visual-language foundation models (VLFMs), including OpenCLIP [68], BLIP [102], and OpenFlamingo-9B [11]. Although there exist more powerful visual-language foundation models such as DeepMind’s Flamingo-9B [6], KOSMOS-1 [64], and GPT-4 [142], they are not openly accessible nor do they provide application programming interface (API) access yet.⁶ We describe the setting of each VLFM as follows.

OpenCLIP-L: We use an OpenCLIP [68] ViT L/14 model pre-trained with the LAION-2B English subset of LAION-5B⁷ as a small-sized OpenCLIP model. We download the pre-trained model from Huggingface.⁸

OpenCLIP-B: We use the OpenCLIP [68] ViT-bigG/14 model trained with the LAION-2B English subset of LAION-5B as a larger-sized OpenCLIP model. The pre-trained model is from Huggingface.⁹

BLIP: We use the pre-trained BLIP-2 model [101] provided by Huggingface¹⁰ that consists of a CLIP-like image encoder, a Querying Transformer (Q-Former), and a large language model (Flan T5-xl).

OpenFlamingo-9B: We use the pre-trained OpenFlamingo-9B model [11] provided by Huggingface¹¹ that consists of an image encoder (CLIP ViT-L/14 [68]) and an LLM (LLaMA-7B [178]).

All VLFMs are evaluated on the testing set of \({SingaporeSVI579}\) in a zero-shot setting. Since different VLFMs require different image input formats and expect different styles of text prompts, we describe the zero-shot pipeline for each VLFM below.

OpenCLIP-L and OpenCLIP-B: We first encode four noise-intensity class names into four text embeddings by using a text template of the form “a city area with the noise intensity of [NOISE_INTENSITY_CLASS]”. Then, given a street view image, we use an OpenCLIP ViT image encoder to encode them into an image embedding. The cosine similarity between this image embedding and all four class text embeddings are computed and the class with the highest similarity will be picked as the prediction.

BLIP: Given a street-view image, we use a prompt of the form “What is the noise intensity of this area, is it 1. very quiet, 2. quiet, 3. noisy, or 4. very noisy?” to instruct the language encoder of BLIP to predict its noise-intensity class.

OpenFlamingo-9B: We use a prompt of the form “There are four noise intensity levels: 1. very quiet, 2. quiet, 3. noisy, or 4. very noisy. <image>The noise intensity of this area is” to instruct OpenFlamingo-9B to predict the noise intensity of the given image. Here “<image>” denotes an image token and CLIP ViT-L/14 is used as the encoder.

We select four CNN models as the alternative baselines to compare against these VLFMs: AlexNet [91], ResNet18 [51], ResNet50 [51], and DenseNet161 [63]. The weights of all CNN models are first initialized by the Place365 pre-trained weights [220], and only their final softmax layers are fine-tuned with full supervision on the \({SingaporeSVI579}\) training dataset. We choose this linear probing method instead of fully fine-tuning the whole CNN architecture due to the very limited training data size.

Table 6 compares the performances of different fine-tuned CNN models with four zero-shot VLFMs. The results show that BLIP achieves the best accuracy and weighted F1-score among all VLFMs in the zero-shot learning setting. The performance of BLIP is comparable to those of AlexNet but is still slightly worse than the best models, ResNet18 and ResNet50. To further understand the classification accuracy of different models on each noise-intensity class, we visualize the confusion matrices of all models in Figure 7. We can see that the predictions of OpenCLIP-L, OpenCLIP-B, and OpenFlamingo-9B are highly biased. OpenCLIP-L and OpenCLIP-B tend to classify most street-view images as ‘very quiet’ whereas OpenFlamingo-9B classifies most images as ‘very noisy’. On the other hand, only BLIP shows balanced and reasonable predictions on all four noise-intensity classes, similar to those fine-tuned CNN models.

Table 6.

	Model	#Param	Accuracy	F1
(A) Supervised Fine-tuned CNNs	AlexNet [91]	58M	0.452	0.405
	ResNet18 [51]	11M	0.493	0.442
	ResNet50 [51]	24M	0.500	0.436
	DenseNet161 [63]	27M	0.486	0.382
(B) Zero-shot FMs	OpenCLIP-L [68, 150, 169]	427M	0.128	0.089
	OpenCLIP-B [68, 150, 169]	2.5B	0.169	0.178
	BLIP [101, 102]	3.9B	0.452	0.405
	OpenFlamingo-9B [11]	8.3B	0.262	0.127

Table 6. Evaluation Results of Various Vision-Language Foundation Models and Baselines on the Urban Street-View Image-based Noise Intensity Classification Dataset, SingaporeSVI579 [218]

We classify models into two groups: (A) Supervised finetuned convolutional neural networks (CNNs); (B) Zero-shot learning with visual-language foundation models (VLFMs). We use accuracy and weighted F1 scores as evaluation metrics. The best scores for each group are highlighted.

Fig. 7.

These results are very encouraging, with zero-shot BLIP achieving comparable performance with fine-tuned models. We can observe from Figure 7(g) that BLIP has a general sense of the noise-intensity level of the target urban area, e.g., it misclassifies most “very noisy” areas as simply “noisy”. This implies that BLIP understands noise-intensity levels on a different scale. For example, a “very noisy” place annotated by a human interviewee in Singapore might not qualify as “very” for BLIP, which might have seen many much noisier urban areas. To this end, BLIP is generally competent for this urban perception task. At the same time, we recognize that most of the open VLFMs are still not powerful enough to connect visual features to their important yet nuanced semantics and concepts in urban studies. For example, when presented with a construction site in Figure 6(d), we expect a VLFM to predict that this is a very noisy neighborhood. When seeing a large vegetation coverage in Figure 6(d), a VLFM should associate this visual feature with the concept of ‘quiet’ in the language space. This study highlights the fact that the current VLFMs have certain capabilities in understanding the characteristics of urban neighborhoods given visual inputs. However, their ability is still generally not as strong as the current LLMs on language-only tasks. Furthermore, we think the urban perception task, as a classic task in urban geography, is more challenging than current visual question-answering tasks commonly used in VLFM research [64, 150] partly due to their partially subjective nature and the rarity of annotated datasets. This further emphasizes the unique challenges faced by foundation model research in GeoAI.

3.4 Remote Sensing

Our final experiment focuses on a typical RS task: RS image scene classification. We choose a widely used aerial image scene classification dataset, \({AID}\) [192], which consists of 10K scenes and 30 aerial scene types. These data were collected from Google Earth imagery. Refer to Xia et al. [192] for a detailed description of this dataset. \({AID}\) does not provide an official dataset split; thus, we split the dataset into training and testing sets using stratified sampling with a ratio of 80% for training and 20% for testing, ensuring that both sets have similar scene type label distributions.

Similar to the street-view image classification task from Section 3.3.1, we use four CNN models (i.e., AlexNet, ResNet18, ResNet50, and DenseNet161) and four VLFMs (i.e., OpenCLIP-L, OpenCLIP-B, BLIP, and OpenFlamingo-9B). For all CNN models, their weights are first initialized by the ImageNet-V1 pre-trained weights, and their final softmax layers are fine-tuned with full supervision on the \({AID}\) training dataset. For the VLFMs, their model performances are highly dependent on whether their language model component can correctly comprehend the semantics of each RS image scene type. However, many RS image scene types of \({AID}\) are vague, such as “center” and “commercial”. We find that if keeping their original scene type names, models like OpenCLIP would assign no RS image to those two types. Therefore, we modify the names of “center” to “theater” (although this only partially covers the semantics of this class), and “commercial” to “commercial area” and use them in the prompt. Models with such prompts are denoted as “ \((Updated)\) ” while “ \((Origin)\) ” denotes the original RS image scene type names from \({AID}\) being used in the prompt. We evaluate all VLFMs in a zero-shot learning setting. Following the street-view image classification task in Section 3.3.1, similar prompt formats are used on the \({AID}\) dataset.

Table 7 summarizes the experiment results of four fine-tuned CNN models and zero-shot VLFMs. We can see that AlexNet achieves the best accuracy and F1-score among all CNN models. Surprisingly, OpenCLIP-L \((Updated)\) obtains the best accuracy and F1-score among all VLFMs. We observe that bigger models do not necessarily lead to better results in this task. For example, the largest model, OpenFlamingo-9B only achieves a 0.206 accuracy. One possible reason is that these larger VLFMs might not see RS images in their training data, which usually contain general web-crawled images and texts. OpenCLIP, on the other hand, explicitly includes satellite images in its pre-training data [68]. However, both BLIP and OpenFlamingo-9B did not mention whether they utilized RS images during the pre-training stage. Note that street-view images are quite similar to Internet images, which are widely used for VLFM pre-training. RS images, on the other hand, such as satellite images and unmanned aerial vehicles (UAVs), are visually distinguished from Internet photos, the majority of which are captured using consumers’ digital cameras at the ground level. If the visual encoders of BLIP and OpenFlamingo-9B are not pre-trained on RS images, the features they extracted will not align well with text features that share similar semantics, which leads to poor performance on the \({AID}\) dataset. Our study highlights the importance of pre-training VLFMs on a diverse set of visual inputs, including RS images, to improve their performance on RS tasks.

Table 7.

	Model	#Param	Accuracy	F1
Supervised Fine-tuned CNNs	AlexNet [91]	58M	0.831	0.827
	ResNet18 [51]	11M	0.752	0.730
	ResNet50 [51]	24M	0.757	0.738
	DenseNet161 [63]	27M	0.818	0.807
Zero-shot FMs	OpenCLIP-L \((Origin)\) [68, 150, 169]	427M	0.708	0.688
	OpenCLIP-L \((Updated)\) [68, 150, 169]	427M	0.710	0.698
	OpenCLIP-B \((Origin)\) [68, 150, 169]	2.5B	0.699	0.668
	OpenCLIP-B \((Updated)\) [68, 150, 169]	2.5B	0.705	0.686
	BLIP \((Origin)\) [102]	2.5B	0.500	0.473
	BLIP \((Updated)\) [102]	2.5B	0.520	0.494
	OpenFlamingo-9B [11]	8.3B	0.206	0.154

Table 7. Evaluation Results of Various Vision-Language Foundation Models and Baselines on the Remote Sensing Image Scene Classification Dataset, \({AID}\) [192]

We use the same model set as Table 6. “ \((Origin)\) ” denotes we use the original remote sensing image scene class name from \({AID}\) to populate the prompt while “ \((Updated)\) ”indicates we update some class names to improve its semantic interpretation for FMs. We use accuracy and F1 score as evaluation metrics.

Another important observation is that the semantics embedded in the prompts play a pivotal role in determining the model’s performance. For example, when using the original scene type name “center”, generally none of the models is able to understand the underlying ambiguous meaning. However, simply changing “center” to “theater” could help OpenCLIP correctly find relevant RS scenes, although this is not a perfect name to describe this class. Nevertheless, this simple change demonstrates the importance of choosing expressive prompts while using FMs for geospatial tasks.

Compared with the results in Table 5, the experimental results in Table 7 highlight the unique challenges of RS images. We will discuss the improvement of FMs for remote sensing in detail in Section 4.4.

4 A Multimodal Foundation Model for GeoAI

Section 3 explores the effectiveness of applying existing FMs on different tasks from various geospatial domains. We can see that many LLMs can outperform fully supervised task-specific ML/DL models and achieve surprisingly good performances on several geospatial tasks, such as toponym recognition, location description recognition, and time series forecasting of dementia. However, on other geospatial tasks (i.e., the two tested Urban Geography tasks and one RS task), especially those that involve multiple data modalities (e.g., point data, street-view images, and RS images), existing FMs still underperform task-specific models. In fact, one unique characteristic of many geospatial tasks is that they involve many data modalities such as text data, knowledge graphs, RS images, street-view images, trajectories, and other geospatial vector data. This will put a significant challenge on GeoAI FM development. Thus, in this section, we discuss the challenges unique to each data modality, then propose a potential framework for future GeoAI that leverages a multimodal FM.

4.1 Geo-Text Data

Despite the promising results in Table 1, LLMs still struggle with more complex geospatial semantics tasks such as toponym resolution/geoparsing [7, 45, 105, 182] and geographic question answering (GeoQA) [22, 29, 48, 93, 116, 120, 121, 125, 131, 148, 168], since LLMs are unable to perform (implicit) spatial reasoning in a way that is grounded in the real world. As a concrete example, we illustrate the shortcomings of GPT-3 on a geoparsing task. Using two examples from the Ju2016 dataset, we ask GPT-3 to both (1) recognize toponyms and (2) predict their geo-coordinates. The prompt is shown in Listing 6 whereas the geoparsing results are visualized in Figure 8. We see that, in both cases, GPT-3 can correctly recognize the toponyms but the predicted coordinates are 500+ miles away from the ground truth. Moreover, we notice that with a small spatial displacement of the generated geo-coordinates, GPT-3’s log probability for this new pair of coordinates decreases significantly. In other words, the probability of coordinates generated by the LLM does not follow Tobler’s First Law of Geography [176]. GPT-3 also generates invalid latitudinal/longitudinal coordinates, indicating that existing LLMs are still far from gracefully handling complex numerical and spatial reasoning tasks.

Fig. 8.

Listing 6.

Figure 9 provides another example of unsatisfactory results of LLMs in answering geographic questions related to spatial relations. In this example, Monore in the ChatGPT-generated answer is not in the north of Athens, Georgia but rather in the southwest of Athens. This example indicates that LLMs do not fully understand the semantics of spatial relation. The reason for this error could be that ChatGPT generates answers to this spatial relation question based on searching through its internal memory of text-based knowledge rather than performing spatial reasoning. One potential solution to this problem could be the use of geospatial knowledge graphs [20, 224], which can guide the LLMs to perform explicit spatial relation computations. We will discuss this further in the next section.

Fig. 9.

4.2 Geospatial Knowledge Graph

Despite the superior end-to-end prediction and generation capability, LLMs may produce content that lacks sufficient coverage of factual knowledge or even contains non-factual information. To address this problem, knowledge graphs (KGs) can serve as effective sources of information that complement LLMs. KGs are factual in nature because the information is usually extracted from reliable sources, with post-processing conducted by human editors to further ensure that incorrect content is removed. As an important type of domain KGs, geospatial knowledge graphs (GeoKGs) such as GeoNames [2], LinkedGeoData [10], YAGO2 [56], GNIS-LD [162], KnowWhereGraph [70], and EVKG [149] are usually generated from authoritative data sources and spatial databases. For example, GNIS-LD was constructed based on the United States Geological Survey’s Geographic Names Information System (GNIS). This ensures the authenticity of these geospatial data.

Developing multimodal FMs for GeoAI that jointly considers text data and GeoKGs can lead to several advantages. First, from the model perspective, (geospatial) KGs could be integrated into pre-training or fine-tuning LLMs through strategies such as retrieving embeddings of knowledge entities for contextual representation learning [147], fusing knowledge entities and text information [52, 214], and designing learning objectives that focus on reconstructing knowledge entities [217] and triples [171, 200]. Second, from the data perspective, GeoKGs could provide contextualized semantic and spatiotemporal knowledge to facilitate prompt engineering or data generation, such as enriching prompts with contextual information from KGs [15, 190] and converting KG triples into natural text corpora for specific domains [1]. Third, from the application perspective, it is possible to convert facts in GeoKGs into natural language to enhance text generation [203] to be used in scenarios such as (geographic) question answering [39, 123] and dialogue systems [189]. Last, from a reasoning perspective, GeoKGs usually provide spatial footprints of geographic entities that enable LLMs to perform explicit spatial reasoning as Neural Symbolic Machines did [106]. This can help avoid the errors we see in Figure 9.

4.3 Street-View Image

Section 3.3.1 has demonstrated the effectiveness of existing VLFMs on a street view–based geospatial task. However, the performance gaps between the task-specific models and VLFMs shown in Table 6 inform us that there are some unique characteristics of urban perception tasks we need to consider if we want to develop an FM for GeoAI.

Although street-view images are like the natural images used in common vision-language tasks, one major difference is that common vision-language tasks usually focus on factual knowledge in images (e.g., “how many cars in this image”) whereas urban perception tasks are usually related to high-level human perception of the images, such as the safety, poverty, beauty, and sound intensity of a neighborhood given a street-view image [207, 208]. Compared with factual knowledge, this kind of high-level perception knowledge is rather hard to estimate and the labels are rather rare. Moreover, many perception concepts are vague and subjective, which increases the difficulties of those tasks. Thus, in order to develop a GeoAI FM that can achieve state-of-the-art performances on various urban perception tasks, we need to conduct some domain studies to provide a concrete definition of each urban perception concept and develop some annotated datasets for GeoAI FM pre-training.

4.4 Remote Sensing

With the advancement of computer vision technology, deep vision models have been successfully applied to different kinds of RS tasks, including image classification/regression [12, 130, 163], land cover classification [12, 28, 86], semantic segmentation [210], and object detection [95]. Unlike the usual vision tasks, which usually work on RGB images, RS tasks are based on multispectral/hyperspectral images from different sensors. Most existing RS works focus on training one model for a specific RS task using data from a specific sensor [95]. Researchers often compare performances of different models using the same training datasets and decide on model implementation based on accuracy statistics. However, we see the trend of FMs in the computer vision (CV) field such as CLIP [150], Flamingo-9B [6] to be further developed to meet the unique challenges of RS tasks. RS experiments in Section 3.4 demonstrate that there is still a performance gap between current visual-language FMs and task-specific deep models. To fill this gap and develop a GeoAI FM that can achieve state-of-the-art performances on various RS tasks, we need to consider the uniqueness of RS images and tasks.

Aside from being task agnostic, the desiderata for an RS FM include being (1) sensor agnostic: it can seamlessly reason among RS images from different sensors with different spatial or spectral resolutions [128]; (2) spatiotemporally aware: it can handle the spatiotemporal metadata of RS images and perform geospatial reasoning for tasks such as image geolocalization and object tracking; and (3) environmentally invariant: it can decompose and isolate the spectral characteristics of the objects of interest across a variety of background environmental conditions and landscape structure. Recent developments here include geography-aware RS models [12] or self-supervised/unsupervised RS models [12, 127, 163], all of which are task agnostic. However, we have yet to develop an FM for RS tasks that can satisfy all such properties.

In summary, efforts should be focused on developing GeoAI FMs using RS to address pressing environmental challenges due to climate change. It would require complex models that look beyond image classification toward modeling ecosystem functions such as forest structure, carbon sequestration, urban heat, coastal flooding, and wetland health. Traditionally, RS is widely used to study these phenomena but in a site-specific and sensor-specific manner. Sensor-agnostic, spatiotemporally aware, and environmentally invariant FMs have the potential to transform our understanding of the trends and behavior of these complex environmental phenomena.

4.5 Trajectory and Human Mobility

Trajectory, which is a sequence of time-ordered location tuples, is another important data type in GeoAI. The proliferation of digital trajectory data generated from various sensors (e.g., smartphones, wearable devices, and vehicle on-board devices) together with the advancement of deep learning approaches has enabled novel GeoAI models for modeling human mobility patterns, which are crucial for city management, transportation services, and more. There are four typical tasks in modeling human dynamics with deep learning [113], including trajectory generation [26, 155, 158], origin–destination (OD) flow generation [114, 173, 198], in/out population flow prediction [74, 103], and next-location/place prediction [108, 156].

In order to develop GeoAI FMs for supporting human mobility analysis, we need to consider the following perspectives: (1) pre-training and generation of task-agnostic trajectory embedding [136, 184], which represent high-level movement semantics (e.g., spatiotemporal awareness, routes, and location sequence) from various kinds of trajectories [108]; (2) context-aware contrastive learning of trajectory: human movements are constrained from their job type, surrounding built environment, and transportation infrastructure as well as many other spatiotemporal and environmental factors [113, 172, 185]; GeoAI FMs should be able to link trajectories to various contextual representations such as road networks (e.g., Road2Vec [109], [24]), POI composition or land use types [209], urban morphology [23], and population distribution [67]; (3) user geoprivacy [85] should be protected when training such GeoAI FMs since trajectory data can reveal individuals’ sensitive locations, such as home and personal trips. The privacy-preserving techniques by utilizing cryptography or differential privacy [5] and federated learning framework may be incorporated in the GeoAI FMs training process for trajectories [156].

4.6 Geospatial Vector Data

Another critical challenge in developing FMs for GeoAI is the complexity of geospatial vector data, which are commonly used in almost all geographic information system (GIS) and mapping platforms. Examples include the US state-level and county-level dementia data (polygon data) discussed in Section 3.2, urban POI data (point and polygon data) introduced in Section 3.3.1, cartographic polyline data [202], building footprints data [196], spatial footprints of geographic entities in a geographic knowledge graph [126], road networks (composed by points and polylines), and many others. In contrast with natural language processing (NLP) and CV, in which text (one-dimensional (1-D)) or images (two-dimensional (2-D)) are well structured and more suitable to common neural network architectures, vector data exhibits more complex data structures in the form of points, polylines, polygons, and networks [122]. Thus, it is particularly challenging to develop an FM that can seamlessly encode or decode different kinds of vector data.

Noticeably, recently developed location encoding [122, 124, 130], polyline encoding [155, 202], polygon encoding [126], and spatial scene encoding [47] techniques can be seen as a fundamental building block for such a model [129]. Moreover, since encoding (e.g., geo-aware image classification[124]) or decoding (e.g., geoparsing [182]) geospatial vector data, or conducting spatial reasoning (e.g., GeoQA [125]) is an indispensable component for most GeoAI tasks, developing FMs for vector data is the key step towards a multimodal FM for GeoAI. This point also differentiates GeoAI FMs from existing FMs in other domains.

4.7 A Multimodal FM for GeoAI

Except for those data modalities, there are also other datasets frequently studied in GeoAI, such as geo-tagged videos, spatial social networks, and sensor networks. Given all these diverse data modalities, the question is how to develop a multimodal FM for GeoAI that best integrates all of them.

When we take a look at the existing multimodal FMs such as CLIP [150], DALL \(\cdot\) E2 [154], MDETR [78], VATT [3], BLIP [102], DeepMind Flamingo [6], and KOSMOS-1 [64], we can see the following general architecture: (1) starting with separate embedding modules to encode different modalities of data (e.g., a Transformer for texts and ViT for images [150]); (2) (optionally) mixing the representations of different modalities by concatenation; (3) (optionally) more Transformer layers for across modality reasoning, which can achieve a certain degree of alignment based on semantics, e.g., the word “hospital” attached to a picture of a hospital; and (4) generative or discriminative prediction modules for different modalities to achieve self-supervised training.

One weak point of these architectures is the lack of integration with geospatial vector data, which is the backbone of spatial reasoning and helps alignment among multimodalities in GeoAI. This is considered central and critical for GeoAI tasks. Therefore, we propose to replace step 2 with aligning the representations of different modalities (e.g., geo-tagged texts and RS images) by augmenting their representations with location encoding [124, 130] before mixing them as Mai et al. did [127]. Figure 10 illustrates this idea. Geo-tagged text data, street-view images, RS images, trajectories, and GeoKGs can be easily aligned via their geographic footprints (vector data). The key advantages of such a model are to enable spatial reasoning and knowledge transfer across modalities.

Fig. 10.

5 Risks and Challenges

Despite the recent progress, several challenges are emerging as more advanced FMs have been released [219]. First, as FMs continue to increase in size, there is a need to improve the computational efficiency for training and fine-tuning these models. Second, as an increasing number of LLMs are not open sourced, it becomes challenging to incorporate knowledge into these models without accessing their internal parameters. Third, as LLMs are increasingly deployed in remote third-party settings, protecting user privacy becomes increasingly important [157]. Beyond these challenges for FMs in general, there are also many unique challenges and risks during the process of GeoAI FM development.

5.1 Geographic Hallucination

Many LLMs have faced criticism for their tendency to produce “hallucinations”, generating content that is nonsensical, inaccurate given the context, or untruthful according to world knowledge [64, 142, 159, 178]. Therefore, recent works have reported truthfulness evaluations with publicly available benchmarks such as TruthfulQA [107] prior to their launch of FMs. For example, ChatGPT and GPT-4 have undergone OpenAI internal adversarially designed factuality evaluations [142]. Similarly, in a geographic context, generating geographic faithful results is particularly important for almost all GeoAI tasks. In addition to Figure 9 in Section 4.1, Figure 11 illustrates two geographically inaccurate results generated from ChatGPT and Stable Diffusion. In Figure 11(a), the expected answer should be “Washington, North Carolina”.¹² However, ChatGPT indicates that there is no Washington in North Carolina. Moreover, the largest city in Washington State should be Seattle and there is no city in this state named Washington.¹³ Figure 11(b) visualizes 4 generated RS images generated by Stable Diffusion.¹⁴ Although those images appear similar to satellite images, it is rather easy to tell that they are fake RS images since the layouts of geographic features in these images are clearly not from any city in the world. In fact, generating faithful RS images is a popular and important RS task [49, 53] in which geometric accuracy is very important for the downstream tasks.

Fig. 11.

The first step to addressing such a problem is to develop geographic truthfulness evaluation datasets for various LLMs based on their generated results formats. For example, we can construct an adversarially designed geographic question-answering dataset to evaluate the geographic truthfulness of various LLMs. In the case of image editing and generation models such as Stable Diffusion, a collection of prompt-geospatial image pairs could be gathered to evaluate the geographic accuracy of the generated content.

5.2 Geographic Bias

It is well known that FMs have the potential to amplify existing societal inequalities and biases present in the data [14, 178, 213]. A key consideration for GeoAI in particular is geographic bias [38, 110, 132, 133], which is often overlooked by AI research. For example, Liu et al. [110] showed that all current geoparsers are highly geographically biased towards data-rich regions. The same issue can be observed in current LLMs. Faisal and Anastasopoulos [38] investigated the geographic and geopolitical bias presented in pre-training language models (PLMs). They show that the knowledge learned by PLMs is unequally shared across languages and countries and many PLMs exhibit so-called geopolitical favoritism, which is defined as an over-amplification of certain countries’ knowledge in the learned representations (e.g., countries with higher GDP, geopolitical stability, military strength, etc.). Figure 12 shows two examples in which both ChatGPT and GPT-4 generate inaccurate results due to the geographic bias inherited in these models. Compared with “San Jose, California, USA”, “San Jose, Batangas, Philippines”¹⁵ is a less popular place name in many text corpora. Similarly, compared with “Washington State, USA” and “Washington, D.C., USA”, “Washington, New York”¹⁶ is also a less popular place name. That is why both ChatGPT and GPT-4 interpret those place names incorrectly. Compared with task-specific models, FMs suffer more from geographic bias since (1) the training data is collected in large scale, which is likely to be dominated by overrepresented communities or regions; (2) the huge number of learnable parameters and complex model structures make model interpretation and debiasing much more difficult; and (3) the geographic bias of the FMs can be easily inherited by all adapted models downstream [14] and, thus, bring much more harm to the society. This indicates a pressing need for designing proper (geographic) debiasing frameworks.

Fig. 12.

To solve the geographic bias problem, the key is to understand the causes of geographic bias and design bespoke solutions. Liu et al. [110] classified geographic bias into four categories: (1) representation bias: whether the distribution of training/testing data is geographically biased; (2) aggregation bias: whether the discretization of the space can lead to different prediction results, thus, different conclusions;¹⁷ (3) algorithmic bias: whether the used model will amplify or bring additional geographic bias; and (4) evaluation bias: whether the evaluation metric can reflect fairness across geographic space.

Representation bias concerning geography is widely acknowledged. Numerous commonly used labeled geospatial datasets exhibit geographic data imbalance, including the fine-grained species recognition datasets (e.g., BirdSnap [13], iNatlist 2018 [27, 115, 130], iNatlist 2021 [197], etc.), satellite image classification and object segmentation datasets (e.g., BigEarthNet [174], SpaceNet [179], xView [95], Agriculture-Vision [25], etc.), and geoparsing datasets (e.g., WikTOK [46], GeoCorporal [181], etc.). In addition, many general-purpose corpora such as Wikipedia and the DBpedia KG have also been found to be geographically biased [71]. To solve this issue, except for collecting more data in the data-sparse area, we can also leverage the massive amount of unlabelled geospatial datasets (which are usually less geographically imbalanced) to perform geographic self-supervised pre-training [127] to make FMs become more robust to the geographic bias in the labeled training datasets.

Aggregation bias is mainly caused by the common practice of performing spatial partition/discretization before AI model training [175, 201, 221]. One possible way to avoid this is to treat the geographic space as a continuous space and learn a location-aware neural network as [89, 120, 124, 222] did.

One example of the algorithm bias is the utilization of population bias for geoparsing [94] – the model tends to favor ranking places with larger populations more prominently. This heuristic might negatively impact the model performance on geoparsing datasets containing many less-used place names, such as Ju2016 [76]. Since FMs are expected to provide a generalized solution for various tasks and datasets, adding such algorithm bias may benefit some tasks but hurt others. This reminds us to systematically check for possible algorithm bias during FM design and training.

Evaluation bias is a crucial concern often overlooked in the assessment process. Many geospatial datasets (e.g., iNatlist 2018) have much less testing data on underdeveloped regions. Consequently, even if the model’s performance is subpar in these regions, it may not substantially affect the overall evaluation of the model’s performance on such a dataset. A comprehensive framework is needed to solve such bias, which includes a set of geographic bias metrics and evaluation datasets that can be used to quantify such bias. In fact, many language FMs undergo bias evaluation in terms of gender, religion, race/color, sexual orientation, age, profession, and socioeconomic status prior to their release [142, 178, 213]. Many bias evaluation datasets are constructed for this purpose, such as CrowS-Pairs [138], WinoGender [166], and StereoSet [137]. However, as far as we know, there is no such work on quantifying geographic bias in FMs. This will be an exciting future research direction.

5.3 Temporal Bias

Similar to geographic bias, FMs also suffer from temporal bias, which also can be attributed to four causes: temporal representation bias, temporal aggregation bias, algorithm bias, and evaluation bias. Among them, temporal representation bias is understood to be the main driver of temporal bias since there is much more training data available for current geographic entities than for historical ones. Temporal bias can also lead to inaccurate results. Two examples are shown in Figure 13. In both cases, the names of historical places are used for other places nearby. GPT-4 fails to answer both questions due to its heavy reliance on pre-training data biased towards current geographic knowledge. Temporal bias and geographic bias are critical challenges that need to be solved for the development of GeoAI FMs.

Fig. 13.

One concrete step is to develop an evaluation framework and a dataset to quantify the temporal bias presented in various FMs. In addressing the issue of temporal bias, one potential solution entails the development of a temporal debiasing framework. Nevertheless, it’s worth noting that such a debiasing framework may have adverse effects on model performance for tasks requiring the most up-to-date information. Consequently, an alternative solution to consider is the formulation of a model fine-tuning strategy tailored to downstream tasks that involve historical events.

5.4 Low Refreshment Rate

Another temporal-related challenge is the slow refresh rate of FMs. The significant efforts, resources, and costs required to train large-scale FMs make it impractical to update them frequently. For example, ChatGPT was trained on data up to September 2021. Consequently, it cannot provide answers to questions about recent events, which is crucial in many domains, such as communication, journalism, medicine, and even AI, given the rapid pace of technological advancements, for example, chatbot applications (e.g., ChatGPT) without using external knowledge (e.g., search engines). The freshness problem can be significantly reduced when geospatial FMs are used in combination with external knowledge (e.g., maps [104], search engines [31, 41], or KGs) so that FMs can focus more on spatial understanding and reasoning capabilities, which need less updating over time. Nevertheless, we believe that there is a pressing need for a sustainable FM ecosystem [170] capable of achieving efficient model training and cost-effective updates in line with the latest information. We believe this will be the next major focus in FM research.

5.5 Spatial Scale

Geographic information can be represented in different spatial scales, which means that the same geographic phenomenon/object can have completely different spatial representations (points vs. polygons) across GeoAI tasks. For example, an urban traffic forecasting model must represent San Francisco (SF) as a complex polygon, whereas a geoparser usually represents SF as a single point. Since FMs are developed for a diverse set of downstream tasks, they need to be able to handle geospatial information with different spatial scales and infer the right spatial scale to use given a downstream task. Developing such a module is a critical component for an effective GeoAI FM.

One possible way to make geospatial FMs spatial-scale aware is leveraging the instruction tuning stage to teach the FMs which spatial representations and spatial operations are available for different spatial scales and showcase which spatial scales should be selected for a given geospatial task.

5.6 Generalizability versus Spatial Heterogeneity

Spatial heterogeneity refers to the phenomenon that the expectation of a random variable (or a confounding of the process of discovery) varies across the Earth’s surface [43, 100] whereas geographic generalizability refers to the ability of a GeoAI model to replicate or generalize the model’s prediction ability across space. An open problem for GeoAI is how to achieve model generalizability (“replicability” [43]) across space while still allowing the model to capture spatial heterogeneity. Given geospatial data with different spatial scales, we desire an FM that can learn general spatial trends while still memorizing location-specific details. Will this generalizability introduce unavoidable intrinsic model bias in downstream GeoAI tasks? Will this memorized localized information lead to an overly complicated prediction surface for a global prediction problem? With large-scale training data, this problem can be amplified and requires care.

Many spatial statistic models have been developed to capture the spatial heterogeneity while still being able to learn the general trends, such as geographic weighted regression [17] and multiscale geographic weighted regression [42]. However, as far as we know, all current FMs cannot capture spatial heterogeneity, thus leading to poor geographic generalizability. One possible solution is to take spatial heterogeneity into account during model pre-training and/or fine-tuning. Possible methods are a spatial heterogeneity–aware deep learning framework [193], which automatically learns the spatial partitions and trains different deep neural networks in different partitions. Another way to increase geographic generalizability is to conduct zero-shot or few-shot learning on geographic regions with lower model performance [100]. Another promising direction is adding location encoding [122, 124, 127, 130] as part of the foundation model input, which can help the model adapt to different locations in a data-efficient way. How to develop a geographically generalizable (or so-called spatial replicable [43]) deep neural net, e.g., language foundation models, is a promising research direction.

6 Conclusion

In this article, we explore the promises and challenges for developing multimodal FMs for GeoAI. The potential of FMs is demonstrated by comparing the performance of existing LLMs and visual-language FMs as zero-shot or few-shot learners with fully supervised task-specific SOTA models on seven tasks across multiple geospatial subdomains, such as Geospatial Semantics, Health Geography, Urban Geography, and RS. While in some language-only geospatial tasks, LLMs, as zero-shot or few-shot learners, can outperform task-specific fully supervised models, existing FMs still underperform the task-specific fully supervised models on other geospatial tasks, especially tasks involving multiple data modalities (e.g., POI-based urban function classification, street-view image-based urban noise intensity classification, and RS image scene classification). We realize that the major challenge for developing an FM for GeoAI is the multimodality nature of geospatial tasks. After discussing the unique challenges of each geospatial data modality, we propose our vision for a novel multimodal FM for GeoAI that should be pre-trained based on the alignment among different data modalities via their geospatial relations. We conclude this work by discussing some unique challenges and risks for such a model.

At this very exciting moment of FM development, there are numerous interesting future research directions for spatial data scientists and GeoAI researchers. An intriguing and distinctive avenue for geo-foundation models involves incorporating geospatial vector data, such as points, polylines, and polygons, as an additional data modality. Given that location serves as the linchpin for aligning diverse geospatial data modalities, this approach will establish the groundwork for the creation of multimodal foundation models for GeoAI, as discussed in Section 4.7. Another research avenue involves investigating methods to incorporate spatial heterogeneity into geo-foundation model frameworks, with the aim of enhancing the resulting model’s geographic generalizability across the globe. Moreover, another interesting question to ask is which role the classic machine learning models (e.g., random forest) can play in FM research. Classic machine learning methods such as random forest are powerful and commonly used approaches to leverage expert-designed features and capture highly nonlinear responses to these features. However, their structures are not very suitable to be used as an FM backbone. This is because, unlike neural networks, there are no clearly defined intermediate representation layers in random forests, which is usually needed for training FMs in an unsupervised or self-supervised fashion —training the model to predict part of unlabeled data from the rest of the data and then using the pre-trained intermediate representation for downstream tasks. Nevertheless, we believe that predictions made by FMs are well suited to be added to random forest models as extra features so that the benefits of random forest and FMs can be seamlessly combined in future GeoAI development.

Acknowledgments

Gengchen Mai would like to acknowledge the support from the UGA Presidential Interdisciplinary Seed Grant – “A Multimodal Foundation Model for Various Geospatial, Environmental, and Agricultural tasks”. Weiming Huang acknowledges the financial support from the Knut and Alice Wallenberg Foundation. Song Gao acknowledges the support by the National Science Foundation funded AI institute (Award No. 2112606) for Intelligent Cyberinfrastructure with Computational Learning in the Environment (ICICLE) and the H.I. Romnes Faculty Fellowship provided by the University of Wisconsin-Madison Office of the Vice Chancellor for Research and Graduate Education with funding from the Wisconsin Alumni Research Foundation. This research was also partially supported by the National Research Foundation, Singapore under its AI Singapore Programme (AISG Award No. AISG2-TC-2021-001), and a Singapore MOE AcRF Tier-2 grant (No. MOE-T2EP20221-0015). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the funding agencies.

Footnotes

Many foundation models such as ChatGPT can only handle one data modality, such as text. Multimodal foundation models were developed to overcome this limitation that can handle multiple data modalities at the same time, such as text, image, video, audio, and more.

This work is a significant extension of our previous 4-page vision paper published in ACM SIGSPATIAL 2022 [118] by adding five additional tasks in Health Geography, Urban Geography, and Remote Sensing domains.

There is also a different variant that predicts masked spans in text [84, 153].

⁴

https://wonder.cdc.gov/ucd-icd10.html

⁵

https://github.com/ualsg/Visual-soundscapes

⁶

Note that the GPT-4 API still does not support visual question answering at the time we submit this paper.

⁷

https://laion.ai/blog/laion-5b/

⁸

https://huggingface.co/laion/CLIP-ViT-L-14-laion2B-s32B-b82K

⁹

https://huggingface.co/laion/CLIP-ViT-bigG-14-laion2B-39B-b160k

¹⁰

https://huggingface.co/Salesforce/blip2-flan-t5-xl

¹¹

https://huggingface.co/openflamingo/OpenFlamingo-9B

¹²

https://en.wikipedia.org/wiki/Washington,_North_Carolina

¹³

Note that the generated answers to this question may vary at different times and different model runs. Sometimes, ChatGPT can answer this question correctly. However, we observe that FMs will generate geographic inaccurate results even with a simple question, as shown in Figure 11(a).

¹⁴

https://huggingface.co/spaces/stabilityai/stable-diffusion

¹⁵

https://en.wikipedia.org/wiki/San_Jose,_Batangas

¹⁶

https://en.wikipedia.org/wiki/Washington,_New_York

¹⁷

The well-known Modifiable Areal Unit Problem (MAUP) [30, 212] tells us that how we partition the space and the spatial granularity of the partition cells in model training and/or evaluation will significantly affect the model prediction results which might lead to different conclusions. This is further validated by Kulkarni et al. [94].

A Appendix

A.1 The Full Prompts Used in Various Experiment

Listing 7.

Listing 8.

Listing 9.

References

[1]

Oshin Agarwal, Heming Ge, Siamak Shakeri, and Rami Al-Rfou. 2021. Knowledge graph based synthetic corpus generation for knowledge-enhanced language model pre-training. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 3554–3565.

Abstract

1 Introduction

2 Related Work

2.1 Language Foundation Model

2.2 Vision Foundation Model

2.3 Multimodal Foundation Model

3 Exploration of the Effectiveness of Existing FMs on Various Geospatial Domains

3.1 Geospatial Semantics

3.1.1 Toponym Recognition.

3.1.2 Location Description Recognition.

3.2 Health Geography

3.2.1 US State-Level Dementia Time Series Forecasting.

3.2.2 US County-Level Dementia Time Series Forecasting.

3.3 Urban Geography

3.3.1 POI-Based Urban Function Classification.

3.3.2 Street-View Image-Based Urban Noise Intensity Classification.

3.4 Remote Sensing

4 A Multimodal Foundation Model for GeoAI

4.1 Geo-Text Data

4.2 Geospatial Knowledge Graph

4.3 Street-View Image

4.4 Remote Sensing

4.5 Trajectory and Human Mobility

4.6 Geospatial Vector Data

4.7 A Multimodal FM for GeoAI

5 Risks and Challenges

5.1 Geographic Hallucination

5.2 Geographic Bias

5.3 Temporal Bias

5.4 Low Refreshment Rate

5.5 Spatial Scale

5.6 Generalizability versus Spatial Heterogeneity

6 Conclusion

Acknowledgments

Footnotes

A Appendix

A.1 The Full Prompts Used in Various Experiment

References

Cited By

Index Terms

Recommendations

Geo-Foundation Models: Reality, Gaps and Opportunities

Towards a foundation model for geospatial artificial intelligence (vision paper)

Foundation models in smart agriculture: Basics, opportunities, and challenges

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations