Select Page

Zhenlong Li, Huan Ning | May 01, 2025

What is the Large Spatial Model?

LLMs are large models designed to process and generate human language, providing intelligent responses through generated text. Sequential communication media, such as meaningful text and, in the case of multi-modal LLMs, images and videos, are critical to their success. In LLM research, human knowledge and patterns of reasoning are embedded within language, which is represented as sequences of tokens (the minimal units of language, such as words or sub-words). The meaning of each token depends on its position relative to other tokens within the sequence. The goal of an LLM is to generate tokens in a coherent and contextually appropriate order. People are also pursuing LLM implementations using other basic units, such as using bytes (Pagnoni et al., 2024) or concepts (L. C. M. et al., 2024) rather than tokens. Overall, LLMs predict the most likely sequence of basic units based on patterns learned from their training data, projecting relevant content in response to input prompts.

What can GIScientists learn from the success of LLM? Can we build a large model to predict geographic event sequences? The first law of geography, “everything is related to everything else, but near things are more related than distant things” (Tobler, 1970), indicates that a geographic entity is apparently the summation of its relation with nearby entities, similar to the token in LLM. Given the continuous nature of space and time (Goodchild, 1992), spatial dependence (Anselin, 1990), and variant nature of geographic data, we assume that there may be techniques to decompose geographic phenomena into basic units similar as tokens, and then use massive basic units to train a large model to generate meaningful token sequences. The core idea is to convert geographic phenomena to simple and computable data structures and then process them via massive computing, i.e., using algorithms to simulate “natural computing” (Zenil, 2013).

Per our knowledge, there is no clear answer to whether the success of large language models can be replicated in GIScience, a discipline that does not focus on text but on geospatial analysis and modeling. Different from languages, the heterogeneity and complexity of geospatial data are major challenges in the transformation to tokens. Is the transformer architecture (Vaswani et al., 2023) or others suitable for geospatial data? How to define the basic spatial units? Is there a universal representation of the multimodal geospatial data? Geospatial data has characteristics that are different from text and vision data for AGI training. For example, their scales, i.e., resolution and extension, vary among datasets and locations. Data missing is common in terms of space and time. Geospatial scientists have made various attempts to train large models based on specific types of geospatial data. For example, the remote sensing foundation models have been developed for downstream imagery applications such as object detection, semantic segmentation, and change detection (D. Hong et al., 2024; Jakubik et al., 2023; S. Lu et al., 2024; Szwarcman et al., 2024; Zheng et al., 2024). Detailed surveys on geography and Earth-related foundation models demonstrate the efforts in this direction (Hsu et al., 2024; W. Li et al., 2024; Mai et al., 2024; Xie et al., 2023; H. Zhang et al., 2024; W. Zhang et al., 2024).

In this paper, we advocate the potential of a more general Large Spatial Model (LSM) that is trained on massive multimodal geospatial data (e.g., raster, vector, network, attributes etc.), following the way how LLMs are trained on extensive text corpora. Such an LSM has billion-level parameters that can accept and generate multimodal geospatial data, such as raster, vector, and text, incorporating general data features for various downstream geospatial applications. Simply speaking, a LSM is a multi-, hyper-, or ultra-modal LLM for geospatial science. Researchers suggest that large-scale interactions foster emergence, as exemplified by LLMs (Schaeffer et al., 2023). Similarly, we may expect a similar emergence in a large spatial model by synthesizing large-scale spatial interactions during training. This could enable the model to develop a deeper understanding of geographic space, enhance spatial awareness, improve explanations of Earth system dynamics, and better model complex human-environment interactions.

Geospatial embedding

Geospatial embedding is the foundational step and one of the primary challenges in developing large spatial models. For text-based LLMs, the basic unit of input and output is a token, representing a piece of a word or subword. LLMs usually use dictionaries of dozens or hundreds of thousands of tokens (DeepSeek-AI et al., 2024; J. Yang et al., 2024), and vision LLMs use pixel patches for embedding. The models will learn the relationships among basic units during the training, which has been briefly discussed in section 5.4.1. If there are basic spatial units, how many units are enough to represent geospatial dynamics? Agarwal et al. (2024) demonstrated a graph-based embedding based on human dynamics for a population dynamics foundation model; Internet search trends, maps, POI business, weather, and air quality are major inputs for embedding. Using a similar method, Fan et al. (2024) utilized the cellphone-based human mobility data to construct visitation connections between regions, and then added these connections to the region embedding. Those attempts based on graph convolution networks seem a reasonable approach for geospatial embedding.

Location matters in GIScience, not only because of the coordinates but also because of the local context and the interaction between its neighbors. However, such embedding is a static snapshot or a temporal average of the period. Many geographic phenomena are spatiotemporal (e.g., climate, weather, and traffic). The snapshot modeling (e.g., graph neural network) does not adopt the sequential pattern and seems to diverge from the temporal nature of GIScience. Researchers are more likely to incorporate temporal information into neural networks, such as temporal convolutional networks (Bi et al., 2022). It is also challenging to consider the varying spatiotemporal scale in modeling. The transfer of information and energy in human society is also entangled with Earth system processes; this complexity imposes difficulties in embedding and modeling. Szwarcman et al. (2024) tried to incorporate spatiotemporal (latitude, longitude, and day) to the image foundation model of Prithvi-EO-2.0, which is a valuable attempt.

Challenges to building large spatial models

For the GIScience community, the high cost of trial-and-error and the complexity of geographic phenomena hinder the attempts to develop large spatial models. For example, even the relatively inexpensive DeepSeek-AI (2024) spent millions of GPU (graphic processing units) hours to train an LLM without vision modality. Szwarcman et al. (2024) trained Prithvi v2 on 90 GPUs. The reported scaling law (Kaplan et al., 2020) prefers large models over limited training data size and the amount of computing since the model sizes are relatively easy to achieve.  The GIScience community may need to take advantage of geospatial big data to decrease the model size, given the limited computing resources. The rise of the remote sensing foundation models reflects such a preference (S. Lu et al., 2024). Other modality geospatial data with long-term archives, such as smartphone-based human mobility and transportation (Q. Wang et al., 2024), can be explored as remote sensing data. Quick and diverse iterations help the geospatial community investigate the possibilities of large spatial models.

Meanwhile, a series of data privacy and information security risks can exist at the pre-training and fine-tuning stages of building large spatial models when training with geospatial data, centralized serving and tooling, prompting-based interaction, and feedback mechanisms (Rao et al., 2023). Cross-domain collaborative efforts are needed for the development of responsible, privacy-preserving and secure large spatial models. We advocate that the GIScience community explores solutions for these challenges and seeks the possibilities of large spatial models.