DynST: Dynamic Sparse Training for Resource-Constrained Spatio-Temporal Forecasting

Hao Wu easyluwu@tencent.com Tencent Inc. , Haomin Wen wenhaomin@bjtu.edu.cn Beijing Jiaotong University , Guibin Zhang bin2003@tongji.edu.cn Tongji University , Yutong Xia yutong.x@outlook.com National University of Singapore , Kai Wang kai.wang@comp.nus.edu.sg National University of Singapore , Yuxuan Liang yuxliang@outlook.com Hong Kong University of Science and Technology (Guangzhou) , Yu Zheng msyuzheng@outlook.com JD iCity, JD Technology and Kun Wang wk520529@mail.ustc.edu.cn University of Science and Technology of China

(20 February 2007; 12 March 2009; 5 June 2009)

Abstract.

The ever-increasing sensor service, though opening a precious path and providing a deluge of earth system data for deep-learning-oriented earth science, sadly introduce a daunting obstacle to their industrial level deployment. Concretely, earth science systems rely heavily on the extensive deployment of sensors, however, the data collection from sensors is constrained by complex geographical and social factors, making it challenging to achieve comprehensive coverage and uniform deployment. To alleviate the obstacle, traditional approaches to sensor deployment utilize specific algorithms to design and deploy sensors. These methods dynamically adjust the activation times of sensors to optimize the detection process across each sub-region. Regrettably, formulating an activation strategy generally based on historical observations and geographic characteristics, which make the methods and resultant models were neither simple nor practical. Worse still, the complex technical design may ultimately lead to a model with weak generalizability. In this paper, we introduce for the first time the concept of spatio-temporal data dynamic sparse training and are committed to adaptively, dynamically filtering important sensor distributions. To our knowledge, this is the first proposal (termed DynST) of an industry-level deployment optimization concept at the data level. However, due to the existence of the temporal dimension, pruning of spatio-temporal data may lead to conflicts at different timestamps. To achieve this goal, we employ dynamic merge technology, along with ingenious dimensional mapping to mitigate potential impacts caused by the temporal aspect. During the training process, DynST utilize iterative pruning and sparse training, repeatedly identifying and dynamically removing sensor perception areas that contribute the least to future predictions.

DynST demonstrates tremendous capability on industrial-grade data from JD Technology TaxiBJ+ and practical deployment scenarios such as meteorology, combustion dynamics, and turbulence. It seamlessly integrates with relevant models and efficiently prunes image and graph-type data, leading to significantly higher inference speeds without introducing noticeable performance degradation.

Sparse Training, Spatio-temporal Data Mining, Deep Learning

1. Introduction

Deep learning has revolutionized spatio-temporal (ST) forecasting, demonstrating remarkable proficiency in distilling valuable insights from extensive ST datasets (e.g., human mobility (Wu et al., 2023b; Pan et al., 2019), precipitation (Zhang et al., 2023b; Bi et al., 2023), frame dynamics (Li et al., 2020; Wu et al., 2023a), and meteorology (Pathak et al., 2022; Wu et al., 2023c)). In recent years, the widespread deployment of sensors has ushered in an unprecedented influx of earth system data from across the globe and outer space. However, this expansion comes at a significant cost. Worse still, the prolonged operation of sensors leads to significant power loss and hardware wear. To illustrate, the National Science Foundation (NSF) in the United States allocated over one billion dollars in its 2021 fiscal year budget to support research in these areas at numerous universities nationwide (Rissler et al., 2020).

Traditional approaches to sensor running time optimization (Priyadarshi et al., 2020; Zou and Chakrabarty, 2003; Yarinezhad and Hashemi, 2023; Xu, 2020; Kundu and Das, 2023), e.g., virtual force and Voronoi diagrams, utilize specific algorithms to dynamically activate sensors. These methods dynamically adjust the activation times of sensors to optimize the detection process across each sub-region. Unfortunately, generating an effective activation strategy using only pre-existing historical observation data or urban geographic characteristics is very tricky, as it often involves complex technical design (Zhang et al., 2023a). Furthermore, with numerous factors influencing sensor deployment, relying solely on single variables (such as urban layout or geographic features) does not accurately capture the optimal deployment strategy (Yan and Li, 2023; Zheng et al., 2023).

With this in mind, in this paper, our aim is to speedup inference time by proposing a novel sensor deactivation strategy, which is based on historical observations. A promising direction involves adopting deep-learning-oriented metrics to adaptively and dynamically evaluate or verify the benefits brought by each sensor deployment. The ever-increasing dynamic sparse training (termed DST) (Evci et al., 2020; Liu et al., 2021; Huang et al., 2023; Liu et al., 2020), though opening a potential path for the upcoming automating effective deployment, sadly drops a daunting obstacle on the way towards their spatio-temporal on-device deployment. Concretely, DST technology demonstrates the potential to train a sub-network from scratch, using sparse network training, to match the performance of a fully dense network. In real world, the training of models and the optimization of sensors are still heavily in both academia and industry. Transferring the concept of DST to spatio-temporal forecasting realm is intuitively beneficial, as it can significantly accelerate model training while optimizing deployment.

Regrettably, the application of DST to the challenge of spatio-temporal sensor deployment necessitates a meticulously aligned methodology. This is primarily because there exists a pronounced and inherent disparity between conventional DST frameworks and the nuances of spatio-temporal forecasting. Specifically:

➠

DST focuses primarily at the network level; if we abstract each sub-region of the data as the monitoring range of a sensor, DST methods struggle to dynamically select the most important sensors (or sub-counterpart of dataset) because the data is a pre-requisite and non-trainable.
➠

The complexity of the above issue is further amplified in time-series data, where the spatial collection of information is dynamic. This dynamic nature poses a significant challenge in determining from historical data which elements will have a more substantial impact on future outcomes.

To bridge the gap between industry and academia, this paper introduces for the first time the concept of dynamic sparse training for spatio-temporal data, termed DynST. DynST dynamically trains to filter out the crucial parts of data for future predictions, and eliminates non-essential services to achieve resource-constrained service management. Concretely, DynST employs dynamic training to apply masking to historical regions, with the aim of aggressively reducing the proliferation of sensor deployment. This approach is taken at the algorithmic level to more effectively mask individual regions (each corresponding to a sensor device). Given the dynamic nature of time-series data, we utilize explicit channel stacking to construct overlapping saliency maps of historical regions. This facilitates the scoring of the importance of sensors in each region.

DynST is both simple and efficient, demonstrating powerful optimization capabilities across a variety of industrial scenarios. It effectively reduces historically insignificant observation areas (i.e., sub-regions) in both regular and inherently irregular data environments, without impacting the performance of future predictions.

Summary of Contributions. This paper makes multiple contributions to address the questions raised. Unlike the pruning of convolutional networks, which are typically heavily over-parameterized (Gao et al., 2022b; Tan et al., 2022; Wang et al., 2018a, 2019; Gao et al., 2022a; Bai et al., 2022), directly pruning a less parameterized spatio-temporal model offers limited scope for improvement. Our first technical innovation is the introduction of an end-to-end optimization framework called DynST, which uniquely prunes the sub-counterparts of data input for the first time. DynST does not rely on any specific spatio-temporal regular architecture or irregular graph structure (Scarselli et al., 2008; Wu et al., 2020), allowing it to be flexibly applied across a wide range of spatio-temporal learning scenarios at scale. To the best of our knowledge, this is the first work to employ dynamic sparse training techniques for the optimization of industrial-level devices.

Viewing DynST as an advanced form of pruning for spatio-temporal datasets, our second technical breakthrough introduces a novel research direction. This direction involves the utilization of deep-learning-guided sparse training techniques for the strategic optimization of sensor deployments. Our methodology is inherently adaptive and data-driven, focusing on identifying and preserving the most vital monitoring areas within historical data. This approach significantly diverges from traditional sensor deployment strategies (Priyadarshi et al., 2020; Zou and Chakrabarty, 2003; Yarinezhad and Hashemi, 2023; Xu, 2020; Kundu and Das, 2023), which often employ specific algorithmic designs for sensor placement, like virtual force techniques and Voronoi diagrams. In contrast, our approach offers substantial real-world relevance and industrial applicability, representing a major leap forward in the field.

Our proposal has been experimentally verified across various industrial-grade datasets and diverse backbones. The key observations from our study are outlined below:

•

DynST Maintains Performance in Sparse Data. DynST integrates into various models and handles sparser input data without significantly affecting performance. For example, in the GNN architecture, DynST integration slightly increases the MAE on the Turbulence dataset from $4.35\rightarrow 4.37$ . In the Transformer architecture, DynST reduces the MAE from $3.67\rightarrow 3.59$ on the JD traffic benchmark.
•

Significantly Improves Inference Efficiency. DynST enhances inference speed across different architectures. On the Turbulence dataset, the STGCN architecture speeds up by 72% to 1.721 times with DynST. On the Fire dataset, the GNN architecture speeds up by about 14.5% to 1.541 times. On the JD Taxibj+ dataset, the Transform architecture nearly doubles in speed, increasing by about 34.5% to 1.987 times. These examples demonstrate DynST’s ability to improve computational efficiency, speeding up inference and handling large datasets efficiently.
•

Meets Industrial Standards. DynST effectively meets industrial requirements, introducing minimal performance loss at sparsity levels ranging from $30\%\sim 60\%$ . Moreover, due to its model-agnostic nature, DynST is compatible with almost all industry-available models without conflict, showcasing strong transferability and plug-and-play characteristics.

2. Related Work

Our research is highly relevant to the following research themes:

ST predictive learning can be categorized into three main types. Convolutional Neural Network (CNN)-based architectures: This research focuses on spatial feature extraction using CNN-based structures (Gao et al., 2022b; Tan et al., 2022; Wu et al., 2023c; Shi et al., 2015). These architectures use convolutional layers to effectively detect patterns in image and video data. Key advancements include deep convolutional networks for complex feature extraction and 3D convolutions for spatial-temporal analysis in video processing (Wang et al., 2018b); Recurrent Neural Network (RNN)-based Architectures: RNNs are used to optimize temporal data handling (Wang et al., 2017, 2018a, 2019), which are key for tasks like sequence prediction and time-dependent data analysis; Transformer-based Architectures delve into Transformer-based architectures for spatio-temporal data handling (Gao et al., 2022a; Bai et al., 2022; Wu et al., 2023b, c), by employing their self-attention mechanism to effectively manage sequence data. They capture long-range dependencies in both spatial and temporal dimensions, making them suitable for complex sequence modeling and analysis. Notably, there are models that leverage graph neural networks primarily for ST graph management (Ji et al., 2023; Shao et al., 2022; Li et al., 2017), we will discuss later.

Graph Neural Networks (GNNs) & Graph Pooling. GNNs have emerged as a prominent subfield in machine learning, specifically tailored to manage and analyze graph-structured data (Wang et al., 2022; Yu et al., 2020; Thekumparampil et al., 2018; You et al., 2019). In general, GNNs owe their efficacy to a distinct “message-passing” mechanism, which seamlessly integrates topological structures with node characteristics to yield richer graph representations. Leveraging the powerful topological awareness capabilities of GNNs, many studies have customized and adapted GNNs for predictions in spatio-temporal scenarios (Ji et al., 2023; Shao et al., 2022; Li et al., 2017). Our method of dynamically filtering sensors can be understood as a form of graph pooling in the graph domain (Chen et al., 2018; Eden et al., 2018; Chen et al., 2021; Gao and Ji, 2019; Ranjan et al., 2020; Zhang et al., 2021). The distinction lies in the fact that traditional graph pooling is static, whereas our approach represents the first instance of addressing this kind of problem in dynamic temporal graphs.

Senor Deployment. In the field of sensor deployment, traditional methods (Priyadarshi et al., 2020; Zou and Chakrabarty, 2003; Yarinezhad and Hashemi, 2023; Xu, 2020; Kundu and Das, 2023) often employ specific algorithms, such as virtual force and Voronoi diagrams, for sensor design and deployment. These strategies involve dynamically adjusting sensor activation times to optimize detection across various sub-regions. However, developing an effective activation strategy based solely on historical observation data or urban geographic features presents significant challenges, primarily due to the intricate technical design requirements (Zhang et al., 2023a). Additionally, as highlighted in (Yan and Li, 2023; Zheng et al., 2023), focusing only on single variables like urban layout or geographic characteristics fails to fully address the complexities of optimal deployment strategies.

3. Motivation

Refer to caption — Figure 1. Motivation of our proposal.

In this section, we carefully examine the significance of our approach and establish the motivation behind DynST. Our analysis begins with empirical observations. Specifically, we use the large-scale dataset EAGLE (Janny et al., 2023), designed for learning complex fluid mechanics, as an example. EAGLE is represented as a graph, where each sub-region can be interpreted as the sensory area of a sensor. We demonstrate the important regions using the attention maps from the study and apply masking to the non-essential areas. In each iteration, we randomly mask 15% of the less important areas to predict the future state of the regions with 7-layer graph convolutional network (Kipf and Welling, 2016).

Insights & Reflections. As illustrated in Figure 1, we observe that for this dataset, identifying and removing 15% of the least important patches does not affect the model’s performance, which remains consistent with a Root Mean Square Error value about $\sim 0.09$ . However, the implementation of DynST results in a noticeable speedup in model inference. This finding inspires us to dynamically eliminate non-essential information. By removing these less important regions, we can better identify the parts crucial for future predictions and accelerate inference, which corresponds to sensor deactivation in real-world applications.

4. Preliminary

As our research involves both graph and image-type data, we systematically present relevant definitions here to facilitate the demonstration of our model.

4.1. Graph Notations

In this study, we focus on an attributed graph, represented as $\mathcal{G}={{(\mathcal{V},\mathcal{E})}}$ . Here, $\mathcal{V}$ and $\mathcal{E}$ correspond to the node and edge sets, respectively. The graph $\mathcal{G}$ has an associated feature matrix $\mathbf{X}\in\mathbb{R}^{N\times D}$ , where $N=|\mathcal{V}|$ indicates the total number of nodes, and $D$ represents the feature dimensionality of each node. For any node $v_{i}\in\mathcal{V}$ , its feature vector is a $D$ -dimensional entity $\mathbf{x}_{i}=\mathbf{X}[i,\cdot]$ . The adjacency matrix $\mathbf{A}\in\mathbb{R}^{N\times N}$ defines the inter-node connections, assigning $\mathbf{A}[i,j]=1$ when a pair of nodes $(v_{i},v_{j})$ is connected in $\mathcal{E}$ and $0$ otherwise. To effectively learn node representations within $\mathcal{G}$ , the majority of GNNs utilize a neighborhood aggregation and message passing paradigm.

(1)

\mathbf{h}_{i}^{(l)}=\text{{COMB}}\left(\mathbf{h}_{i}^{(l-1)},\text{{AGGR}}\{% \mathbf{h}_{j}^{(k-1)}:v_{j}\in\mathcal{N}(v_{i})\}\right),\;0\leq l\leq L

$L$ represents the number of layers in the GNN. The initial feature vector $\mathbf{h}_{i}^{(0)}=\mathbf{x}_{i}$ corresponds to the features of node $v_{i}$ . For each layer $l$ in the GNN, where $1\leq l\leq L$ , the node embedding of $v_{i}$ is denoted by $\mathbf{h}_{i}^{(l)}$ . Two critical functions in this process are AGGR and COMB. The AGGR function is responsible for aggregating information from a node’s neighborhood, while the COMB function is used to combine the representations of the ego-node and its neighbors.

4.2. Image-type Data Notations

For effective modeling in image-type data $\mathcal{X}$ , we initially divide the total urban area into $p\times p$ sub-regions (patches), with each patch encompassing $(H/p,W/p)$ pixels. $H$ and $W$ is the height and the width of the input images. It is worth noting that the choice of $p$ should balance the trade-off between practicality and spatial granularity. In our implementation, we partition the entire urban area into small squares, each comprising $p\times p$ sensors, adhering to practicality requirements.

4.3. Problem Formulation

The target of our task is to identify the index of the sparse trivial sub-counterpart of the whole graph ${\mathcal{G}}$ or image $\mathcal{X}$ . For the sake of simplicity in presentation, we eliminate the temporal dimension $T$ from the spatio-temporal data. More formally, we attempt to obtain a trainable mask $M_{e}\in\mathbb{R}^{N}$ (for masking graph nodes) or $M_{e}\in\mathbb{R}^{p\times p}$ (for masking image patches). When we attach $M_{e}$ on original $\mathcal{G}$ ( $M_{e}\odot\mathcal{G}$ ) or on image $\mathcal{X}$ ( $M_{e}\odot\mathcal{X}$ ), the objective is as follows:

(2)			$\displaystyle\mathop{\operatorname{maximize}}_{\mathbf{M}_{g}}\;s_{g}=1-\frac{% \|\|\mathbf{M}_{g}\|\|_{0}}{\|\|\mathbf{A}\|\|_{0}};\;\;{\rm{or}}=1-\frac{\|\|\mathbf{M}% _{g}\|\|_{0}}{p\times p}$
(2)			$\displaystyle\operatorname{s.t.}\left\|\mathcal{R}_{DynST}\left(M_{e}\odot;% \Theta\right)-\mathcal{R}_{Ori}(;\Theta)\right\|<\epsilon,$

where $s_{g}$ is the sparsity, $||\cdot||_{0}$ counts the number of non-zero elements, and $\epsilon$ is the threshold for permissible performance difference. $*$ denotes the graph or image inputs and $\mathcal{R}$ represents the evaluation metrics.

5. Method

Fig 2 illustrate the overview of DynST framework. In Earth sciences, sensor deployment typically falls into two categories, i.e., image- and graph-type. Image-type deployment ensures that each area (termed ‘patch") is well covered by a sensor, while in graph-type deployment, the information from a node can be understood as being collected by a single sensor. To demonstrate the universal capabilities of DynST, we systematically consider both of these deployment types and perform a patchify operation on the images (Wu et al., 2023a). For graph data, since nodes can be defined as sensors, in this study, we do not perform any operations at the data input stage.

5.1. Stream Morph Operator

Consider that ST frameworks that receives continuous observation data $\mathcal{X}_{i}$ at different time steps ( $i=1,2,...,T$ ). According to relevant literature (Arnab et al., 2021), we view this system as a unified four-dimensional structure, i.e., $\mathcal{X}_{i}\in\mathbb{R}^{[T_{\text{in}},C_{\text{in}},H,W]}$ . Similarly, the dimensions of a temporal graph can be represented as $\mathcal{G}\in\mathbb{R}^{[T_{\text{in}},N,D]}$ . Typically, in spatio-temporal scenarios, the information collected by sensors is expressed as dynamic temporal observations. However, while the positions of the sensors are fixed, the sensory data is subject to dynamic changes. To the best of our knowledge, traditional methods have primarily focused on the optimization of data (Anonymous, 2024). We are the first to consider this industrial scenario from the perspective of sensor deployment. As a result, conventional methods are not applicable in our domain. Taking image-type as an example, the image is first tokenized into $N=HW/(p^{2})$ non-overlapping patches, then we first introduce the stream morph operator.

As shown in Fig 3, stream morph addresses this by merging the $H$ and $W$ channels of the image, and stacking the temporal ( $T$ ) channel with the $C$ channel. This approach effectively eliminates the interference of the $T$ dimension in model predictions. In this way, the training input time series can be deemed as ${\tilde{\mathcal{X}_{i}}}\in\mathbb{R}^{[H\times W,T_{\text{in}}\times C_{% \text{in}}]}$ (graph can be deemed as ${\tilde{\mathcal{G}}_{in}}\in\mathbb{R}^{[N,T_{\text{in}}\times C_{in}]}$ , where $N=HW/(p^{2})$ ), in which each rectangular block ( $\tilde{\mathcal{X}}_{in}^{\left(j\right)}\in\mathbb{R}^{[p^{2},T_{\text{in}}% \times C_{\text{in}}]}$ ) and circle node ( $\tilde{\mathcal{G}}_{in}^{\left(j\right)}\in\mathbb{R}^{[1,T_{\text{in}}\times C% _{\text{in}}]}$ ) can be interpreted as a sensor recorder. For ease of understanding, we will primarily use graph inputs as examples to illustrate the model process in subsequent sections. The distinctions between graph-type data and image data will be highlighted in the final Model Summary (Sec 5.4).

Then, stream morph operator employs a parameterized graph mask $M_{g}\in\mathbb{R}^{[N,1]}$ to dynamically score all nodes, with its parameters shared across all nodes. Given the target graph sparsity $s_{g}\%$ , we first initialize $M_{g}$ and attach the dense mask $M_{g}$ on sensor region $M_{g}\odot{\tilde{\mathcal{G}}_{in}}$ , then we start to resort to currently training scheme to find important and trivial regions.

5.2. Iterative Pruning towards High Sparsity

With $M_{g}$ at hand, we proceed to train the models together with the fixed input graph and the graph mask, denoted as $f(M_{g}\odot{\tilde{\mathcal{G}}_{in}},\mathbf{\Theta})$ , $f$ denotes the mapping function of the input ST model. with the objective function in Eq. 2, we aim to gradually find the sparse sub-graph towards better semantical preservation. One promising approach is to adopt one-shot pruning (Ma et al., 2021; Frankle et al., 2020), however, the sparse mask acquired through one-shot pruning is suboptimal. In fact, the assessment of each sensor necessitates iterative testing to ensure that the removal of a specific area does not significantly impact future predictions. To achieve our objectives, we employ an iterative pruning strategy (Chen et al., 2021) to gradually increase network sparsity. Assuming that each pruning iteration trims $p\%$ of the data parameters, after $\phi$ rounds of pruning, the remaining regions exhibit distinct advantages over the one-shot approach–that is–By iteratively pruning and retraining, the network can more effectively identify which parts are less important, as the remaining parameters have undergone $\phi$ rounds of repeated verification. Unlike previous iterative pruning literature, we alternately train the network and the mask $M_{g}$ to ensure that the mask can fully assimilate the effective information from the training process:

(3)

\left.{opt~{}}_{\Theta}^{(R)}f\left(M_{g}\odot{\overset{\sim}{\mathcal{G}}}_{% in},\Theta\right)\leftrightharpoons{opt~{}}_{M_{g}}^{(M)}f\left(M_{g}\odot{% \overset{\sim}{\mathcal{G}}}_{in},\Theta^{*}\right)~{}\right.

$\leftrightharpoons$ denotes the iterative alternation process. We first train the parameters $\Theta$ for $R$ iterations, then fix $\Theta$ as $\Theta^{*}$ and iteratively train the mask $M_{g}$ for $M$ iterations. Through this process, the mask $M_{g}$ potentially encapsulates the important information inherent in the data. Given the target sensor sparsity $s_{g}\%$ , we binarize the mask $M_{g}$ by zeroing out the parts with the smallest parameter values:

(4)

\mathcal{D}o\left({\rm{ArgTop}}\left(|M_{g}^{(\mu)}|;p\%\right)\Rightarrow% \left\{0,1\right\}\right)

$M_{g}^{(\mu)}$ represents the state of the mask $M_{g}$ at the $\mu^{th}$ iteration. The operation ${\rm{ArgTop}}(u,v)$ denotes the process of setting the top $u\%$ parameters in the matrix to 1, while the remaining $v\%$ are set to 0. $\mathcal{D}_{o}$ operator forcefully assigns mask status as 0 or 1.

5.3. Dynamical Sparse Training

As depicted above, each sensor region requires meticulous verification to ensure reliability. To this end, in the intervals between each iterative pruning, we further introduce Dynamical Sparse Training (DST) techniques (Liu et al., 2021; Huang et al., 2023; Liu et al., 2020; Zhang et al., 2023c) to perform fine-tuning between two iterative pruning steps. Concretely, we selectively activate a portion of the regions that were previously pruned, while masking the areas that remain unpruned. After the $\omega^{th}$ round, we perform a drop and regrow process on the pruned mask $M_{g}^{({\omega(R+M)})}$ (i.e., drop ↔ regrow). We adjust this process proportion to $q\%$ , typically where $q\ll p$ , to control the drop and regrow of elements. We perform the “exchange of sensors" between the current activation regions $\mathcal{E}_{(\omega)}=\mathbf{M}_{g}\odot{\tilde{\mathcal{G}}_{in}}$ and its complementary part $\mathcal{E}_{(\omega)}^{C}=\neg\mathbf{M}_{g}\odot{\tilde{\mathcal{G}}_{in}}$ . Consider that this process at $\omega(D+M)$ time points, we proceed to train and adjust the $M_{g}$ :

(5)

\displaystyle M_{g}^{\left(\omega\right)}\left({prune}\right)={\rm{ArgBottom}}% \left\{{\left({\left|{\nabla\left({\bar{M}_{g}^{\left(\omega\right)}}\right)}% \right|;{\rm{q\%}}}\right)\Rightarrow\left\{{0,1}\right\}}\right\}

In this context, ${\bar{M}_{g}^{\left(\omega\right)}}$ represents the elements of $M_{g}^{\left(\omega\right)}$ that have not been pruned. Here, we resort to gradient calculation $\nabla$ to identify and drop the elements with the lowest gradients ( ${\rm{ArgBottom}}$ operator). Generally, gradients can indicate elements with the potential to contribute to the loss function (Wang et al., 2023; Evci et al., 2020). We need to align this activation to further explore their effectiveness in future judgments. Going beyond this process, we identify and regrow elements with the highest gradients among those that have been pruned, effectively replacing parts that consist of dropped elements:

(6)

\displaystyle M_{g}^{\left(\omega\right)}\left({regrow}\right)=\neg{\rm{ArgTop% }}\left\{{\left({\left|{-\nabla\left({\neg\bar{M}_{g}^{\left(\omega\right)}}% \right)}\right|;{\rm{q\%}}}\right)\Rightarrow\left\{{0,1}\right\}}\right\}

In Eq. 6, we activate elements with larger gradients from the pruned set $\left({\neg\bar{M}_{g}^{\left(\omega\right)}}\right)$ . The operation $\neg{\rm{ArgTop}}$ serves as the inverse process of pruning, selecting elements with larger gradients for activation. This ensures that sensor regions with potential contributions are re-evaluated and validated.

Following the completion of the aforementioned evaluation process, we reconstruct $M_{g}$ to form a more reliable regional mask:

(7)

\mathbf{M}_{g}^{(\omega^{*})}\leftarrow\left(\mathbf{M}_{g}^{(\omega)}% \setminus M_{g}^{\left(\omega\right)}\left({prune}\right)\right)\cup M_{g}^{% \left(\omega\right)}\left({regrow}\right),

Then, at the begin of the round $\omega+1$ , we continue to trian and adjust the mask for sending it to $\omega+1$ round pruning. We binarize the mask $\mathbf{M}_{g}^{(\omega+1)}$ after another $\Delta T$ iteration training. Without loss of generality, taking the semi-supervised node classification task as an example, our objective function can be expressed as follows:

(8)

\mathcal{L}(M_{g}\odot{\tilde{\mathcal{G}}_{in}};\Theta)=\frac{1}{K}\sum_{i=1}% ^{K}\|\mathcal{Y}_{T+i}-f(M_{g}\odot{\tilde{\mathcal{G}}_{in}};\Theta)\|^{2}

where $\mathcal{L}$ is the MSE loss calculated over the unmasked node set ${\tilde{\mathcal{G}}_{in}}$ , and $\mathcal{Y}_{T+i}$ denotes the ground-truth.

5.4. Model Summary & Complexity Analysis

For image-type data, we transform each sub-region into a patch, which can also be understood as the concept of a “node”. Therefore, by training in a similar manner, we can identify the important sub-regions accordingly. DynST can enhance the inference speed of the model, which specifically depends on the predefined sparsity $s_{g}\%$ . Typically, this results in an acceleration ratio of $1/s_{g}\%$ . We summarize our prospective system and algorithm in Fig 4 and Appendix C, respectively.

6. Experiments

In this section, we conduct extensive experiments to answer the following research questions ( $\mathcal{RQ}$ ):

$\mathcal{RQ}$ 1:

Can DynST effectively find the sparse sub-counterpart of the original input without performance degradation?
$\mathcal{RQ}$ 2:

What is the specific performance of DynST on image-type data?
$\mathcal{RQ}$ 3:

What is the specific performance of DynST on graph data?
$\mathcal{RQ}$ 4:

Can we combine the concept of the DynST with a different training scheme?

To answers these $\mathcal{RQ}$ , we orchestrate the following experiments:

•

Main experiment. We conduct a comprehensive comparative analysis on various scientific datasets, covering meteorology, combustion science, traffic studies, and turbulence dynamics. The study encompasses both mainstream Graph Neural Network (GNN) architectures and non-GNN structures. In the appendix B, we detail the methods for data preprocessing, including how to convert raw data into graphical and image formats.
•

Multiple Training Strategies Experiments. We choose Weatherbench as the benchmark dataset, to evaluate the effectiveness of DynST when combining different training schemes. Specifically, in the training phase, we not only consider the impacts of parallel prediction and autoregressive iterative prediction but also introduce iterative pruning and one-shot pruning strategies. We focus on assessing the impacts of these strategies on model size, computational efficiency, and accuracy.
•

Ablation experiment. We carry out comprehensive ablation studies on the Jingdong Technology industry-level traffic dataset, Taxibj+, to validate the impact of various design choices on the practical implementation of our model. Through these experiments, we aim to deeply understand how the DynST concept affects data interpretability and the overall effectiveness.

Experimental settings. All experiments in this study are conducted on the NVIDIA-A100 40G configuration. To ensure consistency, we use the same settings in all experiments, including learning rate, optimizer, and more. We also apply a uniform training strategy. The loss function used in the experiments is set as Mean Squared Error (MSE) loss. For dataset division, we split the data into training, validation, and test sets in an 8:1:1 ratio. Specifically, for the Vision Transformer model (Ranftl et al., 2021), we replace the classification head from the original paper with three deconvolution layers.

6.1. Dataset & Backbones

Table 1. Performance comparisons on different GNN and non-GNN architectures, in which we report the best performance of these baselines. All experimental results are under ten runs. We show the MAE metric for all settings.

Backbone	GNNs						non-GNNs								Avg Speedup
Backbone	STGCN	+ DynST	CLCRN	+ DynST	EGNN	+ DynST	ViT	+ DynST	Simvp	+ DynST	TAU	+ DynST	Earthfarseer	+ DynST	Avg Speedup
Model Performance Evaluation
WeatherBench ♣	4.35	4.37	1.17	1.22	2.98	3.00	0.72	0.73	0.74	0.73	0.73	0.77	0.58	0.62	1.721
WeatherBench ♠	2.02	2.04	1.49	1.52	3.39	3.42	0.27	0.29	0.27	0.29	0.26	0.25	0.24	0.25	1.522
WeatherBench ♥	0.79	0.75	0.45	0.47	0.66	0.72	0.24	0.26	0.25	0.26	0.23	0.24	0.22	0.24	1.119
WeatherBench ♠	3.64	3.67	1.33	1.31	2.31	2.33	0.51	0.54	0.51	0.52	0.49	0.50	0.48	0.50	1.398
FIT $\phi$	1.27	1.29	0.97	0.98	1.03	1.09	0.23	0.22	0.14	0.16	0.13	0.14	0.09	0.11	1.543
FIT $\varphi$	0.96	1.09	0.76	0.81	0.92	0.95	0.17	0.19	0.10	0.09	0.09	0.10	0.02	0.03	1.541
Taxibj+ Inflow	5.98	5.99	3.98	4.02	4.22	4.33	3.22	3.33	3.05	3.11	2.98	3.00	2.09	2.10	1.421
Taxibj+ Outflow	5.21	5.23	3.64	3.60	4.21	4.19	3.67	3.59	3.01	3.03	2.77	2.87	2.12	2.22	1.987
EAGLE	1.99	2.07	1.45	1.47	1.66	1.67	1.45	1.47	1.23	1.34	1.19	1.27	1.08	1.12	1.988

Datasets. In this study, we conduct thorough analyses of multiple sensor-loaded datasets covering four main areas: meteorology, fires, turbulence, and traffic flow. In meteorology, we select the Weatherbench dataset. Following the design framework of related papers (Rasp et al., 2020), we consider four key variables: temperature (♣), humidity (♠), wind speed (♥), and cloud cover (♠), with the dataset containing 2048 nodes. For fire data, we choose the FIT dataset. Adhering to existing paper settings (Anonymous, 2023), we focus primarily on two variables: temperature ( $\phi$ ) and visibility ( $\varphi$ ), totaling 15360 data nodes. In turbulence, we refer to the EAGLE dataset (Janny et al., 2023), a large turbulence dataset involving velocity and pressure variables, presented in an irregular grid form with 162760 nodes. Regarding traffic flow, we use JD Technology’s Taxibj+ dataset (Wu et al., 2023b), which provides traffic flow statistics for Beijing city, comprising 16384 data nodes. For the convenience of this study, each node is considered an independent sensor.

Backbones. We use both GNN and non-GNN architectures to systematically validate the generalizability of our ideas. Concretely, we use GNN-based models as our backbone, such as STGCN (Han et al., 2020), CLCRN (Lin et al., 2022) and EGNN (Satorras et al., 2021), as well as non-GNNs such as Vision Transformer (Dosovitskiy et al., 2021), SimVP-V2 (Tan et al., 2022), TAU (Tan et al., 2023) and Earthfarseer (Wu et al., 2023b). All GNNs have 7-layer encoder blocks, while non-GNNs use Transpose Conv2d for upsampling. This detailed categorization method greatly helps in deeply understanding and accurately analyzing the capabilities of DynST.

6.2. Main experiments ( $\mathcal{RQ}$ 1)

In this section, we test whether DynST can effectively remove non-essential areas (corresponding to the concept of sensors in the real world) without impacting the overall predictive performance of the model. To thoroughly investigate the generalizability and optimization capabilities of DynST, we integrate it with existing general frameworks and set the iterative pruning process to occur 10 times, each time reducing the data by 3%. We showcase the main results in Tab 1 and we can list the observations:

Obs 1.DynST has demonstrated that the removal of certain parts from the input data does not affect the model’s performance. As shown in Tab 1, We can easily observe the outcomes following the integration of the DynST concept into the model (+DynST). In the GNN architecture, the addition of DynST generally has a minimal impact on MAE. For example, on the WeatherBench ♣ and FIT $\varphi$ datasets, the MAE slightly increases from 4.35 to 4.37 and from 0.92 to 0.95, respectively. In non-GNN architectures, DynST usually maintains or reduces the MAE. For instance, in the ViT architecture on the Taxibj+ Outflow dataset, the MAE decreases from 3.67 $\rightarrow$ 3.59. In particular, DynST generally significantly enhances the inference speed across various architectures. For example, in WeatherBench ♣, STGCN speeds up to 1.721 times, EGNN on FIT $\varphi$ to 1.541 times, and ViT on Taxibj+ Outflow to 1.987 times, effectively boosting the efficiency of inference.

Obs 2. DynST shows high efficiency in several scenarios. DynST also highly effective in improving the inference efficiency of various architectures. For example, on the WeatherBench ♣ dataset, the inference speed of STGCN increased by 23.7% with DynST (from the original speed to 1.721 times faster). Similarly, on the FIT $\varphi$ dataset, the EGNN architecture achieved a 14.5% speed increase with DynST (reaching 1.541 times faster). Moreover, on the Taxibj+ Outflow dataset, the inference speed of the ViT architecture almost doubled, specifically a 34.5% increase (rising to 1.987 times faster). These examples collectively show DynST’s capability to significantly enhance computational efficiency in various scenarios. The percentage-based speed improvements highlight its notable advantage in accelerating the inference of various ST architectures.

6.3. Deep insights ( $\mathcal{RQ}$ 2 & $\mathcal{RQ}$ 3)

In this section, we conduct a more systematic study of DynST’s ability to accelerate inference. We select both graph and image-type data to observe model performance at various levels of sparsity. Concretely, for graph-type data, we choose Taxibj+ and EAGLE as benchmarks. For image-type data, we choose temperature ( $\phi$ ) variable of FIT datasets and the temperature (♣) variable of the WeatherBench as verification. We integrate it with existing general frameworks and set the iterative pruning process to occur 10 times, with each iteration reducing the data volume by $\{1\%,2\%,\cdots,6\%\}$ . Then we can obtain the data sparsity $\{10\%,20\%,\cdots,60\%\}$ . We employ roll out strategy (Luo et al., 2023) to iteratively predict long sequence and verify the long-term prediction ability of baselines after involve DynST. We list the observations as follow.

Table 2. Comparison results among different benchmarks, considering different data sparsity levels and prediction lengths.

Benchmark Graph-type	Taxibj+			EAGLE
Benchmark Graph-type	4	8	12	30	40	50
$10\%$	${1.92_{\pm 0.01}}$	${1.99_{\pm 0.01}}$	${2.03_{\pm 0.01}}$	${1.14_{\pm 0.02}}$	${1.18_{\pm 0.02}}$	${1.19_{\pm 0.02}}$
$20\%$	${2.04_{\pm 0.03}}$	${2.12_{\pm 0.01}}$	${2.14_{\pm 0.01}}$	${1.17_{\pm 0.02}}$	${1.23_{\pm 0.02}}$	${1.24_{\pm 0.02}}$
$30\%$	${2.07_{\pm 0.01}}$	${2.17_{\pm 0.01}}$	${2.18_{\pm 0.02}}$	${1.21_{\pm 0.02}}$	${1.24_{\pm 0.01}}$	${1.26_{\pm 0.02}}$
$40\%$	${2.21_{\pm 0.02}}$	${2.24_{\pm 0.01}}$	${2.25_{\pm 0.01}}$	${1.25_{\pm 0.03}}$	${1.26_{\pm 0.01}}$	${1.28_{\pm 0.02}}$
$50\%$	${2.37_{\pm 0.03}}$	${2.39_{\pm 0.01}}$	${2.42_{\pm 0.03}}$	${1.27_{\pm 0.01}}$	${1.27_{\pm 0.02}}$	${1.29_{\pm 0.02}}$
$60\%$	${2.45_{\pm 0.02}}$	${2.48_{\pm 0.01}}$	${2.51_{\pm 0.01}}$	${1.29_{\pm 0.01}}$	${1.30_{\pm 0.01}}$	${1.33_{\pm 0.02}}$
Benchmark Image-type	FIT $\phi$			WeatherBench ♣
Benchmark Image-type	30	40	50	4	8	12
$10\%$	${0.13_{\pm 0.01}}$	${0.15_{\pm 0.01}}$	${0.15_{\pm 0.01}}$	${0.53_{\pm 0.01}}$	${0.57_{\pm 0.01}}$	${0.58_{\pm 0.02}}$
$20\%$	${0.15_{\pm 0.03}}$	${0.16_{\pm 0.01}}$	${0.17_{\pm 0.01}}$	${0.54_{\pm 0.01}}$	${0.59_{\pm 0.01}}$	${0.61_{\pm 0.03}}$
$30\%$	${0.15_{\pm 0.01}}$	${0.17_{\pm 0.01}}$	${0.18_{\pm 0.01}}$	${0.58_{\pm 0.01}}$	${0.62_{\pm 0.01}}$	${0.64_{\pm 0.01}}$
$40\%$	${0.17_{\pm 0.02}}$	${0.18_{\pm 0.03}}$	${0.19_{\pm 0.01}}$	${0.60_{\pm 0.01}}$	${0.65_{\pm 0.01}}$	${0.66_{\pm 0.02}}$
$50\%$	${0.19_{\pm 0.01}}$	${0.21_{\pm 0.01}}$	${0.22_{\pm 0.02}}$	${0.61_{\pm 0.01}}$	${0.67_{\pm 0.01}}$	${0.69_{\pm 0.01}}$
$60\%$	${0.21_{\pm 0.01}}$	${0.22_{\pm 0.01}}$	${0.23_{\pm 0.02}}$	${0.63_{\pm 0.01}}$	${0.69_{\pm 0.01}}$	${0.70_{\pm 0.01}}$

Obs 3. DynST effectively achieves long-term predictions without causing significant performance degradation. We tested the capability of long-term prediction with a combination of DynST and Earthfarseer and found that incorporating the concept of dynamic sparse training did not compromise the model’s performance. Even at a higher sparsity level of 60%, it still manages to deliver reasonably good predictive performance without a significant increase in RMSE.

Obs 4. DynST effectively meets industrial-level requirements (30% sparsity), helping to achieve manageable inference demands while reducing the burden of inference. As shown in Fig 5, The first line of the display meticulously captures the actual observed temperature flow field, providing a vivid and accurate representation of the existing conditions. In contrast, the second line offers a predictive perspective, showcasing the temperature flow field as forecasted by the innovative Earthfarseer+DynST model. This juxtaposition not only illustrates the capabilities of the predictive model but also allows for a direct comparison between observed and predicted states. Bottom: Delving deeper into the analysis, the left image opens a window into a detailed time series comparison. It meticulously charts both the real and the predicted temperatures at the specific coordinates of (50,7), offering a granular view of the model’s precision over time. Similarly, the right image extends this comparison to another set of coordinates, (425,7), revealing how the model captures the temporal evolution of temperatures in this distinct area. These results showcase the remarkable ability of the DynST-enhanced model to preserve high local fidelity. This fidelity is not just theoretical; it translates into practical, industry-level reliability, consistently maintaining the prediction deviation within a tight 15% margin (Verda et al., 2021). Such performance not only underscores the robustness of the Earthfarseer+DynST model but also highlights its potential for widespread application in scenarios demanding high precision and reliability (Fig 6 also support our research findings).

6.4. Structural & Ablation study ( $\mathcal{RQ}$ 4)

We initially configure DynST to maintain the model at a moderate sparsity level (30%) to observe how well the model preserves structural integrity at this level of sparsity. Here, we employ two metrics, SSIM and PSNR, to measure the completeness of the model’s predictions. Higher values of SSIM and PSNR indicate more accurate structural predictions by the model. Additionally, we also observe the trend of SSIM performance at different levels of sparsity.

Table 3. SSIM and PSNR results on three research domain. The underline symbol represents the best performance. Ori denotes the original results, +Dyn denotes add DynST at sparsity 30%.

Model (data)	SSIM (Ori $\leftrightarrow$ +Dyn)	PSNR (Ori $\leftrightarrow$ +Dyn)
SimVP (TaxiBJ+)	$0.94_{\pm 0.01}$ / $0.93_{\pm 0.01}$	$36.27_{\pm 0.01}$ / $35.43_{\pm 0.01}$
TAU (TaxiBJ+)	$0.96_{\pm 0.01}$ / $0.95_{\pm 0.01}$	$36.76_{\pm 0.01}$ / $35.62_{\pm 0.01}$
Earthfarseer (TaxiBJ+)	$\underline{0.98}$ ${}_{\pm 0.01}$ / $\underline{0.96}$ ${}_{\pm 0.01}$	$\underline{37.84}$ ${}_{\pm 0.01}$ / $\underline{36.44}$ ${}_{\pm 0.01}$
CLCRN (WeatherBench)	$0.94_{\pm 0.02}$ / $0.93_{\pm 0.02}$	$36.12_{\pm 0.02}$ / $35.22_{\pm 0.19}$
Simvp (WeatherBench)	$0.96_{\pm 0.01}$ / $0.95_{\pm 0.01}$	$37.33_{\pm 0.01}$ / $36.33_{\pm 0.17}$
Earthfarseer (WeatherBench)	$\underline{0.98}$ ${}_{\pm 0.01}$ / $\underline{0.97}$ ${}_{\pm 0.01}$	$\underline{39.27}$ ${}_{\pm 0.11}$ / $\underline{38.12}$ ${}_{\pm 0.03}$
VIT (FIT)	$0.90_{\pm 0.02}$ / $0.89_{\pm 0.02}$	$35.41_{\pm 0.02}$ / $33.33_{\pm 0.01}$
EGNN (FIT)	$0.83_{\pm 0.01}$ / $0.81_{\pm 0.01}$	$35.41_{\pm 0.01}$ / $34.68_{\pm 0.02}$
Earthfarseer (FIT)	$\underline{0.95}$ ${}_{\pm 0.01}$ / $\underline{0.93}$ ${}_{\pm 0.01}$	$\underline{37.23}$ ${}_{\pm 0.01}$ / $\underline{36.31}$ ${}_{\pm 0.01}$

Obs 5. As shown in Tab 3 and Fig 7, we find that integrating DynST into the model does not significantly impact the SSIM and PSNR metrics. On the TaxiBJ+ dataset, Earthfarseer achieves an SSIM value close to 0.97, and the incorporation of DynST appears to have minimal effect on the prediction results. This phenomenon is nearly identical on both the WeatherBench ( $0.98\rightarrow 0.97$ ) and FIT ( $0.98\rightarrow 0.93$ ) datasets, thereby validating the effectiveness of DynST. Further, as the model’s SSIM values under varying data sparsity levels (Fig 7), we note that as sparsity increases, the SSIM values gradually decrease, providing a trade-off solution for practical applications.

In the last, we select three training schemes (Earthfarseer as the base model) to explore the performance of our algorithm and the benefits of combining our algorithm with mainstream training approaches: (1) One-shot pruning (OP): We thoroughly train our model and subsequently conduct training of the mask for a one-time pruning process. (2) Iterative pruning (IP): As our work can be regarded as a pruning method, we have opted for the widely recognized iterative pruning (IP) strategy (Frankle and Carbin, 2018) in the main manuscript part, we prune data for 10 times and every time for pruning 4% sub-counterpart. (3) Dynamic sparse training (DST): We select a target sparsity level and then maintain the data training consistently at this fixed sparsity. Dynamically, we remove and restore the smallest and largest magnitudes in the mask (Evci et al., 2020).We set a 40% sparsity for dynamic training. Table 4 shows STGCN, SimVP, and Earthfarseer’s performance in IP, OS, and DST training methods. Their RMSEs are 0.5698, 0.5108, 0.3507 (IP), 0.6197, 0.5650, 0.4121 (OS), and 0.5792, 0.5261, 0.3495 (DST).

Table 4. Performance across different training schemes (RMSE).

Baselines

Training Schemes

DST

EAGLE

STGCN

SimVP

Earthfarseer

0.5698

0.5108

0.3507

0.6197

0.5650

0.4121

0.5792

0.5261

0.3495

FIT

\phi

STGCN

SimVP

Earthfarseer

0.3245

0.2193

0.1983

0.3617

0.2565

0.2293

0.3123

0.2252

0.1842

7. Conclusion

In this paper, we introduce the concept of dynamic sparse training in the context of sensor deployment, termed DynST, which adjusts sensor deployment dynamically through training without compromising the model’s predictive capabilities. DynST ingeniously circumvents the complexity issues posed by the temporal dimension through clever dimension mapping. Following this, through dynamic training and mask operations, we can precisely identify the less significant parts of the output data, which correspond to the areas detected by the sensors. DynST is both general and succinct, compatible with many mainstream training schemes, such as one-shot, iterative pruning and dynamical sparse training, to boost inference speed without significant performance degradation.

Acknowledgements.

This work is also supported by Guangzhou-HKUST (GZ) Joint Funding Program (No. 2024A03J0620)

References

(1)
Anonymous (2023) Anonymous. 2023. Spatio-temporal Twins with A Cache for Modeling Long-term System Dynamics. In Submitted to The Twelfth International Conference on Learning Representations. https://openreview.net/forum?id=aE6HazMgRz under review.
Anonymous (2024) Anonymous. 2024. NuwaDynamics: Discovering and Updating in Causal Spatio-Temporal Modeling. In The Twelfth International Conference on Learning Representations. https://openreview.net/forum?id=sLdVl0q68X
Arnab et al. (2021) Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, and Cordelia Schmid. 2021. Vivit: A video vision transformer. In Proceedings of the IEEE/CVF international conference on computer vision. 6836–6846.
Bai et al. (2022) Cong Bai, Feng Sun, Jinglin Zhang, Yi Song, and Shengyong Chen. 2022. Rainformer: Features extraction balanced network for radar-based precipitation nowcasting. IEEE Geoscience and Remote Sensing Letters 19 (2022), 1–5.
Bi et al. (2023) Kaifeng Bi, Lingxi Xie, Hengheng Zhang, Xin Chen, Xiaotao Gu, and Qi Tian. 2023. Accurate medium-range global weather forecasting with 3D neural networks. Nature 619, 7970 (2023), 533–538.
Chen et al. (2018) Jie Chen, Tengfei Ma, and Cao Xiao. 2018. Fastgcn: fast learning with graph convolutional networks via importance sampling. arXiv preprint arXiv:1801.10247 (2018).
Chen et al. (2021) Tianlong Chen, Yongduo Sui, Xuxi Chen, Aston Zhang, and Zhangyang Wang. 2021. A unified lottery ticket hypothesis for graph neural networks. In International Conference on Machine Learning. PMLR, 1695–1706.
Dosovitskiy et al. (2021) Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations. https://openreview.net/forum?id=YicbFdNTTy
Eden et al. (2018) Talya Eden, Shweta Jain, Ali Pinar, Dana Ron, and C Seshadhri. 2018. Provable and practical approximations for the degree distribution using sublinear graph samples. In Proceedings of the 2018 World Wide Web Conference. 449–458.
Evci et al. (2020) Utku Evci, Trevor Gale, Jacob Menick, Pablo Samuel Castro, and Erich Elsen. 2020. Rigging the lottery: Making all tickets winners. In International Conference on Machine Learning. PMLR, 2943–2952.
Frankle and Carbin (2018) Jonathan Frankle and Michael Carbin. 2018. The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635 (2018).
Frankle et al. (2020) Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M Roy, and Michael Carbin. 2020. Pruning neural networks at initialization: Why are we missing the mark? arXiv preprint arXiv:2009.08576 (2020).
Gao and Ji (2019) Hongyang Gao and Shuiwang Ji. 2019. Graph u-nets. In international conference on machine learning. PMLR, 2083–2092.
Gao et al. (2022a) Zhihan Gao, Xingjian Shi, Hao Wang, Yi Zhu, Yuyang Bernie Wang, Mu Li, and Dit-Yan Yeung. 2022a. Earthformer: Exploring space-time transformers for earth system forecasting. Advances in Neural Information Processing Systems 35 (2022), 25390–25403.
Gao et al. (2022b) Zhangyang Gao, Cheng Tan, Lirong Wu, and Stan Z Li. 2022b. Simvp: Simpler yet better video prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3170–3180.
Han et al. (2020) Haoyu Han, Mengdi Zhang, Min Hou, Fuzheng Zhang, Zhongyuan Wang, Enhong Chen, Hongwei Wang, Jianhui Ma, and Qi Liu. 2020. STGCN: a spatial-temporal aware graph learning method for POI recommendation. In 2020 IEEE International Conference on Data Mining (ICDM). IEEE, 1052–1057.
Huang et al. (2023) Shaoyi Huang, Bowen Lei, Dongkuan Xu, Hongwu Peng, Yue Sun, Mimi Xie, and Caiwen Ding. 2023. Dynamic sparse training via balancing the exploration-exploitation trade-off. In 2023 60th ACM/IEEE Design Automation Conference (DAC). IEEE, 1–6.
Janny et al. (2023) Steeven Janny, Aurélien Beneteau, Nicolas Thome, Madiha Nadri, Julie Digne, and Christian Wolf. 2023. Eagle: Large-scale learning of turbulent fluid dynamics with mesh transformers. arXiv preprint arXiv:2302.10803 (2023).
Ji et al. (2023) Jiahao Ji, Jingyuan Wang, Chao Huang, Junjie Wu, Boren Xu, Zhenhe Wu, Junbo Zhang, and Yu Zheng. 2023. Spatio-temporal self-supervised learning for traffic flow prediction. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 4356–4364.
Kipf and Welling (2016) Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).
Kundu and Das (2023) Srabani Kundu and Nabanita Das. 2023. A study on boundary detection in wireless sensor networks. Innovations in Systems and Software Engineering 19, 2 (2023), 217–225.
Li et al. (2017) Yaguang Li, Rose Yu, Cyrus Shahabi, and Yan Liu. 2017. Diffusion convolutional recurrent neural network: Data-driven traffic forecasting. arXiv preprint arXiv:1707.01926 (2017).
Li et al. (2020) Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar. 2020. Fourier neural operator for parametric partial differential equations. arXiv preprint arXiv:2010.08895 (2020).
Lin et al. (2022) Haitao Lin, Zhangyang Gao, Yongjie Xu, Lirong Wu, Ling Li, and Stan Z Li. 2022. Conditional local convolution for spatio-temporal meteorological forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 7470–7478.
Liu et al. (2020) Junjie Liu, Zhe Xu, Runbin Shi, Ray CC Cheung, and Hayden KH So. 2020. Dynamic sparse training: Find efficient sparse network from scratch with trainable masked layers. arXiv preprint arXiv:2005.06870 (2020).
Liu et al. (2021) Shiwei Liu, Lu Yin, Decebal Constantin Mocanu, and Mykola Pechenizkiy. 2021. Do we actually need dense over-parameterization? in-time over-parameterization in sparse training. In International Conference on Machine Learning. PMLR, 6989–7000.
Luo et al. (2023) Xiao Luo, Jingyang Yuan, Zijie Huang, Huiyu Jiang, Yifang Qin, Wei Ju, Ming Zhang, and Yizhou Sun. 2023. HOPE: High-order graph ODE for modeling interacting dynamics. In International Conference on Machine Learning. PMLR, 23124–23139.
Ma et al. (2021) Xiaolong Ma, Geng Yuan, Xuan Shen, Tianlong Chen, Xuxi Chen, Xiaohan Chen, Ning Liu, Minghai Qin, Sijia Liu, Zhangyang Wang, et al. 2021. Sanity checks for lottery tickets: Does your winning ticket really win the jackpot? Advances in Neural Information Processing Systems 34 (2021), 12749–12760.
Pan et al. (2019) Zheyi Pan, Yuxuan Liang, Weifeng Wang, Yong Yu, Yu Zheng, and Junbo Zhang. 2019. Urban traffic prediction from spatio-temporal data using deep meta learning. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining. 1720–1730.
Pathak et al. (2022) Jaideep Pathak, Shashank Subramanian, Peter Harrington, Sanjeev Raja, Ashesh Chattopadhyay, Morteza Mardani, Thorsten Kurth, David Hall, Zongyi Li, Kamyar Azizzadenesheli, et al. 2022. Fourcastnet: A global data-driven high-resolution weather model using adaptive fourier neural operators. arXiv preprint arXiv:2202.11214 (2022).
Priyadarshi et al. (2020) Rahul Priyadarshi, Bharat Gupta, and Amulya Anurag. 2020. Wireless sensor networks deployment: a result oriented analysis. Wireless Personal Communications 113 (2020), 843–866.
Ranftl et al. (2021) René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. 2021. Vision transformers for dense prediction. 12179–12188.
Ranjan et al. (2020) Ekagra Ranjan, Soumya Sanyal, and Partha Talukdar. 2020. Asap: Adaptive structure aware pooling for learning hierarchical graph representations. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 5470–5477.
Rasp et al. (2020) Stephan Rasp, Peter D Dueben, Sebastian Scher, Jonathan A Weyn, Soukayna Mouatadid, and Nils Thuerey. 2020. WeatherBench: a benchmark data set for data-driven weather forecasting. Journal of Advances in Modeling Earth Systems 12, 11 (2020), e2020MS002203.
Rissler et al. (2020) Leslie J Rissler, Katherine L Hale, Nina R Joffe, and Nicholas M Caruso. 2020. Gender differences in grant submissions across science and engineering fields at the NSF. Bioscience 70, 9 (2020), 814–820.
Satorras et al. (2021) Vıctor Garcia Satorras, Emiel Hoogeboom, and Max Welling. 2021. E (n) equivariant graph neural networks. In International conference on machine learning. PMLR, 9323–9332.
Scarselli et al. (2008) Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. 2008. The graph neural network model. IEEE transactions on neural networks 20, 1 (2008), 61–80.
Shao et al. (2022) Zezhi Shao, Zhao Zhang, Fei Wang, Wei Wei, and Yongjun Xu. 2022. Spatial-temporal identity: A simple yet effective baseline for multivariate time series forecasting. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management. 4454–4458.
Shi et al. (2015) Xingjian Shi, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang-chun Woo. 2015. Convolutional LSTM network: A machine learning approach for precipitation nowcasting. Advances in neural information processing systems 28 (2015).
Tan et al. (2022) Cheng Tan, Zhangyang Gao, Siyuan Li, and Stan Z Li. 2022. Simvp: Towards simple yet powerful spatiotemporal predictive learning. arXiv preprint arXiv:2211.12509 (2022).
Tan et al. (2023) Cheng Tan, Zhangyang Gao, Lirong Wu, Yongjie Xu, Jun Xia, Siyuan Li, and Stan Z Li. 2023. Temporal attention unit: Towards efficient spatiotemporal predictive learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18770–18782.
Thekumparampil et al. (2018) Kiran K Thekumparampil, Chong Wang, Sewoong Oh, and Li-Jia Li. 2018. Attention-based graph neural network for semi-supervised learning. arXiv preprint arXiv:1803.03735 (2018).
Verda et al. (2021) Vittorio Verda, Romano Borchiellini, Sara Cosentino, Elisa Guelpa, and Jesus Mejias Tuni. 2021. Expanding the FDS simulation capabilities to fire tunnel scenarios through a novel multi-scale model. Fire Technology 57 (2021), 2491–2514.
Wang et al. (2023) Kun Wang, Yuxuan Liang, Xinglin Li, Guohao Li, Bernard Ghanem, Roger Zimmermann, Huahui Yi, Yudong Zhang, Yang Wang, et al. 2023. Brave the Wind and the Waves: Discovering Robust and Generalizable Graph Lottery Tickets. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023).
Wang et al. (2022) Kun Wang, Yuxuan Liang, Pengkun Wang, Xu Wang, Pengfei Gu, Junfeng Fang, and Yang Wang. 2022. Searching Lottery Tickets in Graph Neural Networks: A Dual Perspective. In The Eleventh International Conference on Learning Representations.
Wang et al. (2018a) Yunbo Wang, Zhihan Gao, Mingsheng Long, Jianmin Wang, and Philip S Yu. 2018a. Pre-dRNN++: Towards a resolution of the deep-in-time dilemma in spatiotemporal predictive learning. In International Conference on Machine Learning. 5123–5132.
Wang et al. (2018b) Yunbo Wang, Lu Jiang, Ming-Hsuan Yang, Li-Jia Li, Mingsheng Long, and Li Fei-Fei. 2018b. Eidetic 3D LSTM: A model for video prediction and beyond. In International conference on learning representations.
Wang et al. (2017) Yunbo Wang, Mingsheng Long, Jianmin Wang, Zhifeng Gao, and Philip S Yu. 2017. Predrnn: Recurrent neural networks for predictive learning using spatiotemporal lstms. Advances in neural information processing systems 30 (2017).
Wang et al. (2019) Yunbo Wang, Jianjin Zhang, Hongyu Zhu, Mingsheng Long, Jianmin Wang, and Philip S Yu. 2019. Memory in memory: A predictive neural network for learning higher-order non-stationarity from spatiotemporal dynamics. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9154–9162.
Wu et al. (2023a) Haixu Wu, Tengge Hu, Huakun Luo, Jianmin Wang, and Mingsheng Long. 2023a. Solving High-Dimensional PDEs with Latent Spectral Models. arXiv preprint arXiv:2301.12664 (2023).
Wu et al. (2023b) Hao Wu, Shilong Wang, Yuxuan Liang, Zhengyang Zhou, Wei Huang, Wei Xiong, and Kun Wang. 2023b. Earthfarseer: Versatile Spatio-Temporal Dynamical Systems Modeling in One Model. arXiv preprint arXiv:2312.08403 (2023).
Wu et al. (2023c) Hao Wu, Wei Xion, Fan Xu, Xiao Luo, Chong Chen, Xian-Sheng Hua, and Haixin Wang. 2023c. PastNet: Introducing Physical Inductive Biases for Spatio-temporal Video Prediction. arXiv preprint arXiv:2305.11421 (2023).
Wu et al. (2020) Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and S Yu Philip. 2020. A comprehensive survey on graph neural networks. IEEE transactions on neural networks and learning systems 32, 1 (2020), 4–24.
Xu (2020) Sheng Xu. 2020. Optimal sensor placement for target localization using hybrid RSS, AOA and TOA measurements. IEEE Communications Letters 24, 9 (2020), 1966–1970.
Yan and Li (2023) Huan Yan and Yong Li. 2023. A Survey of Generative AI for Intelligent Transportation Systems. arXiv preprint arXiv:2312.08248 (2023).
Yarinezhad and Hashemi (2023) Ramin Yarinezhad and Seyed Naser Hashemi. 2023. A sensor deployment approach for target coverage problem in wireless sensor networks. Journal of Ambient Intelligence and Humanized Computing 14, 5 (2023), 5941–5956.
You et al. (2019) Jiaxuan You, Rex Ying, and Jure Leskovec. 2019. Position-aware graph neural networks. In International conference on machine learning. PMLR, 7134–7143.
Yu et al. (2020) Changqian Yu, Yifan Liu, Changxin Gao, Chunhua Shen, and Nong Sang. 2020. Representative graph neural network. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VII 16. Springer, 379–396.
Zhang et al. (2023a) Yunke Zhang, Tong Li, Yuan Yuan, Fengli Xu, Fan Yang, Funing Sun, and Yong Li. 2023a. Demand-Driven Urban Facility Visit Prediction. ACM Transactions on Intelligent Systems and Technology (2023).
Zhang et al. (2023b) Yuchen Zhang, Mingsheng Long, Kaiyuan Chen, Lanxiang Xing, Ronghua Jin, Michael I Jordan, and Jianmin Wang. 2023b. Skilful nowcasting of extreme precipitation with NowcastNet. Nature 619, 7970 (2023), 526–532.
Zhang et al. (2023c) Yuxin Zhang, Lirui Zhao, Mingbao Lin, Yunyun Sun, Yiwu Yao, Xingjia Han, Jared Tanner, Shiwei Liu, and Rongrong Ji. 2023c. Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs. arXiv preprint arXiv:2310.08915 (2023).
Zhang et al. (2021) Zhen Zhang, Jiajun Bu, Martin Ester, Jianfeng Zhang, Zhao Li, Chengwei Yao, Dai Huifen, Zhi Yu, and Can Wang. 2021. Hierarchical multi-view graph pooling with structure learning. IEEE Transactions on Knowledge and Data Engineering (2021).
Zheng et al. (2023) Yu Zheng, Yuming Lin, Liang Zhao, Tinghai Wu, Depeng Jin, and Yong Li. 2023. Spatial planning of urban communities via deep reinforcement learning. Nature Computational Science 3, 9 (2023), 748–762.
Zou and Chakrabarty (2003) Yao Zou and Krishnendu Chakrabarty. 2003. Sensor deployment and target localization based on virtual forces. In IEEE INFOCOM 2003. Twenty-second Annual Joint Conference of the IEEE Computer and Communications Societies (IEEE Cat. No. 03CH37428), Vol. 2. IEEE, 1293–1303.

Appendix A Datasets and backbones descriptions.

Table 5. The statistics of the datasets.

Dataset	#Nodes	#Variables	#Input	#Output
Weatherbench	2048	4	12	12
FIT	15360	2	50	50
Taxibj+	16384	2	12	12
EAGLE	3388	2	50	50

In this study, we analyze four benchmark datasets. Each snapshot in these datasets serves as an independent graph structure. We summarize the statistical characteristics of these datasets in Table 5. Specifically, the datasets include:

1. Weatherbench Dataset: Each graph contains 2048 nodes, covering four variables: temperature, humidity, wind speed, and cloud concentration. The input and output duration for this dataset is 12 time steps.

2. FIT Dataset: Each graph in this dataset consists of 15360 nodes, with two variables: temperature and visibility. The input and output duration is 50 time steps.

3. Taxibj+ Dataset: Each graph has 16384 nodes, including two variables: Inflow and Outflow. The input and output duration is 12 time steps.

4. EAGLE Dataset: Each graph in this dataset comprises 3388 nodes, with two variables: pressure and speed. The input and output duration is 50 time steps.

These datasets provide diverse experimental scenarios and analytical perspectives for our research.

Appendix B Dataset preprocessing.

In this section, we meticulously detail the specifics of data processing, as shown in Figure 8, encompassing the conversion of raw data into graph and image formats through two distinct processes: Nodalization and Patchify. We utilize Weatherbench as a case study to illustrate these concepts:

1. Nodalization: This process involves the dimensional transformation of raw data from the format $(C,H,W)$ , where $C$ represents the number of physical variables, and $H$ and $W$ signify the data’s height and width, respectively. In this context, the data can be perceived as having $H\times W$ observation points, each containing $C$ variables. If we analogize each observation point to a sensor, these correspond to nodes in a graph structure. Consequently, the transformed graph data dimension is $(Num\_nodes,C)$ , where $Num\_nodes=H\times W$ . To alleviate memory pressure during training, a down-sampling of $H$ and $W$ can be implemented in practical applications.

2. Patchify: In the Patchify process, we adhere to the strategy outlined in the literature, assuming that each Patch is of size $p\times p$ . This results in a total of $(H/p)\times(W/p)$ Patches. The dimension of each Patch is $(p\times p\times C)$ . This method enables us to leverage Transformer-based architectures for data feature extraction. At the same time, for convolutional structures, the raw data can be directly inputted into the model without the need for specialized data preprocessing.

Through these two methodologies, we effectively transform the original data format into one that is conducive to deep learning model processing, thereby enhancing the efficiency of data handling and model training.

Appendix C Algorithm

Algorithm 1 Dynamic Sparse Training (DynST) Framework

0: Input graph

\mathcal{G}_{in}

, Network

f

, Target Sparsity

S_{g}\%

0: Sparse mask

M_{g}

1: Initialize graph mask

M_{g}

2: Stream Morph for input

\mathcal{G}_{in}\rightarrow\hat{\mathcal{G}}_{in}

3: while

1-\frac{\|M_{g}\|_{0}}{\|\mathcal{G}_{in}[;*]\|_{0}}<S_{g}

4: Training network for

R

iterations

5: Training

M_{g}

for

M