Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
License: CC BY 4.0
arXiv:2403.02914v1 [cs.AI] 05 Mar 2024

DynST: Dynamic Sparse Training for Resource-Constrained Spatio-Temporal Forecasting

Hao Wu easyluwu@tencent.com Tencent Inc. Haomin Wen wenhaomin@bjtu.edu.cn Beijing Jiaotong University Guibin Zhang bin2003@tongji.edu.cn Tongji University Yutong Xia yutong.x@outlook.com National University of Singapore Kai Wang kai.wang@comp.nus.edu.sg National University of Singapore Yuxuan Liang yuxliang@outlook.com Hong Kong University of Science and Technology (Guangzhou) Yu Zheng msyuzheng@outlook.com JD iCity, JD Technology  and  Kun Wang wk520529@mail.ustc.edu.cn University of Science and Technology of China
(20 February 2007; 12 March 2009; 5 June 2009)
Abstract.

The ever-increasing sensor service, though opening a precious path and providing a deluge of earth system data for deep-learning-oriented earth science, sadly introduce a daunting obstacle to their industrial level deployment. Concretely, earth science systems rely heavily on the extensive deployment of sensors, however, the data collection from sensors is constrained by complex geographical and social factors, making it challenging to achieve comprehensive coverage and uniform deployment. To alleviate the obstacle, traditional approaches to sensor deployment utilize specific algorithms to design and deploy sensors. These methods dynamically adjust the activation times of sensors to optimize the detection process across each sub-region. Regrettably, formulating an activation strategy generally based on historical observations and geographic characteristics, which make the methods and resultant models were neither simple nor practical. Worse still, the complex technical design may ultimately lead to a model with weak generalizability. In this paper, we introduce for the first time the concept of spatio-temporal data dynamic sparse training and are committed to adaptively, dynamically filtering important sensor distributions. To our knowledge, this is the first proposal (termed DynST) of an industry-level deployment optimization concept at the data level. However, due to the existence of the temporal dimension, pruning of spatio-temporal data may lead to conflicts at different timestamps. To achieve this goal, we employ dynamic merge technology, along with ingenious dimensional mapping to mitigate potential impacts caused by the temporal aspect. During the training process, DynST utilize iterative pruning and sparse training, repeatedly identifying and dynamically removing sensor perception areas that contribute the least to future predictions.

DynST demonstrates tremendous capability on industrial-grade data from JD Technology TaxiBJ+ and practical deployment scenarios such as meteorology, combustion dynamics, and turbulence. It seamlessly integrates with relevant models and efficiently prunes image and graph-type data, leading to significantly higher inference speeds without introducing noticeable performance degradation.

Sparse Training, Spatio-temporal Data Mining, Deep Learning

1. Introduction

Deep learning has revolutionized spatio-temporal (ST) forecasting, demonstrating remarkable proficiency in distilling valuable insights from extensive ST datasets (e.g., human mobility (Wu et al., 2023b; Pan et al., 2019), precipitation (Zhang et al., 2023b; Bi et al., 2023), frame dynamics (Li et al., 2020; Wu et al., 2023a), and meteorology (Pathak et al., 2022; Wu et al., 2023c)). In recent years, the widespread deployment of sensors has ushered in an unprecedented influx of earth system data from across the globe and outer space. However, this expansion comes at a significant cost. Worse still, the prolonged operation of sensors leads to significant power loss and hardware wear. To illustrate, the National Science Foundation (NSF) in the United States allocated over one billion dollars in its 2021 fiscal year budget to support research in these areas at numerous universities nationwide (Rissler et al., 2020).

Traditional approaches to sensor running time optimization (Priyadarshi et al., 2020; Zou and Chakrabarty, 2003; Yarinezhad and Hashemi, 2023; Xu, 2020; Kundu and Das, 2023), e.g., virtual force and Voronoi diagrams, utilize specific algorithms to dynamically activate sensors. These methods dynamically adjust the activation times of sensors to optimize the detection process across each sub-region. Unfortunately, generating an effective activation strategy using only pre-existing historical observation data or urban geographic characteristics is very tricky, as it often involves complex technical design (Zhang et al., 2023a). Furthermore, with numerous factors influencing sensor deployment, relying solely on single variables (such as urban layout or geographic features) does not accurately capture the optimal deployment strategy (Yan and Li, 2023; Zheng et al., 2023).

With this in mind, in this paper, our aim is to speedup inference time by proposing a novel sensor deactivation strategy, which is based on historical observations. A promising direction involves adopting deep-learning-oriented metrics to adaptively and dynamically evaluate or verify the benefits brought by each sensor deployment. The ever-increasing dynamic sparse training (termed DST) (Evci et al., 2020; Liu et al., 2021; Huang et al., 2023; Liu et al., 2020), though opening a potential path for the upcoming automating effective deployment, sadly drops a daunting obstacle on the way towards their spatio-temporal on-device deployment. Concretely, DST technology demonstrates the potential to train a sub-network from scratch, using sparse network training, to match the performance of a fully dense network. In real world, the training of models and the optimization of sensors are still heavily in both academia and industry. Transferring the concept of DST to spatio-temporal forecasting realm is intuitively beneficial, as it can significantly accelerate model training while optimizing deployment.

Regrettably, the application of DST to the challenge of spatio-temporal sensor deployment necessitates a meticulously aligned methodology. This is primarily because there exists a pronounced and inherent disparity between conventional DST frameworks and the nuances of spatio-temporal forecasting. Specifically:

  • DST focuses primarily at the network level; if we abstract each sub-region of the data as the monitoring range of a sensor, DST methods struggle to dynamically select the most important sensors (or sub-counterpart of dataset) because the data is a pre-requisite and non-trainable.

  • The complexity of the above issue is further amplified in time-series data, where the spatial collection of information is dynamic. This dynamic nature poses a significant challenge in determining from historical data which elements will have a more substantial impact on future outcomes.

To bridge the gap between industry and academia, this paper introduces for the first time the concept of dynamic sparse training for spatio-temporal data, termed DynST. DynST dynamically trains to filter out the crucial parts of data for future predictions, and eliminates non-essential services to achieve resource-constrained service management. Concretely, DynST employs dynamic training to apply masking to historical regions, with the aim of aggressively reducing the proliferation of sensor deployment. This approach is taken at the algorithmic level to more effectively mask individual regions (each corresponding to a sensor device). Given the dynamic nature of time-series data, we utilize explicit channel stacking to construct overlapping saliency maps of historical regions. This facilitates the scoring of the importance of sensors in each region.

DynST is both simple and efficient, demonstrating powerful optimization capabilities across a variety of industrial scenarios. It effectively reduces historically insignificant observation areas (i.e., sub-regions) in both regular and inherently irregular data environments, without impacting the performance of future predictions.

Summary of Contributions. This paper makes multiple contributions to address the questions raised. Unlike the pruning of convolutional networks, which are typically heavily over-parameterized (Gao et al., 2022b; Tan et al., 2022; Wang et al., 2018a, 2019; Gao et al., 2022a; Bai et al., 2022), directly pruning a less parameterized spatio-temporal model offers limited scope for improvement. Our first technical innovation is the introduction of an end-to-end optimization framework called DynST, which uniquely prunes the sub-counterparts of data input for the first time. DynST does not rely on any specific spatio-temporal regular architecture or irregular graph structure (Scarselli et al., 2008; Wu et al., 2020), allowing it to be flexibly applied across a wide range of spatio-temporal learning scenarios at scale. To the best of our knowledge, this is the first work to employ dynamic sparse training techniques for the optimization of industrial-level devices.

Viewing DynST as an advanced form of pruning for spatio-temporal datasets, our second technical breakthrough introduces a novel research direction. This direction involves the utilization of deep-learning-guided sparse training techniques for the strategic optimization of sensor deployments. Our methodology is inherently adaptive and data-driven, focusing on identifying and preserving the most vital monitoring areas within historical data. This approach significantly diverges from traditional sensor deployment strategies (Priyadarshi et al., 2020; Zou and Chakrabarty, 2003; Yarinezhad and Hashemi, 2023; Xu, 2020; Kundu and Das, 2023), which often employ specific algorithmic designs for sensor placement, like virtual force techniques and Voronoi diagrams. In contrast, our approach offers substantial real-world relevance and industrial applicability, representing a major leap forward in the field.

Our proposal has been experimentally verified across various industrial-grade datasets and diverse backbones. The key observations from our study are outlined below:

  • DynST Maintains Performance in Sparse Data. DynST integrates into various models and handles sparser input data without significantly affecting performance. For example, in the GNN architecture, DynST integration slightly increases the MAE on the Turbulence dataset from 4.354.374.354.374.35\rightarrow 4.374.35 → 4.37. In the Transformer architecture, DynST reduces the MAE from 3.673.593.673.593.67\rightarrow 3.593.67 → 3.59 on the JD traffic benchmark.

  • Significantly Improves Inference Efficiency. DynST enhances inference speed across different architectures. On the Turbulence dataset, the STGCN architecture speeds up by 72% to 1.721 times with DynST. On the Fire dataset, the GNN architecture speeds up by about 14.5% to 1.541 times. On the JD Taxibj+ dataset, the Transform architecture nearly doubles in speed, increasing by about 34.5% to 1.987 times. These examples demonstrate DynST’s ability to improve computational efficiency, speeding up inference and handling large datasets efficiently.

  • Meets Industrial Standards. DynST effectively meets industrial requirements, introducing minimal performance loss at sparsity levels ranging from 30%60%similar-topercent30percent6030\%\sim 60\%30 % ∼ 60 %. Moreover, due to its model-agnostic nature, DynST is compatible with almost all industry-available models without conflict, showcasing strong transferability and plug-and-play characteristics.

2. Related Work

Our research is highly relevant to the following research themes:

ST predictive learning can be categorized into three main types. Convolutional Neural Network (CNN)-based architectures: This research focuses on spatial feature extraction using CNN-based structures (Gao et al., 2022b; Tan et al., 2022; Wu et al., 2023c; Shi et al., 2015). These architectures use convolutional layers to effectively detect patterns in image and video data. Key advancements include deep convolutional networks for complex feature extraction and 3D convolutions for spatial-temporal analysis in video processing (Wang et al., 2018b); Recurrent Neural Network (RNN)-based Architectures: RNNs are used to optimize temporal data handling (Wang et al., 2017, 2018a, 2019), which are key for tasks like sequence prediction and time-dependent data analysis; Transformer-based Architectures delve into Transformer-based architectures for spatio-temporal data handling (Gao et al., 2022a; Bai et al., 2022; Wu et al., 2023b, c), by employing their self-attention mechanism to effectively manage sequence data. They capture long-range dependencies in both spatial and temporal dimensions, making them suitable for complex sequence modeling and analysis. Notably, there are models that leverage graph neural networks primarily for ST graph management (Ji et al., 2023; Shao et al., 2022; Li et al., 2017), we will discuss later.

Graph Neural Networks (GNNs) & Graph Pooling. GNNs have emerged as a prominent subfield in machine learning, specifically tailored to manage and analyze graph-structured data (Wang et al., 2022; Yu et al., 2020; Thekumparampil et al., 2018; You et al., 2019). In general, GNNs owe their efficacy to a distinct “message-passing” mechanism, which seamlessly integrates topological structures with node characteristics to yield richer graph representations. Leveraging the powerful topological awareness capabilities of GNNs, many studies have customized and adapted GNNs for predictions in spatio-temporal scenarios (Ji et al., 2023; Shao et al., 2022; Li et al., 2017). Our method of dynamically filtering sensors can be understood as a form of graph pooling in the graph domain (Chen et al., 2018; Eden et al., 2018; Chen et al., 2021; Gao and Ji, 2019; Ranjan et al., 2020; Zhang et al., 2021). The distinction lies in the fact that traditional graph pooling is static, whereas our approach represents the first instance of addressing this kind of problem in dynamic temporal graphs.

Senor Deployment. In the field of sensor deployment, traditional methods (Priyadarshi et al., 2020; Zou and Chakrabarty, 2003; Yarinezhad and Hashemi, 2023; Xu, 2020; Kundu and Das, 2023) often employ specific algorithms, such as virtual force and Voronoi diagrams, for sensor design and deployment. These strategies involve dynamically adjusting sensor activation times to optimize detection across various sub-regions. However, developing an effective activation strategy based solely on historical observation data or urban geographic features presents significant challenges, primarily due to the intricate technical design requirements (Zhang et al., 2023a). Additionally, as highlighted in (Yan and Li, 2023; Zheng et al., 2023), focusing only on single variables like urban layout or geographic characteristics fails to fully address the complexities of optimal deployment strategies.

3. Motivation

Refer to caption
Figure 1. Motivation of our proposal.

In this section, we carefully examine the significance of our approach and establish the motivation behind DynST. Our analysis begins with empirical observations. Specifically, we use the large-scale dataset EAGLE (Janny et al., 2023), designed for learning complex fluid mechanics, as an example. EAGLE is represented as a graph, where each sub-region can be interpreted as the sensory area of a sensor. We demonstrate the important regions using the attention maps from the study and apply masking to the non-essential areas. In each iteration, we randomly mask 15% of the less important areas to predict the future state of the regions with 7-layer graph convolutional network (Kipf and Welling, 2016).

Insights & Reflections. As illustrated in Figure 1, we observe that for this dataset, identifying and removing 15% of the least important patches does not affect the model’s performance, which remains consistent with a Root Mean Square Error value about 0.09similar-toabsent0.09\sim 0.09∼ 0.09. However, the implementation of DynST results in a noticeable speedup in model inference. This finding inspires us to dynamically eliminate non-essential information. By removing these less important regions, we can better identify the parts crucial for future predictions and accelerate inference, which corresponds to sensor deactivation in real-world applications.

4. Preliminary

As our research involves both graph and image-type data, we systematically present relevant definitions here to facilitate the demonstration of our model.

4.1. Graph Notations

In this study, we focus on an attributed graph, represented as 𝒢=(𝒱,)𝒢𝒱\mathcal{G}={{(\mathcal{V},\mathcal{E})}}caligraphic_G = ( caligraphic_V , caligraphic_E ). Here, 𝒱𝒱\mathcal{V}caligraphic_V and \mathcal{E}caligraphic_E correspond to the node and edge sets, respectively. The graph 𝒢𝒢\mathcal{G}caligraphic_G has an associated feature matrix 𝐗N×D𝐗superscript𝑁𝐷\mathbf{X}\in\mathbb{R}^{N\times D}bold_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_D end_POSTSUPERSCRIPT, where N=|𝒱|𝑁𝒱N=|\mathcal{V}|italic_N = | caligraphic_V | indicates the total number of nodes, and D𝐷Ditalic_D represents the feature dimensionality of each node. For any node vi𝒱subscript𝑣𝑖𝒱v_{i}\in\mathcal{V}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_V, its feature vector is a D𝐷Ditalic_D-dimensional entity 𝐱i=𝐗[i,]subscript𝐱𝑖𝐗𝑖\mathbf{x}_{i}=\mathbf{X}[i,\cdot]bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = bold_X [ italic_i , ⋅ ]. The adjacency matrix 𝐀N×N𝐀superscript𝑁𝑁\mathbf{A}\in\mathbb{R}^{N\times N}bold_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_N end_POSTSUPERSCRIPT defines the inter-node connections, assigning 𝐀[i,j]=1𝐀𝑖𝑗1\mathbf{A}[i,j]=1bold_A [ italic_i , italic_j ] = 1 when a pair of nodes (vi,vj)subscript𝑣𝑖subscript𝑣𝑗(v_{i},v_{j})( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) is connected in \mathcal{E}caligraphic_E and 00 otherwise. To effectively learn node representations within 𝒢𝒢\mathcal{G}caligraphic_G, the majority of GNNs utilize a neighborhood aggregation and message passing paradigm.

(1) 𝐡i(l)=𝐂𝐎𝐌𝐁(𝐡i(l1),𝐀𝐆𝐆𝐑{𝐡j(k1):vj𝒩(vi)}), 0lLformulae-sequencesuperscriptsubscript𝐡𝑖𝑙𝐂𝐎𝐌𝐁superscriptsubscript𝐡𝑖𝑙1𝐀𝐆𝐆𝐑conditional-setsuperscriptsubscript𝐡𝑗𝑘1subscript𝑣𝑗𝒩subscript𝑣𝑖 0𝑙𝐿\mathbf{h}_{i}^{(l)}=\text{{COMB}}\left(\mathbf{h}_{i}^{(l-1)},\text{{AGGR}}\{% \mathbf{h}_{j}^{(k-1)}:v_{j}\in\mathcal{N}(v_{i})\}\right),\;0\leq l\leq Lbold_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT = COMB ( bold_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT , AGGR { bold_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT : italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ caligraphic_N ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } ) , 0 ≤ italic_l ≤ italic_L

L𝐿Litalic_L represents the number of layers in the GNN. The initial feature vector 𝐡i(0)=𝐱isuperscriptsubscript𝐡𝑖0subscript𝐱𝑖\mathbf{h}_{i}^{(0)}=\mathbf{x}_{i}bold_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT corresponds to the features of node visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. For each layer l𝑙litalic_l in the GNN, where 1lL1𝑙𝐿1\leq l\leq L1 ≤ italic_l ≤ italic_L, the node embedding of visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is denoted by 𝐡i(l)superscriptsubscript𝐡𝑖𝑙\mathbf{h}_{i}^{(l)}bold_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT. Two critical functions in this process are AGGR and COMB. The AGGR function is responsible for aggregating information from a node’s neighborhood, while the COMB function is used to combine the representations of the ego-node and its neighbors.

4.2. Image-type Data Notations

For effective modeling in image-type data 𝒳𝒳\mathcal{X}caligraphic_X, we initially divide the total urban area into p×p𝑝𝑝p\times pitalic_p × italic_p sub-regions (patches), with each patch encompassing (H/p,W/p)𝐻𝑝𝑊𝑝(H/p,W/p)( italic_H / italic_p , italic_W / italic_p ) pixels. H𝐻Hitalic_H and W𝑊Witalic_W is the height and the width of the input images. It is worth noting that the choice of p𝑝pitalic_p should balance the trade-off between practicality and spatial granularity. In our implementation, we partition the entire urban area into small squares, each comprising p×p𝑝𝑝p\times pitalic_p × italic_p sensors, adhering to practicality requirements.

4.3. Problem Formulation

Refer to caption
Figure 2. Overview of our proposed DynST framework.

The target of our task is to identify the index of the sparse trivial sub-counterpart of the whole graph 𝒢𝒢{\mathcal{G}}caligraphic_G or image 𝒳𝒳\mathcal{X}caligraphic_X. For the sake of simplicity in presentation, we eliminate the temporal dimension T𝑇Titalic_T from the spatio-temporal data. More formally, we attempt to obtain a trainable mask MeNsubscript𝑀𝑒superscript𝑁M_{e}\in\mathbb{R}^{N}italic_M start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT (for masking graph nodes) or Mep×psubscript𝑀𝑒superscript𝑝𝑝M_{e}\in\mathbb{R}^{p\times p}italic_M start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_p × italic_p end_POSTSUPERSCRIPT (for masking image patches). When we attach Mesubscript𝑀𝑒M_{e}italic_M start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT on original 𝒢𝒢\mathcal{G}caligraphic_G (Me𝒢direct-productsubscript𝑀𝑒𝒢M_{e}\odot\mathcal{G}italic_M start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ⊙ caligraphic_G) or on image 𝒳𝒳\mathcal{X}caligraphic_X (Me𝒳direct-productsubscript𝑀𝑒𝒳M_{e}\odot\mathcal{X}italic_M start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ⊙ caligraphic_X), the objective is as follows:

(2) maximize𝐌gsg=1𝐌g0𝐀0;or=1𝐌g0p×pformulae-sequencesubscriptmaximizesubscript𝐌𝑔subscript𝑠𝑔1subscriptnormsubscript𝐌𝑔0subscriptnorm𝐀0or1subscriptnormsubscript𝐌𝑔0𝑝𝑝\displaystyle\mathop{\operatorname{maximize}}_{\mathbf{M}_{g}}\;s_{g}=1-\frac{% ||\mathbf{M}_{g}||_{0}}{||\mathbf{A}||_{0}};\;\;{\rm{or}}=1-\frac{||\mathbf{M}% _{g}||_{0}}{p\times p}roman_maximize start_POSTSUBSCRIPT bold_M start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT = 1 - divide start_ARG | | bold_M start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG | | bold_A | | start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ; roman_or = 1 - divide start_ARG | | bold_M start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG italic_p × italic_p end_ARG
s.t.|DynST(Me*;Θ)Ori(*;Θ)|<ϵ,\displaystyle\operatorname{s.t.}\left|\mathcal{R}_{DynST}\left(M_{e}\odot*;% \Theta\right)-\mathcal{R}_{Ori}(*;\Theta)\right|<\epsilon,start_OPFUNCTION roman_s . roman_t . end_OPFUNCTION | caligraphic_R start_POSTSUBSCRIPT italic_D italic_y italic_n italic_S italic_T end_POSTSUBSCRIPT ( italic_M start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ⊙ * ; roman_Θ ) - caligraphic_R start_POSTSUBSCRIPT italic_O italic_r italic_i end_POSTSUBSCRIPT ( * ; roman_Θ ) | < italic_ϵ ,

where sgsubscript𝑠𝑔s_{g}italic_s start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT is the sparsity, ||||0||\cdot||_{0}| | ⋅ | | start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT counts the number of non-zero elements, and ϵitalic-ϵ\epsilonitalic_ϵ is the threshold for permissible performance difference. *** denotes the graph or image inputs and \mathcal{R}caligraphic_R represents the evaluation metrics.

5. Method

Fig 2 illustrate the overview of DynST framework. In Earth sciences, sensor deployment typically falls into two categories, i.e., image- and graph-type. Image-type deployment ensures that each area (termed ‘patch") is well covered by a sensor, while in graph-type deployment, the information from a node can be understood as being collected by a single sensor. To demonstrate the universal capabilities of DynST, we systematically consider both of these deployment types and perform a patchify operation on the images (Wu et al., 2023a). For graph data, since nodes can be defined as sensors, in this study, we do not perform any operations at the data input stage.

5.1. Stream Morph Operator

Consider that ST frameworks that receives continuous observation data 𝒳isubscript𝒳𝑖\mathcal{X}_{i}caligraphic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT at different time steps (i=1,2,,T𝑖12𝑇i=1,2,...,Titalic_i = 1 , 2 , … , italic_T). According to relevant literature (Arnab et al., 2021), we view this system as a unified four-dimensional structure, i.e., 𝒳i[Tin,Cin,H,W]subscript𝒳𝑖superscriptsubscript𝑇insubscript𝐶in𝐻𝑊\mathcal{X}_{i}\in\mathbb{R}^{[T_{\text{in}},C_{\text{in}},H,W]}caligraphic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT [ italic_T start_POSTSUBSCRIPT in end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT in end_POSTSUBSCRIPT , italic_H , italic_W ] end_POSTSUPERSCRIPT. Similarly, the dimensions of a temporal graph can be represented as 𝒢[Tin,N,D]𝒢superscriptsubscript𝑇in𝑁𝐷\mathcal{G}\in\mathbb{R}^{[T_{\text{in}},N,D]}caligraphic_G ∈ blackboard_R start_POSTSUPERSCRIPT [ italic_T start_POSTSUBSCRIPT in end_POSTSUBSCRIPT , italic_N , italic_D ] end_POSTSUPERSCRIPT. Typically, in spatio-temporal scenarios, the information collected by sensors is expressed as dynamic temporal observations. However, while the positions of the sensors are fixed, the sensory data is subject to dynamic changes. To the best of our knowledge, traditional methods have primarily focused on the optimization of data (Anonymous, 2024). We are the first to consider this industrial scenario from the perspective of sensor deployment. As a result, conventional methods are not applicable in our domain. Taking image-type as an example, the image is first tokenized into N=HW/(p2)𝑁𝐻𝑊superscript𝑝2N=HW/(p^{2})italic_N = italic_H italic_W / ( italic_p start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) non-overlapping patches, then we first introduce the stream morph operator.

Refer to caption
Figure 3. The process of stream morph operator. Each rectangular block and circle node can be interpreted as a sensor recorder.

As shown in Fig 3, stream morph addresses this by merging the H𝐻Hitalic_H and W𝑊Witalic_W channels of the image, and stacking the temporal (T𝑇Titalic_T) channel with the C𝐶Citalic_C channel. This approach effectively eliminates the interference of the T𝑇Titalic_T dimension in model predictions. In this way, the training input time series can be deemed as 𝒳i~[H×W,Tin×Cin]~subscript𝒳𝑖superscript𝐻𝑊subscript𝑇insubscript𝐶in{\tilde{\mathcal{X}_{i}}}\in\mathbb{R}^{[H\times W,T_{\text{in}}\times C_{% \text{in}}]}over~ start_ARG caligraphic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT [ italic_H × italic_W , italic_T start_POSTSUBSCRIPT in end_POSTSUBSCRIPT × italic_C start_POSTSUBSCRIPT in end_POSTSUBSCRIPT ] end_POSTSUPERSCRIPT (graph can be deemed as 𝒢~in[N,Tin×Cin]subscript~𝒢𝑖𝑛superscript𝑁subscript𝑇insubscript𝐶𝑖𝑛{\tilde{\mathcal{G}}_{in}}\in\mathbb{R}^{[N,T_{\text{in}}\times C_{in}]}over~ start_ARG caligraphic_G end_ARG start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT [ italic_N , italic_T start_POSTSUBSCRIPT in end_POSTSUBSCRIPT × italic_C start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ] end_POSTSUPERSCRIPT, where N=HW/(p2)𝑁𝐻𝑊superscript𝑝2N=HW/(p^{2})italic_N = italic_H italic_W / ( italic_p start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )), in which each rectangular block (𝒳~in(j)[p2,Tin×Cin]superscriptsubscript~𝒳𝑖𝑛𝑗superscriptsuperscript𝑝2subscript𝑇insubscript𝐶in\tilde{\mathcal{X}}_{in}^{\left(j\right)}\in\mathbb{R}^{[p^{2},T_{\text{in}}% \times C_{\text{in}}]}over~ start_ARG caligraphic_X end_ARG start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT [ italic_p start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_T start_POSTSUBSCRIPT in end_POSTSUBSCRIPT × italic_C start_POSTSUBSCRIPT in end_POSTSUBSCRIPT ] end_POSTSUPERSCRIPT) and circle node (𝒢~in(j)[1,Tin×Cin]superscriptsubscript~𝒢𝑖𝑛𝑗superscript1subscript𝑇insubscript𝐶in\tilde{\mathcal{G}}_{in}^{\left(j\right)}\in\mathbb{R}^{[1,T_{\text{in}}\times C% _{\text{in}}]}over~ start_ARG caligraphic_G end_ARG start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT [ 1 , italic_T start_POSTSUBSCRIPT in end_POSTSUBSCRIPT × italic_C start_POSTSUBSCRIPT in end_POSTSUBSCRIPT ] end_POSTSUPERSCRIPT) can be interpreted as a sensor recorder. For ease of understanding, we will primarily use graph inputs as examples to illustrate the model process in subsequent sections. The distinctions between graph-type data and image data will be highlighted in the final Model Summary (Sec 5.4).

Then, stream morph operator employs a parameterized graph mask Mg[N,1]subscript𝑀𝑔superscript𝑁1M_{g}\in\mathbb{R}^{[N,1]}italic_M start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT [ italic_N , 1 ] end_POSTSUPERSCRIPT to dynamically score all nodes, with its parameters shared across all nodes. Given the target graph sparsity sg%percentsubscript𝑠𝑔s_{g}\%italic_s start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT %, we first initialize Mgsubscript𝑀𝑔M_{g}italic_M start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT and attach the dense mask Mgsubscript𝑀𝑔M_{g}italic_M start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT on sensor region Mg𝒢~indirect-productsubscript𝑀𝑔subscript~𝒢𝑖𝑛M_{g}\odot{\tilde{\mathcal{G}}_{in}}italic_M start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ⊙ over~ start_ARG caligraphic_G end_ARG start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT, then we start to resort to currently training scheme to find important and trivial regions.

5.2. Iterative Pruning towards High Sparsity

With Mgsubscript𝑀𝑔M_{g}italic_M start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT at hand, we proceed to train the models together with the fixed input graph and the graph mask, denoted as f(Mg𝒢~in,𝚯)𝑓direct-productsubscript𝑀𝑔subscript~𝒢𝑖𝑛𝚯f(M_{g}\odot{\tilde{\mathcal{G}}_{in}},\mathbf{\Theta})italic_f ( italic_M start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ⊙ over~ start_ARG caligraphic_G end_ARG start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT , bold_Θ ), f𝑓fitalic_f denotes the mapping function of the input ST model. with the objective function in Eq. 2, we aim to gradually find the sparse sub-graph towards better semantical preservation. One promising approach is to adopt one-shot pruning (Ma et al., 2021; Frankle et al., 2020), however, the sparse mask acquired through one-shot pruning is suboptimal. In fact, the assessment of each sensor necessitates iterative testing to ensure that the removal of a specific area does not significantly impact future predictions. To achieve our objectives, we employ an iterative pruning strategy (Chen et al., 2021) to gradually increase network sparsity. Assuming that each pruning iteration trims p%percent𝑝p\%italic_p % of the data parameters, after ϕitalic-ϕ\phiitalic_ϕ rounds of pruning, the remaining regions exhibit distinct advantages over the one-shot approach–that is–By iteratively pruning and retraining, the network can more effectively identify which parts are less important, as the remaining parameters have undergone ϕitalic-ϕ\phiitalic_ϕ rounds of repeated verification. Unlike previous iterative pruning literature, we alternately train the network and the mask Mgsubscript𝑀𝑔M_{g}italic_M start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT to ensure that the mask can fully assimilate the effective information from the training process:

(3) optΘ(R)f(Mg𝒢in,Θ)optMg(M)f(Mg𝒢in,Θ*)𝑜𝑝superscriptsubscript𝑡Θ𝑅𝑓direct-productsubscript𝑀𝑔subscriptsimilar-to𝒢𝑖𝑛Θ𝑜𝑝superscriptsubscript𝑡subscript𝑀𝑔𝑀𝑓direct-productsubscript𝑀𝑔subscriptsimilar-to𝒢𝑖𝑛superscriptΘ\left.{opt~{}}_{\Theta}^{(R)}f\left(M_{g}\odot{\overset{\sim}{\mathcal{G}}}_{% in},\Theta\right)\leftrightharpoons{opt~{}}_{M_{g}}^{(M)}f\left(M_{g}\odot{% \overset{\sim}{\mathcal{G}}}_{in},\Theta^{*}\right)~{}\right.italic_o italic_p italic_t start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_R ) end_POSTSUPERSCRIPT italic_f ( italic_M start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ⊙ over∼ start_ARG caligraphic_G end_ARG start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT , roman_Θ ) ⇋ italic_o italic_p italic_t start_POSTSUBSCRIPT italic_M start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_M ) end_POSTSUPERSCRIPT italic_f ( italic_M start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ⊙ over∼ start_ARG caligraphic_G end_ARG start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT , roman_Θ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT )

\leftrightharpoons denotes the iterative alternation process. We first train the parameters ΘΘ\Thetaroman_Θ for R𝑅Ritalic_R iterations, then fix ΘΘ\Thetaroman_Θ as Θ*superscriptΘ\Theta^{*}roman_Θ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT and iteratively train the mask Mgsubscript𝑀𝑔M_{g}italic_M start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT for M𝑀Mitalic_M iterations. Through this process, the mask Mgsubscript𝑀𝑔M_{g}italic_M start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT potentially encapsulates the important information inherent in the data. Given the target sensor sparsity sg%percentsubscript𝑠𝑔s_{g}\%italic_s start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT %, we binarize the mask Mgsubscript𝑀𝑔M_{g}italic_M start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT by zeroing out the parts with the smallest parameter values:

(4) 𝒟o(ArgTop(|Mg(μ)|;p%){0,1})𝒟𝑜ArgTopsuperscriptsubscript𝑀𝑔𝜇percent𝑝01\mathcal{D}o\left({\rm{ArgTop}}\left(|M_{g}^{(\mu)}|;p\%\right)\Rightarrow% \left\{0,1\right\}\right)caligraphic_D italic_o ( roman_ArgTop ( | italic_M start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_μ ) end_POSTSUPERSCRIPT | ; italic_p % ) ⇒ { 0 , 1 } )

Mg(μ)superscriptsubscript𝑀𝑔𝜇M_{g}^{(\mu)}italic_M start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_μ ) end_POSTSUPERSCRIPT represents the state of the mask Mgsubscript𝑀𝑔M_{g}italic_M start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT at the μthsuperscript𝜇𝑡\mu^{th}italic_μ start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT iteration. The operation ArgTop(u,v)ArgTop𝑢𝑣{\rm{ArgTop}}(u,v)roman_ArgTop ( italic_u , italic_v ) denotes the process of setting the top u%percent𝑢u\%italic_u % parameters in the matrix to 1, while the remaining v%percent𝑣v\%italic_v % are set to 0. 𝒟osubscript𝒟𝑜\mathcal{D}_{o}caligraphic_D start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT operator forcefully assigns mask status as 0 or 1.

5.3. Dynamical Sparse Training

As depicted above, each sensor region requires meticulous verification to ensure reliability. To this end, in the intervals between each iterative pruning, we further introduce Dynamical Sparse Training (DST) techniques (Liu et al., 2021; Huang et al., 2023; Liu et al., 2020; Zhang et al., 2023c) to perform fine-tuning between two iterative pruning steps. Concretely, we selectively activate a portion of the regions that were previously pruned, while masking the areas that remain unpruned. After the ωthsuperscript𝜔𝑡\omega^{th}italic_ω start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT round, we perform a drop and regrow process on the pruned mask Mg(ω(R+M))superscriptsubscript𝑀𝑔𝜔𝑅𝑀M_{g}^{({\omega(R+M)})}italic_M start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_ω ( italic_R + italic_M ) ) end_POSTSUPERSCRIPT (i.e., drop ↔ regrow). We adjust this process proportion to q%percent𝑞q\%italic_q %, typically where qpmuch-less-than𝑞𝑝q\ll pitalic_q ≪ italic_p, to control the drop and regrow of elements. We perform the “exchange of sensors" between the current activation regions (ω)=𝐌g𝒢~insubscript𝜔direct-productsubscript𝐌𝑔subscript~𝒢𝑖𝑛\mathcal{E}_{(\omega)}=\mathbf{M}_{g}\odot{\tilde{\mathcal{G}}_{in}}caligraphic_E start_POSTSUBSCRIPT ( italic_ω ) end_POSTSUBSCRIPT = bold_M start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ⊙ over~ start_ARG caligraphic_G end_ARG start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT and its complementary part (ω)C=¬𝐌g𝒢~insuperscriptsubscript𝜔𝐶direct-productsubscript𝐌𝑔subscript~𝒢𝑖𝑛\mathcal{E}_{(\omega)}^{C}=\neg\mathbf{M}_{g}\odot{\tilde{\mathcal{G}}_{in}}caligraphic_E start_POSTSUBSCRIPT ( italic_ω ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT = ¬ bold_M start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ⊙ over~ start_ARG caligraphic_G end_ARG start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT. Consider that this process at ω(D+M)𝜔𝐷𝑀\omega(D+M)italic_ω ( italic_D + italic_M ) time points, we proceed to train and adjust the Mgsubscript𝑀𝑔M_{g}italic_M start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT:

(5) Mg(ω)(prune)=ArgBottom{(|(M¯g(ω))|;q%){0,1}}superscriptsubscript𝑀𝑔𝜔𝑝𝑟𝑢𝑛𝑒ArgBottomsuperscriptsubscript¯𝑀𝑔𝜔percentq01\displaystyle M_{g}^{\left(\omega\right)}\left({prune}\right)={\rm{ArgBottom}}% \left\{{\left({\left|{\nabla\left({\bar{M}_{g}^{\left(\omega\right)}}\right)}% \right|;{\rm{q\%}}}\right)\Rightarrow\left\{{0,1}\right\}}\right\}italic_M start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_ω ) end_POSTSUPERSCRIPT ( italic_p italic_r italic_u italic_n italic_e ) = roman_ArgBottom { ( | ∇ ( over¯ start_ARG italic_M end_ARG start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_ω ) end_POSTSUPERSCRIPT ) | ; roman_q % ) ⇒ { 0 , 1 } }

In this context, M¯g(ω)superscriptsubscript¯𝑀𝑔𝜔{\bar{M}_{g}^{\left(\omega\right)}}over¯ start_ARG italic_M end_ARG start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_ω ) end_POSTSUPERSCRIPT represents the elements of Mg(ω)superscriptsubscript𝑀𝑔𝜔M_{g}^{\left(\omega\right)}italic_M start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_ω ) end_POSTSUPERSCRIPT that have not been pruned. Here, we resort to gradient calculation \nabla to identify and drop the elements with the lowest gradients (ArgBottomArgBottom{\rm{ArgBottom}}roman_ArgBottom operator). Generally, gradients can indicate elements with the potential to contribute to the loss function (Wang et al., 2023; Evci et al., 2020). We need to align this activation to further explore their effectiveness in future judgments. Going beyond this process, we identify and regrow elements with the highest gradients among those that have been pruned, effectively replacing parts that consist of dropped elements:

(6) Mg(ω)(regrow)=¬ArgTop{(|(¬M¯g(ω))|;q%){0,1}}superscriptsubscript𝑀𝑔𝜔𝑟𝑒𝑔𝑟𝑜𝑤ArgTopsuperscriptsubscript¯𝑀𝑔𝜔percentq01\displaystyle M_{g}^{\left(\omega\right)}\left({regrow}\right)=\neg{\rm{ArgTop% }}\left\{{\left({\left|{-\nabla\left({\neg\bar{M}_{g}^{\left(\omega\right)}}% \right)}\right|;{\rm{q\%}}}\right)\Rightarrow\left\{{0,1}\right\}}\right\}italic_M start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_ω ) end_POSTSUPERSCRIPT ( italic_r italic_e italic_g italic_r italic_o italic_w ) = ¬ roman_ArgTop { ( | - ∇ ( ¬ over¯ start_ARG italic_M end_ARG start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_ω ) end_POSTSUPERSCRIPT ) | ; roman_q % ) ⇒ { 0 , 1 } }

In Eq. 6, we activate elements with larger gradients from the pruned set (¬M¯g(ω))superscriptsubscript¯𝑀𝑔𝜔\left({\neg\bar{M}_{g}^{\left(\omega\right)}}\right)( ¬ over¯ start_ARG italic_M end_ARG start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_ω ) end_POSTSUPERSCRIPT ). The operation ¬ArgTopArgTop\neg{\rm{ArgTop}}¬ roman_ArgTop serves as the inverse process of pruning, selecting elements with larger gradients for activation. This ensures that sensor regions with potential contributions are re-evaluated and validated.

Following the completion of the aforementioned evaluation process, we reconstruct Mgsubscript𝑀𝑔M_{g}italic_M start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT to form a more reliable regional mask:

(7) 𝐌g(ω*)(𝐌g(ω)Mg(ω)(prune))Mg(ω)(regrow),superscriptsubscript𝐌𝑔superscript𝜔superscriptsubscript𝐌𝑔𝜔superscriptsubscript𝑀𝑔𝜔𝑝𝑟𝑢𝑛𝑒superscriptsubscript𝑀𝑔𝜔𝑟𝑒𝑔𝑟𝑜𝑤\mathbf{M}_{g}^{(\omega^{*})}\leftarrow\left(\mathbf{M}_{g}^{(\omega)}% \setminus M_{g}^{\left(\omega\right)}\left({prune}\right)\right)\cup M_{g}^{% \left(\omega\right)}\left({regrow}\right),bold_M start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_ω start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ← ( bold_M start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_ω ) end_POSTSUPERSCRIPT ∖ italic_M start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_ω ) end_POSTSUPERSCRIPT ( italic_p italic_r italic_u italic_n italic_e ) ) ∪ italic_M start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_ω ) end_POSTSUPERSCRIPT ( italic_r italic_e italic_g italic_r italic_o italic_w ) ,

Then, at the begin of the round ω+1𝜔1\omega+1italic_ω + 1, we continue to trian and adjust the mask for sending it to ω+1𝜔1\omega+1italic_ω + 1 round pruning. We binarize the mask 𝐌g(ω+1)superscriptsubscript𝐌𝑔𝜔1\mathbf{M}_{g}^{(\omega+1)}bold_M start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_ω + 1 ) end_POSTSUPERSCRIPT after another ΔTΔ𝑇\Delta Troman_Δ italic_T iteration training. Without loss of generality, taking the semi-supervised node classification task as an example, our objective function can be expressed as follows:

(8) (Mg𝒢~in;Θ)=1Ki=1K𝒴T+if(Mg𝒢~in;Θ)2direct-productsubscript𝑀𝑔subscript~𝒢𝑖𝑛Θ1𝐾superscriptsubscript𝑖1𝐾superscriptnormsubscript𝒴𝑇𝑖𝑓direct-productsubscript𝑀𝑔subscript~𝒢𝑖𝑛Θ2\mathcal{L}(M_{g}\odot{\tilde{\mathcal{G}}_{in}};\Theta)=\frac{1}{K}\sum_{i=1}% ^{K}\|\mathcal{Y}_{T+i}-f(M_{g}\odot{\tilde{\mathcal{G}}_{in}};\Theta)\|^{2}caligraphic_L ( italic_M start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ⊙ over~ start_ARG caligraphic_G end_ARG start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ; roman_Θ ) = divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ∥ caligraphic_Y start_POSTSUBSCRIPT italic_T + italic_i end_POSTSUBSCRIPT - italic_f ( italic_M start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ⊙ over~ start_ARG caligraphic_G end_ARG start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ; roman_Θ ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

where \mathcal{L}caligraphic_L is the MSE loss calculated over the unmasked node set 𝒢~insubscript~𝒢𝑖𝑛{\tilde{\mathcal{G}}_{in}}over~ start_ARG caligraphic_G end_ARG start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT, and 𝒴T+isubscript𝒴𝑇𝑖\mathcal{Y}_{T+i}caligraphic_Y start_POSTSUBSCRIPT italic_T + italic_i end_POSTSUBSCRIPT denotes the ground-truth.

Refer to caption
Figure 4. An overview of the anticipated JD Technology Platform, we represent the importance of sensors with a gradient from light to dark blue, effectively removing the deployment in the white areas to emphasize this gradation of significance.

5.4. Model Summary & Complexity Analysis

For image-type data, we transform each sub-region into a patch, which can also be understood as the concept of a “node”. Therefore, by training in a similar manner, we can identify the important sub-regions accordingly. DynST can enhance the inference speed of the model, which specifically depends on the predefined sparsity sg%percentsubscript𝑠𝑔s_{g}\%italic_s start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT %. Typically, this results in an acceleration ratio of 1/sg%1percentsubscript𝑠𝑔1/s_{g}\%1 / italic_s start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT %. We summarize our prospective system and algorithm in Fig 4 and Appendix C, respectively.

6. Experiments

In this section, we conduct extensive experiments to answer the following research questions (𝒬𝒬\mathcal{RQ}caligraphic_R caligraphic_Q):

  1. 𝒬𝒬\mathcal{RQ}caligraphic_R caligraphic_Q1:

    Can DynST effectively find the sparse sub-counterpart of the original input without performance degradation?

  2. 𝒬𝒬\mathcal{RQ}caligraphic_R caligraphic_Q2:

    What is the specific performance of DynST on image-type data?

  3. 𝒬𝒬\mathcal{RQ}caligraphic_R caligraphic_Q3:

    What is the specific performance of DynST on graph data?

  4. 𝒬𝒬\mathcal{RQ}caligraphic_R caligraphic_Q4:

    Can we combine the concept of the DynST with a different training scheme?

To answers these 𝒬𝒬\mathcal{RQ}caligraphic_R caligraphic_Q, we orchestrate the following experiments:

  • Main experiment. We conduct a comprehensive comparative analysis on various scientific datasets, covering meteorology, combustion science, traffic studies, and turbulence dynamics. The study encompasses both mainstream Graph Neural Network (GNN) architectures and non-GNN structures. In the appendix B, we detail the methods for data preprocessing, including how to convert raw data into graphical and image formats.

  • Multiple Training Strategies Experiments. We choose Weatherbench as the benchmark dataset, to evaluate the effectiveness of DynST when combining different training schemes. Specifically, in the training phase, we not only consider the impacts of parallel prediction and autoregressive iterative prediction but also introduce iterative pruning and one-shot pruning strategies. We focus on assessing the impacts of these strategies on model size, computational efficiency, and accuracy.

  • Ablation experiment. We carry out comprehensive ablation studies on the Jingdong Technology industry-level traffic dataset, Taxibj+, to validate the impact of various design choices on the practical implementation of our model. Through these experiments, we aim to deeply understand how the DynST concept affects data interpretability and the overall effectiveness.

Experimental settings. All experiments in this study are conducted on the NVIDIA-A100 40G configuration. To ensure consistency, we use the same settings in all experiments, including learning rate, optimizer, and more. We also apply a uniform training strategy. The loss function used in the experiments is set as Mean Squared Error (MSE) loss. For dataset division, we split the data into training, validation, and test sets in an 8:1:1 ratio. Specifically, for the Vision Transformer model (Ranftl et al., 2021), we replace the classification head from the original paper with three deconvolution layers.

6.1. Dataset & Backbones

Table 1. Performance comparisons on different GNN and non-GNN architectures, in which we report the best performance of these baselines. All experimental results are under ten runs. We show the MAE metric for all settings.
Backbone GNNs non-GNNs Avg Speedup
STGCN + DynST CLCRN + DynST EGNN + DynST ViT + DynST Simvp + DynST TAU + DynST Earthfarseer + DynST
Model Performance Evaluation
WeatherBench ♣ 4.35 4.37 1.17 1.22 2.98 3.00 0.72 0.73 0.74 0.73 0.73 0.77 0.58 0.62 1.721
WeatherBench ♠ 2.02 2.04 1.49 1.52 3.39 3.42 0.27 0.29 0.27 0.29 0.26 0.25 0.24 0.25 1.522
WeatherBench ♥ 0.79 0.75 0.45 0.47 0.66 0.72 0.24 0.26 0.25 0.26 0.23 0.24 0.22 0.24 1.119
WeatherBench ♠ 3.64 3.67 1.33 1.31 2.31 2.33 0.51 0.54 0.51 0.52 0.49 0.50 0.48 0.50 1.398
FIT ϕitalic-ϕ\phiitalic_ϕ 1.27 1.29 0.97 0.98 1.03 1.09 0.23 0.22 0.14 0.16 0.13 0.14 0.09 0.11 1.543
FIT φ𝜑\varphiitalic_φ 0.96 1.09 0.76 0.81 0.92 0.95 0.17 0.19 0.10 0.09 0.09 0.10 0.02 0.03 1.541
Taxibj+ Inflow 5.98 5.99 3.98 4.02 4.22 4.33 3.22 3.33 3.05 3.11 2.98 3.00 2.09 2.10 1.421
Taxibj+ Outflow 5.21 5.23 3.64 3.60 4.21 4.19 3.67 3.59 3.01 3.03 2.77 2.87 2.12 2.22 1.987
EAGLE 1.99 2.07 1.45 1.47 1.66 1.67 1.45 1.47 1.23 1.34 1.19 1.27 1.08 1.12 1.988

Datasets. In this study, we conduct thorough analyses of multiple sensor-loaded datasets covering four main areas: meteorology, fires, turbulence, and traffic flow. In meteorology, we select the Weatherbench dataset. Following the design framework of related papers (Rasp et al., 2020), we consider four key variables: temperature (♣), humidity (♠), wind speed (♥), and cloud cover (♠), with the dataset containing 2048 nodes. For fire data, we choose the FIT dataset. Adhering to existing paper settings (Anonymous, 2023), we focus primarily on two variables: temperature (ϕitalic-ϕ\phiitalic_ϕ) and visibility (φ𝜑\varphiitalic_φ), totaling 15360 data nodes. In turbulence, we refer to the EAGLE dataset (Janny et al., 2023), a large turbulence dataset involving velocity and pressure variables, presented in an irregular grid form with 162760 nodes. Regarding traffic flow, we use JD Technology’s Taxibj+ dataset (Wu et al., 2023b), which provides traffic flow statistics for Beijing city, comprising 16384 data nodes. For the convenience of this study, each node is considered an independent sensor.

Backbones. We use both GNN and non-GNN architectures to systematically validate the generalizability of our ideas. Concretely, we use GNN-based models as our backbone, such as STGCN (Han et al., 2020), CLCRN (Lin et al., 2022) and EGNN (Satorras et al., 2021), as well as non-GNNs such as Vision Transformer (Dosovitskiy et al., 2021), SimVP-V2 (Tan et al., 2022), TAU (Tan et al., 2023) and Earthfarseer (Wu et al., 2023b). All GNNs have 7-layer encoder blocks, while non-GNNs use Transpose Conv2d for upsampling. This detailed categorization method greatly helps in deeply understanding and accurately analyzing the capabilities of DynST.

6.2. Main experiments (𝒬𝒬\mathcal{RQ}caligraphic_R caligraphic_Q1)

In this section, we test whether DynST can effectively remove non-essential areas (corresponding to the concept of sensors in the real world) without impacting the overall predictive performance of the model. To thoroughly investigate the generalizability and optimization capabilities of DynST, we integrate it with existing general frameworks and set the iterative pruning process to occur 10 times, each time reducing the data by 3%. We showcase the main results in Tab 1 and we can list the observations:

Obs 1.DynST has demonstrated that the removal of certain parts from the input data does not affect the model’s performance. As shown in Tab 1, We can easily observe the outcomes following the integration of the DynST concept into the model (+DynST). In the GNN architecture, the addition of DynST generally has a minimal impact on MAE. For example, on the WeatherBench ♣ and FIT φ𝜑\varphiitalic_φ datasets, the MAE slightly increases from 4.35 to 4.37 and from 0.92 to 0.95, respectively. In non-GNN architectures, DynST usually maintains or reduces the MAE. For instance, in the ViT architecture on the Taxibj+ Outflow dataset, the MAE decreases from 3.67\rightarrow3.59. In particular, DynST generally significantly enhances the inference speed across various architectures. For example, in WeatherBench ♣, STGCN speeds up to 1.721 times, EGNN on FIT φ𝜑\varphiitalic_φ to 1.541 times, and ViT on Taxibj+ Outflow to 1.987 times, effectively boosting the efficiency of inference.

Obs 2. DynST shows high efficiency in several scenarios. DynST also highly effective in improving the inference efficiency of various architectures. For example, on the WeatherBench ♣ dataset, the inference speed of STGCN increased by 23.7% with DynST (from the original speed to 1.721 times faster). Similarly, on the FIT φ𝜑\varphiitalic_φ dataset, the EGNN architecture achieved a 14.5% speed increase with DynST (reaching 1.541 times faster). Moreover, on the Taxibj+ Outflow dataset, the inference speed of the ViT architecture almost doubled, specifically a 34.5% increase (rising to 1.987 times faster). These examples collectively show DynST’s capability to significantly enhance computational efficiency in various scenarios. The percentage-based speed improvements highlight its notable advantage in accelerating the inference of various ST architectures.

6.3. Deep insights (𝒬𝒬\mathcal{RQ}caligraphic_R caligraphic_Q2 & 𝒬𝒬\mathcal{RQ}caligraphic_R caligraphic_Q3)

In this section, we conduct a more systematic study of DynST’s ability to accelerate inference. We select both graph and image-type data to observe model performance at various levels of sparsity. Concretely, for graph-type data, we choose Taxibj+ and EAGLE as benchmarks. For image-type data, we choose temperature (ϕitalic-ϕ\phiitalic_ϕ) variable of FIT datasets and the temperature (♣) variable of the WeatherBench as verification. We integrate it with existing general frameworks and set the iterative pruning process to occur 10 times, with each iteration reducing the data volume by {1%,2%,,6%}percent1percent2percent6\{1\%,2\%,\cdots,6\%\}{ 1 % , 2 % , ⋯ , 6 % }. Then we can obtain the data sparsity {10%,20%,,60%}percent10percent20percent60\{10\%,20\%,\cdots,60\%\}{ 10 % , 20 % , ⋯ , 60 % }. We employ roll out strategy (Luo et al., 2023) to iteratively predict long sequence and verify the long-term prediction ability of baselines after involve DynST. We list the observations as follow.

Table 2. Comparison results among different benchmarks, considering different data sparsity levels and prediction lengths.
Benchmark Graph-type Taxibj+ EAGLE
4 8 12 30 40 50
10%percent1010\%10 % 1.92±0.01subscript1.92plus-or-minus0.01{1.92_{\pm 0.01}}1.92 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT 1.99±0.01subscript1.99plus-or-minus0.01{1.99_{\pm 0.01}}1.99 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT 2.03±0.01subscript2.03plus-or-minus0.01{2.03_{\pm 0.01}}2.03 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT 1.14±0.02subscript1.14plus-or-minus0.02{1.14_{\pm 0.02}}1.14 start_POSTSUBSCRIPT ± 0.02 end_POSTSUBSCRIPT 1.18±0.02subscript1.18plus-or-minus0.02{1.18_{\pm 0.02}}1.18 start_POSTSUBSCRIPT ± 0.02 end_POSTSUBSCRIPT 1.19±0.02subscript1.19plus-or-minus0.02{1.19_{\pm 0.02}}1.19 start_POSTSUBSCRIPT ± 0.02 end_POSTSUBSCRIPT
20%percent2020\%20 % 2.04±0.03subscript2.04plus-or-minus0.03{2.04_{\pm 0.03}}2.04 start_POSTSUBSCRIPT ± 0.03 end_POSTSUBSCRIPT 2.12±0.01subscript2.12plus-or-minus0.01{2.12_{\pm 0.01}}2.12 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT 2.14±0.01subscript2.14plus-or-minus0.01{2.14_{\pm 0.01}}2.14 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT 1.17±0.02subscript1.17plus-or-minus0.02{1.17_{\pm 0.02}}1.17 start_POSTSUBSCRIPT ± 0.02 end_POSTSUBSCRIPT 1.23±0.02subscript1.23plus-or-minus0.02{1.23_{\pm 0.02}}1.23 start_POSTSUBSCRIPT ± 0.02 end_POSTSUBSCRIPT 1.24±0.02subscript1.24plus-or-minus0.02{1.24_{\pm 0.02}}1.24 start_POSTSUBSCRIPT ± 0.02 end_POSTSUBSCRIPT
30%percent3030\%30 % 2.07±0.01subscript2.07plus-or-minus0.01{2.07_{\pm 0.01}}2.07 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT 2.17±0.01subscript2.17plus-or-minus0.01{2.17_{\pm 0.01}}2.17 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT 2.18±0.02subscript2.18plus-or-minus0.02{2.18_{\pm 0.02}}2.18 start_POSTSUBSCRIPT ± 0.02 end_POSTSUBSCRIPT 1.21±0.02subscript1.21plus-or-minus0.02{1.21_{\pm 0.02}}1.21 start_POSTSUBSCRIPT ± 0.02 end_POSTSUBSCRIPT 1.24±0.01subscript1.24plus-or-minus0.01{1.24_{\pm 0.01}}1.24 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT 1.26±0.02subscript1.26plus-or-minus0.02{1.26_{\pm 0.02}}1.26 start_POSTSUBSCRIPT ± 0.02 end_POSTSUBSCRIPT
40%percent4040\%40 % 2.21±0.02subscript2.21plus-or-minus0.02{2.21_{\pm 0.02}}2.21 start_POSTSUBSCRIPT ± 0.02 end_POSTSUBSCRIPT 2.24±0.01subscript2.24plus-or-minus0.01{2.24_{\pm 0.01}}2.24 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT 2.25±0.01subscript2.25plus-or-minus0.01{2.25_{\pm 0.01}}2.25 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT 1.25±0.03subscript1.25plus-or-minus0.03{1.25_{\pm 0.03}}1.25 start_POSTSUBSCRIPT ± 0.03 end_POSTSUBSCRIPT 1.26±0.01subscript1.26plus-or-minus0.01{1.26_{\pm 0.01}}1.26 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT 1.28±0.02subscript1.28plus-or-minus0.02{1.28_{\pm 0.02}}1.28 start_POSTSUBSCRIPT ± 0.02 end_POSTSUBSCRIPT
50%percent5050\%50 % 2.37±0.03subscript2.37plus-or-minus0.03{2.37_{\pm 0.03}}2.37 start_POSTSUBSCRIPT ± 0.03 end_POSTSUBSCRIPT 2.39±0.01subscript2.39plus-or-minus0.01{2.39_{\pm 0.01}}2.39 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT 2.42±0.03subscript2.42plus-or-minus0.03{2.42_{\pm 0.03}}2.42 start_POSTSUBSCRIPT ± 0.03 end_POSTSUBSCRIPT 1.27±0.01subscript1.27plus-or-minus0.01{1.27_{\pm 0.01}}1.27 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT 1.27±0.02subscript1.27plus-or-minus0.02{1.27_{\pm 0.02}}1.27 start_POSTSUBSCRIPT ± 0.02 end_POSTSUBSCRIPT 1.29±0.02subscript1.29plus-or-minus0.02{1.29_{\pm 0.02}}1.29 start_POSTSUBSCRIPT ± 0.02 end_POSTSUBSCRIPT
60%percent6060\%60 % 2.45±0.02subscript2.45plus-or-minus0.02{2.45_{\pm 0.02}}2.45 start_POSTSUBSCRIPT ± 0.02 end_POSTSUBSCRIPT 2.48±0.01subscript2.48plus-or-minus0.01{2.48_{\pm 0.01}}2.48 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT 2.51±0.01subscript2.51plus-or-minus0.01{2.51_{\pm 0.01}}2.51 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT 1.29±0.01subscript1.29plus-or-minus0.01{1.29_{\pm 0.01}}1.29 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT 1.30±0.01subscript1.30plus-or-minus0.01{1.30_{\pm 0.01}}1.30 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT 1.33±0.02subscript1.33plus-or-minus0.02{1.33_{\pm 0.02}}1.33 start_POSTSUBSCRIPT ± 0.02 end_POSTSUBSCRIPT
Benchmark Image-type FIT ϕitalic-ϕ\phiitalic_ϕ WeatherBench ♣
30 40 50 4 8 12
10%percent1010\%10 % 0.13±0.01subscript0.13plus-or-minus0.01{0.13_{\pm 0.01}}0.13 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT 0.15±0.01subscript0.15plus-or-minus0.01{0.15_{\pm 0.01}}0.15 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT 0.15±0.01subscript0.15plus-or-minus0.01{0.15_{\pm 0.01}}0.15 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT 0.53±0.01subscript0.53plus-or-minus0.01{0.53_{\pm 0.01}}0.53 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT 0.57±0.01subscript0.57plus-or-minus0.01{0.57_{\pm 0.01}}0.57 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT 0.58±0.02subscript0.58plus-or-minus0.02{0.58_{\pm 0.02}}0.58 start_POSTSUBSCRIPT ± 0.02 end_POSTSUBSCRIPT
20%percent2020\%20 % 0.15±0.03subscript0.15plus-or-minus0.03{0.15_{\pm 0.03}}0.15 start_POSTSUBSCRIPT ± 0.03 end_POSTSUBSCRIPT 0.16±0.01subscript0.16plus-or-minus0.01{0.16_{\pm 0.01}}0.16 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT 0.17±0.01subscript0.17plus-or-minus0.01{0.17_{\pm 0.01}}0.17 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT 0.54±0.01subscript0.54plus-or-minus0.01{0.54_{\pm 0.01}}0.54 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT 0.59±0.01subscript0.59plus-or-minus0.01{0.59_{\pm 0.01}}0.59 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT 0.61±0.03subscript0.61plus-or-minus0.03{0.61_{\pm 0.03}}0.61 start_POSTSUBSCRIPT ± 0.03 end_POSTSUBSCRIPT
30%percent3030\%30 % 0.15±0.01subscript0.15plus-or-minus0.01{0.15_{\pm 0.01}}0.15 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT 0.17±0.01subscript0.17plus-or-minus0.01{0.17_{\pm 0.01}}0.17 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT 0.18±0.01subscript0.18plus-or-minus0.01{0.18_{\pm 0.01}}0.18 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT 0.58±0.01subscript0.58plus-or-minus0.01{0.58_{\pm 0.01}}0.58 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT 0.62±0.01subscript0.62plus-or-minus0.01{0.62_{\pm 0.01}}0.62 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT 0.64±0.01subscript0.64plus-or-minus0.01{0.64_{\pm 0.01}}0.64 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT
40%percent4040\%40 % 0.17±0.02subscript0.17plus-or-minus0.02{0.17_{\pm 0.02}}0.17 start_POSTSUBSCRIPT ± 0.02 end_POSTSUBSCRIPT 0.18±0.03subscript0.18plus-or-minus0.03{0.18_{\pm 0.03}}0.18 start_POSTSUBSCRIPT ± 0.03 end_POSTSUBSCRIPT 0.19±0.01subscript0.19plus-or-minus0.01{0.19_{\pm 0.01}}0.19 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT 0.60±0.01subscript0.60plus-or-minus0.01{0.60_{\pm 0.01}}0.60 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT 0.65±0.01subscript0.65plus-or-minus0.01{0.65_{\pm 0.01}}0.65 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT 0.66±0.02subscript0.66plus-or-minus0.02{0.66_{\pm 0.02}}0.66 start_POSTSUBSCRIPT ± 0.02 end_POSTSUBSCRIPT
50%percent5050\%50 % 0.19±0.01subscript0.19plus-or-minus0.01{0.19_{\pm 0.01}}0.19 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT 0.21±0.01subscript0.21plus-or-minus0.01{0.21_{\pm 0.01}}0.21 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT 0.22±0.02subscript0.22plus-or-minus0.02{0.22_{\pm 0.02}}0.22 start_POSTSUBSCRIPT ± 0.02 end_POSTSUBSCRIPT 0.61±0.01subscript0.61plus-or-minus0.01{0.61_{\pm 0.01}}0.61 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT 0.67±0.01subscript0.67plus-or-minus0.01{0.67_{\pm 0.01}}0.67 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT 0.69±0.01subscript0.69plus-or-minus0.01{0.69_{\pm 0.01}}0.69 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT
60%percent6060\%60 % 0.21±0.01subscript0.21plus-or-minus0.01{0.21_{\pm 0.01}}0.21 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT 0.22±0.01subscript0.22plus-or-minus0.01{0.22_{\pm 0.01}}0.22 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT 0.23±0.02subscript0.23plus-or-minus0.02{0.23_{\pm 0.02}}0.23 start_POSTSUBSCRIPT ± 0.02 end_POSTSUBSCRIPT 0.63±0.01subscript0.63plus-or-minus0.01{0.63_{\pm 0.01}}0.63 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT 0.69±0.01subscript0.69plus-or-minus0.01{0.69_{\pm 0.01}}0.69 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT 0.70±0.01subscript0.70plus-or-minus0.01{0.70_{\pm 0.01}}0.70 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT

Obs 3. DynST effectively achieves long-term predictions without causing significant performance degradation. We tested the capability of long-term prediction with a combination of DynST and Earthfarseer and found that incorporating the concept of dynamic sparse training did not compromise the model’s performance. Even at a higher sparsity level of 60%, it still manages to deliver reasonably good predictive performance without a significant increase in RMSE.

Refer to caption
Figure 5. The performance visualization of FIT datasets.
Refer to caption
Figure 6. The performance visualization of JD Technology Taxibj+ datasets.

Obs 4. DynST effectively meets industrial-level requirements (30% sparsity), helping to achieve manageable inference demands while reducing the burden of inference. As shown in Fig 5, The first line of the display meticulously captures the actual observed temperature flow field, providing a vivid and accurate representation of the existing conditions. In contrast, the second line offers a predictive perspective, showcasing the temperature flow field as forecasted by the innovative Earthfarseer+DynST model. This juxtaposition not only illustrates the capabilities of the predictive model but also allows for a direct comparison between observed and predicted states. Bottom: Delving deeper into the analysis, the left image opens a window into a detailed time series comparison. It meticulously charts both the real and the predicted temperatures at the specific coordinates of (50,7), offering a granular view of the model’s precision over time. Similarly, the right image extends this comparison to another set of coordinates, (425,7), revealing how the model captures the temporal evolution of temperatures in this distinct area. These results showcase the remarkable ability of the DynST-enhanced model to preserve high local fidelity. This fidelity is not just theoretical; it translates into practical, industry-level reliability, consistently maintaining the prediction deviation within a tight 15% margin (Verda et al., 2021). Such performance not only underscores the robustness of the Earthfarseer+DynST model but also highlights its potential for widespread application in scenarios demanding high precision and reliability (Fig 6 also support our research findings).

6.4. Structural & Ablation study (𝒬𝒬\mathcal{RQ}caligraphic_R caligraphic_Q4)

We initially configure DynST to maintain the model at a moderate sparsity level (30%) to observe how well the model preserves structural integrity at this level of sparsity. Here, we employ two metrics, SSIM and PSNR, to measure the completeness of the model’s predictions. Higher values of SSIM and PSNR indicate more accurate structural predictions by the model. Additionally, we also observe the trend of SSIM performance at different levels of sparsity.

Table 3. SSIM and PSNR results on three research domain. The underline symbol represents the best performance. Ori denotes the original results, +Dyn denotes add DynST at sparsity 30%.
Model (data) SSIM (Ori \leftrightarrow +Dyn) PSNR (Ori \leftrightarrow +Dyn)
SimVP (TaxiBJ+) 0.94±0.01subscript0.94plus-or-minus0.010.94_{\pm 0.01}0.94 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT/ 0.93±0.01subscript0.93plus-or-minus0.010.93_{\pm 0.01}0.93 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT 36.27±0.01subscript36.27plus-or-minus0.0136.27_{\pm 0.01}36.27 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT/ 35.43±0.01subscript35.43plus-or-minus0.0135.43_{\pm 0.01}35.43 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT
TAU (TaxiBJ+) 0.96±0.01subscript0.96plus-or-minus0.010.96_{\pm 0.01}0.96 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT/ 0.95±0.01subscript0.95plus-or-minus0.010.95_{\pm 0.01}0.95 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT 36.76±0.01subscript36.76plus-or-minus0.0136.76_{\pm 0.01}36.76 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT / 35.62±0.01subscript35.62plus-or-minus0.0135.62_{\pm 0.01}35.62 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT
Earthfarseer (TaxiBJ+) 0.98¯¯0.98\underline{0.98}under¯ start_ARG 0.98 end_ARG ±0.01plus-or-minus0.01{}_{\pm 0.01}start_FLOATSUBSCRIPT ± 0.01 end_FLOATSUBSCRIPT/ 0.96¯¯0.96\underline{0.96}under¯ start_ARG 0.96 end_ARG ±0.01plus-or-minus0.01{}_{\pm 0.01}start_FLOATSUBSCRIPT ± 0.01 end_FLOATSUBSCRIPT 37.84¯¯37.84\underline{37.84}under¯ start_ARG 37.84 end_ARG ±0.01plus-or-minus0.01{}_{\pm 0.01}start_FLOATSUBSCRIPT ± 0.01 end_FLOATSUBSCRIPT/ 36.44¯¯36.44\underline{36.44}under¯ start_ARG 36.44 end_ARG ±0.01plus-or-minus0.01{}_{\pm 0.01}start_FLOATSUBSCRIPT ± 0.01 end_FLOATSUBSCRIPT
CLCRN (WeatherBench) 0.94±0.02subscript0.94plus-or-minus0.020.94_{\pm 0.02}0.94 start_POSTSUBSCRIPT ± 0.02 end_POSTSUBSCRIPT/ 0.93±0.02subscript0.93plus-or-minus0.020.93_{\pm 0.02}0.93 start_POSTSUBSCRIPT ± 0.02 end_POSTSUBSCRIPT 36.12±0.02subscript36.12plus-or-minus0.0236.12_{\pm 0.02}36.12 start_POSTSUBSCRIPT ± 0.02 end_POSTSUBSCRIPT/ 35.22±0.19subscript35.22plus-or-minus0.1935.22_{\pm 0.19}35.22 start_POSTSUBSCRIPT ± 0.19 end_POSTSUBSCRIPT
Simvp (WeatherBench) 0.96±0.01subscript0.96plus-or-minus0.010.96_{\pm 0.01}0.96 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT / 0.95±0.01subscript0.95plus-or-minus0.010.95_{\pm 0.01}0.95 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT 37.33±0.01subscript37.33plus-or-minus0.0137.33_{\pm 0.01}37.33 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT/ 36.33±0.17subscript36.33plus-or-minus0.1736.33_{\pm 0.17}36.33 start_POSTSUBSCRIPT ± 0.17 end_POSTSUBSCRIPT
Earthfarseer (WeatherBench) 0.98¯¯0.98\underline{0.98}under¯ start_ARG 0.98 end_ARG±0.01plus-or-minus0.01{}_{\pm 0.01}start_FLOATSUBSCRIPT ± 0.01 end_FLOATSUBSCRIPT/ 0.97¯¯0.97\underline{0.97}under¯ start_ARG 0.97 end_ARG ±0.01plus-or-minus0.01{}_{\pm 0.01}start_FLOATSUBSCRIPT ± 0.01 end_FLOATSUBSCRIPT 39.27¯¯39.27\underline{39.27}under¯ start_ARG 39.27 end_ARG ±0.11plus-or-minus0.11{}_{\pm 0.11}start_FLOATSUBSCRIPT ± 0.11 end_FLOATSUBSCRIPT/ 38.12¯¯38.12\underline{38.12}under¯ start_ARG 38.12 end_ARG ±0.03plus-or-minus0.03{}_{\pm 0.03}start_FLOATSUBSCRIPT ± 0.03 end_FLOATSUBSCRIPT
VIT (FIT) 0.90±0.02subscript0.90plus-or-minus0.020.90_{\pm 0.02}0.90 start_POSTSUBSCRIPT ± 0.02 end_POSTSUBSCRIPT/ 0.89±0.02subscript0.89plus-or-minus0.020.89_{\pm 0.02}0.89 start_POSTSUBSCRIPT ± 0.02 end_POSTSUBSCRIPT 35.41±0.02subscript35.41plus-or-minus0.0235.41_{\pm 0.02}35.41 start_POSTSUBSCRIPT ± 0.02 end_POSTSUBSCRIPT/ 33.33±0.01subscript33.33plus-or-minus0.0133.33_{\pm 0.01}33.33 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT
EGNN (FIT) 0.83±0.01subscript0.83plus-or-minus0.010.83_{\pm 0.01}0.83 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT/ 0.81±0.01subscript0.81plus-or-minus0.010.81_{\pm 0.01}0.81 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT 35.41±0.01subscript35.41plus-or-minus0.0135.41_{\pm 0.01}35.41 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT/ 34.68±0.02subscript34.68plus-or-minus0.0234.68_{\pm 0.02}34.68 start_POSTSUBSCRIPT ± 0.02 end_POSTSUBSCRIPT
Earthfarseer (FIT) 0.95¯¯0.95\underline{0.95}under¯ start_ARG 0.95 end_ARG ±0.01plus-or-minus0.01{}_{\pm 0.01}start_FLOATSUBSCRIPT ± 0.01 end_FLOATSUBSCRIPT / 0.93¯¯0.93\underline{0.93}under¯ start_ARG 0.93 end_ARG ±0.01plus-or-minus0.01{}_{\pm 0.01}start_FLOATSUBSCRIPT ± 0.01 end_FLOATSUBSCRIPT 37.23¯¯37.23\underline{37.23}under¯ start_ARG 37.23 end_ARG ±0.01plus-or-minus0.01{}_{\pm 0.01}start_FLOATSUBSCRIPT ± 0.01 end_FLOATSUBSCRIPT/ 36.31¯¯36.31\underline{36.31}under¯ start_ARG 36.31 end_ARG ±0.01plus-or-minus0.01{}_{\pm 0.01}start_FLOATSUBSCRIPT ± 0.01 end_FLOATSUBSCRIPT

Obs 5. As shown in Tab 3 and Fig 7, we find that integrating DynST into the model does not significantly impact the SSIM and PSNR metrics. On the TaxiBJ+ dataset, Earthfarseer achieves an SSIM value close to 0.97, and the incorporation of DynST appears to have minimal effect on the prediction results. This phenomenon is nearly identical on both the WeatherBench (0.980.970.980.970.98\rightarrow 0.970.98 → 0.97) and FIT (0.980.930.980.930.98\rightarrow 0.930.98 → 0.93) datasets, thereby validating the effectiveness of DynST. Further, as the model’s SSIM values under varying data sparsity levels (Fig 7), we note that as sparsity increases, the SSIM values gradually decrease, providing a trade-off solution for practical applications.

Refer to caption
Figure 7. The proposed three plug-and-play model + DynST on SSIM.

In the last, we select three training schemes (Earthfarseer as the base model) to explore the performance of our algorithm and the benefits of combining our algorithm with mainstream training approaches: (1) One-shot pruning (OP): We thoroughly train our model and subsequently conduct training of the mask for a one-time pruning process. (2) Iterative pruning (IP): As our work can be regarded as a pruning method, we have opted for the widely recognized iterative pruning (IP) strategy (Frankle and Carbin, 2018) in the main manuscript part, we prune data for 10 times and every time for pruning 4% sub-counterpart. (3) Dynamic sparse training (DST): We select a target sparsity level and then maintain the data training consistently at this fixed sparsity. Dynamically, we remove and restore the smallest and largest magnitudes in the mask (Evci et al., 2020).We set a 40% sparsity for dynamic training. Table 4 shows STGCN, SimVP, and Earthfarseer’s performance in IP, OS, and DST training methods. Their RMSEs are 0.5698, 0.5108, 0.3507 (IP), 0.6197, 0.5650, 0.4121 (OS), and 0.5792, 0.5261, 0.3495 (DST).

Table 4. Performance across different training schemes (RMSE).
Baselines Training Schemes
IP OS DST
EAGLE
STGCN
SimVP
Earthfarseer
0.5698
0.5108
0.3507
0.6197
0.5650
0.4121
0.5792
0.5261
0.3495
FIT ϕitalic-ϕ\phiitalic_ϕ
STGCN
SimVP
Earthfarseer
0.3245
0.2193
0.1983
0.3617
0.2565
0.2293
0.3123
0.2252
0.1842

7. Conclusion

In this paper, we introduce the concept of dynamic sparse training in the context of sensor deployment, termed DynST, which adjusts sensor deployment dynamically through training without compromising the model’s predictive capabilities. DynST ingeniously circumvents the complexity issues posed by the temporal dimension through clever dimension mapping. Following this, through dynamic training and mask operations, we can precisely identify the less significant parts of the output data, which correspond to the areas detected by the sensors. DynST is both general and succinct, compatible with many mainstream training schemes, such as one-shot, iterative pruning and dynamical sparse training, to boost inference speed without significant performance degradation.

Acknowledgements.
This work is also supported by Guangzhou-HKUST (GZ) Joint Funding Program (No. 2024A03J0620)

References

  • (1)
  • Anonymous (2023) Anonymous. 2023. Spatio-temporal Twins with A Cache for Modeling Long-term System Dynamics. In Submitted to The Twelfth International Conference on Learning Representations. https://openreview.net/forum?id=aE6HazMgRz under review.
  • Anonymous (2024) Anonymous. 2024. NuwaDynamics: Discovering and Updating in Causal Spatio-Temporal Modeling. In The Twelfth International Conference on Learning Representations. https://openreview.net/forum?id=sLdVl0q68X
  • Arnab et al. (2021) Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, and Cordelia Schmid. 2021. Vivit: A video vision transformer. In Proceedings of the IEEE/CVF international conference on computer vision. 6836–6846.
  • Bai et al. (2022) Cong Bai, Feng Sun, Jinglin Zhang, Yi Song, and Shengyong Chen. 2022. Rainformer: Features extraction balanced network for radar-based precipitation nowcasting. IEEE Geoscience and Remote Sensing Letters 19 (2022), 1–5.
  • Bi et al. (2023) Kaifeng Bi, Lingxi Xie, Hengheng Zhang, Xin Chen, Xiaotao Gu, and Qi Tian. 2023. Accurate medium-range global weather forecasting with 3D neural networks. Nature 619, 7970 (2023), 533–538.
  • Chen et al. (2018) Jie Chen, Tengfei Ma, and Cao Xiao. 2018. Fastgcn: fast learning with graph convolutional networks via importance sampling. arXiv preprint arXiv:1801.10247 (2018).
  • Chen et al. (2021) Tianlong Chen, Yongduo Sui, Xuxi Chen, Aston Zhang, and Zhangyang Wang. 2021. A unified lottery ticket hypothesis for graph neural networks. In International Conference on Machine Learning. PMLR, 1695–1706.
  • Dosovitskiy et al. (2021) Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations. https://openreview.net/forum?id=YicbFdNTTy
  • Eden et al. (2018) Talya Eden, Shweta Jain, Ali Pinar, Dana Ron, and C Seshadhri. 2018. Provable and practical approximations for the degree distribution using sublinear graph samples. In Proceedings of the 2018 World Wide Web Conference. 449–458.
  • Evci et al. (2020) Utku Evci, Trevor Gale, Jacob Menick, Pablo Samuel Castro, and Erich Elsen. 2020. Rigging the lottery: Making all tickets winners. In International Conference on Machine Learning. PMLR, 2943–2952.
  • Frankle and Carbin (2018) Jonathan Frankle and Michael Carbin. 2018. The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635 (2018).
  • Frankle et al. (2020) Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M Roy, and Michael Carbin. 2020. Pruning neural networks at initialization: Why are we missing the mark? arXiv preprint arXiv:2009.08576 (2020).
  • Gao and Ji (2019) Hongyang Gao and Shuiwang Ji. 2019. Graph u-nets. In international conference on machine learning. PMLR, 2083–2092.
  • Gao et al. (2022a) Zhihan Gao, Xingjian Shi, Hao Wang, Yi Zhu, Yuyang Bernie Wang, Mu Li, and Dit-Yan Yeung. 2022a. Earthformer: Exploring space-time transformers for earth system forecasting. Advances in Neural Information Processing Systems 35 (2022), 25390–25403.
  • Gao et al. (2022b) Zhangyang Gao, Cheng Tan, Lirong Wu, and Stan Z Li. 2022b. Simvp: Simpler yet better video prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3170–3180.
  • Han et al. (2020) Haoyu Han, Mengdi Zhang, Min Hou, Fuzheng Zhang, Zhongyuan Wang, Enhong Chen, Hongwei Wang, Jianhui Ma, and Qi Liu. 2020. STGCN: a spatial-temporal aware graph learning method for POI recommendation. In 2020 IEEE International Conference on Data Mining (ICDM). IEEE, 1052–1057.
  • Huang et al. (2023) Shaoyi Huang, Bowen Lei, Dongkuan Xu, Hongwu Peng, Yue Sun, Mimi Xie, and Caiwen Ding. 2023. Dynamic sparse training via balancing the exploration-exploitation trade-off. In 2023 60th ACM/IEEE Design Automation Conference (DAC). IEEE, 1–6.
  • Janny et al. (2023) Steeven Janny, Aurélien Beneteau, Nicolas Thome, Madiha Nadri, Julie Digne, and Christian Wolf. 2023. Eagle: Large-scale learning of turbulent fluid dynamics with mesh transformers. arXiv preprint arXiv:2302.10803 (2023).
  • Ji et al. (2023) Jiahao Ji, Jingyuan Wang, Chao Huang, Junjie Wu, Boren Xu, Zhenhe Wu, Junbo Zhang, and Yu Zheng. 2023. Spatio-temporal self-supervised learning for traffic flow prediction. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 4356–4364.
  • Kipf and Welling (2016) Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).
  • Kundu and Das (2023) Srabani Kundu and Nabanita Das. 2023. A study on boundary detection in wireless sensor networks. Innovations in Systems and Software Engineering 19, 2 (2023), 217–225.
  • Li et al. (2017) Yaguang Li, Rose Yu, Cyrus Shahabi, and Yan Liu. 2017. Diffusion convolutional recurrent neural network: Data-driven traffic forecasting. arXiv preprint arXiv:1707.01926 (2017).
  • Li et al. (2020) Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar. 2020. Fourier neural operator for parametric partial differential equations. arXiv preprint arXiv:2010.08895 (2020).
  • Lin et al. (2022) Haitao Lin, Zhangyang Gao, Yongjie Xu, Lirong Wu, Ling Li, and Stan Z Li. 2022. Conditional local convolution for spatio-temporal meteorological forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 7470–7478.
  • Liu et al. (2020) Junjie Liu, Zhe Xu, Runbin Shi, Ray CC Cheung, and Hayden KH So. 2020. Dynamic sparse training: Find efficient sparse network from scratch with trainable masked layers. arXiv preprint arXiv:2005.06870 (2020).
  • Liu et al. (2021) Shiwei Liu, Lu Yin, Decebal Constantin Mocanu, and Mykola Pechenizkiy. 2021. Do we actually need dense over-parameterization? in-time over-parameterization in sparse training. In International Conference on Machine Learning. PMLR, 6989–7000.
  • Luo et al. (2023) Xiao Luo, Jingyang Yuan, Zijie Huang, Huiyu Jiang, Yifang Qin, Wei Ju, Ming Zhang, and Yizhou Sun. 2023. HOPE: High-order graph ODE for modeling interacting dynamics. In International Conference on Machine Learning. PMLR, 23124–23139.
  • Ma et al. (2021) Xiaolong Ma, Geng Yuan, Xuan Shen, Tianlong Chen, Xuxi Chen, Xiaohan Chen, Ning Liu, Minghai Qin, Sijia Liu, Zhangyang Wang, et al. 2021. Sanity checks for lottery tickets: Does your winning ticket really win the jackpot? Advances in Neural Information Processing Systems 34 (2021), 12749–12760.
  • Pan et al. (2019) Zheyi Pan, Yuxuan Liang, Weifeng Wang, Yong Yu, Yu Zheng, and Junbo Zhang. 2019. Urban traffic prediction from spatio-temporal data using deep meta learning. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining. 1720–1730.
  • Pathak et al. (2022) Jaideep Pathak, Shashank Subramanian, Peter Harrington, Sanjeev Raja, Ashesh Chattopadhyay, Morteza Mardani, Thorsten Kurth, David Hall, Zongyi Li, Kamyar Azizzadenesheli, et al. 2022. Fourcastnet: A global data-driven high-resolution weather model using adaptive fourier neural operators. arXiv preprint arXiv:2202.11214 (2022).
  • Priyadarshi et al. (2020) Rahul Priyadarshi, Bharat Gupta, and Amulya Anurag. 2020. Wireless sensor networks deployment: a result oriented analysis. Wireless Personal Communications 113 (2020), 843–866.
  • Ranftl et al. (2021) René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. 2021. Vision transformers for dense prediction. 12179–12188.
  • Ranjan et al. (2020) Ekagra Ranjan, Soumya Sanyal, and Partha Talukdar. 2020. Asap: Adaptive structure aware pooling for learning hierarchical graph representations. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 5470–5477.
  • Rasp et al. (2020) Stephan Rasp, Peter D Dueben, Sebastian Scher, Jonathan A Weyn, Soukayna Mouatadid, and Nils Thuerey. 2020. WeatherBench: a benchmark data set for data-driven weather forecasting. Journal of Advances in Modeling Earth Systems 12, 11 (2020), e2020MS002203.
  • Rissler et al. (2020) Leslie J Rissler, Katherine L Hale, Nina R Joffe, and Nicholas M Caruso. 2020. Gender differences in grant submissions across science and engineering fields at the NSF. Bioscience 70, 9 (2020), 814–820.
  • Satorras et al. (2021) Vıctor Garcia Satorras, Emiel Hoogeboom, and Max Welling. 2021. E (n) equivariant graph neural networks. In International conference on machine learning. PMLR, 9323–9332.
  • Scarselli et al. (2008) Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. 2008. The graph neural network model. IEEE transactions on neural networks 20, 1 (2008), 61–80.
  • Shao et al. (2022) Zezhi Shao, Zhao Zhang, Fei Wang, Wei Wei, and Yongjun Xu. 2022. Spatial-temporal identity: A simple yet effective baseline for multivariate time series forecasting. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management. 4454–4458.
  • Shi et al. (2015) Xingjian Shi, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang-chun Woo. 2015. Convolutional LSTM network: A machine learning approach for precipitation nowcasting. Advances in neural information processing systems 28 (2015).
  • Tan et al. (2022) Cheng Tan, Zhangyang Gao, Siyuan Li, and Stan Z Li. 2022. Simvp: Towards simple yet powerful spatiotemporal predictive learning. arXiv preprint arXiv:2211.12509 (2022).
  • Tan et al. (2023) Cheng Tan, Zhangyang Gao, Lirong Wu, Yongjie Xu, Jun Xia, Siyuan Li, and Stan Z Li. 2023. Temporal attention unit: Towards efficient spatiotemporal predictive learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18770–18782.
  • Thekumparampil et al. (2018) Kiran K Thekumparampil, Chong Wang, Sewoong Oh, and Li-Jia Li. 2018. Attention-based graph neural network for semi-supervised learning. arXiv preprint arXiv:1803.03735 (2018).
  • Verda et al. (2021) Vittorio Verda, Romano Borchiellini, Sara Cosentino, Elisa Guelpa, and Jesus Mejias Tuni. 2021. Expanding the FDS simulation capabilities to fire tunnel scenarios through a novel multi-scale model. Fire Technology 57 (2021), 2491–2514.
  • Wang et al. (2023) Kun Wang, Yuxuan Liang, Xinglin Li, Guohao Li, Bernard Ghanem, Roger Zimmermann, Huahui Yi, Yudong Zhang, Yang Wang, et al. 2023. Brave the Wind and the Waves: Discovering Robust and Generalizable Graph Lottery Tickets. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023).
  • Wang et al. (2022) Kun Wang, Yuxuan Liang, Pengkun Wang, Xu Wang, Pengfei Gu, Junfeng Fang, and Yang Wang. 2022. Searching Lottery Tickets in Graph Neural Networks: A Dual Perspective. In The Eleventh International Conference on Learning Representations.
  • Wang et al. (2018a) Yunbo Wang, Zhihan Gao, Mingsheng Long, Jianmin Wang, and Philip S Yu. 2018a. Pre-dRNN++: Towards a resolution of the deep-in-time dilemma in spatiotemporal predictive learning. In International Conference on Machine Learning. 5123–5132.
  • Wang et al. (2018b) Yunbo Wang, Lu Jiang, Ming-Hsuan Yang, Li-Jia Li, Mingsheng Long, and Li Fei-Fei. 2018b. Eidetic 3D LSTM: A model for video prediction and beyond. In International conference on learning representations.
  • Wang et al. (2017) Yunbo Wang, Mingsheng Long, Jianmin Wang, Zhifeng Gao, and Philip S Yu. 2017. Predrnn: Recurrent neural networks for predictive learning using spatiotemporal lstms. Advances in neural information processing systems 30 (2017).
  • Wang et al. (2019) Yunbo Wang, Jianjin Zhang, Hongyu Zhu, Mingsheng Long, Jianmin Wang, and Philip S Yu. 2019. Memory in memory: A predictive neural network for learning higher-order non-stationarity from spatiotemporal dynamics. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9154–9162.
  • Wu et al. (2023a) Haixu Wu, Tengge Hu, Huakun Luo, Jianmin Wang, and Mingsheng Long. 2023a. Solving High-Dimensional PDEs with Latent Spectral Models. arXiv preprint arXiv:2301.12664 (2023).
  • Wu et al. (2023b) Hao Wu, Shilong Wang, Yuxuan Liang, Zhengyang Zhou, Wei Huang, Wei Xiong, and Kun Wang. 2023b. Earthfarseer: Versatile Spatio-Temporal Dynamical Systems Modeling in One Model. arXiv preprint arXiv:2312.08403 (2023).
  • Wu et al. (2023c) Hao Wu, Wei Xion, Fan Xu, Xiao Luo, Chong Chen, Xian-Sheng Hua, and Haixin Wang. 2023c. PastNet: Introducing Physical Inductive Biases for Spatio-temporal Video Prediction. arXiv preprint arXiv:2305.11421 (2023).
  • Wu et al. (2020) Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and S Yu Philip. 2020. A comprehensive survey on graph neural networks. IEEE transactions on neural networks and learning systems 32, 1 (2020), 4–24.
  • Xu (2020) Sheng Xu. 2020. Optimal sensor placement for target localization using hybrid RSS, AOA and TOA measurements. IEEE Communications Letters 24, 9 (2020), 1966–1970.
  • Yan and Li (2023) Huan Yan and Yong Li. 2023. A Survey of Generative AI for Intelligent Transportation Systems. arXiv preprint arXiv:2312.08248 (2023).
  • Yarinezhad and Hashemi (2023) Ramin Yarinezhad and Seyed Naser Hashemi. 2023. A sensor deployment approach for target coverage problem in wireless sensor networks. Journal of Ambient Intelligence and Humanized Computing 14, 5 (2023), 5941–5956.
  • You et al. (2019) Jiaxuan You, Rex Ying, and Jure Leskovec. 2019. Position-aware graph neural networks. In International conference on machine learning. PMLR, 7134–7143.
  • Yu et al. (2020) Changqian Yu, Yifan Liu, Changxin Gao, Chunhua Shen, and Nong Sang. 2020. Representative graph neural network. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VII 16. Springer, 379–396.
  • Zhang et al. (2023a) Yunke Zhang, Tong Li, Yuan Yuan, Fengli Xu, Fan Yang, Funing Sun, and Yong Li. 2023a. Demand-Driven Urban Facility Visit Prediction. ACM Transactions on Intelligent Systems and Technology (2023).
  • Zhang et al. (2023b) Yuchen Zhang, Mingsheng Long, Kaiyuan Chen, Lanxiang Xing, Ronghua Jin, Michael I Jordan, and Jianmin Wang. 2023b. Skilful nowcasting of extreme precipitation with NowcastNet. Nature 619, 7970 (2023), 526–532.
  • Zhang et al. (2023c) Yuxin Zhang, Lirui Zhao, Mingbao Lin, Yunyun Sun, Yiwu Yao, Xingjia Han, Jared Tanner, Shiwei Liu, and Rongrong Ji. 2023c. Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs. arXiv preprint arXiv:2310.08915 (2023).
  • Zhang et al. (2021) Zhen Zhang, Jiajun Bu, Martin Ester, Jianfeng Zhang, Zhao Li, Chengwei Yao, Dai Huifen, Zhi Yu, and Can Wang. 2021. Hierarchical multi-view graph pooling with structure learning. IEEE Transactions on Knowledge and Data Engineering (2021).
  • Zheng et al. (2023) Yu Zheng, Yuming Lin, Liang Zhao, Tinghai Wu, Depeng Jin, and Yong Li. 2023. Spatial planning of urban communities via deep reinforcement learning. Nature Computational Science 3, 9 (2023), 748–762.
  • Zou and Chakrabarty (2003) Yao Zou and Krishnendu Chakrabarty. 2003. Sensor deployment and target localization based on virtual forces. In IEEE INFOCOM 2003. Twenty-second Annual Joint Conference of the IEEE Computer and Communications Societies (IEEE Cat. No. 03CH37428), Vol. 2. IEEE, 1293–1303.

Appendix A Datasets and backbones descriptions.

Table 5. The statistics of the datasets.
Dataset #Nodes #Variables #Input #Output
Weatherbench 2048 4 12 12
FIT 15360 2 50 50
Taxibj+ 16384 2 12 12
EAGLE 3388 2 50 50

In this study, we analyze four benchmark datasets. Each snapshot in these datasets serves as an independent graph structure. We summarize the statistical characteristics of these datasets in Table 5. Specifically, the datasets include:

1. Weatherbench Dataset: Each graph contains 2048 nodes, covering four variables: temperature, humidity, wind speed, and cloud concentration. The input and output duration for this dataset is 12 time steps.

2. FIT Dataset: Each graph in this dataset consists of 15360 nodes, with two variables: temperature and visibility. The input and output duration is 50 time steps.

3. Taxibj+ Dataset: Each graph has 16384 nodes, including two variables: Inflow and Outflow. The input and output duration is 12 time steps.

4. EAGLE Dataset: Each graph in this dataset comprises 3388 nodes, with two variables: pressure and speed. The input and output duration is 50 time steps.

These datasets provide diverse experimental scenarios and analytical perspectives for our research.

Appendix B Dataset preprocessing.

Refer to caption
Figure 8. Transforming Raw Data into Graph and Image Structures.

In this section, we meticulously detail the specifics of data processing, as shown in Figure 8, encompassing the conversion of raw data into graph and image formats through two distinct processes: Nodalization and Patchify. We utilize Weatherbench as a case study to illustrate these concepts:

1. Nodalization: This process involves the dimensional transformation of raw data from the format (C,H,W)𝐶𝐻𝑊(C,H,W)( italic_C , italic_H , italic_W ), where C𝐶Citalic_C represents the number of physical variables, and H𝐻Hitalic_H and W𝑊Witalic_W signify the data’s height and width, respectively. In this context, the data can be perceived as having H×W𝐻𝑊H\times Witalic_H × italic_W observation points, each containing C𝐶Citalic_C variables. If we analogize each observation point to a sensor, these correspond to nodes in a graph structure. Consequently, the transformed graph data dimension is (Num_nodes,C)𝑁𝑢𝑚_𝑛𝑜𝑑𝑒𝑠𝐶(Num\_nodes,C)( italic_N italic_u italic_m _ italic_n italic_o italic_d italic_e italic_s , italic_C ), where Num_nodes=H×W𝑁𝑢𝑚_𝑛𝑜𝑑𝑒𝑠𝐻𝑊Num\_nodes=H\times Witalic_N italic_u italic_m _ italic_n italic_o italic_d italic_e italic_s = italic_H × italic_W. To alleviate memory pressure during training, a down-sampling of H𝐻Hitalic_H and W𝑊Witalic_W can be implemented in practical applications.

2. Patchify: In the Patchify process, we adhere to the strategy outlined in the literature, assuming that each Patch is of size p×p𝑝𝑝p\times pitalic_p × italic_p. This results in a total of (H/p)×(W/p)𝐻𝑝𝑊𝑝(H/p)\times(W/p)( italic_H / italic_p ) × ( italic_W / italic_p ) Patches. The dimension of each Patch is (p×p×C)𝑝𝑝𝐶(p\times p\times C)( italic_p × italic_p × italic_C ). This method enables us to leverage Transformer-based architectures for data feature extraction. At the same time, for convolutional structures, the raw data can be directly inputted into the model without the need for specialized data preprocessing.

Through these two methodologies, we effectively transform the original data format into one that is conducive to deep learning model processing, thereby enhancing the efficiency of data handling and model training.

Appendix C Algorithm

Algorithm 1 Dynamic Sparse Training (DynST) Framework
0:  Input graph 𝒢insubscript𝒢𝑖𝑛\mathcal{G}_{in}caligraphic_G start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT, Network f𝑓fitalic_f, Target Sparsity Sg%percentsubscript𝑆𝑔S_{g}\%italic_S start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT %
0:  Sparse mask Mgsubscript𝑀𝑔M_{g}italic_M start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT
1:  Initialize graph mask Mgsubscript𝑀𝑔M_{g}italic_M start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT
2:  Stream Morph for input 𝒢in𝒢^insubscript𝒢𝑖𝑛subscript^𝒢𝑖𝑛\mathcal{G}_{in}\rightarrow\hat{\mathcal{G}}_{in}caligraphic_G start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT → over^ start_ARG caligraphic_G end_ARG start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT
3:  while 1Mg0𝒢in[;*]0<Sg1-\frac{\|M_{g}\|_{0}}{\|\mathcal{G}_{in}[;*]\|_{0}}<S_{g}1 - divide start_ARG ∥ italic_M start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG ∥ caligraphic_G start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT [ ; * ] ∥ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG < italic_S start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT do
4:     Training network for R𝑅Ritalic_R iterations
5:     Training Mgsubscript𝑀𝑔M_{g}italic_M start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT for M𝑀Mitalic_M iterations
6:     Dynamical sparse training using Eq.5 and Eq.6
7:     Adjust Mgsubscript𝑀𝑔M_{g}italic_M start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT using Eq.7
8:  end while
9:  𝒢^in𝒢^inMgsubscript^𝒢𝑖𝑛direct-productsubscript^𝒢𝑖𝑛subscript𝑀𝑔\hat{\mathcal{G}}_{in}\leftarrow\hat{\mathcal{G}}_{in}\odot M_{g}over^ start_ARG caligraphic_G end_ARG start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ← over^ start_ARG caligraphic_G end_ARG start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ⊙ italic_M start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT