Leveraging Temporal Graph Networks Using Module Decoupling

Or Feldman
Department of Computer Science
Technion – Israel Institute of Technology
orfeldman@campus.technion.ac.il
&Chaim Baskin
Department of Computer Science
Technion – Israel Institute of Technology
chaimbaskin@technion.ac.il

Abstract

Current memory-based methods for dynamic graph learning utilize batch processing to handle rapid updates efficiently. However, the adoption of batches introduces a phenomenon we term as missing updates, which adversely affects the performance of memory-based models. In this work, we analyze the impact of missing updates on dynamic graph learning models and propose a decoupling strategy to mitigate these effects. Implementing this strategy, we present the Lightweight Decoupled Temporal Graph Network, a memory-based model with a minimal number of learnable parameters that is capable of dealing with the demand of high frequency of updates. We validated our approach across diverse dynamic graph benchmarks. LDTGN surpassed the average precision of previous methods by over 20% in scenarios demanding frequent graph updates, such as US Legis or UN Trade. In the vast majority of the benchmarks, LDTGN achieves state-of-the-art or comparable results while operating with significantly higher throughput than existing baselines. The code to replicate our experiments is available at this url.

1 Introduction

Dynamic graphs are commonly used to describe real-world dynamic systems, where the interacting elements are modeled as nodes, and the interactions between two elements are represented as edges. Each edge is usually labeled with a timestamp indicating its time of occurrence. Item recommendation on e-commerce platforms (Ding et al., 2019), friendship suggestion on social networks (Backstrom & Leskovec, 2011; Haghani & Keyvanpour, 2019), anomaly detection on communication networks (Yu et al., 2018), and traffic forecasting (Cini et al., 2023) are all practical tasks that can be modeled using dynamic graphs.

Although most graph-related real-world tasks are time-evolving, deep learning approaches usually focus on problems described using static graphs, which do not change over time. Moreover, it has also been shown that ignoring the dynamic nature of a system by abstracting it with static graphs is suboptimal (Rossi et al., 2020; Xu et al., 2020). A dynamic representation of a system, on the other hand, is often able to define the evolving behavior of the latter (Simmel, 1950; Granovetter, 1973; Mangan & Alon, 2003; Toivonen et al., 2007; Gorochowski et al., 2018).

Dynamic graph approaches are often based on discrete-time (Liben-Nowell & Kleinberg, 2003; Sankar et al., 2020; Pareja et al., 2020) or continuous-time (Trivedi et al., 2019; Ma et al., 2020; Cong et al., 2023) settings. In discrete-time settings, data are received as a sequence of snapshots describing the full graph structure at specific times, while in the flexible continuous-time setting, a single update on the graphs can happen at any moment. The setting in which deep learning models for dynamic graphs operate at inference time can be roughly divided into the following types: streaming, deployed, and live update (Huang et al., 2023). In this work, we focus on continuous-time dynamic graphs in the streaming context, in which the models may be updated upon receiving new information, but cannot perform backpropagation due to the high throughput required.

Memory-based models for dynamic graphs are designed to support the assimilation of new information through graph updates during the inference phase. To do this, they manage a memory unit that represents the current state of the dynamic graph. This memory unit usually includes the current structure of the dynamic graph, data-specific information such as node and edge features, timestamps of previous updates, and learnable information computed by the model.

In the streaming setting for continuous-time dynamic graphs, memory-based deep networks have to use batches to keep up with the stream of incoming updates, which means they process multiple updates in parallel. This situation introduces a new problem in which updates for the models are not being considered for the predictions of inputs inside their mutual batch. In Section 3, we formally define this undesirable phenomenon as missing updates. In this work, we suggest a decoupling strategy that minimizes the negative impacts of missing updates while still using batches. Guided by this strategy, we have built the Lightweight Decoupled Temporal Graph Network (LDTGN) – an efficient memory-based model for dynamic graph learning that outperforms most established baselines both in terms of running time and performance.

To summarize, this work makes the following contributions:

•

We introduce and analyze the problem of missing updates when using memory-based models.
•

We suggest a novel decoupling methodology for building deep learning memory-based models for dynamic graph learning.
•

Based on the suggested methodology, we propose a new lightweight model for dynamic graph learning tasks that can operate at high streaming rates and with significantly smaller number of parameters compared to other baselines.
•

We evaluated LDTGN on various transductive and inductive benchmarks for dynamic graph learning and achieved state-of-the-art or comparable performance on most of the tested benchmarks, while outperforming previous methods in terms of throughput.

2 Background

Static graph ${\mathcal{G}}=({\mathcal{V}},{\mathcal{E}})$ is a tuple of vertex set ${\mathcal{V}}$ and edge set ${\mathcal{E}}$ , s.t., ${\bm{e}}\in{\mathcal{E}}$ is a tuple of two vertices from ${\mathcal{V}}$ . ${\mathcal{G}}$ is often equipped with features functions $F_{\mathcal{V}}:{\mathcal{V}}\rightarrow\mathbb{R}^{n}$ and $F_{\mathcal{E}}:{\mathcal{E}}\rightarrow\mathbb{R}^{n}$ that maps a vertex or an edge into an $n$ -dimensional vector representing their matching features. Continuous-Time Dynamic Graph (CTDG) is a sequence ${\mathcal{Q}}=\{{u}_{t_{1}},{u}_{t_{2}},...,{u}_{t_{m}}\}$ of $m$ timestamped updates on the graph. An update ${u}_{t}$ that occurs at time $t$ can be one of the following: node addition, node removal, edge addition, and edge removal. ${\mathcal{G}}[{\mathcal{Q}}(t)]$ is the static graph received by applying all the updates from ${\mathcal{Q}}$ on ${\mathcal{G}}$ that have occurred until time $t$ . The $k$ -hop neighborhood of a node $v_{i}$ at time $t$ is defined by:

	$\displaystyle\mathcal{N}_{i}^{0}(t)=\{v_{i}\}$		(1)
	$\displaystyle\mathcal{N}_{i}^{k}(t)=\{v_{j}\|v_{u}\in\mathcal{N}_{i}^{k-1}(t),(% v_{u},v_{j})\in{\mathcal{G}}[{\mathcal{Q}}(t)]\}$		(2)

As a result of the growing interest in CTDG with a stream of updates, several techniques were recently developed (Kumar et al., 2019; Trivedi et al., 2019; Xu et al., 2020; Wang et al., 2021a, b; Cong et al., 2023). Many of these methods are specific cases of the Temporal Graph Network (TGN, Rossi et al., 2020) model. TGN is a general memory-based deep learning architecture designed to learn on CTDG while achieving throughput suitable for streaming tasks. The primary concept of TGN is to maintain states, namely node features, which are updated with each modification to the graph. To achieve this objective, TGN utilizes two central modules: memory and prediction.

Memory module

The memory module is responsible for applying the updates on the graph and update the states accordingly. When a new batch of updates arrives, the memory module applies a message function that generates a vector for each node involved in each update. If the update is the addition of edge $e_{i,j}$ from node $v_{i}$ to node $v_{j}$ at time $t$ , then the appropriate messages of $v_{i}$ and $v_{j}$ are:

	$\displaystyle m_{i}(t)=\mathrm{msg_{s}}(s_{i}(t^{-}),s_{j}(t^{-}),\Delta t_{i}% ,F_{{\mathcal{E}}}(e_{i,j}))$		(3)
	$\displaystyle m_{j}(t)=\mathrm{msg_{d}}(s_{j}(t^{-}),s_{i}(t^{-}),\Delta t_{j}% ,F_{{\mathcal{E}}}(e_{i,j}))$		(4)

where $s_{u}(t^{-})$ is the state of $v_{u}$ prior to $t$ and, $\Delta t_{u}$ is the time elapsed since $v_{u}$ received an update. $\mathrm{msg_{s}}$ and $\mathrm{msg_{d}}$ may have learnable parameters. Then, all the messages in the batch are aggregated into a single message per node, s.t., if node $v_{i}$ is involved in updates at times $t_{1}\leq t_{2}\leq...t_{n}=t$ then:

\overline{m}_{i}(t)=\mathrm{agg}(m_{i}(t_{1}),m_{i}(t_{2})...m_{i}(t_{n}))

(5)

The aggregation function, for example, can take only $m_{i}(t_{n})$ and neglects any previous messages in the batch. Finally, the message updater updates the state of the nodes:

s_{i}(t)=\mathrm{mem}(\overline{m}_{i}(t),s_{i}(t^{-}))

(6)

The $\mathrm{mem}$ function is a memory-based neural network such as LSTM (Hochreiter & Schmidhuber, 1997) or GRU (Cho et al., 2014).

Prediction module

The prediction module computes the predictions for the inputs in a given batch, e.g., potential edges in a link prediction task. First, the prediction module reads from the memory module all the states of the nodes in the neighborhood of any node involved in the input. Then it generates a new embedding for each node based on its state and the states of its neighbors. For example, denote $[\bm{\cdot}||\bm{\cdot}]$ as the operation of vector concatenation, then the embedding formulation based on $1$ -hop neighborhood of node $v_{i}$ is:

z_{i}(t)=\Sigma_{v_{j}\in\mathcal{N}_{i}^{1}(t)}h(v_{i}(t),v_{j}(t),F_{{% \mathcal{E}}}(e_{i,j}))

(7)

where $v_{u}(t)=[F_{\mathcal{V}}(v_{u})||s_{u}(t^{-})||\Delta t_{u}]$ and $h$ is a learnable function. Using the neighborhood of a node in the graph to compute its embedding averts the staleness problem(Kazemi et al., 2020).

For the task of future link prediction of edge $e_{i,j}$ at time $t$ , TGN computes the edge’s probability to exist by:

p_{i,j}(t)=\mathrm{merge}(z_{i}(t),z_{j}(t))

(8)

where $\mathrm{merge}$ is a learnable function such as an $\mathrm{MLP}$ .

3 Problem statement

The processing sequence of memory-based models for dynamic graph learning involves receiving a new batch containing both graph updates and inputs for prediction. This utilization of batches allows memory-based models to achieve a reasonable throughput during inference time (Kumar et al., 2019; Rossi et al., 2020; Wang et al., 2021b; Cong et al., 2023). In the streaming setting, where the graph receives new updates at extremely high speeds, it is crucial for the model to have sufficient throughput. Otherwise, a buffer to the model containing the new updates will overflow.

Memory-based models have a well-defined flow of operation upon receiving a new batch. Initially, they compute their predictions for the inputs in the batch. This operation is performed in parallel by using the current states and the current graph structure as saved in their memory. Subsequently, they process all the updates in the batch and update their inner memory accordingly. This flow of operations introduces the undesirable phenomenon defined as missing updates.

Formally, given a batch ${\mathcal{Q}}=\{{x}_{t_{1}},{x}_{t_{2}}...{x}_{t_{m}}\}$ of size $m$ where ${x}_{t_{i}}$ can be an update to the graph ${u}_{t_{i}}$ or an input to predict ${p}_{t_{i}}$ . We say that ${u}_{t_{i}}$ is a missing update if there exists an input ${p}_{t_{j}}$ , s.t., $i<j$ and the nodes involved in ${u}_{t_{i}}$ are in the neighborhood of the nodes involved in ${p}_{t_{j}}$ . Inputs to the model that depend on missing updates are harder to predict since the memory of the model of their neighborhood is outdated at the time of the prediction.

3.1 Empirical analysis

We examined the incidence and impact of the missing updates phenomenon on various real-world datasets for dynamic graphs. We measured the average ratio of inputs in a batch that depend on at least a single missing update. In addition, we tested the average number of missing updates that affects a single input to the model. In both cases, we examined missing updates within the $1$ -hop neighborhood of the input nodes for different batch sizes and reported the results in Figure 1 and Figure 1 respectively. We also tested a standard TGN trained on these datasets with different batch sizes. In Figure 1, we report the average precision of TGN for each batch size, normalized by the average precision achieved with a batch size of 10. In Appendix A, we supply the full missing updates statistics for all the datasets used in this work.

Refer to caption — Figure 1: The incidence of missing updates in real-world datasets as a function of the batch size and their impact on the performance of TGN. In 1, the ratio of inputs that depend on at least a single missing update increases significantly as the batch size increases. In 1, the average number of missing updates per input increases as the batch size increases. In 1, the performance of TGN corresponds to the extent of missing updates, where a high incidence of missing updates indicates a significant performance decrease.

In Figure 1, it is observed that an increase in batch size correlates with a higher incidence of missing updates. Furthermore, the occurrence of missing updates varies across different datasets. The findings from Figure 1 suggest a negative connection between the occurrence of missing updates in a dataset and the performance of a model trained on it. Consequently, achieving optimal performance in memory-based models necessitates a smaller batch size. This result indicates a trade-off with respect to the batch size in memory-based models, as large batch size is required to attain high throughput in the streaming scenario.

4 Related work

Handling missing updates

The t-Batch algorithm (Kumar et al., 2019) was initially intended to improve the running-time performance of memory-based deep networks for dynamic graphs that process updates one after the other (i.e., batch size equal to 1). The logic motivating t-Batch is that these networks can combine multiple updates into a single batch and apply them in parallel if they do not contain the same nodes, where the batches are temporally sorted. Using t-Batch, JODIE’s memory-based model becomes X9.2 faster than similar methods without suffering from missing updates (Kumar et al., 2019). The t-batch algorithm, however, suffers from two main flaws. First, large batch sizes for t-Batch are often impossible since temporal locality is a common characteristic of dynamic graphs (Poursafaei et al., 2022). In addition, many modern deep learning networks for dynamic graphs, such as TGN, depend on the neighborhood of the nodes to give an appropriate prediction, causing t-Batch to perform complicated neighborhood-independent batches instead of node-independent batches, which are significantly smaller.

Efficient methods for streaming

According to Huang et al. (2023), EdgeBank is currently an order of magnitude faster than other well-known techniques for dynamic graphs. EdgeBank (Poursafaei et al., 2022) is a memorization algorithm that saves any seen update and predicts according to a simple decision rule that can be one of the following: whether the input was seen in the last few iterations or whether the input has already been seen a sufficient number of times. The algorithm’s simplicity allows it to perform extremely fast, even without batches, thus not suffering from missing updates. Nevertheless, EdgeBank was developed to serve as a baseline for testing and comparing other methods for dynamic graphs (Poursafaei et al., 2022), and, therefore, its performance lags significantly behind the state-of-the-art (Yu et al., 2023; Huang et al., 2023).

5 Proposed method

In this section, we describe our method to balance the batch size trade-off discussed earlier in Section 3. The method decouples the TGN modules, by ensuring that each module uses a different batch size. In general, the memory module will utilize smaller batch sizes for frequent updates, while the prediction module will employ larger batch sizes for efficiency.

Following that, we describe our proposed lightweight model for dynamic graph learning tasks. The model is a TGN with decoupled modules implemented using efficient functions. Specifically, we parameterize the EdgeBank (Poursafaei et al., 2022) model to allow it to learn. Then, we add extra parameters to consider single-node information in the prediction instead of solely relying on edge temporal information.

5.1 The decoupling strategy

We propose to decouple the core modules of TGN: the prediction and memory modules. The decoupled modules will operate on different data and different batch sizes. Given a batch containing updates to apply and inputs to predict, the model divides the batch into smaller consecutive batches termed memory batches. The memory module operates on the memory batches, and thus, it can perform memory updates more frequently. After processing a memory batch but before proceeding to the next one, the memory module extracts and temporally saves the temporal neighborhood information. This information encompasses the neighborhood state relevant to the nodes in the subsequent memory batch, preventing it from being overridden. The neighborhood state is defined by:

S_{\mathcal{N}_{i}^{k}(t)}(t)=\{s_{j}(t)\ |v_{j}\in\mathcal{N}_{i}^{k}(t)\}

(9)

This creates a view of the model’s memory for each time a memory batch starts. After processing all memory batches, the prediction module reads the extracted states for each node associated with a given input from the view before that input’s timestamp. Subsequently, the prediction module simultaneously computes predictions for all inputs within the complete batch.

Figure 3 demonstrates the effectiveness of a decoupled model compared to a standard memory-based model. The edge at $t_{4}$ in Figure 3 is given as an input for the models. A standard memory-based model computes the embeddings of $v_{3}$ and $v_{6}$ based on their neighborhood states before $t_{1}$ and only then updates its inner memory with the edges at $t_{1}$ , $t_{2}$ and $t_{3}$ . On the other hand, a decoupled model initially performs memory updates of the two memory batches. Then, the prediction module uses the states extracted before $t_{3}$ that include the updates in first memory batch. In Figure 3, the missing update that affects the interaction between $v_{3}$ and $v_{6}$ is avoided by using the decoupling strategy since the prediction module is aware of the interaction between $v_{2}$ and $v_{3}$ .

Decoupling the modules of TGNs offers two immediate benefits. First, by decoupling the memory module from the prediction module and setting the memory batch size to 1, we completely solve the missing updates problem. Secondly, we can accelerate the execution time of an existing model without compromising its accuracy. This can be achieved by decoupling its modules, setting the memory batch size to match the model’s original batch size, and substantially increasing the new batch size of the prediction module. Using the original batch size for the memory batches ensures the same frequency of missing updates, and the new larger batch size will improve the running time performance. Figure 3 details the running time improvement of decoupled TGN with a memory batch size of 50 when using growing batch sizes. Notably, the decoupling strategy enhances TGN’s running time by 12.5% without compromising its performance, as the frequency of missing updates depends only on the memory batch size. Furthermore, actively transferring additional computations from the memory module to the prediction module will lead to an additional improvement in running time. Further analysis of the potential speedup of the decoupling strategy is discussed in Appendix E.

5.2 Lightweight Decoupled Temporal Graph Network

We propose the Lightweight Decoupled Temporal Graph Network (LDTGN), an efficient model designed for dynamic graph learning tasks. LDTGN operates with high throughput, crucial for the streaming setting, while also achieving superior performance in dynamic graph learning tasks.

In this subsection, we develop LDTGN step-by-step by enhancing EdgeBank and incorporating the decoupled strategy. Despite EdgeBank’s performance falling short of the current state-of-the-art, it demonstrates commendable results with exceptionally high throughput, making it a suitable foundation for our model. We proceed to delineate the deficiencies in EdgeBank that need addressing to attain top-tier performance. Subsequently, we integrate improvements with EdgeBank to resolve these issues effectively, thereby constructing the LDTGN model. For the simplicity of the presentation, we describe LDTGN for future edge prediction tasks and assume only edge addition updates. Comprehensive details about applying node addition, node removal, and edge removal updates, as well as adjustments for node classification task, are provided in Appendix C.

The EdgeBank model can be formulated as a memory-based algorithm as presented by Poursafaei et al. (2022) where an edge that did not get an update in the past $T$ updates is considered negative. We can also describe this memory-based prediction rule as a linear function that maps a time-based difference into a prediction. Equation 10 details the linear function of EdgeBank with a decision function that considers any edge $e_{i,j}$ that was updated in the last $T$ updates as positive.

p_{i,j}(t)=-(t-t_{i,j})+T

(10)

In Equation 10, $t_{i,j}$ is the last time the edge $e_{i,j}$ received an update, and $t$ is the current time. $t_{i,j}$ is set to $0$ if $e_{i,j}$ has not been received yet. Poursafaei et al. (2022) suggested to use a constant value of 1000 for $T$ . The equation should be parameterized to allow the model to learn the most appropriate value of $T$ for every dataset. To do this, we added a bias $b$ and a coefficient $w$ as detailed in Equation 11.

p_{i,j}(t)=(t-t_{i,j})w+b

(11)

Using Equation 11, we can learn the suitable threshold for each dataset. As in EdgeBank, this function does not incorporate the nodes themselves into the prediction. This can easily be solved by adding the time differences of each node in the potential edge and appropriate coefficients as in Equation 12.

p_{i,j}(t)=(t-t_{i,j})w_{1}+(t-t_{i})w_{2}+(t-t_{j})w_{3}+b

(12)

In Equation 12, $t_{i}$ and $t_{j}$ are the last times the nodes $v_{i}$ and $v_{j}$ received an update, respectively. Equation 12 is missing topological and data-specific information such as node and edge features. Moreover, the prediction function is linear, which often causes the learned function to be distant from the ground truth prediction function. To solve this issue, we first create embeddings for the nodes in the potential edge and a preliminary embedding for the edge itself as in Equations 13 and 14:

z_{i}(t)=\Sigma_{k\in\mathcal{N}_{i}^{1}(t)}\alpha_{k}[v_{i}(t)||v_{k}(t)||F_{% {\mathcal{E}}}(e_{i,k})]

(13)

z_{i,j}(t)=\mathrm{TDE}(t-t_{i,j})

(14)

where $v_{i}(t)=[F_{\mathcal{V}}(v_{i})||\mathrm{TDE}(t-t_{i})]$ and $\alpha_{k}$ is attention weight computed as in GAT (Veličković et al., 2018). $\mathrm{TDE}$ is a non-linear time difference embedding function such as Time2Vec Kazemi et al. (2019). Equation 15 uses the embeddings of the nodes, edge time difference, and a non-linear $\mathrm{merge}$ function to give the final prediction.

p_{i,j}(t)=\mathrm{merge}(z_{i}(t),z_{j}(t),z_{i,j}(t))

(15)

Equations 13, 14 and 15 constitute the prediction module of LDTGN. The prediction module only requires $t_{i},t_{j}$ and $t_{i,j}$ from the memory module. Hence, these timestamps are the sole data of the memory module. In the experiments, we implemented LDTGN with a memory batch size of 1, thus eliminating the adverse effects associated with missing updates. This design choice not only mitigates these negative impacts but also obviates the need for a message aggregator required in traditional TGNs. LDTGN operates with a minimal memory batch size and with a high throughput thanks to the removal of $\mathrm{msg}_{s}$ , $\mathrm{msg}_{d}$ and $\mathrm{mem}$ from the memory module. Standard TGNs save states only for the nodes, but LDTGN also saves states for the edges. This does not add an additional memory to LDTGN over other TGNs since TGNs save the full graph to incorporate topological information in the prediction.

In the scenarios where the throughput is allowed to be smaller, and the missing updates negative effects are neglectable for small memory batch size, LDTGN can incorporate long-term dependencies. We refer to this variant of LDTGN as LDTGN-mem. To achieve long-term dependencies, LDTGN-mem is implemented with a heavier memory module. This memory module generates the following messages:

	$\displaystyle m_{i}(t)=[s_{i}(t^{-})\|\|s_{j}(t^{-})\|\|TDE(t-t_{i})]$		(16)
	$\displaystyle m_{j}(t)=[s_{j}(t^{-})\|\|s_{i}(t^{-})\|\|TDE(t-t_{j})]$		(17)

The aggregation function for the messages takes only the most recent message per node, and the $\mathrm{mem}$ function is set to be a $\mathrm{GRU}$ cell:

s_{i}(t)=\mathrm{GRU}(\overline{m}_{i}(t),s_{i}(t^{-}))

(18)

To incorporate the long-term memory in the prediction module, LDTGN-mem adds the current learned state to the data of each node:

v_{i}(t)=[F_{\mathcal{V}}(v_{i})||\mathrm{TDE}(t-t_{i})||s_{i}(t^{-})]

(19)

In contrast to LDTGN, LDTGN-mem has to operate with a memory batch size larger than 1 to ensure a reasonable throughput. We chose to implement LDTGN-mem with a memory batch size of 50. This is because we observed earlier in Figure 1 that the incidence of missing updates with a batch size of 50 is not severe. In addition LDTGN-mem operates with an acceptable throughput when using this memory batch size. The adjustments required for the LDTGN-mem are detailed in the illustration of our model at Figure 4.

6 Experiments

This section contains the description of the experiments we used to evaluate the performance of our model. All the experiments were performed using DyGLib (Yu et al., 2023) – the unified library for dynamic graph learning evaluation. DyGLib contains various real world datasets including large-scale dynamic graphs with millions of edges. The experiments are for future edge prediction with random negative edge sampling on the following datasets: Wikipedia, Reddit, MOOC, lastFM, Enron, Social Evo., UCI, Flights, Can. Parl., US Legis., UN Trade, UN Vote, and Contacts that were collected by Poursafaei et al. (2022). Additional information and statistics regarding the datasets can be found in Appendix A. We used seven well-known methods as baselines for the task of future edge prediction: DyRep (Trivedi et al., 2019), TGAT (Xu et al., 2020), TGN (Rossi et al., 2020), CAWN (Wang et al., 2021b), EdgeBank (Poursafaei et al., 2022), GraphMixer (Cong et al., 2023) and DyGFormer (Yu et al., 2023). Additional information regarding the baselines can be found in Appendix B. We adopted the approach used in previous works and split the dataset into training, validation, and test sets by performing a chronological split of 70%–15%–15%. We report the mean and standard deviation of the Average Precision (AP) on the test set. Results for Areas Under the Receiver Operating Characteristic Curve (AUC-ROC) are detailed in Appendix D.

6.1 Future edge prediction

In the first experiment, we tested transductive future edge prediction with random negative edge sampling, i.e., for each positive edge in the datasets, a negative edge with the same source and a random destination is sampled. The results are presented in Table 1. We also performed an experiment for the inductive future edge prediction setting, in which all the edges in the validation and test sets must contain nodes that have not been previously seen in the training set. The results for this experiment are reported in Table 2. The baselines’ results were computed with DyGLib using the hyperparameters configurations as described in (Yu et al., 2023). Additional implementation-specific details of LDTGN and LDTGN-mem and their training methodology are detailed in Appendix C. LDTGN achieves state-of-the-art or comparable results compared to the baselines for the setting of transductive and inductive future edge prediction. In benchmarks where the negative effects of the missing updates are insignificant for small batch sizes, LDTGN and LDTGN-mem achieve comparable performance to DyGFormer. In the benchmarks where missing updates have substantial influence, such as US Legis, LDTGN considerably outperforms the compared baselines since it completely removes all the missing updates when using a batch size of 1.

Table 1: AP for transductive future edge prediction with random negative sampling over five runs. The significantly best result for each benchmark appears in bold font.

Dataset	DyRep	TGAT	TGN	CAWN	EdgeBank	GraphMixer	DyGFormer	LDTGN (ours)	LDTGN-mem (ours)
Wikipedia	94.86±0.06	96.94±0.06	98.45±0.06	98.76±0.03	90.37±0.00	97.25±0.03	99.03±0.02	98.86±0.02	98.99±0.03
Reddit	98.22±0.04	98.52±0.02	98.63±0.06	99.11±0.01	94.86±0.00	97.31±0.01	99.22±0.01	98.61±0.01	99.28±0.02
MOOC	81.97±0.49	85.84±0.15	89.15±1.60	80.15±0.25	57.97±0.00	82.78±0.15	87.52±0.49	83.34±1.47	91.73±0.65
lastFM	71.92±2.21	73.42±0.21	77.07±3.97	86.99±0.06	79.29±0.00	75.61±0.24	93.00±0.12	90.81±0.01	91.22±0.31
Enron	82.38±3.36	71.12±0.97	86.53±1.11	89.56±0.09	83.53±0.00	82.25±0.16	92.47±0.12	98.10±0.01	92.28±0.32
Social Evo.	88.87±0.30	93.16±0.17	93.57±0.17	84.96±0.09	74.95±0.00	93.37±0.07	94.73±0.01	95.45±0.51	94.02±0.16
UCI	65.14±2.30	79.63±0.70	92.34±1.04	95.18±0.06	76.20±0.00	93.25±0.57	95.79±0.17	97.05±0.01	95.75±0.04
Flights	95.29±0.72	94.03±0.18	97.95±0.14	98.51±0.01	89.35±0.00	90.99±0.05	98.91±0.01	97.50±0.07	98.76±0.06
Can. Parl.	66.54±2.76	70.73±0.72	70.88±2.34	69.82±2.34	64.55±0.00	77.04±0.46	97.36±0.45	99.47±0.03	72.82±9.17
US Legis.	75.34±0.39	68.52±3.16	75.99±0.58	70.58±0.48	58.39±0.00	70.74±1.02	71.11±0.59	92.08±0.09	80.93±0.48
UN Trade	63.21 ± 0.93	61.47±0.18	65.03±1.37	65.39±0.12	60.41±0.00	62.61±0.27	66.46±1.29	97.82±0.07	96.65±0.19
UN Vote	62.81 ± 0.80	52.21±0.98	65.72±2.17	52.84±0.10	58.49±0.00	52.11±0.16	55.55±0.42	80.94±1.43	71.21±1.14
Contacts	95.98 ± 0.15	96.28±0.09	96.89±0.56	90.26±0.28	92.58±0.00	91.92±0.03	98.29±0.01	98.19±0.03	98.78±0.04

Table 2: AP for inductive future edge prediction with random negative sampling over five different runs. The significantly best result for each benchmark appears in bold font.

Dataset	DyRep	TGAT	TGN	CAWN	GraphMixer	DyGFormer	LDTGN (ours)	LDTGN-mem (ours)
Wikipedia	92.43±0.37	96.22±0.07	97.83±0.04	98.24±0.03	96.65±0.02	98.59±0.03	98.74±0.02	98.40±0.04
Reddit	96.09±0.11	97.09±0.04	97.50±0.07	98.62±0.01	95.26±0.02	98.84±0.02	98.00±0.04	98.86±0.02
MOOC	81.07±0.44	85.50±0.19	89.04±1.17	81.42±0.24	81.41±0.21	86.96±0.43	82.73±1.52	90.61±0.32
LastFM	83.02±1.48	78.63±0.31	81.45±4.29	89.42±0.07	82.11±0.42	94.23±0.09	92.17±0.01	92.62±0.59
Enron	74.55±3.95	67.05±1.51	77.94±1.02	86.35±0.51	75.88±0.48	89.76±0.34	96.06±0.09	88.07±0.56
Social Evo.	90.04±0.47	91.41±0.16	90.77±0.86	79.94±0.18	91.86±0.06	93.14±0.04	94.37±0.68	91.31±0.22
UCI	57.48±1.87	79.54±0.48	88.12±2.05	92.73±0.06	91.19±0.42	94.54±0.12	94.92±0.01	93.00±0.12
Flights	92.88±0.73	88.73±0.33	95.03±0.60	97.06±0.02	83.03±0.05	97.79±0.02	95.60± 0.10	97.31±0.16
Can. Parl.	54.02±0.76	55.18±0.79	54.10±0.93	55.80±0.69	55.91±0.82	87.74±0.71	97.83±0.06	58.05±3.08
US Legis.	57.28±0.71	51.00±3.11	58.63±0.37	53.17±1.20	50.71±0.76	54.28±2.87	83.76±0.44	65.75±1.57
UN Trade	57.02±0.69	61.03±0.18	58.31±3.15	65.24±0.21	62.17±0.31	64.55±0.62	97.43±0.07	89.21±1.15
UN Vote	54.62±2.22	52.24±1.46	58.85±2.51	49.94±0.45	50.68±0.44	55.93±0.39	81.29±1.41	63.54±2.09
Contacts	92.18±0.41	95.87±0.11	93.82±0.99	89.55±0.30	90.59±0.05	98.03±0.02	97.85±0.03	97.94±0.13

6.2 Memory and running time performance

We calculated the average number of learnable parameters required for each model to achieve its best performance and reported it in Figure 6. We also measured the average throughput at inference time for each model over all the datasets, where the throughput is defined as the number of edges the model can process in a single second. The results are shown in Figure 6. In both Figure 6 and Figure 6, LDTGN surpasses the other baselines by a large margin in terms of efficiency. Note that the throughput of the baselines were measured using a batch size that is at least the batch size used for LDTGN; hence, the results in Figure 6 are also proportionate to the latency of LDTGN compared to the other baselines.

7 Conclusion

In this work, we introduced the missing updates phenomenon caused by using batches in memory-based models for dynamic graph learning. We showed a strict negative connection between the frequency of missing updates in datasets and the performance of the memory-based models, causing a trade-off with respect to their batch size. To balance this trade-off, we presented the decoupling strategy for designing temporal graph networks. Decoupling enables two types of batches – one for the memory module and the other for the prediction module. In this way, temporal graph networks can increase the frequency of the updates while still handling their arrival streams. In addition, we introduced LDTGN – a lightweight model for the task of future edge prediction that is highly efficient in terms of time and memory. LDTGN can be equipped with a heavier memory module when possible, allowing it to better capture long-term dependencies. We also showed by extensive experiments that LDTGN has outstanding performance for both transductive and inductive tasks, achieving state-of-the-art or comparable performance on most of the tested benchmarks.

References

Backstrom & Leskovec (2011) Backstrom, L. and Leskovec, J. Supervised random walks: predicting and recommending links in social networks. In Proceedings of the fourth ACM international conference on Web search and data mining, pp. 635–644, 2011.
Cho et al. (2014) Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y. Learning phrase representations using rnn encoder-decoder for statistical machine translation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pp. 1724–1734, 2014.
Cini et al. (2023) Cini, A., Marisca, I., Bianchi, F. M., and Alippi, C. Scalable spatiotemporal graph neural networks. In Proceedings of the AAAI conference on artificial intelligence, volume 37(6), pp. 7218–7226, 2023.
Cong et al. (2023) Cong, W., Zhang, S., Kang, J., Yuan, B., Wu, H., Zhou, X., Tong, H., and Mahdavi, M. Do we really need complicated model architectures for temporal networks? The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023.
Ding et al. (2019) Ding, L., Han, B., Wang, S., Li, X., and Song, B. User-centered recommendation using us-elm based on dynamic graph model in e-commerce. International Journal of Machine Learning and Cybernetics, 10:693–703, 2019.
Fowler (2006) Fowler, J. H. Legislative cosponsorship networks in the us house and senate. Social networks, 28(4):454–465, 2006.
Gorochowski et al. (2018) Gorochowski, T. E., Grierson, C. S., and Di Bernardo, M. Organization of feed-forward loop motifs reveals architectural principles in natural and engineered networks. Science advances, 4(3):eaap9751, 2018.
Granovetter (1973) Granovetter, M. S. The strength of weak ties. American journal of sociology, 78(6):1360–1380, 1973.
Haghani & Keyvanpour (2019) Haghani, S. and Keyvanpour, M. R. A systemic analysis of link prediction in social network. Artificial Intelligence Review, 52:1961–1995, 2019.
Hochreiter & Schmidhuber (1997) Hochreiter, S. and Schmidhuber, J. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
Huang et al. (2020) Huang, S., Hitti, Y., Rabusseau, G., and Rabbany, R. Laplacian change point detection for dynamic graphs. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 349–358, 2020.
Huang et al. (2023) Huang, S., Poursafaei, F., Danovitch, J., Fey, M., Hu, W., Rossi, E., Leskovec, J., Bronstein, M., Rabusseau, G., and Rabbany, R. Temporal graph benchmark for machine learning on temporal graphs. Advances in Neural Information Processing Systems, 36, 2023.
Kazemi et al. (2019) Kazemi, S. M., Goel, R., Eghbali, S., Ramanan, J., Sahota, J., Thakur, S., Wu, S., Smyth, C., Poupart, P., and Brubaker, M. Time2vec: Learning a vector representation of time. CoRR, abs/1907.05321, 2019.
Kazemi et al. (2020) Kazemi, S. M., Goel, R., Jain, K., Kobyzev, I., Sethi, A., Forsyth, P., and Poupart, P. Representation learning for dynamic graphs: A survey. The Journal of Machine Learning Research, 21(1):2648–2720, 2020.
Kumar et al. (2019) Kumar, S., Zhang, X., and Leskovec, J. Predicting dynamic embedding trajectory in temporal interaction networks. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1269–1278, 2019.
Liben-Nowell & Kleinberg (2003) Liben-Nowell, D. and Kleinberg, J. The link prediction problem for social networks. In Proceedings of the twelfth international conference on Information and knowledge management, pp. 556–559, 2003.
Ma et al. (2020) Ma, Y., Guo, Z., Ren, Z., Tang, J., and Yin, D. Streaming graph neural networks. In Proceedings of the 43rd international ACM SIGIR conference on research and development in information retrieval, pp. 719–728, 2020.
MacDonald et al. (2015) MacDonald, G. K., Brauman, K. A., Sun, S., Carlson, K. M., Cassidy, E. S., Gerber, J. S., and West, P. C. Rethinking agricultural trade relationships in an era of globalization. BioScience, 65(3):275–289, 2015.
Madan et al. (2011) Madan, A., Cebrian, M., Moturu, S., Farrahi, K., et al. Sensing the" health state" of a community. IEEE Pervasive Computing, 11(4):36–45, 2011.
Mangan & Alon (2003) Mangan, S. and Alon, U. Structure and function of the feed-forward loop network motif. Proceedings of the National Academy of Sciences, 100(21):11980–11985, 2003.
Nair & Hinton (2010) Nair, V. and Hinton, G. E. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), pp. 807–814, 2010.
Panzarasa et al. (2009) Panzarasa, P., Opsahl, T., and Carley, K. M. Patterns and dynamics of users’ behavior and interaction: Network analysis of an online community. Journal of the American Society for Information Science and Technology, 60(5):911–932, 2009.
Pareja et al. (2020) Pareja, A., Domeniconi, G., Chen, J., Ma, T., Suzumura, T., Kanezashi, H., Kaler, T., Schardl, T., and Leiserson, C. Evolvegcn: Evolving graph convolutional networks for dynamic graphs. In Proceedings of the AAAI conference on artificial intelligence, volume 34(04), pp. 5363–5370, 2020.
Pennebaker et al. (2001) Pennebaker, J. W., Francis, M. E., and Booth, R. J. Linguistic inquiry and word count: Liwc 2001. Mahway: Lawrence Erlbaum Associates, 71(2001):2001, 2001.
Poursafaei et al. (2022) Poursafaei, F., Huang, S., Pelrine, K., and Rabbany, R. Towards better evaluation for dynamic link prediction. Advances in Neural Information Processing Systems, 35:32928–32941, 2022.
Rossi et al. (2020) Rossi, E., Chamberlain, B., Frasca, F., Eynard, D., Monti, F., and Bronstein, M. Temporal graph networks for deep learning on dynamic graphs. CoRR, abs/2006.10637, 2020.
Sankar et al. (2020) Sankar, A., Wu, Y., Gou, L., Zhang, W., and Yang, H. Dysat: Deep neural representation learning on dynamic graphs via self-attention networks. In Proceedings of the 13th international conference on web search and data mining, pp. 519–527, 2020.
Sapiezynski et al. (2019) Sapiezynski, P., Stopczynski, A., Lassen, D. D., and Lehmann, S. Interaction data from the copenhagen networks study. Scientific Data, 6(1):315, 2019.
Shetty & Adibi (2004) Shetty, J. and Adibi, J. The enron email dataset database schema and brief statistical report. Information sciences institute technical report, University of Southern California, 4(1):120–128, 2004.
Simmel (1950) Simmel, G. The sociology of georg simmel, volume 92892. Simon and Schuster, 1950.
Strohmeier et al. (2021) Strohmeier, M., Olive, X., Lübbe, J., Schäfer, M., and Lenders, V. Crowdsourced air traffic data from the opensky network 2019–2020. Earth System Science Data, 13(2):357–366, 2021.
Toivonen et al. (2007) Toivonen, R., Kumpula, J. M., Saramäki, J., Onnela, J.-P., Kertész, J., and Kaski, K. The role of edge weights in social networks: modelling structure and dynamics. In Noise and Stochastics in Complex Systems and Finance, volume 6601, pp. 48–55. SPIE, 2007.
Trivedi et al. (2019) Trivedi, R., Farajtabar, M., Biswal, P., and Zha, H. Dyrep: Learning representations over dynamic graphs. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019.
Veličković et al. (2018) Veličković, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., and Bengio, Y. Graph attention networks. 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, 2018.
Voeten et al. (2009) Voeten, E., Strezhnev, A., and Bailey, M. United Nations General Assembly Voting Data. Harvard Dataverse, 2009. URL https://doi.org/10.7910/DVN/LEJUQZ.
Wang et al. (2021a) Wang, L., Chang, X., Li, S., Chu, Y., Li, H., Zhang, W., He, X., Song, L., Zhou, J., and Yang, H. Tcl: Transformer-based dynamic graph modelling via contrastive learning. CoRR, abs/2105.07944, 2021a.
Wang et al. (2021b) Wang, Y., Chang, Y.-Y., Liu, Y., Leskovec, J., and Li, P. Inductive representation learning in temporal networks via causal anonymous walks. 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021b.
Xu et al. (2020) Xu, D., Ruan, C., Korpeoglu, E., Kumar, S., and Achan, K. Inductive representation learning on temporal graphs. 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020.
Yu et al. (2023) Yu, L., Sun, L., Du, B., and Lv, W. Towards better dynamic graph learning: New architecture and unified library. Advances in Neural Information Processing Systems, 36:67686–67700, 2023.
Yu et al. (2018) Yu, W., Cheng, W., Aggarwal, C. C., Zhang, K., Chen, H., and Wang, W. Netwalk: A flexible deep embedding approach for anomaly detection in dynamic networks. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2672–2681, 2018.

Appendix A Datasets statistics and descriptions

In our experiments we used the following dynamic graph datasets:

• Wikipedia (Kumar et al., 2019): Wikipedia edit requests log over one month, where the editing users and Wikipedia pages are represented as nodes and the edit requests are modeled as edges. The edges are timestamped and contain LIWC feature vectors (Pennebaker et al., 2001) of the requested text to post.

• Reddit (Kumar et al., 2019): Reddit post requests log over one month where the posting users and subreddits are represented as nodes and the posting requests are modeled as edges.

• MOOC (Kumar et al., 2019): Students’ access records to MOOC online courses, where students and content units (e.g., videos, answers, etc.) are described as nodes and the access actions (viewing a video, submitting an answer, etc.) are modeled as edges. The edges are timestamped and have four features describing the action.

• LastFM (Kumar et al., 2019): LastFM listening records over one month, where the LastFM users and the songs are represented as nodes and there is an edge between the users and the songs to which they listened. The edges are timestamped and do not contain any features.

• Enron (Shetty & Adibi, 2004): Email logs of the Enron employees over a period of three years, where the employees are modeled as nodes and a single edge represents an email sent between two employees. The edges are timestamped and do not contain any features.

• Social Evo. (Madan et al., 2011): Documentation of the everyday life of undergraduate students living in dormitories from October 2008 to May 2009. Represented as a mobile phone proximity network where each edge has two features.

• UCI Panzarasa et al. (2009): Messages logs of the online community of students from the University of California, Irvine, where the students are modeled as nodes and a single edge represents a message sent between two students. The edges are timestamped with a granularity of seconds.

• Flights (Strohmeier et al., 2021): Tracked air traffic during the COVID-19 pandemic, where the airports are modeled as nodes and the edges are the tracked flights between two airports. The edges are timestamped and weighted. The weight of the edges indicates the number of flights between the airports in a day.

• Can. Parl. (Huang et al., 2020): Documented interactions between Canadian members of parliaments from 2006 to 2019, where the members of parliaments are described as nodes, two of which are connected by an edge if they both voted “yes” on a bill. The edges are timestamped and weighted. The weight of the edges indicates the number of times that one member voted “yes” for another member’s bill within one year.

• US Legis. (Fowler, 2006): Documented interactions in the US Senate, where legislators are modeled as nodes, where two of which are connected by an edge if they co-sponsored a bill. The edges are timestamped and weighted. The weight of the edges indicates the number of times that two members of the US Congress co-sponsored a bill in a given term.

• UN Trade (MacDonald et al., 2015): Documented global food and agriculture trading connections spanning over 30 years, where nations are represented as nodes, two of which are connected by an edge if they have an agriculture import or export relations. The edges are timestamped and weighted. The weight of the edges is the sum of normalized agriculture import or export values between two countries.

• UN Vote (Voeten et al., 2009): Documentation of roll-call votes in the United Nations General Assembly from 1946 to 2020 where nations are represented as nodes, two of which are connected by an edge if they both voted “yes” for an item. The edges are timestamped and weighted. The weight of the edges is the number of times the two countries vote “yes” on a call.

• Contact (Sapiezynski et al., 2019): Physical proximity records documenting around 700 university students over a period of four weeks, where the students are modeled as nodes, two of which are connected by an edge if they each are within close proximity to each other. The edges are timestamped and weighted. The weight of the edges specifies the physical proximity between two students.

The full statistics of the datasets as collected by Yu et al. (2023) are reported in Table 3.

Table 3: Datasets statistics.

Dataset	Domain	#Nodes	#Edges	#Edge Features	Bipartite	Duration
Wikipedia	Social	9,227	157,474	172	True	1 month
Reddit	Social	10,984	672,447	172	True	1 month
MOOC	Interaction	7,144	411,749	4	True	17 months
LastFM	Interaction	1,980	1,293,103	–	True	1 month
Enron	Social	184	125,235	–	False	3 years
Social Evo.	Proximity	74	2,099,519	2	False	8 months
UCI	Social	1,899	59,835	–	False	196 days
Flights	Transport	13,169	1,927,145	1	False	4 months
Can. Parl.	Politics	734	74,478	1	False	14 years
US Legis.	Politics	225	60,396	1	False	12 terms
UN Trade	Economics	255	507,497	1	False	32 years
UN Vote	Politics	201	1,035,742	1	False	72 years
Contact	Proximity	692	2,426,279	1	False	1 month

In Table 5 we report the ratio of inputs that depend on at least a single missing update in their 1-hop neighborhood. In Table 5 we report the average number of missing updates affecting the 1-hop neighborhood of the nodes. Tables 5 and 5 contain the missing updates statistics for all the datasets used in this work for various batch sizes.

Table 4: Ratio of inputs that depend on at least a single missing update in their 1-hop neighborhood.

Dataset	10	25	50	100	200
Wikipedia	0.23	0.42	0.55	0.67	0.76
Reddit	0.31	0.52	0.67	0.78	0.86
MOOC	0.88	0.95	0.98	0.99	0.99
LastFM	0.74	0.88	0.94	0.97	0.98
Enron	0.85	0.92	0.95	0.98	0.99
Social Evo.	0.90	0.96	0.98	0.99	0.99
UCI	0.70	0.85	0.91	0.95	0.97
Flights	0.82	0.90	0.94	0.96	0.98
Can. Parl.	0.90	0.96	0.98	0.99	0.99
US Legis.	0.90	0.96	0.98	0.99	0.99
UN Trade	0.90	0.96	0.98	0.99	0.99
UN Vote	0.90	0.96	0.98	0.99	0.99
Contacts	0.86	0.94	0.97	0.98	0.99

Table 5: Average number of missing updates affecting the 1-hop neighborhood of each input node.

Dataset	10	25	50	100	200
Wikipedia	0.25	0.65	1.19	2.02	3.28
Reddit	0.15	0.39	0.78	1.53	2.99
MOOC	0.72	1.45	2.38	3.90	6.57
LastFM	0.45	1.16	2.10	3.59	5.93
Enron	2.95	6.07	9.81	15.55	24.95
Social Evo.	1.86	3.15	5.22	9.77	19.22
UCI	0.95	2.12	3.67	5.98	9.31
Flights	2.82	5.77	8.79	11.97	14.55
Can. Parl.	4.41	11.46	22.33	40.65	66.87
US Legis.	4.19	10.10	17.41	25.26	31.09
UN Trade	4.37	11.21	21.35	37.41	57.00
UN Vote	3.75	8.56	14.35	21.47	28.01
Contacts	1.74	2.60	3.15	3.85	5.13

Appendix B Baselines descriptions

We used the following baselines for the evaluation experiments of dynamic graph learning:

•DyRep (Trivedi et al., 2019): DyRep is an RNN-based architecture that utilizes a temporal attention mechanism to exploit the dynamic structure of the graphs.

• TGAT (Xu et al., 2020): TGAT uses a time-encoding function and aggregates neighborhood information using self-attention to compute the embedding for each node.

• TGN (Rossi et al., 2020): TGN is a general architecture for CTDG learning tasks. It uses both a prediction module and a memory module to get relevant and accurate predictions for each input at each moment in time. It does this by aggregating information from the neighborhood of each node and maintain learnable updated memory which is based on RNN, and thus also solves the staleness problem.

• CAWN (Wang et al., 2021b): The CAWN model is based on causal anonymous walks that are generated for each node. The walks are encoded using RNNs and aggregated to achieve the node representation.

• EdgeBank (Poursafaei et al., 2022): EdgeBank is a memorization algorithm that saves any seen update and, given an input, it predicts according to a simple decision rule that can be one of the following: whether the input was seen in the last few iterations (EdgeBank_th) or in the last few time units (EdgeBank_tw), or whether the input has already been seen a sufficient number of times (EdgeBank_re). While EdgeBank can also have a decision rule that is based on infinite memory i.e., predicts positive for any previously seen edge and predicts negative otherwise (EdgeBank_inf). The algorithm’s simplicity allows it to perform extremely fast, making it significantly faster than any other model for dynamic graph learning. In our experiments, we report the best results of EdgeBank among all of its decision rule variations.

• GraphMixer (Cong et al., 2023): GraphMixer uses three components for the task of future edge prediction: a link-encoder that is based on MLP and fixed time-encoding function, a node-encoder that only performs neighborhood mean-pooling and another MLP for edge prediction.

• DyGFormer (Yu et al., 2023): DyGFormer is a transformer-based architecture. To generate an encoding for a given interaction, DyGFormer generates a co-occurrence embedding of the interaction in addition to a neighborhood representation for each interacting node. Then it uses a patching technique on historical representations of the interacting nodes to better capture long-term temporal dependencies. The patches are then sent to a transformer and its outputs are averaged to create the final representation.

Appendix C Additional Implementation Details

C.1 Supporting additional update types

In Section 5.2 we described how to handle an edge addition update. To further support the update of the removal of the edge $e_{i,j}$ , $t_{i,j}$ should be set to $0$ . Similarly, when a node addition update of the node $v_{i}$ occurs, $t_{i}$ should be set to the current time. $t_{i}$ should be set to $0$ when it is removed.

C.2 Node classification

To adjust LDTGN for dynamic node classification, the $\mathrm{MERGE}$ function needs to be removed, s.t., the prediction operation is applied directly on the node embedding. The Wikipedia dataset can also be used for dynamic node classification, therefore we used it to evaluate our model compared to other baselines:

Table 6: AUC-ROC for node prediction task on the Wikipedia dataset.

Dataset	DyRep	TGAT	TGN	CAWN	GraphMixer	DyGFormer	LDTGN (ours)
Wikipedia	86.39±0.98	84.09±1.27	86.38±2.34	84.88±1.33	86.80±0.79	87.44±1.08	86.71±0.44

For this task of node classification LDTGN achieves comparable performance to previous state-of-the-art while still being the most efficient in terms of throughput and latency.

C.3 Further implementation details

For the $\mathrm{TDE}$ of LDTGN we used an $\mathrm{MLP}$ with two hidden layers and two activation layers of ReLU (Nair & Hinton, 2010). Each linear layer of $\mathrm{TDE}$ outputs vector of length 100. Before applying $\mathrm{TDE}$ the time difference need to be normalised to ease the learning process. We used Equation 20 to normalise the time difference, where $C$ is the length of the dataset.

\mathrm{normalise}(t)=\frac{log(1+t)}{log(1+C)}

(20)

For LDTGN-mem, we used Time2Vec as the $\mathrm{TDE}$ function. Time2Vec utilizes the cosine function, thus omitting the need for normalization.

The merge function of LDTGN and LDTGN-mem is an $\mathrm{MLP}$ that maps multiple input vectors into a single value that represents the probability of the edge to be positive. The $\mathrm{MLP}$ first applies linear layer that maps the three vectors to a single vector. Then reduces the vector’s dimension to 80, 10 and finally to 1. After each dimensionality reduction, a ReLU is being applied. Finally, a sigmoid function is applied on the result to obtain the probability of an edge to be positive.

In practice, it is challenging to utilize the full neighborhood of input nodes to compute the predictions and withstand a reasonable throughput, since the neighborhood of each node is expected to grow overtime. Thus, we implemented our models using the recent neighbors sampling strategy that was suggested by Rossi et al. (2020) in which only the $k$ neighbors of each hop which were recently involved in an update are used for computing the predictions. For our models, we used $k=20$ .

C.4 Training

We trained the models for 100 epochs with a patience of 20 epochs before early stopping. We used binary cross entropy loss as the objective function and optimized the models using Adam’s algorithm with a learning rate of $10^{-4}$ .

All the experiments were performed on Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz and NVIDIA GeForce RTX 3090.

Appendix D Additional results

In Table 7 and Table 8 we report the AUC-ROC of our proposed model and baselines for the transductive and inductive future edge prediction tasks, respectively.

Table 7: : AUC-ROC for transductive future edge prediction with random negative sampling over five runs. The significantly best result for each benchmark appears in bold font.

Dataset	DyRep	TGAT	TGN	CAWN	EdgeBank	GraphMixer	DyGFormer	LDTGN (ours)	LDTGN-mem (ours)
Wikipedia	94.37 ± 0.09	96.67 ± 0.07	98.37 ± 0.07	98.54 ± 0.04	90.78 ± 0.00	96.92 ± 0.03	98.91 ± 0.02	98.67±0.01	98.90±0.05
Reddit	98.17 ± 0.05	98.47 ± 0.02	98.60 ± 0.06	99.01 ± 0.01	95.37 ± 0.00	97.17 ± 0.02	99.15 ± 0.01	98.20±0.02	99.25±0.02
MOOC	85.03 ± 0.58	87.11 ± 0.19	91.21 ± 1.15	80.38 ± 0.26	60.86 ± 0.00	84.01 ± 0.17	87.91 ± 0.58	82.43±1.72	93.33±0.54
LastFM	71.16 ± 1.89	71.59 ± 0.18	78.47 ± 2.94	85.92 ± 0.10	83.77 ± 0.00	73.53 ± 0.12	93.05 ± 0.10	90.79±0.01	91.68±0.51
Enron	84.89 ± 3.00	68.89 ± 1.10	88.32 ± 0.99	90.45 ± 0.14	87.05 ± 0.00	84.38 ± 0.21	93.33 ± 0.13	98.31±0.01	93.35±0.42
Social Evo.	90.76 ± 0.21	94.76 ± 0.16	95.39 ± 0.17	87.34 ± 0.08	81.60 ± 0.00	95.23 ± 0.07	96.30 ± 0.01	96.82±0.25	95.93±0.06
UCI	68.77 ± 2.34	78.53 ± 0.74	92.03 ± 1.13	93.87 ± 0.08	77.30 ± 0.00	91.81 ± 0.67	94.49 ± 0.26	96.22±0.03	94.79±0.07
Flights	95.95 ± 0.62	94.13 ± 0.17	98.22 ± 0.13	98.45 ± 0.01	90.23 ± 0.00	91.13 ± 0.01	98.93 ± 0.01	96.98±0.09	98.82±0.07
Can. Parl.	73.35 ± 3.67	75.69 ± 0.78	76.99 ± 1.80	75.70 ± 3.27	64.14 ± 0.00	83.17 ± 0.53	97.76 ± 0.41	99.68±0.02	77.66±7.92
US Legis.	82.28 ± 0.32	75.84 ± 1.99	83.34 ± 0.43	77.16 ± 0.39	62.57 ± 0.00	76.96 ± 0.79	77.90 ± 0.58	94.88±0.10	87.96±0.53
UN Trade	67.44 ± 0.83	64.01 ± 0.12	69.10 ± 1.67	68.54 ± 0.18	66.75 ± 0.00	65.52 ± 0.51	70.20 ± 1.44	97.91±0.06	97.16±0.17
UN Vote	67.18 ± 1.04	52.83 ± 1.12	69.71 ± 2.65	53.09 ± 0.22	62.97 ± 0.00	52.46 ± 0.27	57.12 ± 0.62	86.81±0.87	77.33±1.04
Contact	96.48 ± 0.14	96.95 ± 0.08	97.54 ± 0.35	89.99 ± 0.34	94.34 ± 0.00	93.94 ± 0.02	98.53 ± 0.01	98.58±0.01	99.06±0.04

Table 8: : AUC-ROC for inductive future edge prediction with random negative sampling over 5 different runs. The significantly best result for each benchmark appears in bold font.

Dataset	DyRep	TGAT	TGN	CAWN	GraphMixer	DyGFormer	LDTGN (ours)	LDTGN-mem (ours)
Wikipedia	91.49±0.45	95.90±0.09	97.72±0.03	98.03±0.04	95.57±0.20	98.48±0.03	98.23±0.00	98.30±0.06
Reddit	96.05±0.12	96.98±0.04	97.39±0.07	98.42±0.02	93.80±0.07	98.71±0.01	97.30±0.03	98.56±0.05
MOOC	84.03±0.49	86.84±0.17	91.24±0.99	81.86±0.25	81.43±0.19	87.62±0.51	81.88±1.74	92.36±0.30
LastFM	82.24±1.51	76.99±0.29	82.61±3.15	87.82±0.12	70.84±0.85	94.08±0.08	91.75±0.01	92.57±0.86
Enron	76.34±4.20	64.63±1.74	78.83±1.11	87.02±0.50	72.33±0.99	90.69±0.26	95.77±0.13	88.46±0.79
Social Evo.	91.18±0.49	93.41±0.19	93.43±0.59	84.73±0.27	93.71±0.18	95.29±0.03	96.03±0.37	94.01±0.2
UCI	58.08±1.81	77.64±0.38	86.68±2.29	90.40±0.11	84.49±1.82	92.63±0.13	92.83±0.02	90.83±0.21
Flights	93.56±0.70	88.64±0.35	95.92±0.43	96.86±0.02	82.48±0.01	97.80±0.02	94.44±0.21	97.39±0.21
Can. Parl.	55.27±0.49	56.51±0.75	55.86±0.75	58.83±1.13	55.83±1.07	89.33±0.48	98.73±0.05	58.59±4.42
US Legis.	61.07±0.56	48.27±3.50	62.38±0.48	51.49±1.13	50.43±1.48	53.21±3.04	88.19±0.24	72.45±1.31
UN Trade	58.82±0.98	62.72±0.12	59.99±3.50	67.05±0.21	63.76±0.07	67.25±1.05	97.47±0.07	90.26±1.28
UN Vote	55.13±3.46	51.83±1.35	61.23±2.71	48.34±0.76	50.51±1.05	56.73±0.69	86.99±0.86	68.99±1.66
Contact	91.89±0.38	96.53±0.10	94.84±0.75	89.07±0.34	93.05±0.09	98.30 ± 0.02	98.26±0.02	98.22±0.14

Appendix E Decoupling potential speedup analysis

In this section we analyze the potential speedup that models can achieve by using the decoupling strategy. The decoupling strategy does not affect running time directly, but rather aids to accelerate the running time of models without compromising their precision. Figure 3 demonstrates this exact idea: one can decouple a temporal model and increase its batch size significantly while maintaining a constant memory batch size. This will lead to running time improvement without compromising the precision of the decoupled model. In the context of a sequence containing updates for a model to apply and inputs for it to predict, denote the time it takes for the memory module to apply all the given updates in the sequence as $t_{memory}$ and denote the time it takes for the prediction module to finish computing the predictions for all the inputs in the sequence as $t_{prediction}$ . The total time it takes for the model to finish to process the sequence is:

T_{total}=t_{memory}+t_{prediction}

(21)

By decoupling the network one can reduce this time to

T_{decouple\_total}=\frac{t_{prediction}}{BS_{new}/BS_{old}}+t_{memory}

(22)

Where $BS_{old}$ is the batch size of the model before decoupling and $BS_{new}$ is the batch size of the decouple model. Hence the potential speedup of the model is:

speedup=\frac{BS_{new}\cdot t_{prediction}+BS_{new}\cdot t_{memory}}{BS_{old}% \cdot t_{prediction}+BS_{new}\cdot t_{memory}}

(23)