Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Leveraging Temporal Graph Networks Using Module Decoupling

Or Feldman
Department of Computer Science
Technion – Israel Institute of Technology
orfeldman@campus.technion.ac.il
&Chaim Baskin
Department of Computer Science
Technion – Israel Institute of Technology
chaimbaskin@technion.ac.il
Abstract

Current memory-based methods for dynamic graph learning utilize batch processing to handle rapid updates efficiently. However, the adoption of batches introduces a phenomenon we term as missing updates, which adversely affects the performance of memory-based models. In this work, we analyze the impact of missing updates on dynamic graph learning models and propose a decoupling strategy to mitigate these effects. Implementing this strategy, we present the Lightweight Decoupled Temporal Graph Network, a memory-based model with a minimal number of learnable parameters that is capable of dealing with the demand of high frequency of updates. We validated our approach across diverse dynamic graph benchmarks. LDTGN surpassed the average precision of previous methods by over 20% in scenarios demanding frequent graph updates, such as US Legis or UN Trade. In the vast majority of the benchmarks, LDTGN achieves state-of-the-art or comparable results while operating with significantly higher throughput than existing baselines. The code to replicate our experiments is available at this url.

1 Introduction

Dynamic graphs are commonly used to describe real-world dynamic systems, where the interacting elements are modeled as nodes, and the interactions between two elements are represented as edges. Each edge is usually labeled with a timestamp indicating its time of occurrence. Item recommendation on e-commerce platforms (Ding et al., 2019), friendship suggestion on social networks (Backstrom & Leskovec, 2011; Haghani & Keyvanpour, 2019), anomaly detection on communication networks (Yu et al., 2018), and traffic forecasting (Cini et al., 2023) are all practical tasks that can be modeled using dynamic graphs.

Although most graph-related real-world tasks are time-evolving, deep learning approaches usually focus on problems described using static graphs, which do not change over time. Moreover, it has also been shown that ignoring the dynamic nature of a system by abstracting it with static graphs is suboptimal (Rossi et al., 2020; Xu et al., 2020). A dynamic representation of a system, on the other hand, is often able to define the evolving behavior of the latter (Simmel, 1950; Granovetter, 1973; Mangan & Alon, 2003; Toivonen et al., 2007; Gorochowski et al., 2018).

Dynamic graph approaches are often based on discrete-time (Liben-Nowell & Kleinberg, 2003; Sankar et al., 2020; Pareja et al., 2020) or continuous-time (Trivedi et al., 2019; Ma et al., 2020; Cong et al., 2023) settings. In discrete-time settings, data are received as a sequence of snapshots describing the full graph structure at specific times, while in the flexible continuous-time setting, a single update on the graphs can happen at any moment. The setting in which deep learning models for dynamic graphs operate at inference time can be roughly divided into the following types: streaming, deployed, and live update (Huang et al., 2023). In this work, we focus on continuous-time dynamic graphs in the streaming context, in which the models may be updated upon receiving new information, but cannot perform backpropagation due to the high throughput required.

Memory-based models for dynamic graphs are designed to support the assimilation of new information through graph updates during the inference phase. To do this, they manage a memory unit that represents the current state of the dynamic graph. This memory unit usually includes the current structure of the dynamic graph, data-specific information such as node and edge features, timestamps of previous updates, and learnable information computed by the model.

In the streaming setting for continuous-time dynamic graphs, memory-based deep networks have to use batches to keep up with the stream of incoming updates, which means they process multiple updates in parallel. This situation introduces a new problem in which updates for the models are not being considered for the predictions of inputs inside their mutual batch. In Section 3, we formally define this undesirable phenomenon as missing updates. In this work, we suggest a decoupling strategy that minimizes the negative impacts of missing updates while still using batches. Guided by this strategy, we have built the Lightweight Decoupled Temporal Graph Network (LDTGN) – an efficient memory-based model for dynamic graph learning that outperforms most established baselines both in terms of running time and performance.

To summarize, this work makes the following contributions:

  • We introduce and analyze the problem of missing updates when using memory-based models.

  • We suggest a novel decoupling methodology for building deep learning memory-based models for dynamic graph learning.

  • Based on the suggested methodology, we propose a new lightweight model for dynamic graph learning tasks that can operate at high streaming rates and with significantly smaller number of parameters compared to other baselines.

  • We evaluated LDTGN on various transductive and inductive benchmarks for dynamic graph learning and achieved state-of-the-art or comparable performance on most of the tested benchmarks, while outperforming previous methods in terms of throughput.

2 Background

Static graph 𝒢=(𝒱,)𝒢𝒱{\mathcal{G}}=({\mathcal{V}},{\mathcal{E}})caligraphic_G = ( caligraphic_V , caligraphic_E ) is a tuple of vertex set 𝒱𝒱{\mathcal{V}}caligraphic_V and edge set {\mathcal{E}}caligraphic_E, s.t., 𝒆𝒆{\bm{e}}\in{\mathcal{E}}bold_italic_e ∈ caligraphic_E is a tuple of two vertices from 𝒱𝒱{\mathcal{V}}caligraphic_V. 𝒢𝒢{\mathcal{G}}caligraphic_G is often equipped with features functions F𝒱:𝒱n:subscript𝐹𝒱𝒱superscript𝑛F_{\mathcal{V}}:{\mathcal{V}}\rightarrow\mathbb{R}^{n}italic_F start_POSTSUBSCRIPT caligraphic_V end_POSTSUBSCRIPT : caligraphic_V → blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and F:n:subscript𝐹superscript𝑛F_{\mathcal{E}}:{\mathcal{E}}\rightarrow\mathbb{R}^{n}italic_F start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT : caligraphic_E → blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT that maps a vertex or an edge into an n𝑛nitalic_n-dimensional vector representing their matching features. Continuous-Time Dynamic Graph (CTDG) is a sequence 𝒬={ut1,ut2,,utm}𝒬subscript𝑢subscript𝑡1subscript𝑢subscript𝑡2subscript𝑢subscript𝑡𝑚{\mathcal{Q}}=\{{u}_{t_{1}},{u}_{t_{2}},...,{u}_{t_{m}}\}caligraphic_Q = { italic_u start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , … , italic_u start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT } of m𝑚mitalic_m timestamped updates on the graph. An update utsubscript𝑢𝑡{u}_{t}italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT that occurs at time t𝑡titalic_t can be one of the following: node addition, node removal, edge addition, and edge removal. 𝒢[𝒬(t)]𝒢delimited-[]𝒬𝑡{\mathcal{G}}[{\mathcal{Q}}(t)]caligraphic_G [ caligraphic_Q ( italic_t ) ] is the static graph received by applying all the updates from 𝒬𝒬{\mathcal{Q}}caligraphic_Q on 𝒢𝒢{\mathcal{G}}caligraphic_G that have occurred until time t𝑡titalic_t. The k𝑘kitalic_k-hop neighborhood of a node visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT at time t𝑡titalic_t is defined by:

𝒩i0(t)={vi}superscriptsubscript𝒩𝑖0𝑡subscript𝑣𝑖\displaystyle\mathcal{N}_{i}^{0}(t)=\{v_{i}\}caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( italic_t ) = { italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } (1)
𝒩ik(t)={vj|vu𝒩ik1(t),(vu,vj)𝒢[𝒬(t)]}superscriptsubscript𝒩𝑖𝑘𝑡conditional-setsubscript𝑣𝑗formulae-sequencesubscript𝑣𝑢superscriptsubscript𝒩𝑖𝑘1𝑡subscript𝑣𝑢subscript𝑣𝑗𝒢delimited-[]𝒬𝑡\displaystyle\mathcal{N}_{i}^{k}(t)=\{v_{j}|v_{u}\in\mathcal{N}_{i}^{k-1}(t),(% v_{u},v_{j})\in{\mathcal{G}}[{\mathcal{Q}}(t)]\}caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_t ) = { italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | italic_v start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ∈ caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ( italic_t ) , ( italic_v start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∈ caligraphic_G [ caligraphic_Q ( italic_t ) ] } (2)

As a result of the growing interest in CTDG with a stream of updates, several techniques were recently developed (Kumar et al., 2019; Trivedi et al., 2019; Xu et al., 2020; Wang et al., 2021a, b; Cong et al., 2023). Many of these methods are specific cases of the Temporal Graph Network (TGN, Rossi et al., 2020) model. TGN is a general memory-based deep learning architecture designed to learn on CTDG while achieving throughput suitable for streaming tasks. The primary concept of TGN is to maintain states, namely node features, which are updated with each modification to the graph. To achieve this objective, TGN utilizes two central modules: memory and prediction.

Memory module

The memory module is responsible for applying the updates on the graph and update the states accordingly. When a new batch of updates arrives, the memory module applies a message function that generates a vector for each node involved in each update. If the update is the addition of edge ei,jsubscript𝑒𝑖𝑗e_{i,j}italic_e start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT from node visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to node vjsubscript𝑣𝑗v_{j}italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT at time t𝑡titalic_t, then the appropriate messages of visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and vjsubscript𝑣𝑗v_{j}italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT are:

mi(t)=msgs(si(t),sj(t),Δti,F(ei,j))subscript𝑚𝑖𝑡subscriptmsgssubscript𝑠𝑖superscript𝑡subscript𝑠𝑗superscript𝑡Δsubscript𝑡𝑖subscript𝐹subscript𝑒𝑖𝑗\displaystyle m_{i}(t)=\mathrm{msg_{s}}(s_{i}(t^{-}),s_{j}(t^{-}),\Delta t_{i}% ,F_{{\mathcal{E}}}(e_{i,j}))italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) = roman_msg start_POSTSUBSCRIPT roman_s end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) , italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) , roman_Δ italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_F start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT ( italic_e start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) ) (3)
mj(t)=msgd(sj(t),si(t),Δtj,F(ei,j))subscript𝑚𝑗𝑡subscriptmsgdsubscript𝑠𝑗superscript𝑡subscript𝑠𝑖superscript𝑡Δsubscript𝑡𝑗subscript𝐹subscript𝑒𝑖𝑗\displaystyle m_{j}(t)=\mathrm{msg_{d}}(s_{j}(t^{-}),s_{i}(t^{-}),\Delta t_{j}% ,F_{{\mathcal{E}}}(e_{i,j}))italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) = roman_msg start_POSTSUBSCRIPT roman_d end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) , italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) , roman_Δ italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_F start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT ( italic_e start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) ) (4)

where su(t)subscript𝑠𝑢superscript𝑡s_{u}(t^{-})italic_s start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_t start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) is the state of vusubscript𝑣𝑢v_{u}italic_v start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT prior to t𝑡titalic_t and, ΔtuΔsubscript𝑡𝑢\Delta t_{u}roman_Δ italic_t start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT is the time elapsed since vusubscript𝑣𝑢v_{u}italic_v start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT received an update. msgssubscriptmsgs\mathrm{msg_{s}}roman_msg start_POSTSUBSCRIPT roman_s end_POSTSUBSCRIPT and msgdsubscriptmsgd\mathrm{msg_{d}}roman_msg start_POSTSUBSCRIPT roman_d end_POSTSUBSCRIPT may have learnable parameters. Then, all the messages in the batch are aggregated into a single message per node, s.t., if node visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is involved in updates at times t1t2tn=tsubscript𝑡1subscript𝑡2subscript𝑡𝑛𝑡t_{1}\leq t_{2}\leq...t_{n}=titalic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ … italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_t then:

m¯i(t)=agg(mi(t1),mi(t2)mi(tn))subscript¯𝑚𝑖𝑡aggsubscript𝑚𝑖subscript𝑡1subscript𝑚𝑖subscript𝑡2subscript𝑚𝑖subscript𝑡𝑛\overline{m}_{i}(t)=\mathrm{agg}(m_{i}(t_{1}),m_{i}(t_{2})...m_{i}(t_{n}))over¯ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) = roman_agg ( italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) … italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ) (5)

The aggregation function, for example, can take only mi(tn)subscript𝑚𝑖subscript𝑡𝑛m_{i}(t_{n})italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) and neglects any previous messages in the batch. Finally, the message updater updates the state of the nodes:

si(t)=mem(m¯i(t),si(t))subscript𝑠𝑖𝑡memsubscript¯𝑚𝑖𝑡subscript𝑠𝑖superscript𝑡s_{i}(t)=\mathrm{mem}(\overline{m}_{i}(t),s_{i}(t^{-}))italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) = roman_mem ( over¯ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) , italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) ) (6)

The memmem\mathrm{mem}roman_mem function is a memory-based neural network such as LSTM (Hochreiter & Schmidhuber, 1997) or GRU (Cho et al., 2014).

Prediction module

The prediction module computes the predictions for the inputs in a given batch, e.g., potential edges in a link prediction task. First, the prediction module reads from the memory module all the states of the nodes in the neighborhood of any node involved in the input. Then it generates a new embedding for each node based on its state and the states of its neighbors. For example, denote [||][\bm{\cdot}||\bm{\cdot}][ bold_⋅ | | bold_⋅ ] as the operation of vector concatenation, then the embedding formulation based on 1111-hop neighborhood of node visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is:

zi(t)=Σvj𝒩i1(t)h(vi(t),vj(t),F(ei,j))subscript𝑧𝑖𝑡subscriptΣsubscript𝑣𝑗superscriptsubscript𝒩𝑖1𝑡subscript𝑣𝑖𝑡subscript𝑣𝑗𝑡subscript𝐹subscript𝑒𝑖𝑗z_{i}(t)=\Sigma_{v_{j}\in\mathcal{N}_{i}^{1}(t)}h(v_{i}(t),v_{j}(t),F_{{% \mathcal{E}}}(e_{i,j}))italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) = roman_Σ start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( italic_t ) end_POSTSUBSCRIPT italic_h ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) , italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) , italic_F start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT ( italic_e start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) ) (7)

where vu(t)=[F𝒱(vu)su(t)Δtu]subscript𝑣𝑢𝑡delimited-[]subscript𝐹𝒱subscript𝑣𝑢normsubscript𝑠𝑢superscript𝑡Δsubscript𝑡𝑢v_{u}(t)=[F_{\mathcal{V}}(v_{u})||s_{u}(t^{-})||\Delta t_{u}]italic_v start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_t ) = [ italic_F start_POSTSUBSCRIPT caligraphic_V end_POSTSUBSCRIPT ( italic_v start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) | | italic_s start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_t start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) | | roman_Δ italic_t start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ] and hhitalic_h is a learnable function. Using the neighborhood of a node in the graph to compute its embedding averts the staleness problem(Kazemi et al., 2020).

For the task of future link prediction of edge ei,jsubscript𝑒𝑖𝑗e_{i,j}italic_e start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT at time t𝑡titalic_t, TGN computes the edge’s probability to exist by:

pi,j(t)=merge(zi(t),zj(t))subscript𝑝𝑖𝑗𝑡mergesubscript𝑧𝑖𝑡subscript𝑧𝑗𝑡p_{i,j}(t)=\mathrm{merge}(z_{i}(t),z_{j}(t))italic_p start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ( italic_t ) = roman_merge ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) , italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) ) (8)

where mergemerge\mathrm{merge}roman_merge is a learnable function such as an MLPMLP\mathrm{MLP}roman_MLP.

3 Problem statement

The processing sequence of memory-based models for dynamic graph learning involves receiving a new batch containing both graph updates and inputs for prediction. This utilization of batches allows memory-based models to achieve a reasonable throughput during inference time (Kumar et al., 2019; Rossi et al., 2020; Wang et al., 2021b; Cong et al., 2023). In the streaming setting, where the graph receives new updates at extremely high speeds, it is crucial for the model to have sufficient throughput. Otherwise, a buffer to the model containing the new updates will overflow.

Memory-based models have a well-defined flow of operation upon receiving a new batch. Initially, they compute their predictions for the inputs in the batch. This operation is performed in parallel by using the current states and the current graph structure as saved in their memory. Subsequently, they process all the updates in the batch and update their inner memory accordingly. This flow of operations introduces the undesirable phenomenon defined as missing updates.

Formally, given a batch 𝒬={xt1,xt2xtm}𝒬subscript𝑥subscript𝑡1subscript𝑥subscript𝑡2subscript𝑥subscript𝑡𝑚{\mathcal{Q}}=\{{x}_{t_{1}},{x}_{t_{2}}...{x}_{t_{m}}\}caligraphic_Q = { italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT … italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT } of size m𝑚mitalic_m where xtisubscript𝑥subscript𝑡𝑖{x}_{t_{i}}italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT can be an update to the graph utisubscript𝑢subscript𝑡𝑖{u}_{t_{i}}italic_u start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT or an input to predict ptisubscript𝑝subscript𝑡𝑖{p}_{t_{i}}italic_p start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT. We say that utisubscript𝑢subscript𝑡𝑖{u}_{t_{i}}italic_u start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT is a missing update if there exists an input ptjsubscript𝑝subscript𝑡𝑗{p}_{t_{j}}italic_p start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT, s.t., i<j𝑖𝑗i<jitalic_i < italic_j and the nodes involved in utisubscript𝑢subscript𝑡𝑖{u}_{t_{i}}italic_u start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT are in the neighborhood of the nodes involved in ptjsubscript𝑝subscript𝑡𝑗{p}_{t_{j}}italic_p start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT. Inputs to the model that depend on missing updates are harder to predict since the memory of the model of their neighborhood is outdated at the time of the prediction.

3.1 Empirical analysis

We examined the incidence and impact of the missing updates phenomenon on various real-world datasets for dynamic graphs. We measured the average ratio of inputs in a batch that depend on at least a single missing update. In addition, we tested the average number of missing updates that affects a single input to the model. In both cases, we examined missing updates within the 1111-hop neighborhood of the input nodes for different batch sizes and reported the results in Figure 1 and Figure 1 respectively. We also tested a standard TGN trained on these datasets with different batch sizes. In Figure 1, we report the average precision of TGN for each batch size, normalized by the average precision achieved with a batch size of 10. In Appendix A, we supply the full missing updates statistics for all the datasets used in this work.

Refer to caption
Refer to caption
Refer to caption
Figure 1: The incidence of missing updates in real-world datasets as a function of the batch size and their impact on the performance of TGN. In 1, the ratio of inputs that depend on at least a single missing update increases significantly as the batch size increases. In 1, the average number of missing updates per input increases as the batch size increases. In 1, the performance of TGN corresponds to the extent of missing updates, where a high incidence of missing updates indicates a significant performance decrease.

In Figure 1, it is observed that an increase in batch size correlates with a higher incidence of missing updates. Furthermore, the occurrence of missing updates varies across different datasets. The findings from Figure 1 suggest a negative connection between the occurrence of missing updates in a dataset and the performance of a model trained on it. Consequently, achieving optimal performance in memory-based models necessitates a smaller batch size. This result indicates a trade-off with respect to the batch size in memory-based models, as large batch size is required to attain high throughput in the streaming scenario.

4 Related work

Handling missing updates

The t-Batch algorithm (Kumar et al., 2019) was initially intended to improve the running-time performance of memory-based deep networks for dynamic graphs that process updates one after the other (i.e., batch size equal to 1). The logic motivating t-Batch is that these networks can combine multiple updates into a single batch and apply them in parallel if they do not contain the same nodes, where the batches are temporally sorted. Using t-Batch, JODIE’s memory-based model becomes X9.2 faster than similar methods without suffering from missing updates (Kumar et al., 2019). The t-batch algorithm, however, suffers from two main flaws. First, large batch sizes for t-Batch are often impossible since temporal locality is a common characteristic of dynamic graphs (Poursafaei et al., 2022). In addition, many modern deep learning networks for dynamic graphs, such as TGN, depend on the neighborhood of the nodes to give an appropriate prediction, causing t-Batch to perform complicated neighborhood-independent batches instead of node-independent batches, which are significantly smaller.

Efficient methods for streaming

According to Huang et al. (2023), EdgeBank is currently an order of magnitude faster than other well-known techniques for dynamic graphs. EdgeBank (Poursafaei et al., 2022) is a memorization algorithm that saves any seen update and predicts according to a simple decision rule that can be one of the following: whether the input was seen in the last few iterations or whether the input has already been seen a sufficient number of times. The algorithm’s simplicity allows it to perform extremely fast, even without batches, thus not suffering from missing updates. Nevertheless, EdgeBank was developed to serve as a baseline for testing and comparing other methods for dynamic graphs (Poursafaei et al., 2022), and, therefore, its performance lags significantly behind the state-of-the-art (Yu et al., 2023; Huang et al., 2023).

5 Proposed method

In this section, we describe our method to balance the batch size trade-off discussed earlier in Section 3. The method decouples the TGN modules, by ensuring that each module uses a different batch size. In general, the memory module will utilize smaller batch sizes for frequent updates, while the prediction module will employ larger batch sizes for efficiency.

Following that, we describe our proposed lightweight model for dynamic graph learning tasks. The model is a TGN with decoupled modules implemented using efficient functions. Specifically, we parameterize the EdgeBank (Poursafaei et al., 2022) model to allow it to learn. Then, we add extra parameters to consider single-node information in the prediction instead of solely relying on edge temporal information.

5.1 The decoupling strategy

We propose to decouple the core modules of TGN: the prediction and memory modules. The decoupled modules will operate on different data and different batch sizes. Given a batch containing updates to apply and inputs to predict, the model divides the batch into smaller consecutive batches termed memory batches. The memory module operates on the memory batches, and thus, it can perform memory updates more frequently. After processing a memory batch but before proceeding to the next one, the memory module extracts and temporally saves the temporal neighborhood information. This information encompasses the neighborhood state relevant to the nodes in the subsequent memory batch, preventing it from being overridden. The neighborhood state is defined by:

S𝒩ik(t)(t)={sj(t)|vj𝒩ik(t)}subscript𝑆superscriptsubscript𝒩𝑖𝑘𝑡𝑡conditional-setsubscript𝑠𝑗𝑡subscript𝑣𝑗superscriptsubscript𝒩𝑖𝑘𝑡S_{\mathcal{N}_{i}^{k}(t)}(t)=\{s_{j}(t)\ |v_{j}\in\mathcal{N}_{i}^{k}(t)\}italic_S start_POSTSUBSCRIPT caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_t ) end_POSTSUBSCRIPT ( italic_t ) = { italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) | italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_t ) } (9)

This creates a view of the model’s memory for each time a memory batch starts. After processing all memory batches, the prediction module reads the extracted states for each node associated with a given input from the view before that input’s timestamp. Subsequently, the prediction module simultaneously computes predictions for all inputs within the complete batch.

Figure 3 demonstrates the effectiveness of a decoupled model compared to a standard memory-based model. The edge at t4subscript𝑡4t_{4}italic_t start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT in Figure 3 is given as an input for the models. A standard memory-based model computes the embeddings of v3subscript𝑣3v_{3}italic_v start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT and v6subscript𝑣6v_{6}italic_v start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT based on their neighborhood states before t1subscript𝑡1t_{1}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and only then updates its inner memory with the edges at t1subscript𝑡1t_{1}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT,t2subscript𝑡2t_{2}italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and t3subscript𝑡3t_{3}italic_t start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT. On the other hand, a decoupled model initially performs memory updates of the two memory batches. Then, the prediction module uses the states extracted before t3subscript𝑡3t_{3}italic_t start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT that include the updates in first memory batch. In Figure 3, the missing update that affects the interaction between v3subscript𝑣3v_{3}italic_v start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT and v6subscript𝑣6v_{6}italic_v start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT is avoided by using the decoupling strategy since the prediction module is aware of the interaction between v2subscript𝑣2v_{2}italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and v3subscript𝑣3v_{3}italic_v start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT.

Decoupling the modules of TGNs offers two immediate benefits. First, by decoupling the memory module from the prediction module and setting the memory batch size to 1, we completely solve the missing updates problem. Secondly, we can accelerate the execution time of an existing model without compromising its accuracy. This can be achieved by decoupling its modules, setting the memory batch size to match the model’s original batch size, and substantially increasing the new batch size of the prediction module. Using the original batch size for the memory batches ensures the same frequency of missing updates, and the new larger batch size will improve the running time performance. Figure 3 details the running time improvement of decoupled TGN with a memory batch size of 50 when using growing batch sizes. Notably, the decoupling strategy enhances TGN’s running time by 12.5% without compromising its performance, as the frequency of missing updates depends only on the memory batch size. Furthermore, actively transferring additional computations from the memory module to the prediction module will lead to an additional improvement in running time. Further analysis of the potential speedup of the decoupling strategy is discussed in Appendix E.

Refer to caption
Figure 2: Illustration of a dynamic graph at t4subscript𝑡4t_{4}italic_t start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT for the task of predicting the edge (v3,v6)subscript𝑣3subscript𝑣6(v_{3},v_{6})( italic_v start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT ). The state of a memory-based model is compared to the state of a model operating using the proposed decoupling strategy. The memory-based model was updated prior to t1subscript𝑡1t_{1}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and, therefore, does not contain (v1,v2)subscript𝑣1subscript𝑣2(v_{1},v_{2})( italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ),(v2,v3)subscript𝑣2subscript𝑣3(v_{2},v_{3})( italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) and (v4,v5)subscript𝑣4subscript𝑣5(v_{4},v_{5})( italic_v start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT ). The model that follows the decoupling strategy and applies inner batch updates was previously updated at t2subscript𝑡2t_{2}italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and, therefore, closely resembles the ground truth and is missing only (v4,v5)subscript𝑣4subscript𝑣5(v_{4},v_{5})( italic_v start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT ).
Refer to caption
Figure 3: Comparison of running times for decoupled TGN with a constant memory batch size of 50 and varying batch sizes on the test set of the Wikipedia dataset. The running times are normalized by the baseline scenario where both the memory batch size and the batch size are set to 50.

5.2 Lightweight Decoupled Temporal Graph Network


Refer to caption

Figure 4: Framework of the proposed model. The batch of updates and inputs is first divided into memory batches and a single batch of inputs. Then, the new edges and their appropriate timestamps are saved in the memory. In LDTGN-mem, the state of each node in the memory batch is updated using the msgmsg\mathrm{msg}roman_msg, aggagg\mathrm{agg}roman_agg, and memmem\mathrm{mem}roman_mem functions. Before each update, the relevant information is saved in a memory view to prevent it from being overridden. Next, the information of each input node is extracted from the appropriate memory view. Then, TDETDE\mathrm{TDE}roman_TDE is applied to the time differences between the inputs and the time of the extracted timestamps. Neighborhood information is aggregated using learnable attention weights to create a single encoding for each node. Finally, the nodes encoding and the edge encoding are merged using the mergemerge\mathrm{merge}roman_merge function, and the combined encoding is used to get the final prediction.

We propose the Lightweight Decoupled Temporal Graph Network (LDTGN), an efficient model designed for dynamic graph learning tasks. LDTGN operates with high throughput, crucial for the streaming setting, while also achieving superior performance in dynamic graph learning tasks.

In this subsection, we develop LDTGN step-by-step by enhancing EdgeBank and incorporating the decoupled strategy. Despite EdgeBank’s performance falling short of the current state-of-the-art, it demonstrates commendable results with exceptionally high throughput, making it a suitable foundation for our model. We proceed to delineate the deficiencies in EdgeBank that need addressing to attain top-tier performance. Subsequently, we integrate improvements with EdgeBank to resolve these issues effectively, thereby constructing the LDTGN model. For the simplicity of the presentation, we describe LDTGN for future edge prediction tasks and assume only edge addition updates. Comprehensive details about applying node addition, node removal, and edge removal updates, as well as adjustments for node classification task, are provided in Appendix C.

The EdgeBank model can be formulated as a memory-based algorithm as presented by Poursafaei et al. (2022) where an edge that did not get an update in the past T𝑇Titalic_T updates is considered negative. We can also describe this memory-based prediction rule as a linear function that maps a time-based difference into a prediction. Equation 10 details the linear function of EdgeBank with a decision function that considers any edge ei,jsubscript𝑒𝑖𝑗e_{i,j}italic_e start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT that was updated in the last T𝑇Titalic_T updates as positive.

pi,j(t)=(tti,j)+Tsubscript𝑝𝑖𝑗𝑡𝑡subscript𝑡𝑖𝑗𝑇p_{i,j}(t)=-(t-t_{i,j})+Titalic_p start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ( italic_t ) = - ( italic_t - italic_t start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) + italic_T (10)

In Equation 10, ti,jsubscript𝑡𝑖𝑗t_{i,j}italic_t start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT is the last time the edge ei,jsubscript𝑒𝑖𝑗e_{i,j}italic_e start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT received an update, and t𝑡titalic_t is the current time. ti,jsubscript𝑡𝑖𝑗t_{i,j}italic_t start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT is set to 00 if ei,jsubscript𝑒𝑖𝑗e_{i,j}italic_e start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT has not been received yet. Poursafaei et al. (2022) suggested to use a constant value of 1000 for T𝑇Titalic_T. The equation should be parameterized to allow the model to learn the most appropriate value of T𝑇Titalic_T for every dataset. To do this, we added a bias b𝑏bitalic_b and a coefficient w𝑤witalic_w as detailed in Equation 11.

pi,j(t)=(tti,j)w+bsubscript𝑝𝑖𝑗𝑡𝑡subscript𝑡𝑖𝑗𝑤𝑏p_{i,j}(t)=(t-t_{i,j})w+bitalic_p start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ( italic_t ) = ( italic_t - italic_t start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) italic_w + italic_b (11)

Using Equation 11, we can learn the suitable threshold for each dataset. As in EdgeBank, this function does not incorporate the nodes themselves into the prediction. This can easily be solved by adding the time differences of each node in the potential edge and appropriate coefficients as in Equation 12.

pi,j(t)=(tti,j)w1+(tti)w2+(ttj)w3+bsubscript𝑝𝑖𝑗𝑡𝑡subscript𝑡𝑖𝑗subscript𝑤1𝑡subscript𝑡𝑖subscript𝑤2𝑡subscript𝑡𝑗subscript𝑤3𝑏p_{i,j}(t)=(t-t_{i,j})w_{1}+(t-t_{i})w_{2}+(t-t_{j})w_{3}+bitalic_p start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ( italic_t ) = ( italic_t - italic_t start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ( italic_t - italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + ( italic_t - italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT + italic_b (12)

In Equation 12, tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and tjsubscript𝑡𝑗t_{j}italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT are the last times the nodes visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and vjsubscript𝑣𝑗v_{j}italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT received an update, respectively. Equation 12 is missing topological and data-specific information such as node and edge features. Moreover, the prediction function is linear, which often causes the learned function to be distant from the ground truth prediction function. To solve this issue, we first create embeddings for the nodes in the potential edge and a preliminary embedding for the edge itself as in Equations 13 and 14:

zi(t)=Σk𝒩i1(t)αk[vi(t)vk(t)F(ei,k)]subscript𝑧𝑖𝑡subscriptΣ𝑘superscriptsubscript𝒩𝑖1𝑡subscript𝛼𝑘delimited-[]subscript𝑣𝑖𝑡normsubscript𝑣𝑘𝑡subscript𝐹subscript𝑒𝑖𝑘z_{i}(t)=\Sigma_{k\in\mathcal{N}_{i}^{1}(t)}\alpha_{k}[v_{i}(t)||v_{k}(t)||F_{% {\mathcal{E}}}(e_{i,k})]italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) = roman_Σ start_POSTSUBSCRIPT italic_k ∈ caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( italic_t ) end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) | | italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_t ) | | italic_F start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT ( italic_e start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT ) ] (13)
zi,j(t)=TDE(tti,j)subscript𝑧𝑖𝑗𝑡TDE𝑡subscript𝑡𝑖𝑗z_{i,j}(t)=\mathrm{TDE}(t-t_{i,j})italic_z start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ( italic_t ) = roman_TDE ( italic_t - italic_t start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) (14)

where vi(t)=[F𝒱(vi)||TDE(tti)]v_{i}(t)=[F_{\mathcal{V}}(v_{i})||\mathrm{TDE}(t-t_{i})]italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) = [ italic_F start_POSTSUBSCRIPT caligraphic_V end_POSTSUBSCRIPT ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | | roman_TDE ( italic_t - italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ] and αksubscript𝛼𝑘\alpha_{k}italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is attention weight computed as in GAT (Veličković et al., 2018). TDETDE\mathrm{TDE}roman_TDE is a non-linear time difference embedding function such as Time2Vec Kazemi et al. (2019). Equation 15 uses the embeddings of the nodes, edge time difference, and a non-linear mergemerge\mathrm{merge}roman_merge function to give the final prediction.

pi,j(t)=merge(zi(t),zj(t),zi,j(t))subscript𝑝𝑖𝑗𝑡mergesubscript𝑧𝑖𝑡subscript𝑧𝑗𝑡subscript𝑧𝑖𝑗𝑡p_{i,j}(t)=\mathrm{merge}(z_{i}(t),z_{j}(t),z_{i,j}(t))italic_p start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ( italic_t ) = roman_merge ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) , italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) , italic_z start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ( italic_t ) ) (15)

Equations 13, 14 and 15 constitute the prediction module of LDTGN. The prediction module only requires ti,tjsubscript𝑡𝑖subscript𝑡𝑗t_{i},t_{j}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and ti,jsubscript𝑡𝑖𝑗t_{i,j}italic_t start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT from the memory module. Hence, these timestamps are the sole data of the memory module. In the experiments, we implemented LDTGN with a memory batch size of 1, thus eliminating the adverse effects associated with missing updates. This design choice not only mitigates these negative impacts but also obviates the need for a message aggregator required in traditional TGNs. LDTGN operates with a minimal memory batch size and with a high throughput thanks to the removal of msgssubscriptmsg𝑠\mathrm{msg}_{s}roman_msg start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT,msgdsubscriptmsg𝑑\mathrm{msg}_{d}roman_msg start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT and memmem\mathrm{mem}roman_mem from the memory module. Standard TGNs save states only for the nodes, but LDTGN also saves states for the edges. This does not add an additional memory to LDTGN over other TGNs since TGNs save the full graph to incorporate topological information in the prediction.

In the scenarios where the throughput is allowed to be smaller, and the missing updates negative effects are neglectable for small memory batch size, LDTGN can incorporate long-term dependencies. We refer to this variant of LDTGN as LDTGN-mem. To achieve long-term dependencies, LDTGN-mem is implemented with a heavier memory module. This memory module generates the following messages:

mi(t)=[si(t)sj(t)TDE(tti)]subscript𝑚𝑖𝑡delimited-[]subscript𝑠𝑖superscript𝑡normsubscript𝑠𝑗superscript𝑡𝑇𝐷𝐸𝑡subscript𝑡𝑖\displaystyle m_{i}(t)=[s_{i}(t^{-})||s_{j}(t^{-})||TDE(t-t_{i})]italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) = [ italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) | | italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) | | italic_T italic_D italic_E ( italic_t - italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ] (16)
mj(t)=[sj(t)si(t)TDE(ttj)]subscript𝑚𝑗𝑡delimited-[]subscript𝑠𝑗superscript𝑡normsubscript𝑠𝑖superscript𝑡𝑇𝐷𝐸𝑡subscript𝑡𝑗\displaystyle m_{j}(t)=[s_{j}(t^{-})||s_{i}(t^{-})||TDE(t-t_{j})]italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) = [ italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) | | italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) | | italic_T italic_D italic_E ( italic_t - italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ] (17)

The aggregation function for the messages takes only the most recent message per node, and the memmem\mathrm{mem}roman_mem function is set to be a GRUGRU\mathrm{GRU}roman_GRU cell:

si(t)=GRU(m¯i(t),si(t))subscript𝑠𝑖𝑡GRUsubscript¯𝑚𝑖𝑡subscript𝑠𝑖superscript𝑡s_{i}(t)=\mathrm{GRU}(\overline{m}_{i}(t),s_{i}(t^{-}))italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) = roman_GRU ( over¯ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) , italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) ) (18)

To incorporate the long-term memory in the prediction module, LDTGN-mem adds the current learned state to the data of each node:

vi(t)=[F𝒱(vi)TDE(tti)si(t)]subscript𝑣𝑖𝑡delimited-[]subscript𝐹𝒱subscript𝑣𝑖normTDE𝑡subscript𝑡𝑖subscript𝑠𝑖superscript𝑡v_{i}(t)=[F_{\mathcal{V}}(v_{i})||\mathrm{TDE}(t-t_{i})||s_{i}(t^{-})]italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) = [ italic_F start_POSTSUBSCRIPT caligraphic_V end_POSTSUBSCRIPT ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | | roman_TDE ( italic_t - italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | | italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) ] (19)

In contrast to LDTGN, LDTGN-mem has to operate with a memory batch size larger than 1 to ensure a reasonable throughput. We chose to implement LDTGN-mem with a memory batch size of 50. This is because we observed earlier in Figure 1 that the incidence of missing updates with a batch size of 50 is not severe. In addition LDTGN-mem operates with an acceptable throughput when using this memory batch size. The adjustments required for the LDTGN-mem are detailed in the illustration of our model at Figure 4.

6 Experiments

This section contains the description of the experiments we used to evaluate the performance of our model. All the experiments were performed using DyGLib (Yu et al., 2023) – the unified library for dynamic graph learning evaluation. DyGLib contains various real world datasets including large-scale dynamic graphs with millions of edges. The experiments are for future edge prediction with random negative edge sampling on the following datasets: Wikipedia, Reddit, MOOC, lastFM, Enron, Social Evo., UCI, Flights, Can. Parl., US Legis., UN Trade, UN Vote, and Contacts that were collected by Poursafaei et al. (2022). Additional information and statistics regarding the datasets can be found in Appendix A. We used seven well-known methods as baselines for the task of future edge prediction: DyRep (Trivedi et al., 2019), TGAT (Xu et al., 2020), TGN (Rossi et al., 2020), CAWN (Wang et al., 2021b), EdgeBank (Poursafaei et al., 2022), GraphMixer (Cong et al., 2023) and DyGFormer (Yu et al., 2023). Additional information regarding the baselines can be found in Appendix B. We adopted the approach used in previous works and split the dataset into training, validation, and test sets by performing a chronological split of 70%–15%–15%. We report the mean and standard deviation of the Average Precision (AP) on the test set. Results for Areas Under the Receiver Operating Characteristic Curve (AUC-ROC) are detailed in Appendix D.

6.1 Future edge prediction

In the first experiment, we tested transductive future edge prediction with random negative edge sampling, i.e., for each positive edge in the datasets, a negative edge with the same source and a random destination is sampled. The results are presented in Table 1. We also performed an experiment for the inductive future edge prediction setting, in which all the edges in the validation and test sets must contain nodes that have not been previously seen in the training set. The results for this experiment are reported in Table 2. The baselines’ results were computed with DyGLib using the hyperparameters configurations as described in (Yu et al., 2023). Additional implementation-specific details of LDTGN and LDTGN-mem and their training methodology are detailed in Appendix C. LDTGN achieves state-of-the-art or comparable results compared to the baselines for the setting of transductive and inductive future edge prediction. In benchmarks where the negative effects of the missing updates are insignificant for small batch sizes, LDTGN and LDTGN-mem achieve comparable performance to DyGFormer. In the benchmarks where missing updates have substantial influence, such as US Legis, LDTGN considerably outperforms the compared baselines since it completely removes all the missing updates when using a batch size of 1.

Table 1: AP for transductive future edge prediction with random negative sampling over five runs. The significantly best result for each benchmark appears in bold font.
Dataset DyRep TGAT TGN CAWN EdgeBank GraphMixer DyGFormer LDTGN (ours) LDTGN-mem (ours)
Wikipedia 94.86±0.06 96.94±0.06 98.45±0.06 98.76±0.03 90.37±0.00 97.25±0.03 99.03±0.02 98.86±0.02 98.99±0.03
Reddit 98.22±0.04 98.52±0.02 98.63±0.06 99.11±0.01 94.86±0.00 97.31±0.01 99.22±0.01 98.61±0.01 99.28±0.02
MOOC 81.97±0.49 85.84±0.15 89.15±1.60 80.15±0.25 57.97±0.00 82.78±0.15 87.52±0.49 83.34±1.47 91.73±0.65
lastFM 71.92±2.21 73.42±0.21 77.07±3.97 86.99±0.06 79.29±0.00 75.61±0.24 93.00±0.12 90.81±0.01 91.22±0.31
Enron 82.38±3.36 71.12±0.97 86.53±1.11 89.56±0.09 83.53±0.00 82.25±0.16 92.47±0.12 98.10±0.01 92.28±0.32
Social Evo. 88.87±0.30 93.16±0.17 93.57±0.17 84.96±0.09 74.95±0.00 93.37±0.07 94.73±0.01 95.45±0.51 94.02±0.16
UCI 65.14±2.30 79.63±0.70 92.34±1.04 95.18±0.06 76.20±0.00 93.25±0.57 95.79±0.17 97.05±0.01 95.75±0.04
Flights 95.29±0.72 94.03±0.18 97.95±0.14 98.51±0.01 89.35±0.00 90.99±0.05 98.91±0.01 97.50±0.07 98.76±0.06
Can. Parl. 66.54±2.76 70.73±0.72 70.88±2.34 69.82±2.34 64.55±0.00 77.04±0.46 97.36±0.45 99.47±0.03 72.82±9.17
US Legis. 75.34±0.39 68.52±3.16 75.99±0.58 70.58±0.48 58.39±0.00 70.74±1.02 71.11±0.59 92.08±0.09 80.93±0.48
UN Trade 63.21 ± 0.93 61.47±0.18 65.03±1.37 65.39±0.12 60.41±0.00 62.61±0.27 66.46±1.29 97.82±0.07 96.65±0.19
UN Vote 62.81 ± 0.80 52.21±0.98 65.72±2.17 52.84±0.10 58.49±0.00 52.11±0.16 55.55±0.42 80.94±1.43 71.21±1.14
Contacts 95.98 ± 0.15 96.28±0.09 96.89±0.56 90.26±0.28 92.58±0.00 91.92±0.03 98.29±0.01 98.19±0.03 98.78±0.04
Table 2: AP for inductive future edge prediction with random negative sampling over five different runs. The significantly best result for each benchmark appears in bold font.
Dataset DyRep TGAT TGN CAWN GraphMixer DyGFormer LDTGN (ours) LDTGN-mem (ours)
Wikipedia 92.43±0.37 96.22±0.07 97.83±0.04 98.24±0.03 96.65±0.02 98.59±0.03 98.74±0.02 98.40±0.04
Reddit 96.09±0.11 97.09±0.04 97.50±0.07 98.62±0.01 95.26±0.02 98.84±0.02 98.00±0.04 98.86±0.02
MOOC 81.07±0.44 85.50±0.19 89.04±1.17 81.42±0.24 81.41±0.21 86.96±0.43 82.73±1.52 90.61±0.32
LastFM 83.02±1.48 78.63±0.31 81.45±4.29 89.42±0.07 82.11±0.42 94.23±0.09 92.17±0.01 92.62±0.59
Enron 74.55±3.95 67.05±1.51 77.94±1.02 86.35±0.51 75.88±0.48 89.76±0.34 96.06±0.09 88.07±0.56
Social Evo. 90.04±0.47 91.41±0.16 90.77±0.86 79.94±0.18 91.86±0.06 93.14±0.04 94.37±0.68 91.31±0.22
UCI 57.48±1.87 79.54±0.48 88.12±2.05 92.73±0.06 91.19±0.42 94.54±0.12 94.92±0.01 93.00±0.12
Flights 92.88±0.73 88.73±0.33 95.03±0.60 97.06±0.02 83.03±0.05 97.79±0.02 95.60± 0.10 97.31±0.16
Can. Parl. 54.02±0.76 55.18±0.79 54.10±0.93 55.80±0.69 55.91±0.82 87.74±0.71 97.83±0.06 58.05±3.08
US Legis. 57.28±0.71 51.00±3.11 58.63±0.37 53.17±1.20 50.71±0.76 54.28±2.87 83.76±0.44 65.75±1.57
UN Trade 57.02±0.69 61.03±0.18 58.31±3.15 65.24±0.21 62.17±0.31 64.55±0.62 97.43±0.07 89.21±1.15
UN Vote 54.62±2.22 52.24±1.46 58.85±2.51 49.94±0.45 50.68±0.44 55.93±0.39 81.29±1.41 63.54±2.09
Contacts 92.18±0.41 95.87±0.11 93.82±0.99 89.55±0.30 90.59±0.05 98.03±0.02 97.85±0.03 97.94±0.13

6.2 Memory and running time performance

We calculated the average number of learnable parameters required for each model to achieve its best performance and reported it in Figure 6. We also measured the average throughput at inference time for each model over all the datasets, where the throughput is defined as the number of edges the model can process in a single second. The results are shown in Figure 6. In both Figure 6 and Figure 6, LDTGN surpasses the other baselines by a large margin in terms of efficiency. Note that the throughput of the baselines were measured using a batch size that is at least the batch size used for LDTGN; hence, the results in Figure 6 are also proportionate to the latency of LDTGN compared to the other baselines.

Refer to caption
Figure 5: Average number of learnable parameters used by the baselines and our model. The black ranges indicate the standard deviation of the average number of learnable parameters.
Refer to caption
Figure 6: Average throughput (processed edges per second) of the baselines and our model. The black ranges indicate the standard deviation of the average throughput.

7 Conclusion

In this work, we introduced the missing updates phenomenon caused by using batches in memory-based models for dynamic graph learning. We showed a strict negative connection between the frequency of missing updates in datasets and the performance of the memory-based models, causing a trade-off with respect to their batch size. To balance this trade-off, we presented the decoupling strategy for designing temporal graph networks. Decoupling enables two types of batches – one for the memory module and the other for the prediction module. In this way, temporal graph networks can increase the frequency of the updates while still handling their arrival streams. In addition, we introduced LDTGN – a lightweight model for the task of future edge prediction that is highly efficient in terms of time and memory. LDTGN can be equipped with a heavier memory module when possible, allowing it to better capture long-term dependencies. We also showed by extensive experiments that LDTGN has outstanding performance for both transductive and inductive tasks, achieving state-of-the-art or comparable performance on most of the tested benchmarks.

References

  • Backstrom & Leskovec (2011) Backstrom, L. and Leskovec, J. Supervised random walks: predicting and recommending links in social networks. In Proceedings of the fourth ACM international conference on Web search and data mining, pp.  635–644, 2011.
  • Cho et al. (2014) Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y. Learning phrase representations using rnn encoder-decoder for statistical machine translation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pp.  1724–1734, 2014.
  • Cini et al. (2023) Cini, A., Marisca, I., Bianchi, F. M., and Alippi, C. Scalable spatiotemporal graph neural networks. In Proceedings of the AAAI conference on artificial intelligence, volume 37(6), pp.  7218–7226, 2023.
  • Cong et al. (2023) Cong, W., Zhang, S., Kang, J., Yuan, B., Wu, H., Zhou, X., Tong, H., and Mahdavi, M. Do we really need complicated model architectures for temporal networks? The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023.
  • Ding et al. (2019) Ding, L., Han, B., Wang, S., Li, X., and Song, B. User-centered recommendation using us-elm based on dynamic graph model in e-commerce. International Journal of Machine Learning and Cybernetics, 10:693–703, 2019.
  • Fowler (2006) Fowler, J. H. Legislative cosponsorship networks in the us house and senate. Social networks, 28(4):454–465, 2006.
  • Gorochowski et al. (2018) Gorochowski, T. E., Grierson, C. S., and Di Bernardo, M. Organization of feed-forward loop motifs reveals architectural principles in natural and engineered networks. Science advances, 4(3):eaap9751, 2018.
  • Granovetter (1973) Granovetter, M. S. The strength of weak ties. American journal of sociology, 78(6):1360–1380, 1973.
  • Haghani & Keyvanpour (2019) Haghani, S. and Keyvanpour, M. R. A systemic analysis of link prediction in social network. Artificial Intelligence Review, 52:1961–1995, 2019.
  • Hochreiter & Schmidhuber (1997) Hochreiter, S. and Schmidhuber, J. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  • Huang et al. (2020) Huang, S., Hitti, Y., Rabusseau, G., and Rabbany, R. Laplacian change point detection for dynamic graphs. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp.  349–358, 2020.
  • Huang et al. (2023) Huang, S., Poursafaei, F., Danovitch, J., Fey, M., Hu, W., Rossi, E., Leskovec, J., Bronstein, M., Rabusseau, G., and Rabbany, R. Temporal graph benchmark for machine learning on temporal graphs. Advances in Neural Information Processing Systems, 36, 2023.
  • Kazemi et al. (2019) Kazemi, S. M., Goel, R., Eghbali, S., Ramanan, J., Sahota, J., Thakur, S., Wu, S., Smyth, C., Poupart, P., and Brubaker, M. Time2vec: Learning a vector representation of time. CoRR, abs/1907.05321, 2019.
  • Kazemi et al. (2020) Kazemi, S. M., Goel, R., Jain, K., Kobyzev, I., Sethi, A., Forsyth, P., and Poupart, P. Representation learning for dynamic graphs: A survey. The Journal of Machine Learning Research, 21(1):2648–2720, 2020.
  • Kumar et al. (2019) Kumar, S., Zhang, X., and Leskovec, J. Predicting dynamic embedding trajectory in temporal interaction networks. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp.  1269–1278, 2019.
  • Liben-Nowell & Kleinberg (2003) Liben-Nowell, D. and Kleinberg, J. The link prediction problem for social networks. In Proceedings of the twelfth international conference on Information and knowledge management, pp.  556–559, 2003.
  • Ma et al. (2020) Ma, Y., Guo, Z., Ren, Z., Tang, J., and Yin, D. Streaming graph neural networks. In Proceedings of the 43rd international ACM SIGIR conference on research and development in information retrieval, pp.  719–728, 2020.
  • MacDonald et al. (2015) MacDonald, G. K., Brauman, K. A., Sun, S., Carlson, K. M., Cassidy, E. S., Gerber, J. S., and West, P. C. Rethinking agricultural trade relationships in an era of globalization. BioScience, 65(3):275–289, 2015.
  • Madan et al. (2011) Madan, A., Cebrian, M., Moturu, S., Farrahi, K., et al. Sensing the" health state" of a community. IEEE Pervasive Computing, 11(4):36–45, 2011.
  • Mangan & Alon (2003) Mangan, S. and Alon, U. Structure and function of the feed-forward loop network motif. Proceedings of the National Academy of Sciences, 100(21):11980–11985, 2003.
  • Nair & Hinton (2010) Nair, V. and Hinton, G. E. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), pp.  807–814, 2010.
  • Panzarasa et al. (2009) Panzarasa, P., Opsahl, T., and Carley, K. M. Patterns and dynamics of users’ behavior and interaction: Network analysis of an online community. Journal of the American Society for Information Science and Technology, 60(5):911–932, 2009.
  • Pareja et al. (2020) Pareja, A., Domeniconi, G., Chen, J., Ma, T., Suzumura, T., Kanezashi, H., Kaler, T., Schardl, T., and Leiserson, C. Evolvegcn: Evolving graph convolutional networks for dynamic graphs. In Proceedings of the AAAI conference on artificial intelligence, volume 34(04), pp.  5363–5370, 2020.
  • Pennebaker et al. (2001) Pennebaker, J. W., Francis, M. E., and Booth, R. J. Linguistic inquiry and word count: Liwc 2001. Mahway: Lawrence Erlbaum Associates, 71(2001):2001, 2001.
  • Poursafaei et al. (2022) Poursafaei, F., Huang, S., Pelrine, K., and Rabbany, R. Towards better evaluation for dynamic link prediction. Advances in Neural Information Processing Systems, 35:32928–32941, 2022.
  • Rossi et al. (2020) Rossi, E., Chamberlain, B., Frasca, F., Eynard, D., Monti, F., and Bronstein, M. Temporal graph networks for deep learning on dynamic graphs. CoRR, abs/2006.10637, 2020.
  • Sankar et al. (2020) Sankar, A., Wu, Y., Gou, L., Zhang, W., and Yang, H. Dysat: Deep neural representation learning on dynamic graphs via self-attention networks. In Proceedings of the 13th international conference on web search and data mining, pp.  519–527, 2020.
  • Sapiezynski et al. (2019) Sapiezynski, P., Stopczynski, A., Lassen, D. D., and Lehmann, S. Interaction data from the copenhagen networks study. Scientific Data, 6(1):315, 2019.
  • Shetty & Adibi (2004) Shetty, J. and Adibi, J. The enron email dataset database schema and brief statistical report. Information sciences institute technical report, University of Southern California, 4(1):120–128, 2004.
  • Simmel (1950) Simmel, G. The sociology of georg simmel, volume 92892. Simon and Schuster, 1950.
  • Strohmeier et al. (2021) Strohmeier, M., Olive, X., Lübbe, J., Schäfer, M., and Lenders, V. Crowdsourced air traffic data from the opensky network 2019–2020. Earth System Science Data, 13(2):357–366, 2021.
  • Toivonen et al. (2007) Toivonen, R., Kumpula, J. M., Saramäki, J., Onnela, J.-P., Kertész, J., and Kaski, K. The role of edge weights in social networks: modelling structure and dynamics. In Noise and Stochastics in Complex Systems and Finance, volume 6601, pp.  48–55. SPIE, 2007.
  • Trivedi et al. (2019) Trivedi, R., Farajtabar, M., Biswal, P., and Zha, H. Dyrep: Learning representations over dynamic graphs. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019.
  • Veličković et al. (2018) Veličković, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., and Bengio, Y. Graph attention networks. 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, 2018.
  • Voeten et al. (2009) Voeten, E., Strezhnev, A., and Bailey, M. United Nations General Assembly Voting Data. Harvard Dataverse, 2009. URL https://doi.org/10.7910/DVN/LEJUQZ.
  • Wang et al. (2021a) Wang, L., Chang, X., Li, S., Chu, Y., Li, H., Zhang, W., He, X., Song, L., Zhou, J., and Yang, H. Tcl: Transformer-based dynamic graph modelling via contrastive learning. CoRR, abs/2105.07944, 2021a.
  • Wang et al. (2021b) Wang, Y., Chang, Y.-Y., Liu, Y., Leskovec, J., and Li, P. Inductive representation learning in temporal networks via causal anonymous walks. 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021b.
  • Xu et al. (2020) Xu, D., Ruan, C., Korpeoglu, E., Kumar, S., and Achan, K. Inductive representation learning on temporal graphs. 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020.
  • Yu et al. (2023) Yu, L., Sun, L., Du, B., and Lv, W. Towards better dynamic graph learning: New architecture and unified library. Advances in Neural Information Processing Systems, 36:67686–67700, 2023.
  • Yu et al. (2018) Yu, W., Cheng, W., Aggarwal, C. C., Zhang, K., Chen, H., and Wang, W. Netwalk: A flexible deep embedding approach for anomaly detection in dynamic networks. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp.  2672–2681, 2018.

Appendix A Datasets statistics and descriptions

In our experiments we used the following dynamic graph datasets:

• Wikipedia (Kumar et al., 2019): Wikipedia edit requests log over one month, where the editing users and Wikipedia pages are represented as nodes and the edit requests are modeled as edges. The edges are timestamped and contain LIWC feature vectors (Pennebaker et al., 2001) of the requested text to post.

• Reddit (Kumar et al., 2019): Reddit post requests log over one month where the posting users and subreddits are represented as nodes and the posting requests are modeled as edges.

• MOOC (Kumar et al., 2019): Students’ access records to MOOC online courses, where students and content units (e.g., videos, answers, etc.) are described as nodes and the access actions (viewing a video, submitting an answer, etc.) are modeled as edges. The edges are timestamped and have four features describing the action.

• LastFM (Kumar et al., 2019): LastFM listening records over one month, where the LastFM users and the songs are represented as nodes and there is an edge between the users and the songs to which they listened. The edges are timestamped and do not contain any features.

• Enron (Shetty & Adibi, 2004): Email logs of the Enron employees over a period of three years, where the employees are modeled as nodes and a single edge represents an email sent between two employees. The edges are timestamped and do not contain any features.

• Social Evo. (Madan et al., 2011): Documentation of the everyday life of undergraduate students living in dormitories from October 2008 to May 2009. Represented as a mobile phone proximity network where each edge has two features.

• UCI Panzarasa et al. (2009): Messages logs of the online community of students from the University of California, Irvine, where the students are modeled as nodes and a single edge represents a message sent between two students. The edges are timestamped with a granularity of seconds.

• Flights (Strohmeier et al., 2021): Tracked air traffic during the COVID-19 pandemic, where the airports are modeled as nodes and the edges are the tracked flights between two airports. The edges are timestamped and weighted. The weight of the edges indicates the number of flights between the airports in a day.

• Can. Parl. (Huang et al., 2020): Documented interactions between Canadian members of parliaments from 2006 to 2019, where the members of parliaments are described as nodes, two of which are connected by an edge if they both voted “yes” on a bill. The edges are timestamped and weighted. The weight of the edges indicates the number of times that one member voted “yes” for another member’s bill within one year.

• US Legis. (Fowler, 2006): Documented interactions in the US Senate, where legislators are modeled as nodes, where two of which are connected by an edge if they co-sponsored a bill. The edges are timestamped and weighted. The weight of the edges indicates the number of times that two members of the US Congress co-sponsored a bill in a given term.

• UN Trade (MacDonald et al., 2015): Documented global food and agriculture trading connections spanning over 30 years, where nations are represented as nodes, two of which are connected by an edge if they have an agriculture import or export relations. The edges are timestamped and weighted. The weight of the edges is the sum of normalized agriculture import or export values between two countries.

• UN Vote (Voeten et al., 2009): Documentation of roll-call votes in the United Nations General Assembly from 1946 to 2020 where nations are represented as nodes, two of which are connected by an edge if they both voted “yes” for an item. The edges are timestamped and weighted. The weight of the edges is the number of times the two countries vote “yes” on a call.

• Contact (Sapiezynski et al., 2019): Physical proximity records documenting around 700 university students over a period of four weeks, where the students are modeled as nodes, two of which are connected by an edge if they each are within close proximity to each other. The edges are timestamped and weighted. The weight of the edges specifies the physical proximity between two students.

The full statistics of the datasets as collected by Yu et al. (2023) are reported in Table 3.

Table 3: Datasets statistics.
Dataset Domain #Nodes #Edges #Edge Features Bipartite Duration
Wikipedia Social 9,227 157,474 172 True 1 month
Reddit Social 10,984 672,447 172 True 1 month
MOOC Interaction 7,144 411,749 4 True 17 months
LastFM Interaction 1,980 1,293,103 True 1 month
Enron Social 184 125,235 False 3 years
Social Evo. Proximity 74 2,099,519 2 False 8 months
UCI Social 1,899 59,835 False 196 days
Flights Transport 13,169 1,927,145 1 False 4 months
Can. Parl. Politics 734 74,478 1 False 14 years
US Legis. Politics 225 60,396 1 False 12 terms
UN Trade Economics 255 507,497 1 False 32 years
UN Vote Politics 201 1,035,742 1 False 72 years
Contact Proximity 692 2,426,279 1 False 1 month

In Table 5 we report the ratio of inputs that depend on at least a single missing update in their 1-hop neighborhood. In Table 5 we report the average number of missing updates affecting the 1-hop neighborhood of the nodes. Tables 5 and 5 contain the missing updates statistics for all the datasets used in this work for various batch sizes.

Table 4: Ratio of inputs that depend on at least a single missing update in their 1-hop neighborhood.
Dataset 1 10 25 50 100 200
Wikipedia 0 0.23 0.42 0.55 0.67 0.76
Reddit 0 0.31 0.52 0.67 0.78 0.86
MOOC 0 0.88 0.95 0.98 0.99 0.99
LastFM 0 0.74 0.88 0.94 0.97 0.98
Enron 0 0.85 0.92 0.95 0.98 0.99
Social Evo. 0 0.90 0.96 0.98 0.99 0.99
UCI 0 0.70 0.85 0.91 0.95 0.97
Flights 0 0.82 0.90 0.94 0.96 0.98
Can. Parl. 0 0.90 0.96 0.98 0.99 0.99
US Legis. 0 0.90 0.96 0.98 0.99 0.99
UN Trade 0 0.90 0.96 0.98 0.99 0.99
UN Vote 0 0.90 0.96 0.98 0.99 0.99
Contacts 0 0.86 0.94 0.97 0.98 0.99
Table 5: Average number of missing updates affecting the 1-hop neighborhood of each input node.
Dataset 1 10 25 50 100 200
Wikipedia 0 0.25 0.65 1.19 2.02 3.28
Reddit 0 0.15 0.39 0.78 1.53 2.99
MOOC 0 0.72 1.45 2.38 3.90 6.57
LastFM 0 0.45 1.16 2.10 3.59 5.93
Enron 0 2.95 6.07 9.81 15.55 24.95
Social Evo. 0 1.86 3.15 5.22 9.77 19.22
UCI 0 0.95 2.12 3.67 5.98 9.31
Flights 0 2.82 5.77 8.79 11.97 14.55
Can. Parl. 0 4.41 11.46 22.33 40.65 66.87
US Legis. 0 4.19 10.10 17.41 25.26 31.09
UN Trade 0 4.37 11.21 21.35 37.41 57.00
UN Vote 0 3.75 8.56 14.35 21.47 28.01
Contacts 0 1.74 2.60 3.15 3.85 5.13

Appendix B Baselines descriptions

We used the following baselines for the evaluation experiments of dynamic graph learning:

•DyRep (Trivedi et al., 2019): DyRep is an RNN-based architecture that utilizes a temporal attention mechanism to exploit the dynamic structure of the graphs.

• TGAT (Xu et al., 2020): TGAT uses a time-encoding function and aggregates neighborhood information using self-attention to compute the embedding for each node.

• TGN (Rossi et al., 2020): TGN is a general architecture for CTDG learning tasks. It uses both a prediction module and a memory module to get relevant and accurate predictions for each input at each moment in time. It does this by aggregating information from the neighborhood of each node and maintain learnable updated memory which is based on RNN, and thus also solves the staleness problem.

• CAWN (Wang et al., 2021b): The CAWN model is based on causal anonymous walks that are generated for each node. The walks are encoded using RNNs and aggregated to achieve the node representation.

• EdgeBank (Poursafaei et al., 2022): EdgeBank is a memorization algorithm that saves any seen update and, given an input, it predicts according to a simple decision rule that can be one of the following: whether the input was seen in the last few iterations (EdgeBankth) or in the last few time units (EdgeBanktw), or whether the input has already been seen a sufficient number of times (EdgeBankre). While EdgeBank can also have a decision rule that is based on infinite memory i.e., predicts positive for any previously seen edge and predicts negative otherwise (EdgeBankinf). The algorithm’s simplicity allows it to perform extremely fast, making it significantly faster than any other model for dynamic graph learning. In our experiments, we report the best results of EdgeBank among all of its decision rule variations.

• GraphMixer (Cong et al., 2023): GraphMixer uses three components for the task of future edge prediction: a link-encoder that is based on MLP and fixed time-encoding function, a node-encoder that only performs neighborhood mean-pooling and another MLP for edge prediction.

• DyGFormer (Yu et al., 2023): DyGFormer is a transformer-based architecture. To generate an encoding for a given interaction, DyGFormer generates a co-occurrence embedding of the interaction in addition to a neighborhood representation for each interacting node. Then it uses a patching technique on historical representations of the interacting nodes to better capture long-term temporal dependencies. The patches are then sent to a transformer and its outputs are averaged to create the final representation.

Appendix C Additional Implementation Details

C.1 Supporting additional update types

In Section 5.2 we described how to handle an edge addition update. To further support the update of the removal of the edge ei,jsubscript𝑒𝑖𝑗e_{i,j}italic_e start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT, ti,jsubscript𝑡𝑖𝑗t_{i,j}italic_t start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT should be set to 00. Similarly, when a node addition update of the node visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT occurs, tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT should be set to the current time. tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT should be set to 00 when it is removed.

C.2 Node classification

To adjust LDTGN for dynamic node classification, the MERGEMERGE\mathrm{MERGE}roman_MERGE function needs to be removed, s.t., the prediction operation is applied directly on the node embedding. The Wikipedia dataset can also be used for dynamic node classification, therefore we used it to evaluate our model compared to other baselines:

Table 6: AUC-ROC for node prediction task on the Wikipedia dataset.
Dataset DyRep TGAT TGN CAWN GraphMixer DyGFormer LDTGN (ours)
Wikipedia 86.39±0.98 84.09±1.27 86.38±2.34 84.88±1.33 86.80±0.79 87.44±1.08 86.71±0.44

For this task of node classification LDTGN achieves comparable performance to previous state-of-the-art while still being the most efficient in terms of throughput and latency.

C.3 Further implementation details

For the TDETDE\mathrm{TDE}roman_TDE of LDTGN we used an MLPMLP\mathrm{MLP}roman_MLP with two hidden layers and two activation layers of ReLU (Nair & Hinton, 2010). Each linear layer of TDETDE\mathrm{TDE}roman_TDE outputs vector of length 100. Before applying TDETDE\mathrm{TDE}roman_TDE the time difference need to be normalised to ease the learning process. We used Equation 20 to normalise the time difference, where C𝐶Citalic_C is the length of the dataset.

normalise(t)=log(1+t)log(1+C)normalise𝑡𝑙𝑜𝑔1𝑡𝑙𝑜𝑔1𝐶\mathrm{normalise}(t)=\frac{log(1+t)}{log(1+C)}roman_normalise ( italic_t ) = divide start_ARG italic_l italic_o italic_g ( 1 + italic_t ) end_ARG start_ARG italic_l italic_o italic_g ( 1 + italic_C ) end_ARG (20)

For LDTGN-mem, we used Time2Vec as the TDETDE\mathrm{TDE}roman_TDE function. Time2Vec utilizes the cosine function, thus omitting the need for normalization.

The merge function of LDTGN and LDTGN-mem is an MLPMLP\mathrm{MLP}roman_MLP that maps multiple input vectors into a single value that represents the probability of the edge to be positive. The MLPMLP\mathrm{MLP}roman_MLP first applies linear layer that maps the three vectors to a single vector. Then reduces the vector’s dimension to 80, 10 and finally to 1. After each dimensionality reduction, a ReLU is being applied. Finally, a sigmoid function is applied on the result to obtain the probability of an edge to be positive.

In practice, it is challenging to utilize the full neighborhood of input nodes to compute the predictions and withstand a reasonable throughput, since the neighborhood of each node is expected to grow overtime. Thus, we implemented our models using the recent neighbors sampling strategy that was suggested by Rossi et al. (2020) in which only the k𝑘kitalic_k neighbors of each hop which were recently involved in an update are used for computing the predictions. For our models, we used k=20𝑘20k=20italic_k = 20.

C.4 Training

We trained the models for 100 epochs with a patience of 20 epochs before early stopping. We used binary cross entropy loss as the objective function and optimized the models using Adam’s algorithm with a learning rate of 104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT.

All the experiments were performed on Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz and NVIDIA GeForce RTX 3090.

Appendix D Additional results

In Table 7 and Table 8 we report the AUC-ROC of our proposed model and baselines for the transductive and inductive future edge prediction tasks, respectively.

Table 7: : AUC-ROC for transductive future edge prediction with random negative sampling over five runs. The significantly best result for each benchmark appears in bold font.
Dataset DyRep TGAT TGN CAWN EdgeBank GraphMixer DyGFormer LDTGN (ours) LDTGN-mem (ours)
Wikipedia 94.37 ± 0.09 96.67 ± 0.07 98.37 ± 0.07 98.54 ± 0.04 90.78 ± 0.00 96.92 ± 0.03 98.91 ± 0.02 98.67±0.01 98.90±0.05
Reddit 98.17 ± 0.05 98.47 ± 0.02 98.60 ± 0.06 99.01 ± 0.01 95.37 ± 0.00 97.17 ± 0.02 99.15 ± 0.01 98.20±0.02 99.25±0.02
MOOC 85.03 ± 0.58 87.11 ± 0.19 91.21 ± 1.15 80.38 ± 0.26 60.86 ± 0.00 84.01 ± 0.17 87.91 ± 0.58 82.43±1.72 93.33±0.54
LastFM 71.16 ± 1.89 71.59 ± 0.18 78.47 ± 2.94 85.92 ± 0.10 83.77 ± 0.00 73.53 ± 0.12 93.05 ± 0.10 90.79±0.01 91.68±0.51
Enron 84.89 ± 3.00 68.89 ± 1.10 88.32 ± 0.99 90.45 ± 0.14 87.05 ± 0.00 84.38 ± 0.21 93.33 ± 0.13 98.31±0.01 93.35±0.42
Social Evo. 90.76 ± 0.21 94.76 ± 0.16 95.39 ± 0.17 87.34 ± 0.08 81.60 ± 0.00 95.23 ± 0.07 96.30 ± 0.01 96.82±0.25 95.93±0.06
UCI 68.77 ± 2.34 78.53 ± 0.74 92.03 ± 1.13 93.87 ± 0.08 77.30 ± 0.00 91.81 ± 0.67 94.49 ± 0.26 96.22±0.03 94.79±0.07
Flights 95.95 ± 0.62 94.13 ± 0.17 98.22 ± 0.13 98.45 ± 0.01 90.23 ± 0.00 91.13 ± 0.01 98.93 ± 0.01 96.98±0.09 98.82±0.07
Can. Parl. 73.35 ± 3.67 75.69 ± 0.78 76.99 ± 1.80 75.70 ± 3.27 64.14 ± 0.00 83.17 ± 0.53 97.76 ± 0.41 99.68±0.02 77.66±7.92
US Legis. 82.28 ± 0.32 75.84 ± 1.99 83.34 ± 0.43 77.16 ± 0.39 62.57 ± 0.00 76.96 ± 0.79 77.90 ± 0.58 94.88±0.10 87.96±0.53
UN Trade 67.44 ± 0.83 64.01 ± 0.12 69.10 ± 1.67 68.54 ± 0.18 66.75 ± 0.00 65.52 ± 0.51 70.20 ± 1.44 97.91±0.06 97.16±0.17
UN Vote 67.18 ± 1.04 52.83 ± 1.12 69.71 ± 2.65 53.09 ± 0.22 62.97 ± 0.00 52.46 ± 0.27 57.12 ± 0.62 86.81±0.87 77.33±1.04
Contact 96.48 ± 0.14 96.95 ± 0.08 97.54 ± 0.35 89.99 ± 0.34 94.34 ± 0.00 93.94 ± 0.02 98.53 ± 0.01 98.58±0.01 99.06±0.04
Table 8: : AUC-ROC for inductive future edge prediction with random negative sampling over 5 different runs. The significantly best result for each benchmark appears in bold font.
Dataset DyRep TGAT TGN CAWN GraphMixer DyGFormer LDTGN (ours) LDTGN-mem (ours)
Wikipedia 91.49±0.45 95.90±0.09 97.72±0.03 98.03±0.04 95.57±0.20 98.48±0.03 98.23±0.00 98.30±0.06
Reddit 96.05±0.12 96.98±0.04 97.39±0.07 98.42±0.02 93.80±0.07 98.71±0.01 97.30±0.03 98.56±0.05
MOOC 84.03±0.49 86.84±0.17 91.24±0.99 81.86±0.25 81.43±0.19 87.62±0.51 81.88±1.74 92.36±0.30
LastFM 82.24±1.51 76.99±0.29 82.61±3.15 87.82±0.12 70.84±0.85 94.08±0.08 91.75±0.01 92.57±0.86
Enron 76.34±4.20 64.63±1.74 78.83±1.11 87.02±0.50 72.33±0.99 90.69±0.26 95.77±0.13 88.46±0.79
Social Evo. 91.18±0.49 93.41±0.19 93.43±0.59 84.73±0.27 93.71±0.18 95.29±0.03 96.03±0.37 94.01±0.2
UCI 58.08±1.81 77.64±0.38 86.68±2.29 90.40±0.11 84.49±1.82 92.63±0.13 92.83±0.02 90.83±0.21
Flights 93.56±0.70 88.64±0.35 95.92±0.43 96.86±0.02 82.48±0.01 97.80±0.02 94.44±0.21 97.39±0.21
Can. Parl. 55.27±0.49 56.51±0.75 55.86±0.75 58.83±1.13 55.83±1.07 89.33±0.48 98.73±0.05 58.59±4.42
US Legis. 61.07±0.56 48.27±3.50 62.38±0.48 51.49±1.13 50.43±1.48 53.21±3.04 88.19±0.24 72.45±1.31
UN Trade 58.82±0.98 62.72±0.12 59.99±3.50 67.05±0.21 63.76±0.07 67.25±1.05 97.47±0.07 90.26±1.28
UN Vote 55.13±3.46 51.83±1.35 61.23±2.71 48.34±0.76 50.51±1.05 56.73±0.69 86.99±0.86 68.99±1.66
Contact 91.89±0.38 96.53±0.10 94.84±0.75 89.07±0.34 93.05±0.09 98.30 ± 0.02 98.26±0.02 98.22±0.14

Appendix E Decoupling potential speedup analysis

In this section we analyze the potential speedup that models can achieve by using the decoupling strategy. The decoupling strategy does not affect running time directly, but rather aids to accelerate the running time of models without compromising their precision. Figure 3 demonstrates this exact idea: one can decouple a temporal model and increase its batch size significantly while maintaining a constant memory batch size. This will lead to running time improvement without compromising the precision of the decoupled model. In the context of a sequence containing updates for a model to apply and inputs for it to predict, denote the time it takes for the memory module to apply all the given updates in the sequence as tmemorysubscript𝑡𝑚𝑒𝑚𝑜𝑟𝑦t_{memory}italic_t start_POSTSUBSCRIPT italic_m italic_e italic_m italic_o italic_r italic_y end_POSTSUBSCRIPT and denote the time it takes for the prediction module to finish computing the predictions for all the inputs in the sequence as tpredictionsubscript𝑡𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛t_{prediction}italic_t start_POSTSUBSCRIPT italic_p italic_r italic_e italic_d italic_i italic_c italic_t italic_i italic_o italic_n end_POSTSUBSCRIPT. The total time it takes for the model to finish to process the sequence is:

Ttotal=tmemory+tpredictionsubscript𝑇𝑡𝑜𝑡𝑎𝑙subscript𝑡𝑚𝑒𝑚𝑜𝑟𝑦subscript𝑡𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛T_{total}=t_{memory}+t_{prediction}italic_T start_POSTSUBSCRIPT italic_t italic_o italic_t italic_a italic_l end_POSTSUBSCRIPT = italic_t start_POSTSUBSCRIPT italic_m italic_e italic_m italic_o italic_r italic_y end_POSTSUBSCRIPT + italic_t start_POSTSUBSCRIPT italic_p italic_r italic_e italic_d italic_i italic_c italic_t italic_i italic_o italic_n end_POSTSUBSCRIPT (21)

By decoupling the network one can reduce this time to

Tdecouple_total=tpredictionBSnew/BSold+tmemorysubscript𝑇𝑑𝑒𝑐𝑜𝑢𝑝𝑙𝑒_𝑡𝑜𝑡𝑎𝑙subscript𝑡𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝐵subscript𝑆𝑛𝑒𝑤𝐵subscript𝑆𝑜𝑙𝑑subscript𝑡𝑚𝑒𝑚𝑜𝑟𝑦T_{decouple\_total}=\frac{t_{prediction}}{BS_{new}/BS_{old}}+t_{memory}italic_T start_POSTSUBSCRIPT italic_d italic_e italic_c italic_o italic_u italic_p italic_l italic_e _ italic_t italic_o italic_t italic_a italic_l end_POSTSUBSCRIPT = divide start_ARG italic_t start_POSTSUBSCRIPT italic_p italic_r italic_e italic_d italic_i italic_c italic_t italic_i italic_o italic_n end_POSTSUBSCRIPT end_ARG start_ARG italic_B italic_S start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT / italic_B italic_S start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT end_ARG + italic_t start_POSTSUBSCRIPT italic_m italic_e italic_m italic_o italic_r italic_y end_POSTSUBSCRIPT (22)

Where BSold𝐵subscript𝑆𝑜𝑙𝑑BS_{old}italic_B italic_S start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT is the batch size of the model before decoupling and BSnew𝐵subscript𝑆𝑛𝑒𝑤BS_{new}italic_B italic_S start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT is the batch size of the decouple model. Hence the potential speedup of the model is:

speedup=BSnewtprediction+BSnewtmemoryBSoldtprediction+BSnewtmemory𝑠𝑝𝑒𝑒𝑑𝑢𝑝𝐵subscript𝑆𝑛𝑒𝑤subscript𝑡𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝐵subscript𝑆𝑛𝑒𝑤subscript𝑡𝑚𝑒𝑚𝑜𝑟𝑦𝐵subscript𝑆𝑜𝑙𝑑subscript𝑡𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝐵subscript𝑆𝑛𝑒𝑤subscript𝑡𝑚𝑒𝑚𝑜𝑟𝑦speedup=\frac{BS_{new}\cdot t_{prediction}+BS_{new}\cdot t_{memory}}{BS_{old}% \cdot t_{prediction}+BS_{new}\cdot t_{memory}}italic_s italic_p italic_e italic_e italic_d italic_u italic_p = divide start_ARG italic_B italic_S start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT ⋅ italic_t start_POSTSUBSCRIPT italic_p italic_r italic_e italic_d italic_i italic_c italic_t italic_i italic_o italic_n end_POSTSUBSCRIPT + italic_B italic_S start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT ⋅ italic_t start_POSTSUBSCRIPT italic_m italic_e italic_m italic_o italic_r italic_y end_POSTSUBSCRIPT end_ARG start_ARG italic_B italic_S start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT ⋅ italic_t start_POSTSUBSCRIPT italic_p italic_r italic_e italic_d italic_i italic_c italic_t italic_i italic_o italic_n end_POSTSUBSCRIPT + italic_B italic_S start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT ⋅ italic_t start_POSTSUBSCRIPT italic_m italic_e italic_m italic_o italic_r italic_y end_POSTSUBSCRIPT end_ARG (23)