Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
License: CC BY-NC-ND 4.0
arXiv:2303.03379v3 [cs.LG] 27 Dec 2023

SUREL+: Moving from Walks to Sets for Scalable Subgraph-based Graph Representation Learning

Haoteng Yin{}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT, Muhan Zhang{}^{\ddagger}start_FLOATSUPERSCRIPT ‡ end_FLOATSUPERSCRIPT, Jianguo Wang{}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT, Pan Li§absent§{}^{\dagger\lx@sectionsign}start_FLOATSUPERSCRIPT † § end_FLOATSUPERSCRIPT {}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPTDepartment of Computer Science, Purdue University     {}^{\ddagger}start_FLOATSUPERSCRIPT ‡ end_FLOATSUPERSCRIPTInstitute for Artificial Intelligence, Peking University
§§{}^{\lx@sectionsign}start_FLOATSUPERSCRIPT § end_FLOATSUPERSCRIPTSchool of Electrical and Computer Engineering, Georgia Institute of Technology
{}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT{yinht, csjgwang}@purdue.edu {}^{\ddagger}start_FLOATSUPERSCRIPT ‡ end_FLOATSUPERSCRIPTmuhan@pku.edu.cn §§{}^{\lx@sectionsign}start_FLOATSUPERSCRIPT § end_FLOATSUPERSCRIPTpanli@gatech.edu
Abstract.

Subgraph-based graph representation learning (SGRL) has recently emerged as a powerful tool in many prediction tasks on graphs due to its advantages in model expressiveness and generalization ability. Most previous SGRL models face computational challenges associated with the high cost of subgraph extraction for each training or test query. Recently, SUREL was proposed to accelerate SGRL, which samples random walks offline and joins these walks online as a proxy of subgraph for representation learning. Thanks to the reusability of sampled walks across different queries, SUREL achieves state-of-the-art performance in terms of scalability and prediction accuracy. However, SUREL still suffers from high computational overhead caused by node duplication in sampled walks. In this work, we propose a novel framework SUREL+ that upgrades SUREL by using node sets instead of walks to represent subgraphs. This set-based representation eliminates repeated nodes by definition but can also be irregular in size. To address this issue, we design a customized sparse data structure to efficiently store and access node sets and provide a specialized operator to join them in parallel batches. SUREL+ is modularized to support multiple types of set samplers, structural features, and neural encoders to complement the structural information loss after the reduction from walks to sets. Extensive experiments have been performed to validate SUREL+ in the prediction tasks of links, relation types, and higher-order patterns. SUREL+ achieves 3-11×\times× speedups of SUREL while maintaining comparable or even better prediction performance; compared to other SGRL baselines, SUREL+ achieves similar-to\sim20×\times× speedups and significantly improves the prediction accuracy.

PVLDB Reference Format:
Haoteng Yin, Muhan Zhang, Jianguo Wang, Pan Li. PVLDB, 16(11): 2939-2948, 2023.
doi:10.14778/3611479.3611499 This work is licensed under the Creative Commons BY-NC-ND 4.0 International License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of this license. For any use beyond those covered by this license, obtain permission by emailing info@vldb.org. Copyright is held by the owner/author(s). Publication rights licensed to the VLDB Endowment.
Proceedings of the VLDB Endowment, Vol. 16, No. 11 ISSN 2150-8097.
doi:10.14778/3611479.3611499

PVLDB Artifact Availability:
The source code, data, and/or other artifacts have been made available at https://github.com/Graph-COM/SUREL_Plus.

1. Introduction

Graphs are widely used to model interactions in natural sciences and relationships in social life (Koller et al., 2007; Jumper et al., 2021). Graph-structured data in the real world are highly irregular and often large-scale. To solve inference tasks on graphs, graph representation learning (GRL) that studies quantitative representations of graph-structured data has attracted much attention (Hamilton et al., 2017b; Hamilton, 2020; Wu et al., 2022). Recently, subgraph-based GRL (SGRL) has become an important research direction for researchers studying GRL algorithms and systems, as it achieves far better prediction performance than other approaches on many GRL tasks, especially those involving a set of nodes. Given a set of queried nodes, SGRL models such as SEAL (Zhang and Chen, 2018; Zhang et al., 2021), GraIL (Teru et al., 2020), and SubGNN (Alsentzer et al., 2020) first extract a subgraph around the queried node set (called query-induced subgraph), and then use neural networks to encode extracted subgraphs for prediction. Extensive work shows that SGRL models are more robust (Zeng et al., 2021) and more expressive (Bouritsas et al., 2022; Frasca et al., 2022); while canonical graph neural networks (GNNs) including GCN (Kipf and Welling, 2017) and GraphSAGE (Hamilton et al., 2017a) usually fail to make accurate predictions, due to their limited expressive power (Zhang et al., 2021; Garg et al., 2020; Chen et al., 2020), incapability of capturing intra-node distance information (Srinivasan and Ribeiro, 2020; Li et al., 2020), and improper entanglement between receptive field size and model depth (Huang and Zitnik, 2020; Zeng et al., 2021; Yin et al., 2022). An example in Fig. 2 illustrates how SGRL works for link prediction and demonstrates its advantages over GNNs. Here, canonical GNNs generate and aggregate node-wise representations to predict links, which would map structurally symmetric nodes without distinct features into the same representation and thus lead to the ambiguity issue (Xu et al., 2019; Zhang et al., 2021). So far, the advantages of SGRL methods have been verified in many applications, such as link and relation prediction (Zhang and Chen, 2018; Zhang et al., 2021; Teru et al., 2020), higher-order pattern prediction (Meng et al., 2018; Liu et al., 2022), temporal network modeling (Wang et al., 2021), recommender systems (Zhang and Chen, 2020), anomaly detection (Alsentzer et al., 2020; Cai et al., 2021), graph meta-learning (Huang and Zitnik, 2020), subgraph matching (Liu et al., 2020; Lou et al., 2020), and molecular/protein study in life sciences (Wang and Zhang, 2021).

Figure 1. GNNs cannot correctly predict whether x𝑥xitalic_x is more likely linked with y𝑦yitalic_y or z𝑧zitalic_z: y𝑦yitalic_y and z𝑧zitalic_z have the same representation without distinct features. The representations based on one-hop neighbors (query-induced subgraph) are more expressive in distinguishing node pairs (x,y)𝑥𝑦(x,y)( italic_x , italic_y ) and (x,z)𝑥𝑧(x,z)( italic_x , italic_z ).
Refer to caption
Refer to caption
Figure 1. GNNs cannot correctly predict whether x𝑥xitalic_x is more likely linked with y𝑦yitalic_y or z𝑧zitalic_z: y𝑦yitalic_y and z𝑧zitalic_z have the same representation without distinct features. The representations based on one-hop neighbors (query-induced subgraph) are more expressive in distinguishing node pairs (x,y)𝑥𝑦(x,y)( italic_x , italic_y ) and (x,z)𝑥𝑧(x,z)( italic_x , italic_z ).
Figure 2. Overview of SUREL+: a side-by-side comparison with previous SGRL models. For a query, SGRL first prepares the input subgraph attached with structural features, which are fed into a neural encoder to obtain embeddings for prediction. SUREL+ samples node sets as input, while SEAL (Zhang and Chen, 2018; Zhang et al., 2021) extracts query-induced subgraphs and SUREL (Yin et al., 2022) runs random walks. To compensate for the structure loss by reducing subgraphs to node sets, SUREL+ provides various types of set samplers, structure encoders, and set neural encoders. To better serve set-based representations, SUREL+ designs a customized sparse data structure SpG to store sampled node sets with fast access, and supports their online joining in parallel via a sparse join operator SpJoin.

Albeit with multiple benefits of its algorithm, SGRL methods currently face two major computational challenges: (1) Query Dependency. A subgraph must be extracted for each queried node set, which is not reusable across different queries, and cannot be preprocessed if the query is unknown; (2) Irregularity. The extracted subgraphs are irregularly sized, resulting in poor batch processing and load-balancing performance. As Fig. 3 (a) shows, subgraph extraction in SEAL (Zhang and Chen, 2018; Zhang et al., 2021) is prohibitively slow for practical deployment. This inspired recent work on dedicated hardware acceleration for extracting subgraphs (DGL, 2022; PyG, 2022). However, how to fundamentally improve the scalability and efficiency of SGRL methods remains largely unexplored.

Refer to caption
Figure 3. (a) Subgraph extraction in SEAL (Zhang and Chen, 2018; Zhang et al., 2021) has much higher complexity than samplers of other simplified forms. The size of query-induced subgraphs is irregular, growing exponentially w.r.t. the number of extracted hops (Chen et al., 2018; Zeng et al., 2021). (b) Breaking subgraphs into reusable walks faces a serious node duplication issue.

SUREL (Yin et al., 2022) is the state-of-the-art (SOTA) framework that applies algorithm and system co-design to implement SGRL. It decouples the input from specific queries by sampling node-level random walks and uses the joint walks of queried nodes as a proxy of subgraphs. Specifically, SUREL treats each node in the graph as a seed and runs multiple random walks from the seed offline. Given a queried node set, SUREL online joins and encodes the sampled walks of all queried nodes for prediction. The join operation builds the connection between sampled walks of queried nodes so that their joins can function as the query-induced subgraph. To compensate for the structure loss by representing subgraphs in walks, a structural feature termed relative position encoding (RPE) is adopted to record the relative distance between each distinct node in sampled walks and the seed. The RPE is pre-computed offline and attached to sampled walks before being fed into neural networks (NNs) to make predictions. Sampled walks from one seed can be reused for multiple queries whenever that seed node is involved. Through this walk-sharing mechanism, SUREL significantly improves the efficiency of SGRL. The regularity of walks also enables highly parallel walk sampling and online joining with a dedicated system design. However, SUREL still faces several inherent drawbacks of the walk-based representation, namely high node redundancy in sampled walks (over 55% is duplicated, see Fig. 3 (b)). This further raises the following issues: (1) extra space for hosting walks and positional encoding in memory, (2) extra time for operations on duplicated nodes in subsequent routines of walk joining and NN-based encoding, and (3) high workload of data transfer between CPU and GPU.

In this work, we upgrade SUREL and develop a novel framework SUREL+ that again benefits from algorithm and system co-design. The core concept of SUREL+ is simple: instead of using walks to represent subgraphs, we now employ node sets, thereby obviating node duplication. However, this new idea based on node sets also introduces several algorithm and system design difficulties. From an algorithmic perspective, as explored in this study, the transition from induced subgraphs to walks and then to node sets results in a considerable loss of structural information. Therefore, the first priority is to develop a method that can compensate for such loss while maintaining performance. On the system side, SUREL (Yin et al., 2022) utilizes walks, which can be easily stored and processed in an aligned format by controlling sampling parameters. In contrast, node sets are irregular in size, creating difficulties for efficient storage and fast access. To sum up, how to coordinate the designs of both sides constitutes the primary challenge of this work.

SUREL+ tackles the above challenges through its entire pipeline. To avoid node duplication in walk sampling, SUREL+ only keeps unique nodes from neighborhood sampling of seed nodes during preprocessing. To compensate for structural information loss, SUREL+ incorporates various types of set samplers and structure encoders to preserve local graph structures: set samplers employ different graph metrics to measure node importance and determine sampling rules; structure encoders support landing probabilities of random walks (Li et al., 2019), shortest path distances, and personalized PageRank scores (Jeh and Widom, 2003), covering most of the structural features used by previous SGRL models (Zhang and Chen, 2018; Zhang et al., 2021; Li et al., 2020; Teru et al., 2020; Yin et al., 2022). Furthermore, SUREL+ designs a customized sparse data structure, namely SpG, which can efficiently store sampled node sets and achieve fast access. A sparse operator SpJoin is developed accordingly to perform join operations on the sampled node sets and associated structural features for serving queries online. To capture diverse levels of interactions between node and structural features, SUREL+ introduces multiple set neural encoders, such as multi-linear perception with mean pooling, set attention (Veličković et al., 2018) and LSTM (Hamilton et al., 2017a) that ensure sufficient expressive power and consistent performance across various types of SGRL tasks.

Overall, our contributions can be summarized as follows:

  • Algorithm: SUREL+ is a novel SGRL framework (open source), which utilizes reusable node sets associated with various structural features to represent query-induced subgraphs via online joining. Compared with the SOTA baselines, the proposed set-based subgraph representation greatly reduces memory and computation costs without degrading prediction performance.

  • System: SUREL+ designs a customized sparse data structure SpG and a sparse join operator SpJoin to support efficient storage and fast access of node sets, which achieves much lower latency and higher throughput than previous SGRL methods.

  • We conduct extensive experiments on 9 real-world graphs, with millions/billions of nodes/edges, and demonstrate the advantages of SUREL+ in link/relation-type/motif prediction tasks. SUREL+ is 3-11×\times× faster than the current SOTA SGRL method SUREL while maintaining comparable or even better prediction accuracy. SUREL+ also achieves similar-to\sim20×\times× speedups with substantial accuracy improvements over other SGRL baselines.

2. Preliminaries

2.1. Notations and Relevant Definitions in SGRL

Let 𝒢(𝒱,,X)𝒢𝒱𝑋\mathcal{G}(\mathcal{V},\mathcal{E},X)caligraphic_G ( caligraphic_V , caligraphic_E , italic_X ) be an attributed graph with node set 𝒱={1,2,,n}𝒱12𝑛\mathcal{V}=\{1,2,...,n\}caligraphic_V = { 1 , 2 , … , italic_n } and edge set \mathcal{E}caligraphic_E, where Xn×d𝑋superscript𝑛𝑑X\in\mathbb{R}^{n\times d}italic_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_d end_POSTSUPERSCRIPT denotes node attributes with d𝑑ditalic_d-dimension. A query Q𝒱𝑄𝒱Q\subset\mathcal{V}italic_Q ⊂ caligraphic_V is a node set of interest for a certain type of task. We denote the subgraph induced by query Q𝑄Qitalic_Q as 𝒢Qsubscript𝒢𝑄\mathcal{G}_{Q}caligraphic_G start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT and the node-induced subgraph as 𝒢usubscript𝒢𝑢\mathcal{G}_{u}caligraphic_G start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT, where induced subgraphs are typically within a small number of hops.

Definition 2.0 (Subgraph-based Graph Representation Learning (SGRL)).

Given a query Q𝑄Qitalic_Q of node set over graph 𝒢𝒢\mathcal{G}caligraphic_G, SGRL aims to learn a representation of the query-induced subgraph 𝒢Qsubscript𝒢𝑄\mathcal{G}_{Q}caligraphic_G start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT to make prediction f(𝒢Q)𝑓subscript𝒢𝑄f(\mathcal{G}_{Q})italic_f ( caligraphic_G start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ). f()𝑓f(\cdot)italic_f ( ⋅ ) is usually a neural network. SGRL tasks come with some labeled queries {(Qi,yi)}i=1Lsuperscriptsubscriptsubscript𝑄𝑖subscript𝑦𝑖𝑖1𝐿\{(Q_{i},y_{i})\}_{i=1}^{L}{ ( italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT for supervision (positive samples) and other unlabeled queries {Qi}i=L+1L+Nsuperscriptsubscriptsubscript𝑄𝑖𝑖𝐿1𝐿𝑁\{Q_{i}\}_{i=L+1}^{L+N}{ italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = italic_L + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L + italic_N end_POSTSUPERSCRIPT for inference.

Examples of SGRL Tasks Link prediction seeks to estimate the likelihood of a link between two nodes in a given graph, where a query Q𝑄Qitalic_Q corresponds to a node pair. It can be further generalized to predict links with types over heterogeneous graphs (Teru et al., 2020) or to predict blood vessels (Paetzold et al., 2021) and chemical bonds (Jumper et al., 2021) in domain-specific graphs. Tasks beyond pairwise relations are named higher-order pattern prediction, where a query Q𝑄Qitalic_Q consists of three or more nodes. In this work, we consider that given partially observed pairwise relations among queried nodes in Q𝑄Qitalic_Q, whether these queried nodes will establish certain full higher-order relation of interest (Srinivasan et al., 2021; Liu et al., 2022).

Review of SGRL Methods The current SGRL pipeline mainly has three parts, as shown in the Algorithm Design section of Fig. 2: subgraph preparation, structural feature construction, and neural encoder to obtain the readout of subgraphs. Classical SGRL models often group query-dependent parts together (e.g., SEAL (Zhang and Chen, 2018; Zhang et al., 2021) couples subgraph extraction with labeling trick (Zhang et al., 2021)), and then apply GNNs on extracted and labeled subgraphs for prediction. However, such coupling is expensive and makes the computed intermediate results (e.g. labeled subgraphs) not reusable across queries, which motivates recent SGRL methods to decouple them. SUREL (Yin et al., 2022) substitutes explicit subgraph extraction with online joining of multiple pre-sampled walks attached with positional encoding defined on walk landing as structural features, both of which are node-level and thus can be reused to serve multiple queries. Lastly, it applies neural networks to encode joint walks and aggregate their embeddings for prediction.

2.2. Related Works

Scalable SGRL Design. Recent works on SGRL models have primarily focused on efficient subgraph extraction. Various techniques have been proposed, including PPR-based (Bojchevski et al., 2020; Zeng et al., 2021) and random walk-based (Yin et al., 2022) subgraph samplers, node neighborhood sampling through CUDA kernel (DGL,  (DGL, 2022)) and tensor operations (PyG, (PyG, 2022)). Some frameworks have customized data structures to better support subgraph operations and gain higher throughput, such as associative arrays in SUREL (Yin et al., 2022), temporal-CSR in TGL (Zhou et al., 2022), and GPU-orientated dictionary in NAT (Luo and Li, 2022). To achieve scalable modeling design, GDGNN (Kong et al., 2022) utilizes node representations along the geodesic path between queried nodes for prediction, partially decoupling structural feature construction from subgraph extraction. BUDDY (Chamberlain et al., 2023) employs subgraph sketches to avoid explicitly constructing subgraphs for link prediction. However, these works either focus on specific aspects of computational issues in SGRL, namely bottlenecks of extraction, storage, and feature construction, or are limited to one type of SGRL task. In contrast, SUREL+ provides a comprehensive co-design approach in scalable sampling, efficient storage, and expressive modeling, offering a general and scalable framework for various SGRL tasks.

3. The framework of SUREL+

This section introduces SUREL+, whose key concept is to sample node sets and encode structural features offline and then join them online as a proxy of query-induced subgraph for representation learning. This approach only keeps distinct nodes in the sampled set for reuse in different queries, effectively addressing memory and computation concerns of node duplication in the walk-based representation adopted by SUREL (Yin et al., 2022). SUREL+ features a modular design that supports various set samplers, structure encoders, and set neural encoders to provide a trade-off between complexity and expressiveness after reducing subgraphs to node sets. Furthermore, SUREL+ introduces a customized sparse data structure SpG and an arithmetic operator SpJoin to store node sets and perform their online joins efficiently. Fig. 2 summarizes and compares SUREL+ and current SGRL models. The following subsections describe these modules in detail.

3.1. Set Samplers and Structure Encoders

SUREL+ uses set samplers to sample a set of nodes from each node’s neighborhood and calls structure encoders to construct the corresponding structural features. Both operations are executed offline: the former is primarily for computational benefits, while the latter is to offset the structure loss of node sets reducing from subgraphs (adopted by SEAL (Zhang and Chen, 2018; Zhang et al., 2021)) or walks (adopted by SUREL (Yin et al., 2022)). Conceptually, SUREL+ represents the node-induced subgraph 𝒢usubscript𝒢𝑢\mathcal{G}_{u}caligraphic_G start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT via a combination of (1) a node set 𝒮usubscript𝒮𝑢\mathcal{S}_{u}caligraphic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT comprising unique nodes sampled from the neighborhood of node u𝑢uitalic_u and (2) the associated structural features 𝒵usubscript𝒵𝑢\mathcal{Z}_{u}caligraphic_Z start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT reflects the position of sampled nodes in 𝒢usubscript𝒢𝑢\mathcal{G}_{u}caligraphic_G start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT.

Set Samplers Two types of set samplers are adopted. The first type, named Walk-based Sampler, is to sample short-step random walks and then reduce sampled walks into a set of unique nodes. The second type, named Metric-based Sampler, is based on graph metrics that measure the proximity between neighboring nodes and the seed, such as personalized PageRank (PPR) scores (Jeh and Widom, 2003) or short path distances. Specifically, the walk-based sampler runs M𝑀Mitalic_M-many m𝑚mitalic_m-step random walks, starting from each seed u𝑢uitalic_u in parallel on the graph 𝒢𝒢\mathcal{G}caligraphic_G, and then only puts distinct nodes of sampled walks into the set 𝒮usubscript𝒮𝑢\mathcal{S}_{u}caligraphic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT. The metric-based sampler, taking PPR-based (Bojchevski et al., 2020) as an example, first runs the push-flow algorithm (Andersen et al., 2006) to obtain an approximation of the PPR vector for each seed u𝑢uitalic_u, and then selects the top-K𝐾Kitalic_K nodes with the highest PPR scores into the set 𝒮usubscript𝒮𝑢\mathcal{S}_{u}caligraphic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT. Mathematically, PPR scores are convergent landing probabilities of seeded random walks that reach infinite steps. Therefore, these two samplers complement each other by leveraging either more local or global structures of the graph. We use hyper-parameters M𝑀Mitalic_M, m𝑚mitalic_m to control random walks, and K𝐾Kitalic_K to control metric-based samplers, which are all set as some constants in practice. The complexity of the above offline sampling procedures is O(|𝒱|)𝑂𝒱O(|\mathcal{V}|)italic_O ( | caligraphic_V | ).

Structure Encoders The structure encoder is to construct structural features 𝒵u,xksubscript𝒵𝑢𝑥superscript𝑘\mathcal{Z}_{u,x}\in\mathbb{R}^{k}caligraphic_Z start_POSTSUBSCRIPT italic_u , italic_x end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT for each node x𝑥xitalic_x in the sampled node set 𝒮usubscript𝒮𝑢\mathcal{S}_{u}caligraphic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT. These features prove to be crucial for inference tasks on graphs involving multiple nodes (Zhang et al., 2021) and can be conceptually understood as anchoring sampled node x𝑥xitalic_x in the seed u𝑢uitalic_u’s neighborhood. One possible choice is landing probabilities of random walk (Li et al., 2019, 2020; Yin et al., 2022): each element 𝒵u,x[i]subscript𝒵𝑢𝑥delimited-[]𝑖\mathcal{Z}_{u,x}[i]caligraphic_Z start_POSTSUBSCRIPT italic_u , italic_x end_POSTSUBSCRIPT [ italic_i ] stores the counts of node x𝑥xitalic_x landed at step i𝑖iitalic_i of all walks rooted at the seed u𝑢uitalic_u divided by the number of walks performed by the sampler. Landing probabilities (LPs) can be computed along with walk sampling. Another option is the shortest path distance (SPD) between x𝑥xitalic_x and u𝑢uitalic_u (Zhang and Chen, 2018; Li et al., 2020; Zhang et al., 2021), which records their relative distance in terms of reachability. PPR scores (Jeh and Widom, 2003) is also a popular structural feature and can be naturally obtained by running a PPR-based sampler. Later, we denote the collection of structural features for all nodes in 𝒮usubscript𝒮𝑢\mathcal{S}_{u}caligraphic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT as 𝒵u={𝒵u,x|x𝒮u}subscript𝒵𝑢conditional-setsubscript𝒵𝑢𝑥𝑥subscript𝒮𝑢\mathcal{Z}_{u}=\{\mathcal{Z}_{u,x}|x\in\mathcal{S}_{u}\}caligraphic_Z start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT = { caligraphic_Z start_POSTSUBSCRIPT italic_u , italic_x end_POSTSUBSCRIPT | italic_x ∈ caligraphic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT }.

Refer to caption
Figure 4. Node set 𝒮usubscript𝒮𝑢\mathcal{S}_{u}caligraphic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT and its associated structural features 𝒵usubscript𝒵𝑢\mathcal{Z}_{u}caligraphic_Z start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT stored in SpG. It contains three arrays to store the size of node sets, sampled node indices, and associated structural features (with optional two-level indexing). Here, DSFsubscript𝐷SFD_{\text{SF}}italic_D start_POSTSUBSCRIPT SF end_POSTSUBSCRIPT shows the landing counts of nodes at different steps in sampled walks as an example, which can be normalized later to landing probabilities as structural features.

3.2. Set-based Storage - SpG

Set-based subgraph representation has advantages in terms of flexibility and compactness. However, the uneven sizes of sampled node sets pose great challenges to their storage and fast access. Note that, these node sets need to be frequently visited in subsequent online phases. To overcome these obstacles, SUREL+ designs a customized compressed sparse row (CSR) format called SpG, which reorganizes the storage of node sets and their structural features in a memory-efficient manner, as depicted in Fig. 4. Specifically, the node set 𝒮usubscript𝒮𝑢\mathcal{S}_{u}caligraphic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT and its structural features 𝒵usubscript𝒵𝑢\mathcal{Z}_{u}caligraphic_Z start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT are stored as a row of SpG, denoted as 𝚂𝚙𝙶[u,:]𝚂𝚙𝙶𝑢:\texttt{SpG}[u,:]SpG [ italic_u , : ]. Multiple node sets and their associated structural features are consolidated into three contiguous arrays:

  • indptr δn+1𝛿superscript𝑛1\delta\in\mathbb{Z}^{n+1}italic_δ ∈ blackboard_Z start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT, an integer array tracks the starting index of each stored node set (row). It records the cumulative sum of the sizes of all node sets 𝒮u,u𝒱subscript𝒮𝑢for-all𝑢𝒱\mathcal{S}_{u},~{}\forall u\in\mathcal{V}caligraphic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , ∀ italic_u ∈ caligraphic_V, e.g., δu+1=δu+|𝒮u|subscript𝛿𝑢1subscript𝛿𝑢subscript𝒮𝑢\delta_{u+1}=\delta_{u}+|\mathcal{S}_{u}|italic_δ start_POSTSUBSCRIPT italic_u + 1 end_POSTSUBSCRIPT = italic_δ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT + | caligraphic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT |, where |𝒮u|subscript𝒮𝑢|\mathcal{S}_{u}|| caligraphic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT | represents the size of the set 𝒮usubscript𝒮𝑢\mathcal{S}_{u}caligraphic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT. The total number of sampled nodes stored in SpG is δn+1subscript𝛿𝑛1\delta_{n+1}italic_δ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT;

  • indices Iδn+1𝐼superscriptsubscript𝛿𝑛1I\in\mathbb{Z}^{\delta_{n+1}}italic_I ∈ blackboard_Z start_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, a coalesce array of all node sets 𝒮u,u𝒱subscript𝒮𝑢for-all𝑢𝒱\mathcal{S}_{u},~{}\forall u\in\mathcal{V}caligraphic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , ∀ italic_u ∈ caligraphic_V. The segment I[δu:δu+1]I[\delta_{u}:\delta_{u+1}]italic_I [ italic_δ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT : italic_δ start_POSTSUBSCRIPT italic_u + 1 end_POSTSUBSCRIPT ] corresponds to node indices of the set 𝒮usubscript𝒮𝑢\mathcal{S}_{u}caligraphic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT in sorted order. This ordering is particularly useful for speeding up the join operation discussed in Sec. 3.3.

  • SFptr Dδn+1𝐷superscriptsubscript𝛿𝑛1D\in\mathbb{R}^{\delta_{n+1}}italic_D ∈ blackboard_R start_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, an array contains the values of the structural features 𝒵usubscript𝒵𝑢\mathcal{Z}_{u}caligraphic_Z start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT, or the indices of encoding stored in the array DSFsubscript𝐷SFD_{\text{SF}}italic_D start_POSTSUBSCRIPT SF end_POSTSUBSCRIPT. DSFsubscript𝐷SFD_{\text{SF}}italic_D start_POSTSUBSCRIPT SF end_POSTSUBSCRIPT is introduced to eliminate duplicate encoding of structural features, which typically reside in GPU memory. This two-level indexing can further reduce memory needs when LPs/SPDs are used, as they are likely to have many repeated values, but it is not necessary when using PPR scores since their values tend to be distinct.

Table 1. Complexity comparison of GRL models. Suppose using O(||)𝑂O(|\mathcal{E}|)italic_O ( | caligraphic_E | )-many queries, SGRLs use partial edges (q||much-less-than𝑞q\ll|\mathcal{E}|italic_q ≪ | caligraphic_E |) for training. S𝑆Sitalic_S and K𝐾Kitalic_K denote the average size of extracted subgraphs and sampled node sets, respectively. L𝐿Litalic_L is the number of layers. d𝑑ditalic_d and k𝑘kitalic_k are respective dimensions of node and structural features. Assume d𝑑ditalic_d is fixed for all layers. Both SUREL and SUREL+ use the walk-based sampler for M𝑀Mitalic_M-many m𝑚mitalic_m-step walks. c𝑐citalic_c is the number of distinct k𝑘kitalic_k-dim structural features. δn+1subscript𝛿𝑛1\delta_{n+1}italic_δ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT is the size sum of all node sets.
Methods GNN (Kipf and Welling, 2017) SEAL (Zhang and Chen, 2018; Zhang et al., 2021) SUREL (Yin et al., 2022) SUREL+
Structure O(|𝒱|+||)𝑂𝒱O(|\mathcal{V}|+|\mathcal{E}|)italic_O ( | caligraphic_V | + | caligraphic_E | ) O(S||)𝑂𝑆O(S|\mathcal{E}|)italic_O ( italic_S | caligraphic_E | ) O(mM|𝒱|)𝑂𝑚𝑀𝒱O(mM|\mathcal{V}|)italic_O ( italic_m italic_M | caligraphic_V | ) O(δn+1)𝑂subscript𝛿𝑛1O(\delta_{n+1})italic_O ( italic_δ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT )
Feature O(d|𝒱|)𝑂𝑑𝒱O(d|\mathcal{V}|)italic_O ( italic_d | caligraphic_V | ) O(kS||)𝑂𝑘𝑆O(kS|\mathcal{E}|)italic_O ( italic_k italic_S | caligraphic_E | ) O(δn+1*k)𝑂subscript𝛿𝑛1𝑘O(\delta_{n+1}*k)italic_O ( italic_δ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT * italic_k ) O(δn+1+c*k)𝑂subscript𝛿𝑛1𝑐𝑘O(\delta_{n+1}+c*k)italic_O ( italic_δ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT + italic_c * italic_k )
Time O(||Ld+||Ld2)𝑂𝐿𝑑𝐿superscript𝑑2O(|\mathcal{E}|Ld+|\mathcal{E}|Ld^{2})italic_O ( | caligraphic_E | italic_L italic_d + | caligraphic_E | italic_L italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) O(qSLd2)𝑂𝑞superscript𝑆𝐿superscript𝑑2O(qS^{L}d^{2})italic_O ( italic_q italic_S start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) O(qmMd2)𝑂𝑞𝑚𝑀superscript𝑑2O(qmMd^{2})italic_O ( italic_q italic_m italic_M italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) O(qKd2)𝑂𝑞𝐾superscript𝑑2O(qKd^{2})italic_O ( italic_q italic_K italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )

Regarding the cost of SpG, indptr array is of size |𝒱|+1𝒱1|\mathcal{V}|+1| caligraphic_V | + 1, and the size of both indices and SFptr arrays is δn+1subscript𝛿𝑛1\delta_{n+1}italic_δ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT. The compressed encoding array DSFsubscript𝐷SFD_{\text{SF}}italic_D start_POSTSUBSCRIPT SF end_POSTSUBSCRIPT has a size of c*k𝑐𝑘c*kitalic_c * italic_k, where c𝑐citalic_c is the number of distinct structural features and k𝑘kitalic_k denotes feature dimension. The overall complexity of SpG is O(|𝒱|+δn+1+c*k)𝑂𝒱subscript𝛿𝑛1𝑐𝑘O(|\mathcal{V}|+\delta_{n+1}+c*k)italic_O ( | caligraphic_V | + italic_δ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT + italic_c * italic_k ).

Comparison with Other Methods Table 1 summarizes the space and time complexity of GRL methods. By adopting the walk-based sampler (sampling M𝑀Mitalic_M-many m𝑚mitalic_m-step walks), δn+1subscript𝛿𝑛1\delta_{n+1}italic_δ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT amounts to around one-fifth space of O(mM|𝒱|)𝑂𝑚𝑀𝒱O(mM|\mathcal{V}|)italic_O ( italic_m italic_M | caligraphic_V | ) used by SUREL. The metric-based sampler (sampling top-K𝐾Kitalic_K PPR scores) results in δn+1=K|𝒱|subscript𝛿𝑛1𝐾𝒱\delta_{n+1}=K|\mathcal{V}|italic_δ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT = italic_K | caligraphic_V | and K<mM𝐾𝑚𝑀K<mMitalic_K < italic_m italic_M in general. Both values are substantially lower than O(S||)𝑂𝑆O(S|\mathcal{E}|)italic_O ( italic_S | caligraphic_E | ) used by SEAL, where S𝑆Sitalic_S is the average size of extracted subgraphs. SUREL+ further reduces the memory footprint, when the two-level indexing is employed for hosting structural features and only distinct values are stored in DSFsubscript𝐷SFD_{\text{SF}}italic_D start_POSTSUBSCRIPT SF end_POSTSUBSCRIPT. In practice, c𝑐citalic_c typically remains independent of |𝒱|𝒱|\mathcal{V}|| caligraphic_V |. SpG enables SUREL+ to handle SGRL tasks more efficiently on large-scale graph data.

3.3. Joining Node Sets via Sparse Operations

The goal of joining node sets is to connect queried nodes and construct query-level subgraphs from pre-sampled node sets to make predictions. For a given query Q𝑄Qitalic_Q, we merge relevant node sets Su,uQsubscript𝑆𝑢for-all𝑢𝑄S_{u},\forall u\in Qitalic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , ∀ italic_u ∈ italic_Q into 𝒮Q=uQ𝒮usubscript𝒮𝑄subscript𝑢𝑄subscript𝒮𝑢\mathcal{S}_{Q}=\bigcup_{u\in Q}\mathcal{S}_{u}caligraphic_S start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT = ⋃ start_POSTSUBSCRIPT italic_u ∈ italic_Q end_POSTSUBSCRIPT caligraphic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT and join their node-level structural features 𝒵usubscript𝒵𝑢\mathcal{Z}_{u}caligraphic_Z start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT to the query level. In essence, query-level structural features 𝒵Qsubscript𝒵𝑄\mathcal{Z}_{Q}caligraphic_Z start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT record the relative position of each node x𝒮Q𝑥subscript𝒮𝑄x\in\mathcal{S}_{Q}italic_x ∈ caligraphic_S start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT with respect to the set of queried nodes Q𝑄Qitalic_Q (equivalently labeling the query-induced subgraph 𝒢Qsubscript𝒢𝑄\mathcal{G}_{Q}caligraphic_G start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT). Specifically, for a node x𝑥xitalic_x in 𝒮Qsubscript𝒮𝑄\mathcal{S}_{Q}caligraphic_S start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT, the query-level structural feature 𝒵Q,xsubscript𝒵𝑄𝑥\mathcal{Z}_{Q,x}caligraphic_Z start_POSTSUBSCRIPT italic_Q , italic_x end_POSTSUBSCRIPT is obtained by merging its node-level ones 𝒵u,xsubscript𝒵𝑢𝑥\mathcal{Z}_{u,x}caligraphic_Z start_POSTSUBSCRIPT italic_u , italic_x end_POSTSUBSCRIPT for all queried node u𝑢uitalic_u in Q𝑄Qitalic_Q as

(1) 𝒵Q,x=||uQ𝒵u,x=[𝒵u,x]|Q|×k,\displaystyle\mathcal{Z}_{Q,x}=||_{u\in Q}\mathcal{Z}_{u,x}=[\dots\mathcal{Z}_% {u,x}\dots]\in\mathbb{R}^{|Q|\times k},caligraphic_Z start_POSTSUBSCRIPT italic_Q , italic_x end_POSTSUBSCRIPT = | | start_POSTSUBSCRIPT italic_u ∈ italic_Q end_POSTSUBSCRIPT caligraphic_Z start_POSTSUBSCRIPT italic_u , italic_x end_POSTSUBSCRIPT = [ … caligraphic_Z start_POSTSUBSCRIPT italic_u , italic_x end_POSTSUBSCRIPT … ] ∈ blackboard_R start_POSTSUPERSCRIPT | italic_Q | × italic_k end_POSTSUPERSCRIPT ,

where ||||| | denotes concatenation. In cases where 𝒵u,xsubscript𝒵𝑢𝑥\mathcal{Z}_{u,x}caligraphic_Z start_POSTSUBSCRIPT italic_u , italic_x end_POSTSUBSCRIPT does not exist as node xSu𝑥subscript𝑆𝑢x\notin S_{u}italic_x ∉ italic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT, it is set to all zeros. For instance, in Fig. 5, node b𝑏bitalic_b is in 𝒮vsubscript𝒮𝑣\mathcal{S}_{v}caligraphic_S start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT but not in 𝒮usubscript𝒮𝑢\mathcal{S}_{u}caligraphic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT, hence 𝒵u,bsubscript𝒵𝑢𝑏\mathcal{Z}_{u,b}caligraphic_Z start_POSTSUBSCRIPT italic_u , italic_b end_POSTSUBSCRIPT is set to zero. 𝒵Qsubscript𝒵𝑄\mathcal{Z}_{Q}caligraphic_Z start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT is a collection of 𝒵Q,x,x𝒮Qsubscript𝒵𝑄𝑥for-all𝑥subscript𝒮𝑄\mathcal{Z}_{Q,x},\forall x\in\mathcal{S}_{Q}caligraphic_Z start_POSTSUBSCRIPT italic_Q , italic_x end_POSTSUBSCRIPT , ∀ italic_x ∈ caligraphic_S start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT. Together, 𝒮Qsubscript𝒮𝑄\mathcal{S}_{Q}caligraphic_S start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT and 𝒵Qsubscript𝒵𝑄\mathcal{Z}_{Q}caligraphic_Z start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT function as the query-induced subgraph 𝒢Qsubscript𝒢𝑄\mathcal{G}_{Q}caligraphic_G start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT, which is later fed into the neural encoder to obtain the query-level readout for prediction.

The JOIN operator in databases is used to merge tables and establish connections. Concatenation in Eq. (1) requires matching among different node sets with varying sizes and arbitrary node orders, where an outer JOIN is well-suited for this task. In this case, the JOIN operator returns associated values from target sets based on node indices as the specified common field, regardless of their existence. To obtain 𝒵Q,xsubscript𝒵𝑄𝑥\mathcal{Z}_{Q,x}caligraphic_Z start_POSTSUBSCRIPT italic_Q , italic_x end_POSTSUBSCRIPT, node sets {𝒮u}uQsubscriptsubscript𝒮𝑢𝑢𝑄\{\mathcal{S}_{u}\}_{u\in Q}{ caligraphic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_u ∈ italic_Q end_POSTSUBSCRIPT are treated as tables: if the index of node x𝑥xitalic_x matches one of the node indices in 𝒮usubscript𝒮𝑢\mathcal{S}_{u}caligraphic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT for all u𝑢uitalic_u in Q𝑄Qitalic_Q, then the associated structural feature 𝒵u,xsubscript𝒵𝑢𝑥\mathcal{Z}_{u,x}caligraphic_Z start_POSTSUBSCRIPT italic_u , italic_x end_POSTSUBSCRIPT is appended; otherwise, the field is filled with zeros. However, iterating over all 𝒮usubscript𝒮𝑢\mathcal{S}_{u}caligraphic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT’s to retrieve 𝒵u,xsubscript𝒵𝑢𝑥\mathcal{Z}_{u,x}caligraphic_Z start_POSTSUBSCRIPT italic_u , italic_x end_POSTSUBSCRIPT for each node x𝒮Q𝑥subscript𝒮𝑄x\in\mathcal{S}_{Q}italic_x ∈ caligraphic_S start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT is highly inefficient, as its complexity can be O(|Q|*|𝒮Q|2)𝑂𝑄superscriptsubscript𝒮𝑄2O(|Q|*|\mathcal{S}_{Q}|^{2})italic_O ( | italic_Q | * | caligraphic_S start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) per query. This becomes even more challenging when performing these operations for massive queries with varying sizes of 𝒮usubscript𝒮𝑢\mathcal{S}_{u}caligraphic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT and 𝒮Qsubscript𝒮𝑄\mathcal{S}_{Q}caligraphic_S start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT.

To tackle this issue, we design an efficient arithmetic operator SpJoin to perform joins on sparse data objects of SpG in parallel. This operator reduces the per-query time complexity to O(|Q|*|𝒮Q|)𝑂𝑄subscript𝒮𝑄O(|Q|*|\mathcal{S}_{Q}|)italic_O ( | italic_Q | * | caligraphic_S start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT | ) by taking advantage of node indices of 𝒮usubscript𝒮𝑢\mathcal{S}_{u}caligraphic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT stored in SpG are unique and in sorted order. The following demonstrates the use of SpJoin for a query Q={u,v}𝑄𝑢𝑣Q=\{u,v\}italic_Q = { italic_u , italic_v }.

Sparse Join Operator The operator SpJoin performs an outer JOIN for query Q𝑄Qitalic_Q on the sampled node sets from seeds u𝑢uitalic_u and v𝑣vitalic_v stored in SpG as 𝚂𝚙𝙶[u,:]𝚂𝚙𝙶𝑢:\texttt{SpG}[u,:]SpG [ italic_u , : ] and 𝚂𝚙𝙶[v,:]𝚂𝚙𝙶𝑣:\texttt{SpG}[v,:]SpG [ italic_v , : ] through

SpJoin (𝚂𝚙𝙶[u,:],𝚂𝚙𝙶[v,:])=𝚂𝚙𝙶𝑢:𝚂𝚙𝙶𝑣:absent\displaystyle(\texttt{SpG}[u,:],\texttt{SpG}[v,:])=( SpG [ italic_u , : ] , SpG [ italic_v , : ] ) =
𝚂𝚙𝙰𝚍𝚍(𝚖𝚊𝚜𝚔,𝚂𝚙𝙶[u,:])1||𝚂𝚙𝙰𝚍𝚍(𝚖𝚊𝚜𝚔,𝚂𝚙𝙶[v,:])1,\displaystyle{\color[rgb]{0,0,1}\texttt{SpAdd}}\definecolor{temp}{rgb}{0,0,0}% \color[rgb]{0,0,0}\left(\color[rgb]{0,0,0}\texttt{mask},\texttt{SpG}[u,:]% \definecolor{temp}{rgb}{0,0,0}\color[rgb]{0,0,0}\right)\color[rgb]{0,0,0}{% \color[rgb]{1,0,0}-1}~{}||~{}{\color[rgb]{0,0,1}\texttt{SpAdd}}\definecolor{% temp}{rgb}{0,0,0}\color[rgb]{0,0,0}\left(\color[rgb]{0,0,0}\texttt{mask},% \texttt{SpG}[v,:]\definecolor{temp}{rgb}{0,0,0}\color[rgb]{0,0,0}\right)\color% [rgb]{0,0,0}{\color[rgb]{1,0,0}-1},SpAdd ( mask , SpG [ italic_u , : ] ) - 1 | | SpAdd ( mask , SpG [ italic_v , : ] ) - 1 ,

where 𝚖𝚊𝚜𝚔=𝚋𝚘𝚘𝚕(𝚂𝚙𝙰𝚍𝚍(𝚂𝚙𝙶[u,:],𝚂𝚙𝙶[v,:]))𝚖𝚊𝚜𝚔𝚋𝚘𝚘𝚕𝚂𝚙𝙰𝚍𝚍𝚂𝚙𝙶𝑢:𝚂𝚙𝙶𝑣:\texttt{mask}={\color[rgb]{1,.5,0}\texttt{bool}}(\texttt{SpAdd}(\texttt{SpG}[u% ,:],\texttt{SpG}[v,:]))mask = bool ( SpAdd ( SpG [ italic_u , : ] , SpG [ italic_v , : ] ) ).

As illustrated in Fig. 5, SpJoin consists of three steps:

  1. (1)

    It utilizes sparse arithmetic operations from SciPy (Virtanen et al., 2020): SpAdd performs an element-wise addition (XYdirect-sum𝑋𝑌X\oplus Yitalic_X ⊕ italic_Y) of the non-zero elements in X𝑋Xitalic_X and Y𝑌Yitalic_Y; the resulting values are converted to binary via the bool operator and saved in the mask, which corresponds to node indices of the union set 𝒮Qsubscript𝒮𝑄\mathcal{S}_{Q}caligraphic_S start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT.

  2. (2)

    SpAdd are applied between mask and each 𝚂𝚙𝙶[u,:],uQ𝚂𝚙𝙶𝑢:for-all𝑢𝑄\texttt{SpG}[u,:],\,\forall u\in QSpG [ italic_u , : ] , ∀ italic_u ∈ italic_Q following by the reduction ‘-1’, which explicitly adds missing values (all zeros by default) to structural features 𝒵u,xsubscript𝒵𝑢𝑥\mathcal{Z}_{u,x}caligraphic_Z start_POSTSUBSCRIPT italic_u , italic_x end_POSTSUBSCRIPT for all x𝑥xitalic_x if x𝒮u𝑥subscript𝒮𝑢x\not\in\mathcal{S}_{u}italic_x ∉ caligraphic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT while x𝒮Q𝑥subscript𝒮𝑄x\in\mathcal{S}_{Q}italic_x ∈ caligraphic_S start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT.

  3. (3)

    When the two-level indexing is enabled, the results of SpJoin are pointers saved in SFptr, which can be used to gather the values of structural features 𝒵Qsubscript𝒵𝑄\mathcal{Z}_{Q}caligraphic_Z start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT from the array DSFsubscript𝐷SFD_{\text{SF}}italic_D start_POSTSUBSCRIPT SF end_POSTSUBSCRIPT.

Multithreading is employed to leverage the pattern of single program multiple data in arithmetic operations of SpJoin. Since the processing time of each query linearly depends on the size of 𝒮Qsubscript𝒮𝑄\mathcal{S}_{Q}caligraphic_S start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT, we further divide queries of each training batch into groups with nearly balanced sums of |𝒮Q|subscript𝒮𝑄|\mathcal{S}_{Q}|| caligraphic_S start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT |’s, and assign one thread per group to mitigate potential delays caused by uneven workloads.

Refer to caption
Figure 5. An illustration of joining structural features from node-level to query-level (Eq. (1)) via the SpJoin operator on node sets stored in SpG. Note that SpG objects do not physically carry 00 as above shown in 𝚂𝚙𝙶[u,:]𝚂𝚙𝙶𝑢:\texttt{SpG}[u,:]SpG [ italic_u , : ] and 𝚂𝚙𝙶[v,:]𝚂𝚙𝙶𝑣:\texttt{SpG}[v,:]SpG [ italic_v , : ]. Only non-zero elements in grey blocks are stored in SpG and performed arithmetic operations by SpJoin. The half-grey blocks correspond to added missing values.

Comparison with SUREL (Yin et al., 2022) SUREL adopts a hash-based join operator to construct query-level structural features, but its overall computation and memory cost is much higher than SUREL+. This is due to the presence of numerous repeated nodes in walks, depicted in Fig. 3 (b). The set-based input of SUREL+ substantially reduces the workload of transferring data from CPU to GPU and also requires fewer per-query operations on GPU to process transmitted 𝒵Qsubscript𝒵𝑄\mathcal{Z}_{Q}caligraphic_Z start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT’s. As Table 1 shows, SUREL+ reduces time complexity from O(mM)𝑂𝑚𝑀O(mM)italic_O ( italic_m italic_M ) to O(K)𝑂𝐾O(K)italic_O ( italic_K ), where K<mM𝐾𝑚𝑀K<mMitalic_K < italic_m italic_M is the average size of sampled node sets. These advantages ultimately enable SUREL+ to achieve superior performance in terms of efficiency and scalability.

3.4. Set Neural Encoders

After joining node sets for each query Q𝑄Qitalic_Q, the resulting (𝒮Q,𝒵Q)subscript𝒮𝑄subscript𝒵𝑄(\mathcal{S}_{Q},\mathcal{Z}_{Q})( caligraphic_S start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT , caligraphic_Z start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ) acts as the query-induced subgraph 𝒢Qsubscript𝒢𝑄\mathcal{G}_{Q}caligraphic_G start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT and then is fed into a neural encoder for prediction. The mini-batch training procedure of multiple queries is summarized in Algorithm 1. Next, we introduce neural encoders supported by SUREL+.

The adopted neural encoders are simple. For each (𝒮Q,𝒵Q)subscript𝒮𝑄subscript𝒵𝑄(\mathcal{S}_{Q},\mathcal{Z}_{Q})( caligraphic_S start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT , caligraphic_Z start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ),

(2) hQ=𝙰𝙶𝙶𝚁({enc(𝒵Q,x)|x𝒮Q})d.subscript𝑄𝙰𝙶𝙶𝚁conditional-set𝑒𝑛𝑐subscript𝒵𝑄𝑥𝑥subscript𝒮𝑄superscript𝑑h_{Q}=\texttt{AGGR}\definecolor{temp}{rgb}{0,0,0}\color[rgb]{0,0,0}\left(% \color[rgb]{0,0,0}\{enc(\mathcal{Z}_{Q,x})|x\in\mathcal{S}_{Q}\}\definecolor{% temp}{rgb}{0,0,0}\color[rgb]{0,0,0}\right)\color[rgb]{0,0,0}\in\mathbb{R}^{d}.italic_h start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT = AGGR ( { italic_e italic_n italic_c ( caligraphic_Z start_POSTSUBSCRIPT italic_Q , italic_x end_POSTSUBSCRIPT ) | italic_x ∈ caligraphic_S start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT } ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT .

Here, enc()𝑒𝑛𝑐enc(\cdot)italic_e italic_n italic_c ( ⋅ ) encodes query-level structural features 𝒵Q,xsubscript𝒵𝑄𝑥\mathcal{Z}_{Q,x}caligraphic_Z start_POSTSUBSCRIPT italic_Q , italic_x end_POSTSUBSCRIPT using a multi-linear perception (MLP). If node attributes are present, they can be appended after structural features as 𝒵Q,u||Xu\mathcal{Z}_{Q,u}||X_{u}caligraphic_Z start_POSTSUBSCRIPT italic_Q , italic_u end_POSTSUBSCRIPT | | italic_X start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT. AGGR is used to aggregate the encoded features, which can be any neural encoders applicable to sets such as mean/sum/max pooling or set transformers. Currently, SUREL+ supports the implementations of AGGR in mean pooling, LSTM (Hamilton et al., 2017a), and attention (Veličković et al., 2018). Note that, the LSTM applies random permutations to the elements in the set before encoding them as a sequence; while the attention first computes soft attention scores based on the output of enc()𝑒𝑛𝑐enc(\cdot)italic_e italic_n italic_c ( ⋅ ) for each set element and then performs attention-score-weighted pooling. Sec. 4.4 empirically demonstrates that the choice of AGGR has non-trivial effects on prediction performance. Lastly, a fully connected layer takes the readout hQsubscript𝑄h_{Q}italic_h start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT as input to make the final prediction y^Qsubscript^𝑦𝑄\hat{y}_{Q}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT. In our experiments, all SGRL tasks are formulated as binary classification, and thus Binary Cross Entropy is used as the loss function \mathcal{L}caligraphic_L.

Input: Given a graph 𝒢(𝒱,,X)𝒢𝒱𝑋\mathcal{G}(\mathcal{V},\mathcal{E},X)caligraphic_G ( caligraphic_V , caligraphic_E , italic_X ), a group of queries {(Q,yQ)}𝑄subscript𝑦𝑄\{(Q,y_{Q})\}{ ( italic_Q , italic_y start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ) } for training, batch size B𝐵Bitalic_B, a set SAMPLER, a structure ENCODER, and a set AGGR
Output: A neural network for encoding subgraphs enc()𝑒𝑛𝑐enc(\cdot)italic_e italic_n italic_c ( ⋅ )
1 Preprocessing: SAMPLER and ENCODER (𝒮u,𝒵u)absentsubscript𝒮𝑢subscript𝒵𝑢\to(\mathcal{S}_{u},\mathcal{Z}_{u})→ ( caligraphic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , caligraphic_Z start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) for all u𝒱𝑢𝒱u\in\mathcal{V}italic_u ∈ caligraphic_V; convert and save (𝒮u,𝒵u)subscript𝒮𝑢subscript𝒵𝑢(\mathcal{S}_{u},\mathcal{Z}_{u})( caligraphic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , caligraphic_Z start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT )’s as SpG objects.
2 for each mini-batch 𝒬B={,Q,}subscript𝒬𝐵normal-…𝑄normal-…\mathcal{Q}_{B}=\{...,Q,...\}caligraphic_Q start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT = { … , italic_Q , … } do
3       Generate negative training queries (if not given) {,Q¯,}¯𝑄\{...,\bar{Q},...\}{ … , over¯ start_ARG italic_Q end_ARG , … } by random sampling and put them into 𝒬Bsubscript𝒬𝐵\mathcal{Q}_{B}caligraphic_Q start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT;
4       Call SpJoin operator to perform joining on SpG objects {(𝒮u,𝒵u)|uQ}conditional-setsubscript𝒮𝑢subscript𝒵𝑢𝑢𝑄\{(\mathcal{S}_{u},\mathcal{Z}_{u})|u\in Q\}{ ( caligraphic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , caligraphic_Z start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) | italic_u ∈ italic_Q } for all queries Q𝒬B𝑄subscript𝒬𝐵Q\in\mathcal{Q}_{B}italic_Q ∈ caligraphic_Q start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT in parallel;
5       Encode the joined results (𝒮Q,𝒵Q)subscript𝒮𝑄subscript𝒵𝑄(\mathcal{S}_{Q},\mathcal{Z}_{Q})( caligraphic_S start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT , caligraphic_Z start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ) as proxy of subgraphs via Eq. (2) with specified AGGR and get the prediction y^Qsubscript^𝑦𝑄\hat{y}_{Q}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT from readout hQsubscript𝑄h_{Q}italic_h start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT by multithreads;
6       Backward propagation based on the loss (y^Q,yQ)subscript^𝑦𝑄subscript𝑦𝑄\mathcal{L}(\hat{y}_{Q},y_{Q})caligraphic_L ( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ).
7 end for
Algorithm 1 The mini-batch training pipeline of SUREL+

4. Evaluation

In this section, we aim to evaluate the following questions:

  • Regarding space and time complexity, how much improvement can SUREL+ achieve by adopting node sets instead of walks compared to the SOTA framework SUREL?

  • Can SUREL+ provide comparable prediction performance to all baselines using or not using subgraph-based methods?

  • How sensitive is SUREL+ to choices of different set samplers, structure encoders, and set neural encoders?

  • How do sparse storage SpG and parallelism in SpJoin operator perform and benefit the overall performance of SUREL+?

4.1. Experiment Setup

Extensive experiments have been performed to evaluate SUREL+ using nine homogeneous, heterogeneous, and higher-order homogeneous graphs on three types of tasks: link prediction, relation type prediction, and higher-order pattern prediction. A homogeneous graph is a graph that does not contain node/link types, while a heterogeneous graph includes various node/link types. In our setting, higher-order graphs are hypergraphs consisting of hyperedges connecting two or more nodes.

Datasets Table 2 summarizes the statistics of datasets used to benchmark SGRL methods. Five datasets are selected from the Open Graph Benchmark (OGB, (Hu et al., 2020)) for link and relation type prediction, including social networks of citation - citation2 and collaboration - collab; biological network of protein interaction - ppa and blood vessels - vessel; and one heterogeneous academic network ogb-mag, which contains node types of paper (P), author (A) and their extracted relations. The vessel dataset is a large (>>>3M nodes), sparse, biological graph recently constructed from mouse brains (Paetzold et al., 2021), and has unique significance for examining GRL in scientific discovery. The structure of vessels illustrates the spatial organization of the brain’s microvasculature, which can be used for early detection of neurological disorders, e.g. Alzheimer’s and stroke. Two hypergraph datasets collected by (Benson et al., 2018) are used for higher-order pattern prediction: DBLP-coauthor is a temporal hypergraph, where each hyperedge denotes a time-stamped paper connecting all its authors. tags-math contains groups of tags applied to questions on the website math.stackexchange.com as hyperedges. For higher-order pattern prediction tasks, the number of hyperedges is the main computation bottleneck, in which one may connect more than two nodes. Two industry-level graphs, criteo-click with 16.5M records of online banner ads clicking (Diemert et al., 2017) and twitter-2010 with 1.5B user following relations (Kwak et al., 2010) are used to examine the model scalability for real-world applications.

Settings For link prediction, OGB’s standard data split is used to isolate validation and test links from the input graph. For prediction tasks of relation type and higher-order pattern, the same procedure to prepare graph data is adopted as in (Yin et al., 2022): the relations of paper-author (P-A, ”written by”) and paper-paper (P-P, ”cited by”) are selected; higher-order queries in hypergraph datasets are node triplets, where the goal is to predict whether it will foster in a hyperedge given two of them have observed pairwise connections; to learn the representation on hypergraphs, we project hyperedges into cliques and treat the projection results as ordinary graphs. All experiments are run 10 times independently, and we report the mean performance and standard deviation.

Table 2. Summary Statistics for Evaluation Datasets.
Dataset Type #Nodes #Edges Split(%)
criteo-click Homo./Bipartite
Campaign: 675
User: 6,142,256
16,468,027 97/1.5/1.5
twitter-2010 Homo./Social. 41,652,230 1,468,364,884 99.98/0.01/0.01
citation2 Homo./Social. 2,927,963 30,561,187 98/1/1
collab Homo./Social. 235,868 1,285,465 92/4/4
ppa Homo./Bio. 576,289 30,326,273 70/20/10
vessel Homo./Bio. 3,538,495 5,345,897 80/10/10
ogb-mag Hetero.
(P): 736,389
(A): 1,134,649
P-A: 7,145,660
P-P: 5,416,271
99/0.5/0.5
tags-math Higher. 1,629
projected: 91,685
hyperedges: 822,059
60/20/20
DBLP-
coauthor
Higher. 1,924,991
projected: 7,904,336
hyperedges: 3,700,067
60/20/20
Table 3. Prediction Performance for Links, Relation Types and Higher-Order Patterns: the best (bold) and the second best (underlined).
Models citation2 click twitter collab ppa vessel Models MAG(P-A) MAG(P-P) tags-math DBLP-coauthor
MRR (%) Hits@50 (%) Hits@100 (%) ROC-AUC MRR (%)
GCN 84.74±0.21 5.31±0.17 OOM 44.75±1.07 18.67±1.32 43.53±9.61 H*GCN 39.43±0.29 57.43±0.30 51.64±0.27 37.95±2.59
GraphSAINT 79.85±0.40 2.86±0.63 4.12±0.73 53.12±0.52 3.83±1.33 47.14±6.83 H*SAGE 25.35±1.49 60.54±1.60 54.68±2.03 22.91±0.94
GDGNN 86.96±0.28 13.30±0.45 49.86±0.39 54.74±0.48 45.92±2.14 75.84±0.08 R-GCN 37.10±1.05 56.82±4.71 - -
SEAL 87.67±0.32 OOM OOM 63.64±0.71 48.80±3.16 80.50±0.21 SUREL 45.33±2.94 82.47±0.26 71.86±2.15 97.66±2.89
SUREL 89.74±0.18 40.39±0.61 OOM 63.34±0.52 53.23±1.03 86.16±0.39 SUREL+ 58.81±0.42 80.45±0.13 77.73±0.16 99.83±0.02
SUREL+ 88.90±0.06 60.87±0.15 55.67±0.67 64.10±1.06 54.32±0.44 85.73±0.88 / / / / /
Table 4. Breakdown of Runtime, Memory Consumption for Different Models on Prediction of Link, Relation Type, and Higher-order Pattern. The column Train records the runtime per 10K queries.
Models Runtime (s) Memory (GB) Runtime (s) Memory (GB) Runtime (s) Memory (GB) Runtime (s) Memory (GB)
Prep. Train Inf. RAM SDRAM Prep. Train Inf. RAM SDRAM Prep. Train Inf. RAM SDRAM Prep. Train Inf. RAM SDRAM
Dataset criteo-click twitter-2010 citation2 ppa
GCN 3 0.085 8 3.1 62.74 - - - - OOM 17 21.74 105 9.3 36.84 2 0.026 1.2 4.6 11.35
GraphSAINT 1 0.012 20 13.1 8.79 111 0.009 920 253 76.60 151 1.79 107 9.6 9.78 10 0.003 1.5 4.9 23.06
GDGNN 215 1.43 2,928 16.2 23.77 1204 1.84 9,744 188 79.34 338 2.26 5,460 40.6 16.96 127 1.77 902 21.1 10.27
SEAL - - - OOM - - - - OOM - 46 3.52 24,626 35.4 5.71 46 10.57 3,988 9.5 12.13
SUREL 2 1.59 2,307 11.7 16.25 - - - OOM - 151 4.14 6,081 25.1 9.68 31 2.68 1,429 13.6 31.01
SUREL+ 22 0.23 502 10.4 11.93 327 0.26 3,779 210 49.44 130 0.35 1,389 16.7 4.75 69 0.72 201 9.8 19.02

Baselines We consider two classes of baselines. Canonical GNNs: GCN (Kipf and Welling, 2017), GraphSAGE (Hamilton et al., 2017a), GraphSAINT (Zeng et al., 2020) and their variants with the prefix ‘H*’ that are directly applied for heterogeneous graphs with node types and for hypergraphs through clique expansion. R-GCN (Schlichtkrull et al., 2018) performs relational message passing on heterogeneous graphs. SGRL Models: SEAL (Zhang and Chen, 2018; Zhang et al., 2021), GDGNN (Kong et al., 2022), and SUREL (Yin et al., 2022). SEAL adopts online subgraph sampling due to its intractable space needs for offline extraction. Fig. 3 (a) compares the time cost for subgraph sampling across different SGRL methods. We use all baselines’ official implementations with tuned hyperparameters to match their reported results.

Hyperparameters By default, SUREL+ uses the walk-based sampler, the structural encoder LP, and the better set neural encoder tuned between mean pooling and attention. SUREL+ adopts a 2-layer MLP as enc()𝑒𝑛𝑐enc(\cdot)italic_e italic_n italic_c ( ⋅ ) in Eq. (2) followed by a 2-layer MLP classifier to map set-aggregated readouts for final predictions. Default training hyperparameters: learning rate lr=1e-3 with early stopping of 5 epochs, dropout p=0.1, Adam (Kingma and Ba, 2015) as the optimizer. Analysis of parameters M𝑀Mitalic_M and m𝑚mitalic_m to control the walk-based sampler and K𝐾Kitalic_K to control the metric-based sampler and selection of structure encoders and set neural encoders are studied in Sec. 4.4.

Evaluation Metrics The evaluation metrics include Hits@P, Mean Reciprocal Rank (MRR), and Area Under Curve (ROC-AUC). Hit@P counts the ratio of positive samples ranked at the top-P place against negative ones. MRR first computes the inverse of the rank of the first correct prediction and then takes the average of obtained reciprocal ranks for a sample of queries. For all datasets adopting MRR, each positive query is paired with 1000 randomly sampled negative test queries, except tags-math using 100 and crieo-click using 650. ROC-AUC follows the standard definition to measure the model’s performance in binary classification.

Environment We use a server with two Intel Xeon Gold 6248R CPUs, 512GB DRAM, and NVIDIA A100 (80GB) GPU. SUREL+ is built on PyTorch 1.12 and PyG 2.2. Set samplers are implemented in C, OpenMP, NumPy, Numba, and uhash, integrated into Python scripts; SpG is customized based on the CSR format of Scipy (Virtanen et al., 2020).

4.2. Prediction Accuracy Comparison

Table 3 shows the prediction performance of different methods. SGRL models significantly outperform canonical GNNs on these six link prediction benchmarks, especially on two challenging biological datasets ppa and vessel. Predicting links in biological datasets requires richer structural information that canonical GNNs have limited expressive power to capture. Within SGRL models, SUREL+ achieves comparable performance to SUREL and outperforms SEAL, which validates the effectiveness of the proposed set-based representation for subgraphs. For predictions of relation type and higher-order pattern, we observe additional performance gains (+2similar-to\sim13%) from SUREL+ compared to SUREL on three of the four datasets. A large performance gap exists between canonical GNNs and SGRL models, particularly in the higher-order case. This demonstrates the inherent limitations of canonical GNNs to make predictions of complex relations involving multiple nodes.

4.3. Efficiency and Scalability Analysis

Improved Efficiency in Training and Inference.

Table 4 compares model runtime and memory usage on the four largest benchmarks. SUREL+ offers a reasonable training time compared with canonical GNNs. It shows clear improvement in inference compared to the current SOTA framework SUREL (3-11×\times× speedups across all datasets) and its predecessor SEAL (similar-to\sim20×\times× speedups). SUREL+ achieves comparable and even lower RAM usage than canonical GNNs. Compared to other SGRL models, it can save up to half of RAM with lower usage of GPU SDRAM. This is attributed to set-based subgraphs eliminating node duplicates with structural features, which is further echoed by the analysis in Table 1 and the empirical results in Table 4. The key factor scales SUREL+ to billion-size graphs is its set-based subgraph with the sparse design, while GCN (full adjacency matrix), SEAL (complex subgraph extraction), and SUREL (dense walks with duplicate nodes) are all out of memory (OOM) on twitter-2010.

Profiling Different Strategies for Offline Processing

Fig. 5(a) reports the time cost of different samplers with multithreading on citation2. Fig. 5(b) shows memory consumption to store different types of sampled data (walks in SUREL (Yin et al., 2022) or sets in SUREL+) and associated structural features (LPs, SPDs, PPR scores). Compared to the SUREL sampler, the walk-based sampler in SUREL+ is more efficient and only adds one extra minute for encoding and converting data to SpG format (slash/dash marked in Fig. 5(a)), while achieving 6.94×6.94\times6.94 ×, 3.63×3.63\times3.63 × and 4.12×4.12\times4.12 × memory savings on three OGB datasets, respectively. Those savings are crucial for model scalability as they reduce data transfer from CPU to GPU and reduce GPU operations on duplicate nodes. These two factors dominate the online stage and thus lead to improved memory usage and runtime of SUREL+ in Table 4. In addition, the PPR-based sampler has better scaling performance with more threads. When PPR scores or SPDs are used as structural features, SUREL+ further reduces the memory footprint, though they often slightly harm prediction performance.

Note that, in the above comparison of memory cost, techniques of compressing structural features are adopted both in SUREL (locally) and SUREL+ (globally). When LPs are used as structural features, the two-level indexing in SpG achieves compression of 493×493\times493 ×, 11318×11318\times11318 ×, 19527×19527\times19527 × on three datasets listed in Fig. 5(b).

Scaling Analysis for SpJoin

Fig. 7 shows the speedups and throughput of the SpJoin operator for constructing query-level structural features via multithreading, where the walk join operation of SUREL is used for comparison. SUREL employs a hash-based search for joining walks, which has unfavorable memory access patterns and suffers from imbalanced workloads due to inconsistent searching times across different threads. SUREL+ gains more benefits from multithreading, thanks to sparse arithmetic operations and batch-wise load balancing used in SpJoin.

Refer to caption
(a) Runtime
Refer to caption
(b) Memory
Figure 6. Comparison of Runtime, Memory Consumption across Different Offline Processing Strategies (the walk-based sampler: m=4,M=200formulae-sequence𝑚4𝑀200m=4,M=200italic_m = 4 , italic_M = 200, the metric-based sampler: K=150𝐾150K=150italic_K = 150). The highlighted areas break down the total consumption w.r.t. (a) sampling, structure encoding, sparse object construction; (b) structural features, node indices/pointers, and sampled walks (SUREL sampler only).
Refer to caption
(a) Speedup
Refer to caption
(b) Throughput
Figure 7. Scaling Performance Comparison of SpJoin in SUREL+ (with average set size |Su|¯=351¯subscript𝑆𝑢351\bar{|S_{u}|}=351over¯ start_ARG | italic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT | end_ARG = 351) and Join Walks in SUREL (with walk size m=4,M=200formulae-sequence𝑚4𝑀200m=4,M=200italic_m = 4 , italic_M = 200) against Different Numbers of Threads.

4.4. Comparison between Different Set Samplers, Structural Features and Set Neural Encoders

SUREL+ is a modularized framework that supports different set samplers (walk- and metric-based), structural features (LP, SPD, PPR), and set neural encoders AGGR (mean pooling, LSTM, attention).

Table 5 shows the prediction performance and inference runtime by adopting different combinations of structure encoders and set neural encoders. Landing probabilities (LPs) as structural features perform the best on all three OGB datasets while being the slowest for inference. By recording the landing probabilities over different steps of walks, LPs provide structural information in finer granularity than scalar values of SPDs and PPR scores. Furthermore, the adopted link prediction task might favor more local information held by LPs and SPDs than global information carried by PPR scores. The authors conjecture that other tasks that rely on more global information may favor PPR scores. In comparison, no set neural encoder is always a winner. Attention seems to perform the best on average while slower than mean pooling. LSTM is the slowest. On the two social networks (citation2 and collab), mean pooling can provide comparable prediction results with much fewer parameters. However, prediction on the biological network (ppa) requires more expressive and complicated encoders, where LSTM and attention are favored as they can model more complex interactions between sampled nodes in the union set 𝒮Qsubscript𝒮𝑄\mathcal{S}_{Q}caligraphic_S start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT.

Fig. 8 compares prediction results and inference time by using different hyperparameters m,M𝑚𝑀m,Mitalic_m , italic_M, and K𝐾Kitalic_K of set samplers, which heavily affects the coverage of sampled neighborhoods and computation overhead. The performance consistently increases if the walk-based sampler uses a larger M𝑀Mitalic_M, but is not guaranteed for a larger m𝑚mitalic_m (broader exploration). Better coverage with a larger K𝐾Kitalic_K is usually beneficial for the metric-based sampler over citation2 but not for collab, which is due to different characteristics of these two datasets and is also observed by (Yin et al., 2022). In general, small sampling parameters m(24),M(100400)𝑚similar-to24𝑀similar-to100400m~{}(2\sim 4),M~{}(100\sim 400)italic_m ( 2 ∼ 4 ) , italic_M ( 100 ∼ 400 ) and K(50200)𝐾similar-to50200K~{}(50\sim 200)italic_K ( 50 ∼ 200 ) can yield satisfactory performance with fast inference speed that achieves the trade-off between accuracy and efficiency.

Table 5. Prediction Performance and Inference Time of SUREL+ with Different Combinations of Structure Features (LP, SPD, PPR) and Set Neural Encoders (Mean, LSTM, Attn.). The best and the second best are highlighted in bold and underlined accordingly.
Dataset PPR+Mean SPD+Mean LP+Mean LP+LSTM LP+Attention
citation2 78.59±0.38 87.99±1.07 88.55±0.15 88.46±0.34 88.90±0.06
834 1057s 1389s 3678s 2171s
collab 47.15±0.21 62.11±0.13 64.10±1.06 61.31±1.37 62.85±1.19
1.4s 1.7s 2.0s 3.5s 2.3s
ppa 13.28±1.20 41.06±1.70 46.41±1.65 54.45±1.35 54.32±0.44
63s 126s 165s 322s 201s
Refer to caption
Refer to caption
Refer to caption
Figure 8. Hyperparameter Analysis of Set Samplers (Prediction Performance v.s. Time Cost). Walk-based: the number M𝑀Mitalic_M and the step m𝑚mitalic_m of walks, LPs as structural features; Metric-based: the set size K𝐾Kitalic_K, PPR scores as structural features.

5. Conclusion

This work proposes a novel framework SUREL+ for scalable subgraph-based graph representation learning. SUREL+ avoids costly subgraph extraction by decoupling it into sampled node sets with structural features, whose join can function as query-induced subgraphs for prediction. SUREL+ benefits from the reusability and compactness of pre-sampled node sets across different queries. Compared to the SOTA framework SUREL, the set-based subgraph of SUREL+ substantially reduces space and time complexity by avoiding heavy node duplication in sampled walks. To handle irregularly sized node sets, SUREL+ designs a customized sparse storage SpG and a sparse join operator SpJoin, providing memory-efficient storage with fast access. In addition, SUREL+ adopts a modular design, enabling users to choose different set samplers, structure encoders, and set neural encoders flexibly based on the nature of their SGRL tasks. Extensive experiments on three types of prediction tasks over nine real-world graph benchmarks show that SUREL+ significantly improves scalability, memory efficiency, and prediction accuracy compared to current SGRL methods and canonical GNNs.

Acknowledgements.
The authors would like to thank Rongzhe Wei and Yanbang Wang for their helpful discussions and valuable feedback. Haoteng Yin and Pan Li are supported by the 2021 JPMorgan Faculty Award, NSF awards OAC-2117997, IIS-2239565.

References

  • (1)
  • Alsentzer et al. (2020) Emily Alsentzer, Samuel Finlayson, Michelle Li, and Marinka Zitnik. 2020. Subgraph neural networks. Advances in Neural Information Processing Systems 33 (2020), 8017–8029.
  • Andersen et al. (2006) Reid Andersen, Fan Chung, and Kevin Lang. 2006. Local graph partitioning using pagerank vectors. In The 47th Annual IEEE Symposium on Foundations of Computer Science. IEEE, 475–486.
  • Benson et al. (2018) Austin R Benson, Rediet Abebe, Michael T Schaub, Ali Jadbabaie, and Jon Kleinberg. 2018. Simplicial closure and higher-order link prediction. Proceedings of the National Academy of Sciences 115, 48 (2018), E11221–E11230.
  • Bojchevski et al. (2020) Aleksandar Bojchevski, Johannes Klicpera, Bryan Perozzi, Amol Kapoor, Martin Blais, Benedek Rózemberczki, Michal Lukasik, and Stephan Günnemann. 2020. Scaling graph neural networks with approximate pagerank. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2464–2473.
  • Bouritsas et al. (2022) Giorgos Bouritsas, Fabrizio Frasca, Stefanos P Zafeiriou, and Michael Bronstein. 2022. Improving graph neural network expressivity via subgraph isomorphism counting. IEEE Transactions on Pattern Analysis and Machine Intelligence (2022).
  • Cai et al. (2021) Lei Cai, Zhengzhang Chen, Chen Luo, Jiaping Gui, Jingchao Ni, Ding Li, and Haifeng Chen. 2021. Structural temporal graph neural networks for anomaly detection in dynamic graphs. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management. 3747–3756.
  • Chamberlain et al. (2023) Benjamin Paul Chamberlain, Sergey Shirobokov, Emanuele Rossi, Fabrizio Frasca, Thomas Markovich, Nils Hammerla, Michael M Bronstein, and Max Hansmire. 2023. Graph Neural Networks for Link Prediction with Subgraph Sketching. In International Conference on Learning Representations.
  • Chen et al. (2018) Jie Chen, Tengfei Ma, and Cao Xiao. 2018. Fastgcn: fast learning with graph convolutional networks via importance sampling. In International Conference on Learning Representations.
  • Chen et al. (2020) Zhengdao Chen, Lei Chen, Soledad Villar, and Joan Bruna. 2020. Can graph neural networks count substructures? Advances in Neural Information Processing Systems 33 (2020), 10383–10395.
  • Chiang et al. (2019) Wei-Lin Chiang, Xuanqing Liu, Si Si, Yang Li, Samy Bengio, and Cho-Jui Hsieh. 2019. Cluster-gcn: An efficient algorithm for training deep and large graph convolutional networks. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 257–266.
  • DGL (2022) DGL. 2022. 6.7 Using GPU for Neighborhood Sampling — DGL 0.9.1post1 documentation. https://docs.dgl.ai/guide/minibatch-gpu-sampling.html
  • Diemert et al. (2017) Eustache Diemert, Julien Meynet, Pierre Galland, and Damien Lefortier. 2017. Attribution modeling increases efficiency of bidding in display advertising. In Proceedings of the AdKDD and TargetAd Workshop. ACM, 1–6.
  • Frasca et al. (2022) Fabrizio Frasca, Beatrice Bevilacqua, Michael M Bronstein, and Haggai Maron. 2022. Understanding and Extending Subgraph GNNs by Rethinking Their Symmetries. Advances in Neural Information Processing Systems 35 (2022).
  • Garg et al. (2020) Vikas Garg, Stefanie Jegelka, and Tommi Jaakkola. 2020. Generalization and representational limits of graph neural networks. In International Conference on Machine Learning. PMLR, 3419–3430.
  • Hamilton et al. (2017a) Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017a. Inductive representation learning on large graphs. Advances in Neural Information Processing Systems 30 (2017), 1025–1035.
  • Hamilton (2020) William L Hamilton. 2020. Graph representation learning. Synthesis Lectures on Artificial Intelligence and Machine Learning 14, 3 (2020), 1–159.
  • Hamilton et al. (2017b) William L. Hamilton, Rex Ying, and Jure Leskovec. 2017b. Representation Learning on Graphs: Methods and Applications. IEEE Data Eng. Bull. 40, 3 (2017), 52–74.
  • Hu et al. (2020) Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. 2020. Open graph benchmark: Datasets for machine learning on graphs. Advances in Neural Information Processing Systems 33 (2020), 22118–22133.
  • Huang and Zitnik (2020) Kexin Huang and Marinka Zitnik. 2020. Graph meta learning via local subgraphs. Advances in Neural Information Processing Systems 33 (2020), 5862–5874.
  • Jeh and Widom (2003) Glen Jeh and Jennifer Widom. 2003. Scaling personalized web search. In Proceedings of the 12th International Conference on World Wide Web. 271–279.
  • Jumper et al. (2021) John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, et al. 2021. Highly accurate protein structure prediction with AlphaFold. Nature 596, 7873 (2021), 583–589.
  • Kingma and Ba (2015) Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In International Conference on Learning Representations.
  • Kipf and Welling (2017) Thomas N Kipf and Max Welling. 2017. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations.
  • Koller et al. (2007) Daphne Koller, Nir Friedman, Sašo Džeroski, Charles Sutton, Andrew McCallum, Avi Pfeffer, Pieter Abbeel, Ming-Fai Wong, Chris Meek, Jennifer Neville, et al. 2007. Introduction to statistical relational learning. MIT press.
  • Kong et al. (2022) Lecheng Kong, Yixin Chen, and Muhan Zhang. 2022. Geodesic Graph Neural Network for Efficient Graph Representation Learning. Advances in Neural Information Processing Systems 35 (2022).
  • Kwak et al. (2010) Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. 2010. What is Twitter, a social network or a news media?. In Proceedings of the 19th International Conference on World Wide Web. 591–600.
  • Li et al. (2019) Pan Li, I Chien, and Olgica Milenkovic. 2019. Optimizing generalized pagerank methods for seed-expansion community detection. Advances in Neural Information Processing Systems 32 (2019), 11710–11721.
  • Li et al. (2020) Pan Li, Yanbang Wang, Hongwei Wang, and Jure Leskovec. 2020. Distance Encoding: Design Provably More Powerful Neural Networks for Graph Representation Learning. Advances in Neural Information Processing Systems 33 (2020), 4465–4478.
  • Liu et al. (2020) Xin Liu, Haojie Pan, Mutian He, Yangqiu Song, Xin Jiang, and Lifeng Shang. 2020. Neural subgraph isomorphism counting. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1959–1969.
  • Liu et al. (2022) Yunyu Liu, Jianzhu Ma, and Pan Li. 2022. Neural Predicting Higher-Order Patterns in Temporal Networks. In Proceedings of the Web Conference 2022. ACM, 1340–1351.
  • Lou et al. (2020) Zhaoyu Lou, Jiaxuan You, Chengtao Wen, Arquimedes Canedo, Jure Leskovec, et al. 2020. Neural Subgraph Matching. arXiv preprint arXiv:2007.03092 (2020).
  • Luo and Li (2022) Yuhong Luo and Pan Li. 2022. Neighborhood-aware Scalable Temporal Network Representation Learning. Learning on Graphs Conference (2022).
  • Meng et al. (2018) Changping Meng, S Chandra Mouli, Bruno Ribeiro, and Jennifer Neville. 2018. Subgraph pattern neural networks for high-order graph evolution prediction. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.
  • Paetzold et al. (2021) Johannes C Paetzold, Julian McGinnis, Suprosanna Shit, Ivan Ezhov, Paul Büschl, Chinmay Prabhakar, Anjany Sekuboyina, Mihail Todorov, Georgios Kaissis, Ali Ertürk, et al. 2021. Whole Brain Vessel Graphs: A Dataset and Benchmark for Graph Learning and Neuroscience. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
  • Peng et al. (2022) Jingshu Peng, Zhao Chen, Yingxia Shao, Yanyan Shen, Lei Chen, and Jiannong Cao. 2022. Sancus: staleness-aware communication-avoiding full-graph decentralized training in large-scale graph neural networks. Proceedings of the VLDB Endowment 15, 9 (2022), 1937–1950.
  • PyG (2022) PyG. 2022. Accelerating PyG on NVIDIA GPUs. https://www.pyg.org//ns-newsarticle-accelerating-pyg-on-nvidia-gpus
  • Schlichtkrull et al. (2018) Michael Schlichtkrull, Thomas N Kipf, Peter Bloem, Rianne Van Den Berg, Ivan Titov, and Max Welling. 2018. Modeling relational data with graph convolutional networks. In European semantic web conference. Springer, 593–607.
  • Srinivasan and Ribeiro (2020) Balasubramaniam Srinivasan and Bruno Ribeiro. 2020. On the equivalence between positional node embeddings and structural graph representations. In International Conference on Learning Representations.
  • Srinivasan et al. (2021) Balasubramaniam Srinivasan, Da Zheng, and George Karypis. 2021. Learning over Families of Sets-Hypergraph Representation Learning for Higher Order Tasks. In Proceedings of the 2021 SIAM International Conference on Data Mining (SDM). SIAM, 756–764.
  • Teru et al. (2020) Komal Teru, Etienne Denis, and Will Hamilton. 2020. Inductive relation prediction by subgraph reasoning. In International Conference on Machine Learning. PMLR, 9448–9457.
  • Veličković et al. (2018) Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2018. Graph attention networks. In International Conference on Learning Representations.
  • Virtanen et al. (2020) Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, Stéfan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jarrod Millman, Nikolay Mayorov, Andrew R. J. Nelson, Eric Jones, Robert Kern, Eric Larson, C J Carey, İlhan Polat, Yu Feng, Eric W. Moore, Jake VanderPlas, Denis Laxalde, Josef Perktold, Robert Cimrman, Ian Henriksen, E. A. Quintero, Charles R. Harris, Anne M. Archibald, Antônio H. Ribeiro, Fabian Pedregosa, Paul van Mulbregt, and SciPy 1.0 Contributors. 2020. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods 17 (2020), 261–272.
  • Wan et al. (2022a) Cheng Wan, Youjie Li, Ang Li, Nam Sung Kim, and Yingyan Lin. 2022a. BNS-GCN: Efficient full-graph training of graph convolutional networks with partition-parallelism and random boundary node sampling. Proceedings of Machine Learning and Systems 4, 673–693.
  • Wan et al. (2022b) Cheng Wan, Youjie Li, Cameron R Wolfe, Anastasios Kyrillidis, Nam Sung Kim, and Yingyan Lin. 2022b. Pipegcn: Efficient full-graph training of graph convolutional networks with pipelined feature communication. In International Conference on Learning Representations.
  • Wang and Zhang (2021) Xiyuan Wang and Muhan Zhang. 2021. GLASS: GNN with Labeling Tricks for Subgraph Representation Learning. In International Conference on Learning Representations.
  • Wang et al. (2021) Yanbang Wang, Yen-Yu Chang, Yunyu Liu, Jure Leskovec, and Pan Li. 2021. Inductive Representation Learning in Temporal Networks via Causal Anonymous Walks. In International Conference on Learning Representations.
  • Wu et al. (2022) Lingfei Wu, Peng Cui, Jian Pei, Liang Zhao, and Xiaojie Guo. 2022. Graph neural networks: foundation, frontiers and applications. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 4840–4841.
  • Xu et al. (2019) Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. 2019. How Powerful are Graph Neural Networks?. In International Conference on Learning Representations.
  • Yin et al. (2022) Haoteng Yin, Muhan Zhang, Yanbang Wang, Jianguo Wang, and Pan Li. 2022. Algorithm and System Co-design for Efficient Subgraph-based Graph Representation Learning. Proceedings of the VLDB Endowment 15, 11 (2022), 2788–2796.
  • Zeng et al. (2021) Hanqing Zeng, Muhan Zhang, Yinglong Xia, Ajitesh Srivastava, Andrey Malevich, Rajgopal Kannan, Viktor Prasanna, Long Jin, and Ren Chen. 2021. Decoupling the depth and scope of graph neural networks. Advances in Neural Information Processing Systems 34 (2021), 19665–19679.
  • Zeng et al. (2020) Hanqing Zeng, Hongkuan Zhou, Ajitesh Srivastava, Rajgopal Kannan, and Viktor Prasanna. 2020. Graphsaint: Graph sampling based inductive learning method. In International Conference on Learning Representations.
  • Zhang and Chen (2018) Muhan Zhang and Yixin Chen. 2018. Link prediction based on graph neural networks. Advances in Neural Information Processing Systems 31 (2018), 5165–5175.
  • Zhang and Chen (2020) Muhan Zhang and Yixin Chen. 2020. Inductive Matrix Completion Based on Graph Neural Networks. In International Conference on Learning Representations.
  • Zhang et al. (2021) Muhan Zhang, Pan Li, Yinglong Xia, Kai Wang, and Long Jin. 2021. Labeling Trick: A Theory of Using Graph Neural Networks for Multi-Node Representation Learning. Advances in Neural Information Processing Systems 34 (2021), 9061–9073.
  • Zhou et al. (2022) Hongkuan Zhou, Da Zheng, Israt Nisa, Vasileios Ioannidis, Xiang Song, and George Karypis. 2022. TGL: A General Framework for Temporal GNN Training on Billion-Scale Graphs. Proceedings of the VLDB Endowment 15, 8 (2022), 1572–1580.

Appendix A Notations

Frequently used symbols are summarized in Table 6.

Appendix B More Details

Table 6. Summary of Frequently Used Notations.
Symbol Meaning
Q

a query (set of nodes), i.e. Q={u,v,w}𝑄𝑢𝑣𝑤Q=\{u,v,w\}italic_Q = { italic_u , italic_v , italic_w }

𝒬𝒬\mathcal{Q}caligraphic_Q

a set of queries, i.e. Q𝒬𝑄𝒬Q\in\mathcal{Q}italic_Q ∈ caligraphic_Q

𝒢usubscript𝒢𝑢\mathcal{G}_{u}caligraphic_G start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT

a subgraph induced by node u𝑢uitalic_u

𝒢Qsubscript𝒢𝑄\mathcal{G}_{Q}caligraphic_G start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT

a subgraph induced by query Q𝑄Qitalic_Q

𝒮usubscript𝒮𝑢\mathcal{S}_{u}caligraphic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT

a set of unique nodes sampled from the neighborhood of the seed node u𝑢uitalic_u

𝒵u,xsubscript𝒵𝑢𝑥\mathcal{Z}_{u,x}caligraphic_Z start_POSTSUBSCRIPT italic_u , italic_x end_POSTSUBSCRIPT

structural features of node x𝑥xitalic_x regarding the seed node u𝑢uitalic_u (all zeros if x𝒮u𝑥subscript𝒮𝑢x\notin\mathcal{S}_{u}italic_x ∉ caligraphic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT)

𝒵usubscript𝒵𝑢\mathcal{Z}_{u}caligraphic_Z start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT

collection of structural features for all nodes in 𝒮usubscript𝒮𝑢\mathcal{S}_{u}caligraphic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT as 𝒵u={𝒵u,x|x𝒮u}subscript𝒵𝑢conditional-setsubscript𝒵𝑢𝑥𝑥subscript𝒮𝑢\mathcal{Z}_{u}=\{\mathcal{Z}_{u,x}|x\in\mathcal{S}_{u}\}caligraphic_Z start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT = { caligraphic_Z start_POSTSUBSCRIPT italic_u , italic_x end_POSTSUBSCRIPT | italic_x ∈ caligraphic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT }

||||| |

the concatenation that joins node-level structural features, i.e. join 𝒵,xsubscript𝒵𝑥\mathcal{Z}_{\cdot,x}caligraphic_Z start_POSTSUBSCRIPT ⋅ , italic_x end_POSTSUBSCRIPT for a query Q={u,v,w}𝑄𝑢𝑣𝑤Q=\{u,v,w\}italic_Q = { italic_u , italic_v , italic_w } as [𝒵u,x,𝒵v,x,𝒵w,x]subscript𝒵𝑢𝑥subscript𝒵𝑣𝑥subscript𝒵𝑤𝑥[\mathcal{Z}_{u,x},\mathcal{Z}_{v,x},\mathcal{Z}_{w,x}][ caligraphic_Z start_POSTSUBSCRIPT italic_u , italic_x end_POSTSUBSCRIPT , caligraphic_Z start_POSTSUBSCRIPT italic_v , italic_x end_POSTSUBSCRIPT , caligraphic_Z start_POSTSUBSCRIPT italic_w , italic_x end_POSTSUBSCRIPT ].

𝒵Q,xsubscript𝒵𝑄𝑥\mathcal{Z}_{Q,x}caligraphic_Z start_POSTSUBSCRIPT italic_Q , italic_x end_POSTSUBSCRIPT

query-level structural features for node x𝑥xitalic_x regarding the query Q𝑄Qitalic_Q, 𝒵Q,x=||uQ𝒵u,x\mathcal{Z}_{Q,x}=||_{u\in Q}\mathcal{Z}_{u,x}caligraphic_Z start_POSTSUBSCRIPT italic_Q , italic_x end_POSTSUBSCRIPT = | | start_POSTSUBSCRIPT italic_u ∈ italic_Q end_POSTSUBSCRIPT caligraphic_Z start_POSTSUBSCRIPT italic_u , italic_x end_POSTSUBSCRIPT

B.1. Other Related Works

Scalable GNN Design. GNNs are the most widely used toolbox for graph representation learning nowadays, although they face certain challenges when directly applied to subgraph-based methods. To address the scalability of GNNs, current studies focus on improving graph subsampling and mini-batch training techniques (Chiang et al., 2019; Zeng et al., 2020). However, graph subsampling used in GNNs fundamentally differs from subgraph extractions in SGRL. The goal of subsampling is to handle GPU memory overflow during full-batch training of GNN models. For SGRL, subgraphs sampled around a query serve as features for making predictions. Consequently, the scaling techniques developed for GNNs cannot be directly applied to SGRL. Another direction is to deploy distributed GNN systems for industry-level graphs. Unfortunately, these specialized techniques, including pipelining (Wan et al., 2022b), partitioned parallelism (Wan et al., 2022a), and update with staleness (Peng et al., 2022) do not address the main bottleneck of subgraph extraction for SGRL methods.

B.2. Model Design

Benefits of Subgraph-based Graph Representation Learning First, subgraph-based representation is versatile for different types of tasks, especially when queries of certain tasks involving multiple nodes and relations, e.g. existence of a link, property of a motif, development of higher-order patterns; while canonical GNNs are limited to handle such polyadic dynamics via node-wise representations (Srinivasan and Ribeiro, 2020; Wang and Zhang, 2021). Second, subgraph-based models are more expressive by pairing with structural features to obtain most expressive structural representations (Srinivasan and Ribeiro, 2020; Li et al., 2020; Wang and Zhang, 2021; Bouritsas et al., 2022). However, canonical GNNs cannot capture intra-distance information and joint relations over multiple nodes, which are critical to distinguishing nodes in structural symmetry and making predictions over them (also refers to the example in Fig. 2). Lastly, subgraph-based methods decouple the model depth from the receptive field since extracted subgraphs are localized to certain hops: when adding more layers for non-linearity, it does not contaminate embedding with irrelevant nodes or get over-smooth as canonical GNNs do. This results in a more robust representation and is particularly beneficial for modeling relations beyond singleton.

Table 7. Summary Statistics and Experimental Setup for Evaluation Datasets.
Dataset Type #Nodes #Edges Avg. Node Deg. Density Split Ratio Split Type Metric
criteo-click Homo./Bipartite
Campaign(C): 675
User(U): 6,142,256
16,468,027 2.68 N/A 97/1.5/1.5 Time MRR
twitter-2010 Homo./Social. 41,652,230 1,468,364,884 35.25 0.00017% 99.98/0.01/0.01 Random MRR
citation2 Homo./Social. 2,927,963 30,561,187 20.7 0.00036% 98/1/1 Time MRR
collab Homo./Social. 235,868 1,285,465 8.2 0.0046% 92/4/4 Time Hits@50
ppa Homo./Bio. 576,289 30,326,273 73.7 0.018% 70/20/10 Throughput Hits@100
vessel Homo./Bio. 3,538,495 5,345,897 3.02 0.000085% 80/10/10 Random AUC-ROC
ogb-mag Hetero.
Paper(P): 736,389
Author(A): 1,134,649
P-A: 7,145,660
P-P: 5,416,271
21.7 N/A 99/0.5/0.5 Time MRR
tags-math Higher. 1,629
91,685 (projected)
822,059 (hyperedges)
N/A N/A 60/20/20 Time MRR
DBLP-coauthor Higher. 1,924,991
7,904,336 (projected)
3,700,067 (hyperedges)
N/A N/A 60/20/20 Time MRR
Table 8. [Extended] Breakdown of Runtime, Memory Consumption for Different Models on Prediction of Link, Relation Type, and Higher-order Pattern. The column Train records the runtime per 10K queries.
Models Runtime (s) Memory (GB) Runtime (s) Memory (GB) Runtime (s) Memory (GB) Runtime (s) Memory (GB)
Prep. Train Inf. RAM SDRAM Prep. Train Inf. RAM SDRAM Prep. Train Inf. RAM SDRAM Prep. Train Inf. RAM SDRAM
Dataset citation2 ppa collab vessel
GCN 17 21.74 105 9.3 36.84 2 0.026 1.2 4.6 11.35 2 0.005 0.05 2.5 5.50 5 0.076 0.3 2.8 36.98
GraphSAINT 151 1.79 107 9.6 9.78 10 0.003 1.5 4.9 23.06 1 0.004 0.08 2.5 8.11 5 0.008 15 6.9 10.21
GDGNN 338 2.26 5,460 40.6 16.96 127 1.77 902 21.1 10.27 14 0.74 15 4.3 1.08 25 0.85 84 7.2 8.03
SEAL 46 3.52 24,626 35.4 5.71 46 10.57 3,988 9.5 12.13 5 4.05 37 4.0 6.20 6 10.69 998 6.2 2.46
SUREL 151 4.14 6,081 25.1 9.68 31 2.68 1,429 13.6 31.01 1 2.13 17 3.4 9.86 5 1.57 32 5.8 5.18
SUREL+ 130 0.35 1,389 16.7 4.75 69 0.72 201 9.8 19.02 7 0.27 2 2.8 3.37 3 0.31 3 3.3 1.25
Dataset MAG(P-A) MAG(P-P) tags-math DBLP-coauthor
H*GCN 3 0.03 9 5.0 21.56 4 0.03 13 5.5 21.66 2 0.004 1.3 2.4 3.10 - 0.58 95 8.0 25.80
H*SAGE 3 0.03 10 5.0 20.29 4 0.03 13 5.5 20.28 1 0.003 1.3 2.4 3.10 - 0.32 77 7.5 24.70
R-GCN 1 0.52 5 5.3 26.34 1 0.52 4 5.1 31.41 - - - - - - - - - -
SUREL 10 3.20 1,998 7.3 7.18 15 0.99 1924 8.1 16.66 - 2.13 341 3.0 5.95 11 1.29 949 9.7 7.79
SUREL+ 58 0.33 101 7.2 2.95 77 0.13 168 8.1 13.49 1 0.67 116 2.4 5.70 8 0.24 315 3.8 3.16

B.3. Datasets

The full statistics of benchmark datasets are summarized in Table 7. OGB datasets111https://ogb.stanford.edu/docs/dataset_overview/ are selected to benchmark our proposed framework and other baselines. The benchmark contains large-scale graphs (millions of nodes/edges) for real-world applications (e.g., academic and biological networks) and provides standard, open-sourced evaluation metrics and toolkits. Note that, vessel is a newly added benchmark of a biological graph, with >3Mabsent3𝑀>3M> 3 italic_M nodes and sparse vessel structures extracted from the whole mouse brain (Paetzold et al., 2021), where nodes represent bifurcation points, and edges represent the blood vessels. Each node is associated with features of its physical location in the coordinate space (x,y,z)𝑥𝑦𝑧(x,y,z)( italic_x , italic_y , italic_z ). The introduction of vessel provides a unique opportunity to examine graph representation learning approaches in neuroscience, especially in scaling subgraph-based methods to handle sparse and spatial graphs with millions of nodes and edges for scientific discovery.

criteo-click contains a sample of 30 days of Criteo live traffic data, each corresponding to one impression (a banner) displayed to a user and whether it is clicked (Diemert et al., 2017). Each record has 9 contextual features that are aggregated into a 270-dimensional edge feature. There are 675 unique campaign banners and 6.1M users, consisting of a bipartite graph of 16.5M edges: 97% is used for training, and the rest is evenly split for validation and testing based on temporal orders. The task is to predict which campaign the user is most likely to click among 651 candidates. twitter-2010 is an industry-level social network with 1.5B user following relations (Kwak et al., 2010). An edge (i,j)𝑖𝑗(i,j)( italic_i , italic_j ) of this network indicates that user i𝑖iitalic_i is followed by user j𝑗jitalic_j. 1% of Twitter users who follow 10 to 1000 accounts are randomly sampled for evaluation. The task is to recommend which account they will most likely follow among 1001 candidates. The OGB formatted files of these two datasets are accessible via Box at https://purdue.box.com/v/SGRL-LSC-dataset.

B.4. Baselines

For link prediction and relation type prediction, baseline models are selected based on their scalability and prediction performance from the current OGB leaderboard 222https://ogb.stanford.edu/docs/leader_linkprop/. All models listed on the leaderboard are publicly accessible. We adopt their reported numbers on the leaderboard with verification. For the rest of the baselines, we benchmark these models using their official implementations with tuned hyperparameters as listed below.

  • Canonical GNNs: a graph auto-encoder model that uses graph convolution layers to learn node-wise representations, including GCN (Kipf and Welling, 2017), GraphSAGE (Hamilton et al., 2017a), and their more scalable variants by employing graph subsampling, such as GraphSAINT (Zeng et al., 2020).

  • R-GCN333https://github.com/pyg-team/pytorch_geometric/blob/master/examples (Schlichtkrull et al., 2018): a relational GCN that models heterogeneous graphs with different types of node/link.

  • SEAL444https://github.com/facebookresearch/SEAL_OGB (Zhang and Chen, 2018): apply GCN on query-induced subgraphs attached with double radius node labeling to obtain subgraph-level readout for link prediction. SEAL shows great empirical performance on multiple graph machine learning benchmarks and promotes the deployment of subgraph-based models for scientific discovery. The implementation we tested is specialized for OGB datasets provided in (Zhang et al., 2021).

  • GDGNN555https://github.com/woodcutter1998/gdgnn (Kong et al., 2022): a subgraph-based model aggregates node representations generated by GNNs along geodesic paths between queried nodes for fast inference.

  • SUREL666https://github.com/Graph-COM/SUREL(Yin et al., 2022): a walk-based computation framework to accelerate subgraph-based methods, where subgraphs are decomposed to pre-sampled walks and then are joined online to substitute the query-induced subgraph for prediction. By adopting the walk-based representation, SUREL achieves state-of-the-art scalability and prediction accuracy on SGRL tasks.

All canonical GNN baselines777https://github.com/snap-stanford/ogb/tree/master/examples/linkproppred come with three GCNConv/SAGEConv layers of 256 hidden dimensions, and a tuned dropout ratio in {0,0.5}00.5\{0,0.5\}{ 0 , 0.5 } for full-batch training. Canonical GNNs aggregate all node embeddings involved in a query as the representation of link/hyperedge, which is later fed into an MLP classifier for final prediction. In addition, all GNN models need to use full training data (edges/triplets) to generate robust node representations. The hypergraph datasets do not come with raw node features, and thus GNN baselines use randomly initialized features as input for training along with other model parameters. R-GCN uses RGCNConv layers that support message passing with multiple relation types between different types of nodes, where the edge types (relations) are used as input besides node features.

Subgraph-based models only use partial edges/triplets for training. For SEAL, 1-hop enclosing subgraphs are extracted online during the training and inference. Then, it applies three GCN layers of 32 hidden dimensions plus a sort pooling and several 1D convolution layers to generate a readout of the target subgraph for prediction. SUREL consists of a 2-layer MLP for query-level relative position encoding (RPE) and a 2-layer RNN to encode joined walks with attached RPEs. The hidden dimension of both networks is set to 64. The obtained readout of joined walks is aggregated and fed into a 2-layer MLP classifier to make predictions. GDGNN employs GINLayer as its backbone to obtain node embeddings. The horizontal geodesic representation is used for predictions, which finds the shortest path between two nodes in a query and aggregates node representations generated by GNNs along the found geodesic path. The max search distance for geodesic is the same as the number of GNN layers. For collab, ppa, citation2 and vessel, the threshold of distance is set to 4, 4, 3, and 2, respectively. The hidden dimension of all fully connected layers is set to 32.

Appendix C Architecture and Hyperparameter

SUREL+ uses a 2-layer MLP with ReLU activation for encoding structural features and supports three set neural encoders, including mean pooling, LSTM, and attention. LSTM interprets elements to be aggregated in a set as a sequence (Hamilton et al., 2017a); attention first calculates soft attention scores for elements in a set and then performs attention-score-weighted average pooling. The hidden dimension of all parameterized layers is set to 96. Lastly, hidden representations of query-level joined node sets are fed into a 2-layer MLP classifier for final predictions.

The walk-based sampler builds on the sampling function from SubGAcc888https://github.com/VeritasYin/subg_acc library developed by the authors, which also provides the support for efficient structural feature compression and index remapping. The metric-based sampler is adopted from fast PPR approximation in (Bojchevski et al., 2020).

Table 9. Hyperparameters Used for Benchmark SUREL+.
Dataset #steps m𝑚mitalic_m #walks M𝑀Mitalic_M

#negative samples k𝑘kitalic_k

Structural Feature

Set Neural Encoder

criteo-click 4 200

10

LP

Mean

twitter-2010 4 100

25

LP

Mean

citation2 4 100

10

LP

Mean

collab 3 200

10

LP

Mean

ppa 4 200

20

LP

Attn.

vessel 2 50

5

LP

Mean

MAG (P-A) 3 200

10

LP

Mean

MAG (P-P) 4 100

10

LP

Mean

tags-math 4 200

10

LP

Mean

DBLP-coauthor 3 100

10

LP

Mean

We follow the inductive setting for link and relation prediction: only partial samples will be used for training. Over the training graph, we randomly select 5% links as positive training queries, each paired with k𝑘kitalic_k-many negative samples (k=10𝑘10k=10italic_k = 10 by default). We mask these links and use the remaining 95% links to compute each node’s structural features in the split training set via structure encoder. For vessel, as the input graph is very sparse, we first sort the nodes in training set by their degree and then randomly pick 5% nodes to obtain edges of their 2-hop induced subgraphs for training and the rest reserved for structural feature construction. For higher-order pattern prediction, we use the given graph before timestamp t𝑡titalic_t to sample node sets and encode their structural features. The model parameters are optimized by triplets provided in the training set. No node features are used in SUREL+, except for vessel where normalized physical locations of each node are attached after its structural features and similarly for contextual features in click.

Table 8 presents the extended version of Table 4. The results reported in Table 3 and the profiling of SUREL+ in Tables 4, 8 are obtained through the combination of hyperparameters listed in Table 9. The dropout rate on vessel is set to p=0.2. The metric-based sampler is adopted to obtain the results of using PPR and SPD as structural features in Table 5. Its sampling size K𝐾Kitalic_K is set to 50505050, 50505050 and 150150150150 for collab, ppa, citation2, respectively. The walk-based sampler is used for the results of LP as structural features, whose sampling parameters are listed in Table 9. The rest of the hyperparameters remain the same as reported in Sec. 4.1. The SUREL+ framework including SubGAcc library is open-source and free for academic use under the BSD-2-Clause license.