1 Introduction
A complex system is a group of many parts that interact with each other. These systems are everywhere in our world. For instance, think about how different parts of our body work together, how animals and plants rely on each other in nature, or how we connect with friends and family on social media. All of these are examples of complex systems.
Graphs are extensively utilized to model such complex systems, consisting of nodes and edges. In these graphs, nodes represent entities and edges connect nodes that interact with each other. The expansion of the internet and the advancement of data digitization have led to the emergence of large-scale complex systems like e-mail networks, social media, and financial transactions. Hence, there is a growing demand for the efficient analysis of such large-scale graphs.
Given the significant challenge of collecting and analyzing every entity in such large-scale graphs, a common approach involves sampling a smaller graph that retains properties similar to the original. This sampling strategy is widely employed in various tasks, including:
—
Simulation: In the context of internet topology, where nodes represent hosts or routers and edges correspond to communication links, conducting simulations, especially packet-level ones, is notably time-intensive owing to the vast scale of the internet. These simulations often even require multiple runs to ensure the reliability of the protocols under examination. Therefore, this necessitates a significant reduction in simulation time in this field. To address the computational challenges, sampling small graphs that resemble the internet topology has been utilized [
36,
37].
—
Visualization: Visualizing a large-scale graph is essential for a thorough human interpretation, yet it presents challenges due to the vast number of components (i.e., nodes and edges), the lack of screen space, and the complexity of layout algorithms. A small representative subgraph can be used to mitigate these difficulties [
7,
17,
23,
38].
—
Stream Processing: A dynamic graph that grows indefinitely is naturally treated as a stream of edges whose number can potentially be infinite. In dealing with such graphs, it becomes impractical to store every edge for analysis due to the vast and ever-expanding nature of the data. Consequently, several studies have shifted focus toward maintaining a subgraph that reflects the current state of the entire graph. This method is especially prevalent in various graph-related tasks, including outlier detection [
1,
21], edge prediction [
83], and triangle counting [
40,
51,
62].
—
Crawling: Online social networks (e.g., Facebook and X (formerly known as Twitter)) provide information on connections mainly by API queries. Limitations on API request rates make it inevitable to deal with a subgraph instead of the entire graph [
39,
50,
69].
—
Graph Representation Learning: Despite their wide usage,
graph neural networks (
GNNs) often suffer from scalability issues due to the recursive expansion of neighborhoods across layers. Sampling has been employed to accelerate training by limiting the size of the neighborhoods [
9,
10,
11,
25,
75,
85].
Ordinary graphs are suitable for modeling connections between two entities, known as pairwise interactions. However, in many complex systems, group interactions are prevalent, where more than two entities interact with each other simultaneously. Such interactions are commonly seen in various contexts, including multiple researchers collaborating on a manuscript, users engaging in a group discussion on online Q&A platforms, and the dynamic interplay of ingredients in a recipe.
Consequently, these complex systems are more aptly depicted using hypergraphs rather than traditional graphs. A hypergraph consists of nodes and hyperedges, with each hyperedge capable of including any number of nodes, thus effectively capturing the essence of group interactions. This approach is visually demonstrated in Figure
1, where nodes represent tags and hyperedges correspond to multi-tagged questions on an online Q&A platform. Modeling complex systems as hypergraphs, rather than graphs, can help capture domain-specific structural patterns [
43], predict interactions [
73], cluster nodes [
68], and measure node importance [
12]. Since real-world hypergraphs are similar in size to and more complex than real-world graphs, sampling from hypergraphs provides substantial benefits, including those listed above.
In this article, our primary focus is on the challenge of identifying a “good” sub-hypergraph sample within a given hypergraph. The definition of “good” can vary based on specific applications, prompting us to delve into general tasks and assess whether the sample adequately preserves the structural properties of the target. The target, in this context, can take one of two forms: either the input hypergraph itself or a past snapshot of the time-evolving hypergraph. For instance, when contemplating sampling a sub-hypergraph that represents half of the input hypergraph, a crucial question arises—should the goal be to preserve similar structural properties as the input hypergraph, or should it mimic the past version of the entire hypergraph when their size is halved? Given the validity of both perspectives, we approach the hypergraph sampling problem with two main objectives: (a)
representative hypergraph sampling and (b)
back-in-time hypergraph sampling. Furthermore, to the best of our knowledge, our work represents the first attempt to address the challenge of sampling from real-world hypergraphs. Consequently, we conduct an analysis of simple and intuitive sampling approaches, e.g., random sampling of hyperedges. Drawing insights from the properties of these straightforward approaches, we develop our algorithm to overcome their inherent weaknesses. To guide our investigation, we aim at answering the following questions for each problem:
—
Q1. How can we measure the quality of a sub-hypergraph as a good sample?
—
Q2. What are the benefits and limitations of simple and intuitive approaches for hypergraph sampling?
—
Q3. How can we find a high-quality sample sub-hypergraph rapidly without extensively exploring the search space?
In addressing the first problem,
representative hypergraph sampling, the objective is to capture the characteristics of the input hypergraph within the sampled sub-hypergraph. Regarding Q1, we measure the difference between the input hypergraph and a sample sub-hypergraph using ten distinct statistics related to the unique structural properties of real-world hypergraphs [
41]. These statistics include both node-level and hyperedge-level analyses, comparing the distributions of node degrees, hyperedge sizes, intersection sizes [
35], and node-pair degrees [
42] in both sampled and entire hypergraphs. Additionally, we assess their average clustering coefficient [
19], density [
26], overlapness [
42], and effective diameter [
35,
47] as graph-level statistics. Concerning Q2, we try six simple and intuitive sampling approaches from 11 real-world hypergraphs, as we are the first, to our knowledge, to tackle this problem. Then, we analyze their benefits and limitations. While some approaches preserve certain structural properties well, none of them succeeds in preserving all ten properties, demonstrating the difficulty of the considered problem. With respect to Q3, leveraging insights from our previous analyses, we propose
Minimum Degree Biased Sampling of Hyperedges (
MiDaS) for representative hypergraph sampling.
MiDaS is inspired by two facts: (a) all the simple approaches fail to preserve degree distributions well and (b) the ability to preserve degree distributions is strongly correlated to the ability to preserve other properties. Utilizing these facts,
MiDaS is designed to be able to draw hyperedges with a sampling bias (i.e., the statistical prioritization of specific nodes or hyperedges during sampling) toward those with high-degree nodes, while automatically adjusting the degree of bias to align with the degree distribution of the input hypergraph. Through extensive experiments, we show that
MiDaS performs best overall among 14 competitors in 11 real-world hypergraphs, as shown in Figure
2.
The second problem, back-in-time hypergraph sampling, is defined as follows: Given a snapshot of a time-evolving hypergraph and a target size, the objective is to construct a sub-hypergraph that closely approximates the past snapshot of the hypergraph at the target size. Note that the target (i.e., the past snapshot of the hypergraph) is not provided, unlike in representative sampling, where the given hypergraph itself is the target. It is also important to note that both representative sampling and back-in-time sampling share the overarching goal of obtaining a structurally similar but smaller sub-hypergraph from the input hypergraph, making both suitable for the applications mentioned. Regarding Q1, we assess the quality of a sub-hypergraph by comparing it with a past snapshot of the same size by employing the aforementioned ten statistics. Concerning Q2, we analyze eight sampling methods, including the aforementioned six straightforward methods and MiDaS, across 11 real-world hypergraphs. They exhibit distinct characteristics, which also differ from those observed in the previous problem. Notably, while MiDaS, designed for representative sampling, exhibits superior performance compared to other simple sampling methods, it encounters challenges in effectively preserving hyperedge sizes in this back-in-time hypergraph sampling problem. Therefore, in response to Q3, we introduce MiDaS-B, an extension of MiDaS specifically tailored for back-in-time hypergraph sampling. MiDaS-B additionally incorporates a hyperedge-size-related term into the hyperedge sampling probabilities, effectively controlling biases toward both degrees and sizes to closely match the degree and size distribution of the target hypergraph. Note that, since the target hypergraph is unavailable, adjusting the hyperparameters of MiDaS-B to minimize the difference from it is not straightforward. In order to address this challenge, we leverage the replication of evolutionary patterns that are commonly observed in real-world hypergraphs as a substitute objective for tuning hyperparameters. Experimental results demonstrate that MiDaS-B significantly outperforms 10 competing methods across 11 real-world hypergraphs.
Our contributions are summarized as follows
1:
—
New Problem: To the best of our knowledge, our work is the first to tackle the challenging task of sampling sub-hypergraphs from real-world hypergraphs. We aim at obtaining structurally similar yet smaller sub-hypergraphs while pursuing two distinct objectives (representative sampling and back-in-time sampling).
—
Findings: We conduct a comprehensive analysis of a wide array of intuitive sampling approaches in the context of these new problems. Our examination, conducted on 11 datasets, focuses on uncovering the limitations of these approaches in preserving 10 essential properties of the target hypergraph.
—
Algorithm: We propose
MiDaS, which rapidly finds overall the most representative sample—a sub-hypergraph sharing structural similarities but smaller than the input hypergraph—among 15 methods (see Figure
2). Additionally, we present
MiDaS-B, an extension of
MiDaS, capable of accurately approximating the past snapshot of the input hypergraph, without relying on the ground-truth past snapshot information.
The rest of the article is organized as follows. In Section
2, we establish the necessary notations and preliminaries, including structural statistics of hypergraphs, simple and intuitive sampling approaches, datasets used in the article, and evaluation criteria for assessing the quality of sub-hypergraphs. In Section
3, we focus on representative hypergraph sampling, introducing and evaluating our proposed algorithm,
MiDaS. In Section
4, we shift our focus to back-in-time hypergraph sampling, where we introduce and evaluate our proposed algorithm,
MiDaS-B. In Section
5, we discuss related works. In Section
6, we offer conclusions along with future research directions.