research-article

Open access

Representative and Back-In-Time Sampling from Real-world Hypergraphs

Authors: Minyoung Choe, Jaemin Yoo, Geon Lee, Woonsung Baek, U Kang, Kijung ShinAuthors Info & Claims

ACM Transactions on Knowledge Discovery from Data, Volume 18, Issue 6

Article No.: 156, Pages 1 - 48

https://doi.org/10.1145/3653306

Published: 26 April 2024 Publication History

PDF eReader

Abstract

Graphs are widely used for representing pairwise interactions in complex systems. Since such real-world graphs are large and often evergrowing, sampling subgraphs is useful for various purposes, including simulation, visualization, stream processing, representation learning, and crawling. However, many complex systems consist of group interactions (e.g., collaborations of researchers and discussions on online Q&A platforms) and thus are represented more naturally and accurately by hypergraphs than by ordinary graphs. Motivated by the prevalence of large-scale hypergraphs, we study the problem of sampling from real-world hypergraphs, aiming at answering (Q1) how can we measure the goodness of sub-hypergraphs, and (Q2) how can we efficiently find a “good” sub-hypergraph. Regarding Q1, we distinguish between two goals: (a) representative sampling, which aims at capturing the characteristics of the input hypergraph, and (b) back-in-time sampling, which aims at closely approximating a past snapshot of the input time-evolving hypergraph. To evaluate the similarity of the sampled sub-hypergraph to the target (i.e., the input hypergraph or its past snapshot), we consider 10 graph-level, hyperedge-level, and node-level statistics. Regarding Q2, we first conduct a thorough analysis of various intuitive approaches using 11 real-world hypergraphs. Then, based on this analysis, we propose MiDaS and MiDaS-B, designed for representative sampling and back-in-time sampling, respectively. Regarding representative sampling, we demonstrate through extensive experiments that MiDaS, which employs a sampling bias toward high-degree nodes in hyperedge selection, is (a) Representative: finding overall the most representative samples among 15 considered approaches, (b) Fast: several orders of magnitude faster than the strongest competitors, and (c) Automatic: automatically tuning the degree of sampling bias. Regarding back-in-time sampling, we demonstrate that MiDaS-B inherits the strengths of MiDaS despite an additional challenge—the unavailability of the target (i.e., past snapshot). It effectively handles this challenge by focusing on replicating universal evolutionary patterns, rather than directly replicating the target.

1 Introduction

A complex system is a group of many parts that interact with each other. These systems are everywhere in our world. For instance, think about how different parts of our body work together, how animals and plants rely on each other in nature, or how we connect with friends and family on social media. All of these are examples of complex systems.

Graphs are extensively utilized to model such complex systems, consisting of nodes and edges. In these graphs, nodes represent entities and edges connect nodes that interact with each other. The expansion of the internet and the advancement of data digitization have led to the emergence of large-scale complex systems like e-mail networks, social media, and financial transactions. Hence, there is a growing demand for the efficient analysis of such large-scale graphs.

Given the significant challenge of collecting and analyzing every entity in such large-scale graphs, a common approach involves sampling a smaller graph that retains properties similar to the original. This sampling strategy is widely employed in various tasks, including:

—

Simulation: In the context of internet topology, where nodes represent hosts or routers and edges correspond to communication links, conducting simulations, especially packet-level ones, is notably time-intensive owing to the vast scale of the internet. These simulations often even require multiple runs to ensure the reliability of the protocols under examination. Therefore, this necessitates a significant reduction in simulation time in this field. To address the computational challenges, sampling small graphs that resemble the internet topology has been utilized [36, 37].

—

Visualization: Visualizing a large-scale graph is essential for a thorough human interpretation, yet it presents challenges due to the vast number of components (i.e., nodes and edges), the lack of screen space, and the complexity of layout algorithms. A small representative subgraph can be used to mitigate these difficulties [7, 17, 23, 38].

—

Stream Processing: A dynamic graph that grows indefinitely is naturally treated as a stream of edges whose number can potentially be infinite. In dealing with such graphs, it becomes impractical to store every edge for analysis due to the vast and ever-expanding nature of the data. Consequently, several studies have shifted focus toward maintaining a subgraph that reflects the current state of the entire graph. This method is especially prevalent in various graph-related tasks, including outlier detection [1, 21], edge prediction [83], and triangle counting [40, 51, 62].

—

Crawling: Online social networks (e.g., Facebook and X (formerly known as Twitter)) provide information on connections mainly by API queries. Limitations on API request rates make it inevitable to deal with a subgraph instead of the entire graph [39, 50, 69].

—

Graph Representation Learning: Despite their wide usage, graph neural networks (GNNs) often suffer from scalability issues due to the recursive expansion of neighborhoods across layers. Sampling has been employed to accelerate training by limiting the size of the neighborhoods [9, 10, 11, 25, 75, 85].

Ordinary graphs are suitable for modeling connections between two entities, known as pairwise interactions. However, in many complex systems, group interactions are prevalent, where more than two entities interact with each other simultaneously. Such interactions are commonly seen in various contexts, including multiple researchers collaborating on a manuscript, users engaging in a group discussion on online Q&A platforms, and the dynamic interplay of ingredients in a recipe.

Consequently, these complex systems are more aptly depicted using hypergraphs rather than traditional graphs. A hypergraph consists of nodes and hyperedges, with each hyperedge capable of including any number of nodes, thus effectively capturing the essence of group interactions. This approach is visually demonstrated in Figure 1, where nodes represent tags and hyperedges correspond to multi-tagged questions on an online Q&A platform. Modeling complex systems as hypergraphs, rather than graphs, can help capture domain-specific structural patterns [43], predict interactions [73], cluster nodes [68], and measure node importance [12]. Since real-world hypergraphs are similar in size to and more complex than real-world graphs, sampling from hypergraphs provides substantial benefits, including those listed above.

Fig. 1.

In this article, our primary focus is on the challenge of identifying a “good” sub-hypergraph sample within a given hypergraph. The definition of “good” can vary based on specific applications, prompting us to delve into general tasks and assess whether the sample adequately preserves the structural properties of the target. The target, in this context, can take one of two forms: either the input hypergraph itself or a past snapshot of the time-evolving hypergraph. For instance, when contemplating sampling a sub-hypergraph that represents half of the input hypergraph, a crucial question arises—should the goal be to preserve similar structural properties as the input hypergraph, or should it mimic the past version of the entire hypergraph when their size is halved? Given the validity of both perspectives, we approach the hypergraph sampling problem with two main objectives: (a) representative hypergraph sampling and (b) back-in-time hypergraph sampling. Furthermore, to the best of our knowledge, our work represents the first attempt to address the challenge of sampling from real-world hypergraphs. Consequently, we conduct an analysis of simple and intuitive sampling approaches, e.g., random sampling of hyperedges. Drawing insights from the properties of these straightforward approaches, we develop our algorithm to overcome their inherent weaknesses. To guide our investigation, we aim at answering the following questions for each problem:

—

Q1. How can we measure the quality of a sub-hypergraph as a good sample?

—

Q2. What are the benefits and limitations of simple and intuitive approaches for hypergraph sampling?

—

Q3. How can we find a high-quality sample sub-hypergraph rapidly without extensively exploring the search space?

In addressing the first problem, representative hypergraph sampling, the objective is to capture the characteristics of the input hypergraph within the sampled sub-hypergraph. Regarding Q1, we measure the difference between the input hypergraph and a sample sub-hypergraph using ten distinct statistics related to the unique structural properties of real-world hypergraphs [41]. These statistics include both node-level and hyperedge-level analyses, comparing the distributions of node degrees, hyperedge sizes, intersection sizes [35], and node-pair degrees [42] in both sampled and entire hypergraphs. Additionally, we assess their average clustering coefficient [19], density [26], overlapness [42], and effective diameter [35, 47] as graph-level statistics. Concerning Q2, we try six simple and intuitive sampling approaches from 11 real-world hypergraphs, as we are the first, to our knowledge, to tackle this problem. Then, we analyze their benefits and limitations. While some approaches preserve certain structural properties well, none of them succeeds in preserving all ten properties, demonstrating the difficulty of the considered problem. With respect to Q3, leveraging insights from our previous analyses, we propose Minimum Degree Biased Sampling of Hyperedges (MiDaS) for representative hypergraph sampling. MiDaS is inspired by two facts: (a) all the simple approaches fail to preserve degree distributions well and (b) the ability to preserve degree distributions is strongly correlated to the ability to preserve other properties. Utilizing these facts, MiDaS is designed to be able to draw hyperedges with a sampling bias (i.e., the statistical prioritization of specific nodes or hyperedges during sampling) toward those with high-degree nodes, while automatically adjusting the degree of bias to align with the degree distribution of the input hypergraph. Through extensive experiments, we show that MiDaS performs best overall among 14 competitors in 11 real-world hypergraphs, as shown in Figure 2.

Fig. 2.

The second problem, back-in-time hypergraph sampling, is defined as follows: Given a snapshot of a time-evolving hypergraph and a target size, the objective is to construct a sub-hypergraph that closely approximates the past snapshot of the hypergraph at the target size. Note that the target (i.e., the past snapshot of the hypergraph) is not provided, unlike in representative sampling, where the given hypergraph itself is the target. It is also important to note that both representative sampling and back-in-time sampling share the overarching goal of obtaining a structurally similar but smaller sub-hypergraph from the input hypergraph, making both suitable for the applications mentioned. Regarding Q1, we assess the quality of a sub-hypergraph by comparing it with a past snapshot of the same size by employing the aforementioned ten statistics. Concerning Q2, we analyze eight sampling methods, including the aforementioned six straightforward methods and MiDaS, across 11 real-world hypergraphs. They exhibit distinct characteristics, which also differ from those observed in the previous problem. Notably, while MiDaS, designed for representative sampling, exhibits superior performance compared to other simple sampling methods, it encounters challenges in effectively preserving hyperedge sizes in this back-in-time hypergraph sampling problem. Therefore, in response to Q3, we introduce MiDaS-B, an extension of MiDaS specifically tailored for back-in-time hypergraph sampling. MiDaS-B additionally incorporates a hyperedge-size-related term into the hyperedge sampling probabilities, effectively controlling biases toward both degrees and sizes to closely match the degree and size distribution of the target hypergraph. Note that, since the target hypergraph is unavailable, adjusting the hyperparameters of MiDaS-B to minimize the difference from it is not straightforward. In order to address this challenge, we leverage the replication of evolutionary patterns that are commonly observed in real-world hypergraphs as a substitute objective for tuning hyperparameters. Experimental results demonstrate that MiDaS-B significantly outperforms 10 competing methods across 11 real-world hypergraphs.

Our contributions are summarized as follows¹:

—

New Problem: To the best of our knowledge, our work is the first to tackle the challenging task of sampling sub-hypergraphs from real-world hypergraphs. We aim at obtaining structurally similar yet smaller sub-hypergraphs while pursuing two distinct objectives (representative sampling and back-in-time sampling).

—

Findings: We conduct a comprehensive analysis of a wide array of intuitive sampling approaches in the context of these new problems. Our examination, conducted on 11 datasets, focuses on uncovering the limitations of these approaches in preserving 10 essential properties of the target hypergraph.

—

Algorithm: We propose MiDaS, which rapidly finds overall the most representative sample—a sub-hypergraph sharing structural similarities but smaller than the input hypergraph—among 15 methods (see Figure 2). Additionally, we present MiDaS-B, an extension of MiDaS, capable of accurately approximating the past snapshot of the input hypergraph, without relying on the ground-truth past snapshot information.

For reproducibility, we make our code and datasets available at https://github.com/young917/MiDaS.

The rest of the article is organized as follows. In Section 2, we establish the necessary notations and preliminaries, including structural statistics of hypergraphs, simple and intuitive sampling approaches, datasets used in the article, and evaluation criteria for assessing the quality of sub-hypergraphs. In Section 3, we focus on representative hypergraph sampling, introducing and evaluating our proposed algorithm, MiDaS. In Section 4, we shift our focus to back-in-time hypergraph sampling, where we introduce and evaluate our proposed algorithm, MiDaS-B. In Section 5, we discuss related works. In Section 6, we offer conclusions along with future research directions.

2 Preliminaries and Datasets

In this section, we provide an overview of the basic concepts related to hypergraphs. We then discuss the ten statistics that we use to measure the performance of hypergraph sampling. We also introduce simple and intuitive sampling approaches that serve as baselines for comparison. Additionally, we describe the datasets we use in our evaluation and provide an overview of the numerical evaluation process.

2.1 Notations

A hypergraph \(\mathcal {G}=(\mathcal {V},\mathcal {E})\) consists of a set of nodes \(\mathcal {V}\) and a set of hyperedges \(\mathcal {E}\subseteq 2^{\mathcal {V}}\) . Each hyperedge \(e\in \mathcal {E}\) is a non-empty subset of \(\mathcal {V}\) . The degree of a node \(v\) is the number of hyperedges containing \(v\) , i.e., \(d_{v}:=|\lbrace e \in \mathcal {E}: v \in e \rbrace |\) . A sub-hypergraph \(\mathcal {\hat{G}}=(\mathcal {\hat{V}},\mathcal {\hat{E}})\) of \(\mathcal {G}=(\mathcal {V},\mathcal {E})\) is a hypergraph consisting of nodes \(\mathcal {\hat{V}}\subseteq \mathcal {V}\) and hyperedges \(\mathcal {\hat{E}}\subseteq \mathcal {E}\) .

2.2 Statistics for Structure of Hypergraphs

We introduce ten node-level (P1, P3), hyperedge-level (P2, P4), and graph-level (P5–P10) statistics that have been used extensively for structure analysis of real-world graphs [46, 47] and hypergraphs [19, 35, 42]. Refer to a survey [41] for the structure analyses. They are used throughout this article to measure the structural similarity of hypergraphs.

—

P1. Degree: We consider the degree distribution of nodes. The distribution tends to be heavy-tailed in real-world hypergraphs but not in uniform random hypergraphs [19, 35].

—

P2. Size: We consider the size distribution of hyperedges, which is shown to be heavy-tailed in real-world hypergraphs [35].

—

P3. Pair Degree: We consider the pair degree distribution of neighboring node pairs. The pair degree of two nodes is defined as the number of hyperedges containing both. The distribution reveals structural similarity between nodes, and it tends to have a heavier tail in real-world hypergraphs than in randomized ones [42].

—

P4. Intersection Size (Int. Size): We consider the intersection-size (i.e., count of common nodes) distribution of overlapping hyperedge pairs. The distribution from pairwise connections between hyperedges is heavy-tailed in many real-world hypergraphs [35].

—

P5. Singular Values (SV): We consider the relative variance explained by singular vectors of the incidence matrix. In detail, for each \(i\in \lbrace 1,\ldots ,R\rbrace\) , we compute \(s_{i}^{2}\) / \(\sum _{k=1}^{R} s_{k}^{2}\) where \(s_i\) is the \(i\) th largest singular value and \(R\) is the rank of the incidence matrix. Singular values indicate the variance explained by the corresponding singular vectors [66], and they are highly skewed in many real-world hypergraphs [35]. They are also equal to the square root of eigenvalues of the weighted adjacency matrix. For the large datasets from the threads and co-authorship domains, we use 300 instead of \(R\) , and for a sample from them, we use \(300/R\) of the rank of its incidence matrix.

—

P6. Connected Component Size (CC): We consider the portion of nodes in each \(i\) th largest connected component in the clique expansion. The clique expansion of a hypergraph \(\mathcal {G}=(\mathcal {V},\mathcal {E})\) is the undirected graph obtained by replacing each hyperedge \(e\in \mathcal {E}\) with the clique with the nodes in \(e\) . In many real-world hypergraphs, a majority of nodes belong to a few connected components [19].

—

P7. Global Clustering Coefficient (GCC): We estimate the average of the clustering coefficients of all nodes in the clique expansion (defined in P6) using [59]. This statistic measures the cohesiveness of connections, and it tends to be larger in real-world hypergraphs than in uniform random hypergraphs [19].

—

P8. Density: The density is defined as the ratio of the hyperedge count over the node count (i.e., \(|\mathcal {E}|/|\mathcal {V}|\) ) [26]. Hypergraphs from the same domain tend to share a similar significance of density [42].

—

P9. Overlapness: The overlapness of a hypergraph \(\mathcal {G}=(\mathcal {V},\mathcal {E})\) is defined as \(\sum _{e \in \mathcal {E}} |e| / |\mathcal {V}|\) . It measures the degree of hyperedge overlaps, satisfying desirable axioms [42]. Hypergraphs from the same domain tend to share a similar significance of overlapness [42].

—

P10. Diameter: The effective diameter is defined as the smallest \(d\) such that the paths of length at most \(d\) in the clique expansion (defined in P6) connect 90% of reachable pairs of nodes [47]. It measures how closely nodes are connected. The effective diameter tends to be small in real-world hypergraphs [35].

2.3 Simple and Intuitive Sampling Approaches

We describe the six intuitive approaches, which are categorized into node-selection methods and hyperedge-selection methods.

2.3.1 Node Selection (NS).

In node-selection methods, we choose a subset \(\mathcal {\hat{V}}\) of nodes and return the induced sub-hypergraph \(\mathcal {\hat{G}}=(\mathcal {\hat{V}},\mathcal {E}({\mathcal {\hat{V}}}))\) where \(\mathcal {E}({\mathcal {\hat{V}}}):=\lbrace e \in \mathcal {E}: \forall v \in e, v \in \mathcal {\hat{V}}\rbrace\) denotes the set of hyperedges composed only of nodes in \(\mathcal {\hat{V}}\) . Each process below is repeated until \(\mathcal {\hat{G}}\) has the desired size (i.e., \(|\mathcal {E}({\mathcal {\hat{V}}})|=\lfloor |\mathcal {E}| \cdot p \rfloor\) with \(p\) representing the proportion of sampling).

—

Random Node Sampling (RNS): We repeat drawing a node uniformly at random and adding it to \(\mathcal {\hat{V}}\) .

—

Random Degree Node (RDN): We repeat drawing a node with probabilities proportional to node degrees and adding it to \(\mathcal {\hat{V}}\) as in [46].

—

Random Walk (RW): We perform random walk with restart [64], setting the restart probability \(c=0.15\) on the clique expansion (defined in Section 2.2), and add each visited node to \(\mathcal {\hat{V}}\) in turn. We select a new seed node, where random walks restart, after reaching the maximum number of steps, which is set to the number of nodes.

—

Forest Fire (FF): We simulate forest fire in hypergraphs as in [35]. First, we choose a random node \(w\) as an ambassador and burn it. Then, we burn \(n\) neighbors of \(w\) where \(n\) is sampled from a geometric distribution with mean \(p/(1-p)\) . We recursively apply the previous step to each burned neighbor by considering it as a new ambassador, but the number of neighbors to be burned is sampled from a different geometric distribution with mean \(q/(1-q)\) . Each burned node is added to \(\mathcal {\hat{V}}\) in turn. If there is no new burned node, we choose a new ambassador uniformly at random. We set \(p\) to 0.51 and \(q\) to 0.2 as in [35]. This method extends a successful representative sampling method for graphs [46] to hypergraphs.

2.3.2 Hyperedge Selection (HS).

In hyperedge-selection methods, we draw a subset \(\mathcal {\hat{E}}\) of hyperedges and return \(\mathcal {\hat{G}}=(\mathcal {V}({\mathcal {\hat{E}}}),\mathcal {\hat{E}})\) , where \(\mathcal {V}({\mathcal {\hat{E}}}):=\bigcup _{e\in \mathcal {\hat{E}}}e\) is the set of nodes in any hyperedge in \(\mathcal {\hat{E}}\) .

—

Random Hyperedge Sampling (RHS): We draw a target number (i.e., \(\lfloor |\mathcal {E}| \cdot p \rfloor\) with \(p\) representing the proportion of sampling) of hyperedges uniformly at random.

—

Totally-Induced Hyperedge Sampling (TIHS): We extend totally-induced edge sampling [2] to hypergraphs. We repeat (a) adding a hyperedge uniformly at random to \(\mathcal {\hat{E}}\) and (b) adding all hyperedges induced by \(\mathcal {V}({\mathcal {\hat{E}}})\) (i.e., \(\lbrace e \in \mathcal {E}: \forall v \in e, v \in \mathcal {V}({\mathcal {\hat{E}}})\rbrace\) ) to \(\mathcal {\hat{E}}\) .

2.4 Datasets

Throughout the article, we use 11 datasets summarized in Table 1 after removing all duplicated hyperedges. Their domains are:

Table 1.

Dataset	\(\|\mathcal{V}\|\)	\(\|\mathcal{\varepsilon}\|\)	\({\text{AVG.} d}(v)\)	\(\text{AVG.}\|e\|\)	No. of CCs	Largest CC	GCC	Density	Diameter
email-Enron	143	1,514	32.3	3.05	1	143	0.66	10.59	2.38
email-Eu	1,005	25,148	88.9	3.56	20	986	0.57	25.02	2.78
contact-primary	242	12,704	126.9	2.42	1	242	0.53	52.50	1.88
contact-high	327	7,818	55.6	2.33	1	327	0.50	23.91	2.63
NDC-classes	1,161	1,090	5.6	5.97	183	628	0.83	0.94	4.65
NDC-substances	5,556	10,273	12.2	6.62	1,888	3,414	0.72	1.85	3.04
tags-ubuntu	3,029	147 K	164.8	3.39	9	3,021	0.61	48.60	2.41
tags-math	1,629	170 K	364.1	3.48	3	1,627	0.63	104.65	2.13
threads-ubuntu	125 K	166 K	2.5	1.91	39 K	82 K	0.55	1.33	4.73
coauth-geology	1.2 M	1.2 M	3.0	3.17	230 K	903 K	0.76	0.96	7.04
coauth-history	1 M	896 K	1.3	1.57	617 K	242 K	0.82	0.87	11.28

Table 1. Overview of the Real-world Hypergraph Datasets Used in the Article

The 11 datasets originate from six distinct domains, and they exhibit distinct structural properties.

—

e-mail (email-Enron [34] and email-Eu [47, 71]): Each hyperedge represents an e-mail. It consists of the sender and receivers.

—

contact (contact-primary [63] and contact-high [54]): Each hyperedge represents a group interaction. It consists of individuals.

—

drugs (NDC-classes and NDC-substances): Each hyperedge represents an NDC code for a drug. It consists of classes or substances.

—

tags (tags-ubuntu and tags-math): Each hyperedge represents a post. It consists of tags.

—

threads (threads-ubuntu): Each hyperedge represents a question. It consists of a questioner and responders.

—

co-authorship (coauth-geology [60] and coauth-history [60]): Each hyperedge represents a publication. It consists of co-authors.

2.5 Evaluation

In this work, our focus is on general-purpose sub-hypergraph sampling. Rather than assuming specific use cases of sampled sub-hypergraphs, we aim at preserving a wide range of structural properties that are identified as unique characteristics of real-world hypergraphs. Specifically, we evaluate the goodness of a sub-hypergraph \(\mathcal {\hat{G}}\) based on its ability to accurately preserve the structural properties of the target hypergraph in ten aspects P1–P10. The target hypergraph can be either the input hypergraph itself or a past snapshot of the input time-evolving hypergraph, depending on whether we are dealing with representative sampling (Section 3) or back-in-time hypergraph sampling (Section 4), respectively. Specifically, for each of P1–P6, which are (probability density) functions, we measure the Kolmogorov-Smirnov D-statistic. Specifically, for functions \(f\) from \(\mathcal {G}\) and \(\hat{f}\) from \(\mathcal {\hat{G}}\) , if we let their cumulative sums be \(F\) and \(\hat{F}\) ,² the D-statistic is defined as follows:

\begin{equation} \text{D-statistic} (f, \hat{f}) = \max _{x\in \mathcal {D}} \lbrace | \hat{F}(x) - F(x) | \rbrace , \end{equation}

(1)

where \(\mathcal {D}\) is the domain of \(f\) and \(\hat{f}\) . For each of P7–P10, which are scalars, we measure the relative difference. Specifically, for scalars \(y\) from \(\mathcal {G}\) and \(\hat{y}\) from \(\mathcal {\hat{G}}\) , the relative difference is defined as follows:

\begin{equation} \text{Relative Difference}(y, \hat{y}) = \frac{|y-\hat{y}|}{|y|}. \end{equation}

(2)

In order to compare the qualities of sub-hypergraphs sampled by different methods, we aggregate the ten distances described in Section 2.2. Since the scales of the distances may differ, we compute rankings and Z-Scores to make it possible to directly compare and average them, as follows:

—

Ranking: With respect to each of P1–P10, we rank all sub-hypergraphs using their distances.

—

Z-Score: With respect to each of P1–P10, we standardize the distance of sub-hypergraphs by subtracting the mean and dividing the difference by the standard deviation.

When comparing sampling methods in multiple settings (e.g., sampling portions and datasets), we compute the above rankings (or Z-Scores) of their samples in each setting and average the rankings (or Z-Scores) of samples from each method. Note that both metrics are determined based on the specific methods being compared. Therefore, the same sampling method may yield different metric values depending on the methods used for comparison.

3 Representative Hypergraph Sampling

In this section, we focus on representative hypergraph sampling. In Section 3.1, we provide a formal problem definition. In Section 3.2, given that we are, to the best of our knowledge, the first to explore this problem, we analyze the advantages and drawbacks of six simple and intuitive sampling approaches (described in Section 2.3). Leveraging insights from these investigations, in Section 3.3, we propose our approach, MiDaS. In Section 3.4, we demonstrate the effectiveness of MiDaS through experiments, where we follow the evaluation methodology outlined in Section 2.5.

3.1 Problem Definition

Based on the statistics (defined in Section 2.2), we formulate the representative hypergraph sampling problem in Problem 1.

Problem 1 (Representative Hypergraph Sampling).

—

Given: - a large hypergraph \(\mathcal {G}= (\mathcal {V}, \mathcal {E})\)

—

a sampling portion \(p\in (0,1)\)

—

Find: a sub-hypergraph \(\mathcal {\hat{G}}=(\mathcal {\hat{V}}, \mathcal {\hat{E}})\) where \(\mathcal {\hat{V}}\subseteq \mathcal {V}\) and \(\mathcal {\hat{E}}\subseteq \mathcal {E}\)

—

to Preserve: ten structural properties of \(\mathcal {G}\) measured by P1-P10

—

Subject to: \(|\mathcal {\hat{E}}| = \lfloor |\mathcal {E}| \cdot p \rfloor\)

In Problem 1, the objective is to find the most representative sub-hypergraph composed of a given portion of hyperedges. However, achieving optimality is challenging as the ten structural properties need to be considered simultaneously. In this article, we focus on developing heuristics that work well in practice. We measure the performance of sampling algorithms by evaluating the sub-hypergraph \(\mathcal {\hat{G}}\) sampled by each algorithm as described in Section 2.5.

3.2 Observations

We evaluate the six intuitive approaches using the 11 datasets under five different sampling portions, as described in Section 2. The results are summarized in Table 2 and Table 3. Below, we describe the characteristics of each approach. We use \(\mathcal {\hat{G}}_{ALG}\) to denote a sub-hypergraph obtained by each approach \(ALG\) .

Table 2.

Table 3.

		RNS	RDN [46]	RW [64]	FF [35]	RHS	TIHS [2]
Degree	Dstat.	0.29	0.29	0.32	0.30	0.30	0.28
Degree	Rank (Z-Score)	3.51 (-0.11)	3.16 (-0.14)	3.96 (0.29)	3.89 (0.24)	3.24(-0.13)	3.24 (-0.15)
Int. Size	Dstat.	0.09	0.03	0.04	0.04	0.01	0.03
Int. Size	Rank (Z-Score)	4.42 (0.69)	3.49 (-0.06)	4.00 (0.15)	4.29 (0.41)	1.07 (-1.14)	3.73 (-0.04)
Pair Degree	Dstat.	0.13	0.11	0.09	0.11	0.11	0.09
Pair Degree	Rank (Z-Score)	3.95 (0.26)	4.11 (0.30)	2.69(-0.28)	3.93 (0.13)	3.64 (0.01)	2.69(-0.43)
Size	Dstat.	0.23	0.11	0.12	0.06	0.01	0.09
Size	Rank (Z-Score)	5.85 (1.54)	4.20 (0.08)	4.11 (0.20)	2.45 (-0.57)	1.00 (-1.11)	3.38 (-0.14)
SV	Dstat.	0.12	0.16	0.15	0.16	0.08	0.15
SV	Rank (Z-Score)	2.89 (-0.24)	4.07 (0.30)	3.64 (0.21)	4.56 (0.59)	1.78 (-0.96)	3.45 (0.10)
CC	Dstat.	0.16	0.13	0.18	0.14	0.10	0.14
CC	Rank (Z-Score)	3.71 (0.61)	2.25 (-0.17)	3.73 (0.46)	3.09 (-0.04)	1.87 (-0.71)	2.65 (-0.15)
GCC	Diff.	0.12	0.15	0.09	0.12	0.10	0.08
GCC	Rank (Z-Score)	3.31 (0.06)	4.60 (0.53)	3.15 (-0.25)	4.16 (0.16)	2.84 (-0.30)	2.95 (-0.21)
Density	Diff.	0.37	0.54	0.49	0.52	0.52	0.43
Density	Rank (Z-Score)	3.07 (-0.38)	4.20 (0.35)	3.24 (-0.08)	3.42 (-0.00)	4.18 (0.44)	2.89 (-0.33)
Overlapness	Diff.	0.55	0.55	0.62	0.53	0.52	0.46
Overlapness	Rank (Z-Score)	3.98 (0.16)	3.49 (-0.09)	3.80 (0.29)	3.24 (-0.20)	3.71 (0.20)	2.78 (-0.38)
Diameter	Diff.	0.34	0.14	0.12	0.11	0.20	0.12
Diameter	Rank (Z-Score)	4.64 (0.72)	3.02 (-0.21)	3.42 (-0.15)	3.20 (-0.26)	3.98 (0.29)	2.75 (-0.39)
Average	Rank (Z-Score)	3.93 (0.33)	3.66 (0.09)	3.57 (0.08)	3.62 (0.05)	2.73 (-0.34)	3.05 (-0.21)

Table 3. Six Intuitive Sampling Methods are Compared as Described in Section 2

D-statistics (Dstat.) and relative differences (Diff.) are computed for each property. As D-statistics and relative differences have varying scales for different properties, rankings and Z-Scores are calculated to facilitate comparisons among the six methods. Reported results are the averages over five sampling portions ( \(10\%, \ldots , 50\%\) ) across 11 datasets. The bold text highlights the best results in terms of each property. Notably, RHS provides the most representative sub-hypergraphs overall.

3.2.1 Random Node Sampling (RNS).

Small hyperedges: In \(\mathcal {\hat{G}}_{RNS}\) , large hyperedges are rarely sampled because all nodes in a hyperedge must be sampled for the hyperedge to be sampled, which is unlikely.

Weak connectivity: As large hyperedges are rare, the local and global connectivity is weak. Locally, node degrees, node-pair degrees, hyperedge sizes, and intersection sizes tend to be low in \(\mathcal {\hat{G}}_{RNS}\) . Globally, \(\mathcal {\hat{G}}_{RNS}\) tends to have low density, especially low overlapness, and large diameter. It also tends to have many connected components with small portions of nodes.

Precise preservation of relative singular values: Relative singular values (see P5) are preserved best in \(\mathcal {\hat{G}}_{RNS}\) among the sub-hypergraphs obtained by node-selection methods.

3.2.2 Random Degree Node (RDN), Random Walk (RW), and Forest-Fire (FF).

More high-degree nodes than RNS : RDN, RW, and FF lead to a larger portion of high-degree nodes than RNS since they prioritize high-degree nodes. Thus, they preserve degree distributions better than RNS in some datasets where RNS significantly increases the fraction of low-degree nodes. Especially, degree distributions are preserved best by RDN in terms of ranking.

Stronger connectivity than RNS : High-degree nodes strengthen connectivity. Thus, sub-hypergraphs obtained by non-uniform node-selection methods tend to have higher density, higher overlapness, and smaller diameter than \(\mathcal {\hat{G}}_{RNS}\) but sometimes even than the original hypergraph \(\mathcal {G}\) . Notably, in the sub-hypergraphs, a larger fraction of nodes belong to the largest connected component, reducing the number of connected components, compared to \(\mathcal {G}\) .

3.2.3 Random Hyperedge Sampling (RHS).

Best preservation of many properties: RHS preserves hyperedge-level statistics (i.e., hyperedge sizes and intersection sizes) nearly perfectly. It is also best at preserving connected-component sizes, relative singular values, and global clustering coefficients.

Weak connectivity: As RHS is equivalent to uniform hypergraph sparsification, \(\mathcal {\hat{G}}_{RHS}\) suffers from weak connectivity. Locally, node degrees and pair degrees tend to be low in \(\mathcal {\hat{G}}_{RHS}\) . Globally, in \(\mathcal {\hat{G}}_{RHS}\) , density and overlapness are low, and diameter is large.

3.2.4 Totally-Induced Hyperedge Sampling (TIHS).

Complementarity to RHS : TIHS preserves node degrees, node-pair degrees, density, overlapness, and diameter best, which are overlooked by RHS.

Strong connectivity: Still, node degrees, density, and overlapness tend to be higher, and diameter tends to be smaller in \(\mathcal {\hat{G}}_{TIHS}\) than in the original hypergraph \(\mathcal {G}\) . That is, the connectivity tends to be a bit stronger in \(\mathcal {\hat{G}}_{TIHS}\) than in \(\mathcal {G}\) . Thus, \(\mathcal {\hat{G}}_{TIHS}\) tends to have fewer but larger connected components than \(\mathcal {G}\) .

3.2.5 Summary of Observations.

As summarized in Table 3, when considering all settings, RHS provides overall the best representative sub-hypergraphs. While RHS produces sub-hypergraphs with weaker connectivity, RHS is by far the best method in preserving hyperedge sizes, intersection sizes, relative singular values, connected-component sizes, and global clustering coefficients.

3.3 Proposed Approach: MiDaS

In this section, we propose MiDaS, our sampling method for Problem 1. We first discuss the motivations behind it. Then, we describe MiDaS-Basic, a preliminary version of MiDaS. Lastly, we present the full-fledged version of MiDaS.

3.3.1 Intuitions Behind MiDaS.

Analyzing the simple approaches in Section 3.2 motivates us to come up with MiDaS. Especially, we focus on the following findings:

Observation 1.

RHS performs best, but its samples suffer from weak connectivity, including the lack of high-degree nodes.

Specifically, when designing MiDaS, we aim at overcoming the limitations of RHS while maintaining its strengths. Especially, based on the above findings, our focus is on better preservation of node degrees by increasing the fraction of high-degree nodes while expecting that this also helps preserve other properties. Our expectation is also supported by the strong correlation between (a) the average degree in sub-hypergraphs and (b) their overlapness and density, which tend to be low in sub-hypergraphs sampled by RHS. This correlation, which is shown in Table 4, is naturally expected from the fact that high-degree nodes increase the number of hyperedges per node and the definitions of density and overlapness (see Section 2.2).

Table 4.

3.3.2 MiDaS-Basic: Preliminary Version.

How can we better preserve node degrees, which seem to be a decisive property, while maintaining the advantages of RHS? Towards this goal, we first present MiDaS-Basic, a preliminary sampling method that determines the amount of sampling bias (i.e., the statistical prioritization of specific nodes or hyperedges during sampling) toward high-degree nodes by a single hyperparameter, for Problem 1.

Description: The pseudocode of MiDaS-Basic is provided in Algorithm 1. Given a hypergraph \(\mathcal {G}=(\mathcal {V},\mathcal {E})\) and a sampling portion \(p\) , it returns a sub-hypergraph \(\mathcal {\hat{G}}=(\mathcal {\hat{V}},\mathcal {\hat{E}})\) of \(\mathcal {G}\) where the number of hyperedges in \(\mathcal {\hat{G}}\) is \(p\) of that in \(\mathcal {G}\) . Starting from an empty hypergraph, MiDaS-Basic repeats drawing a hyperedge as in RHS. However, unlike RHS, the probability for each hyperedge \(e\) being drawn at each step is proportional to \(\omega (e)^{\alpha }\) where \(\omega (\cdot)\) is a hyperedge weight function and the exponent \(\alpha\) ( \(\ge 0\) ) is a given constant. Note that, if \(\alpha\) is zero, MiDaS-Basic is equivalent to RHS.

Based on the intuitions, MiDaS-Basic prioritizes hyperedges with high-degree nodes to increase the fraction of such nodes. In order to prioritize especially hyperedges composed only of high-degree nodes, it uses \(\omega (e):=\min _{v \in e} d_{v}\) , where \(d_v\) is the degree of \(v\) in \(\mathcal {G}\) , as the hyperedge weight function.

Empirical properties: The value of \(\alpha\) affects the amount of bias toward high-degree nodes in MiDaS-Basic. Below, we analyze how \(\alpha\) affects samples (i.e., sub-hypergraphs) in practice.

The degrees of nodes within samples obtained with different \(\alpha\) values are shown in Figure 4, from which we make Observation 3. This result is promising, showing that the bias in degree distributions can be directly controlled by \(\alpha\) .

Fig. 4.

Observation 3.

As \(\alpha\) increases, the degree distributions in samples tend to be more biased toward high-degree nodes.

We additionally explore how the best-performing \(\alpha\) values,³ which lead to best preservation of degree distributions in terms of D-statistics (see Section 2.5), are related to the skewness of degree distributions.⁴ A strong negative correlation is observed as summarized in Observation 4 and shown in Figure 5.

Fig. 5.

Observation 4.

As degree distributions in original hypergraphs are more skewed, larger \(\alpha\) values are required (i.e., high-degree nodes need to be prioritized more) to preserve the distributions.

We also find out a strong negative correlation between best-performing \(\alpha\) values and sampling portions, as shown in Figure 6 and summarized in Observation 5.

Fig. 6.

Observation 5.

As we sample fewer hyperedges, larger \(\alpha\) values are required (i.e., high-degree nodes need to be prioritized more) to preserve degree distributions.

Theoretical analysis: We analyze the time complexity of Algorithm 1. We also theoretically analyze Observation 3 in Appendix A.1. Specifically, we provide a sufficient condition for bias toward high-degree nodes to grow as \(\alpha\) increases, and we confirm that only \(\min _{v \in e} d_{v}\) satisfies this condition, while \(\max _{v \in e} d_{v}\) and \(\mathrm{avg}_{v \in e} d_{v}\) do not.

Theorem 1 (Time Complexity).

The time complexity of Algorithm 1 is \(O(p \cdot |\mathcal {E}| \cdot \log (\max _{v\in \mathcal {V}} d_{v})+ \sum _{e\in \mathcal {E}} |e|)\) .

Proof.

It takes \(O(\sum _{e\in \mathcal {E}} |e|)\) time to compute \(\omega (e)\) for every hyperedge \(e\) , and it takes \(O(|\mathcal {E}|)\) time to build a balanced binary tree with \(\max _{e \in \mathcal {E}}\) \(\omega (e)\) leaf nodes where each \(k\) th leaf node points to the list of all hyperedges whose weight is \(k^{\alpha }\) . Then, it takes \(O(\max _{e\in \mathcal {E}} \omega (e))=O(|\mathcal {E}|)\) time in total to store in each node \(i\) the sum of the weights of the hyperedges pointed by any node in the sub-tree rooted at \(i\) if we store them from leaf nodes to the root. The height of the tree is \(O(\log (\max _{e\in \mathcal {E}} \omega (e)))=O(\log (\max _{v\in \mathcal {V}} d_{v}))\) , and thus drawing each hyperedge (i.e., from the root, repeatedly choosing a child with weights until reaching a leaf; and then drawing a hyperedge that the leaf points to) and updating weights accordingly takes \(O(\log (\max _{v\in \mathcal {V}} d_{v}))\) time. Drawing \(p \cdot |\mathcal {E}|\) hyperedges takes \(p \cdot |\mathcal {E}| \cdot \log (\max _{v\in \mathcal {V}} d_{v}))\) time, and since \(O(|\mathcal {E}|)=O(\sum _{e\in \mathcal {E}} |e|)\) , the total time complexity is \(O(p \cdot |\mathcal {E}| \cdot \log (\max _{v\in \mathcal {V}} d_{v})+\sum _{e\in \mathcal {E}} |e|)\) . □

3.3.3 MiDaS: Full-Fledged Version.

As suggested by Observation 3, the hyperparameter \(\alpha\) in MiDaS-Basic should be tuned carefully. We propose MiDaS, the full-fledged version of our sampling method that automatically tunes \(\alpha\) .

Based on the strong correlations in Observations 4 and 5, MiDaS tunes \(\alpha\) using a linear regressor \(\mathcal {M}\) that maps (a) the skewness of the degree distribution in the input hypergraph \(\mathcal {G}\) and (b) the sampling portion to (c) a best-performing \(\alpha\) value. In our experiments in Section 3.4, \(\mathcal {M}\) was fitted using the best-performing \(\alpha\) values⁵ on the considered datasets with five different sampling portions.⁶ For a fair comparison, when evaluating MiDaS on a dataset, we used only the remaining datasets for fitting \(\mathcal {M}\) .

The \(\alpha\) value obtained by the linear regression model \(\mathcal {M}\) is further tuned using hill climbing [58]. As the objective function, \(\mathcal {L}(\mathcal {G}, \mathcal {\hat{G}})\) , MiDaS uses the D-statistics (see Section 2.5) between the degree distributions in the input hypergraph \(\mathcal {G}\) and a sample \(\mathcal {\hat{G}}\) . For speed, we search for \(\alpha\) within a given discrete search space \(\mathcal {S}\) ,⁷ aiming at minimizing \(\mathcal {L^{\prime }}(\alpha):=\mathcal {L}(\mathcal {G},\text {MiDaS-Basic} (\mathcal {G}, p, \alpha))\) . A search ends when it (a) finds a local minimum of \(\alpha\) when we limit our attention to \(\mathcal {S}\) or (b) reaches an end of \(\mathcal {S}\) . Algorithm 2 describes MiDaS.

3.4 Evaluation

We review our experiments designed to answer the following questions:

Q1.

Quality: How well does MiDaS preserve the ten structural properties (P1–P10) of real-world hypergraphs?

Q2.

Consistency: Does MiDaS perform well regardless of the sampling portions?

Q3.

Speed: How fast is MiDaS compared to the competitors?

3.4.1 Experimental Settings.

We use the 11 datasets described in Section 2.4. We compare MiDaS with the simple methods described in Section 2.3 and two more sophisticated approaches: Hybrid random walk (HRW) [78] and metropolis graph sampling (MGS), which are described below. We use a machine with an i9-10900 K CPU and 64 GB RAM in all cases except one. When running MGS-Avg-Del on the tags-math dataset, we use a machine with an AMD Ryzen 9 3900X CPU and 128 GB RAM due to its large memory requirement. The sample quality in each method is averaged over three trials.

Hybrid random walk (HRW) [78]: HRW is a recent approach for efficient hypergraph sampling, utilizing a random walk that alternates between nodes and hyperedges. When transitioning from a node, a random walker moves to a neighboring hyperedge selected with a probability proportional to its size; and when transitioning from a hyperedge, it selects a node within it uniformly at random. We limit the maximum length of each walk to twice the number of nodes. The resulting sub-hypergraph consists of all hyperedges visited during the random walk and all nodes contained in them, potentially including unvisited nodes. If the target number of hyperedges is not reached, the walk restarts from unvisited nodes until the goal is met. In our experiments, we use two advanced versions of HRW, HRW-NB and HRW-SK, which are specifically designed to prevent backtracking and the revisiting of nodes, respectively (refer to [78] for details).

Metropolis graph sampling (MGS): We also adapt MGS [27] for Problem 1. For a sub-hypergraph \(\mathcal {\hat{G}}\) , \(\varrho ^{*}(\mathcal {\hat{G}})\) := \(\frac{1}{\exp (k \cdot \Delta _{\mathcal {G}}(\mathcal {\hat{G}}))}\) . Then, the acceptance probability of a move from a state \(\mathcal {\hat{G}}\) to a state \({\mathcal {\hat{G}}}^{\prime }\) is min(1, \(\frac{\varrho ^{*}({\mathcal {\hat{G}}}^{\prime })}{\varrho ^{*}(\mathcal {\hat{G}})}\) ) = min(1, \(\exp (k \cdot (\Delta _{\mathcal {G}}(\mathcal {\hat{G}}) - \Delta _{\mathcal {G}}({\mathcal {\hat{G}}}^{\prime }))\) ). This algorithm makes greedy choices that decrease a predefined objective function \(\Delta _{\mathcal {G}}(\mathcal {\hat{G}})\) most. In our experiments, we use the \(k\) values in the search space \(\lbrace 1, 10, 100, 10000\rbrace\) . Depending on \(\Delta _{\mathcal {G}}(\mathcal {\hat{G}})\) , we divide MGS into MGS-Deg, and MGS-Avg. The former aims at preserving node degrees by setting \(\Delta _{\mathcal {G}}(\mathcal {\hat{G}})\) to the D-statistic between degree distributions in \(\mathcal {G}\) and \(\mathcal {\hat{G}}\) . The latter aims at preserving node degrees, hyperedge sizes, node-pair degrees, and hyperedge intersection sizes at the same time by setting \(\Delta _{\mathcal {G}}(\mathcal {\hat{G}})\) to the average of the D-statistics from their distributions. The four statistics are chosen since they are cheap to compute at every step. We further divide each of MGS-Deg and MGS-Avg into three versions depending on how to move between states as follows:

—

Add: MGS-Deg-Add (MGS-DA) and MGS-Avg-Add (MGS-AA) start from \(\mathcal {\hat{E}}=\emptyset\) and repeatedly propose to add a hyperedge to \(\mathcal {\hat{E}}\) until \(|\mathcal {\hat{E}}| = \lfloor |\mathcal {E}| \cdot p \rfloor\) holds.

—

Replace: MGS-Deg-Rep (MGS-DR) and MGS-Avg-Rep (MGS-AR) initialize \(\mathcal {\hat{E}}\) so that \(|\mathcal {\hat{E}}| = \lfloor |\mathcal {E}| \cdot p \rfloor\) by using RHS. They repeat 3,000 times proposing to replace a hyperedge in \(\mathcal {\hat{E}}\) with one outside \(\mathcal {\hat{E}}\) .

—

Delete: MGS-Deg-Del (MGS-DD) and MGS-Avg-Del (MGS-AD) start from \(\mathcal {\hat{E}}= \mathcal {E}\) and repeatedly propose to remove a hyperedge from \(\mathcal {\hat{E}}\) until \(|\mathcal {\hat{E}}| = \lfloor |\mathcal {E}| \cdot p \rfloor\) holds.

3.4.2 Quality: How Well Does MiDaS Preserve the ten Structural Properties of Real-world Hypergraphs?.

We compare all 15 considered sampling methods numerically using distances (D-statistics or relative differences), rankings, and Z-Scores, as described in Section 2.5. The results under five sampling portions (10%–50%) are averaged in Table 5, and some are visualized in Table 6. MiDaSprovides overall the most representative samples, among the 15 considered methods, in terms of both average rankings and average Z-Scores. Especially, MiDaS best preserves node degrees, density, overlapness, and diameter. Compared to RHS, the D-statistics in degree distributions drop significantly, and the differences in density, overlapness, and diameters drop more significantly. That is, better preserving node degrees in MiDaS helps resolve the weaknesses of RHS. While MiDaS is outperformed by RHS in preserving hyperedge sizes, intersection sizes, relative singular value, and connected component sizes, the gaps between their D-statistics or differences are less than 0.05.

Table 5.

		RNS	RDN [46]	RW [64]	FF [35]	RHS	TIHS [2]	NB	SK	Add	Rep	Del	Add	Rep	Del	MiDaS
		RNS	RDN [46]	RW [64]	FF [35]	RHS	TIHS [2]	HRW [78]		MGS - Deg [27]			MGS - Avg [27]			MiDaS
Degree	Dstat.	0.291	0.285	0.317	0.302	0.302	0.283	0.282	0.269	0.241	0.257	0.217	0.285	0.270	0.259	0.133
	Rank	9.309	8.582	10.018	10.236	11.545	8.709	7.964	6.509	6.600	7.182	4.127	9.909	9.073	7.291	2.909
	Z-Score	0.253	0.202	0.717	0.598	0.261	0.199	0.015	-0.127	-0.317	-0.109	-0.487	0.113	-0.004	-0.082	-1.234
Int. Size	Dstat.	0.093	0.033	0.038	0.035	0.007	0.033	0.083	0.080	0.014	0.024	0.053	0.002	0.002	0.008	0.024
	Rank	10.600	9.273	9.727	10.673	3.491	9.764	13.673	13.491	4.764	5.927	8.855	2.545	3.200	5.036	8.964
	Z-Score	0.627	-0.026	0.097	0.230	-0.767	-0.001	1.632	1.573	-0.583	-0.554	-0.004	-0.818	-0.792	-0.595	-0.019
Pair Degree	Dstat.	0.132	0.111	0.089	0.112	0.112	0.090	0.045	0.045	0.092	0.089	0.064	0.089	0.075	0.063	0.094
	Rank	9.764	10.891	8.291	10.673	11.527	8.109	3.345	3.618	8.800	9.382	7.273	7.927	7.327	5.709	7.364
	Z-Score	0.675	0.693	0.067	0.570	0.449	-0.014	-0.977	-0.948	0.132	0.133	-0.272	0.082	-0.101	-0.341	-0.148
Size	Dstat.	0.227	0.105	0.121	0.057	0.009	0.085	0.292	0.295	0.020	0.034	0.099	0.007	0.003	0.023	0.051
	Rank	13.109	10.691	10.691	8.164	3.582	9.691	14.073	14.109	4.927	5.418	8.800	2.418	1.673	4.600	8.055
	Z-Score	1.148	0.062	0.305	-0.353	-0.770	-0.108	1.829	1.836	-0.699	-0.591	-0.049	-0.796	-0.818	-0.603	-0.393
SV	Dstat.	0.122	0.158	0.154	0.164	0.084	0.154	0.104	0.105	0.101	0.087	0.115	0.085	0.085	0.096	0.125
	Rank	7.945	10.491	9.691	11.673	4.236	10.000	6.018	5.691	6.382	5.091	7.527	5.000	4.727	5.491	8.455
	Z-Score	0.241	0.794	0.557	0.873	-0.633	0.532	-0.151	-0.140	-0.413	-0.508	0.162	-0.465	-0.551	-0.338	0.038
CC	Dstat.	0.160	0.132	0.175	0.142	0.097	0.135	0.145	0.143	0.104	0.102	0.103	0.101	0.100	0.101	0.115
	Rank	9.345	7.382	9.800	8.618	4.055	7.782	7.327	6.655	5.364	5.000	5.873	4.327	3.545	5.200	6.982
	Z-Score	1.124	0.153	0.933	0.325	-0.462	0.167	0.136	0.072	-0.364	-0.470	-0.337	-0.354	-0.517	-0.375	-0.029
GCC	Diff.	0.120	0.153	0.085	0.119	0.100	0.083	0.081	0.092	0.097	0.096	0.093	0.106	0.099	0.075	0.081
	Rank	8.836	11.273	8.145	9.891	8.109	8.182	5.727	6.636	6.509	7.600	9.018	7.673	7.764	7.636	7.000
	Z-Score	0.517	0.962	-0.032	0.401	-0.116	-0.019	-0.360	-0.248	-0.238	-0.203	0.014	-0.163	-0.208	-0.275	-0.033
Density	Diff.	0.374	0.540	0.488	0.516	0.523	0.426	0.508	0.511	0.400	0.500	0.520	0.492	0.501	0.509	0.202
	Rank	4.818	9.364	7.491	8.145	10.927	6.836	8.909	8.891	5.655	8.491	9.473	7.564	8.818	9.782	2.473
	Z-Score	-0.490	0.412	-0.012	0.045	0.313	-0.317	0.326	0.347	-0.282	0.182	0.307	0.123	0.189	0.236	-1.379
Overlapness	Diff.	0.546	0.550	0.616	0.531	0.523	0.460	0.433	0.446	0.404	0.472	0.451	0.501	0.500	0.489	0.202
	Rank	9.891	7.909	8.818	8.455	11.982	6.727	5.473	6.236	6.836	8.691	7.218	10.073	10.145	8.964	2.582
	Z-Score	0.389	0.017	0.516	-0.060	0.353	-0.296	-0.236	-0.163	-0.097	0.143	0.032	0.264	0.251	0.213	-1.326
Diameter	Diff.	0.344	0.139	0.117	0.109	0.195	0.117	0.122	0.132	0.157	0.162	0.158	0.204	0.182	0.182	0.079
	Rank	10.945	7.582	8.273	6.582	9.382	6.582	6.818	6.855	6.273	8.655	8.564	9.600	9.782	9.618	4.491
	Z-Score	1.169	-0.105	-0.020	-0.246	0.258	-0.318	-0.268	-0.211	-0.317	0.024	0.048	0.316	0.224	0.214	-0.768
Average	Rank	9.456	9.344	9.095	9.311	7.884	8.238	7.933	7.869	6.211	7.144	7.673	6.704	6.605	6.933	5.927
Average	Z-Score	0.565	0.317	0.313	0.238	-0.111	-0.018	0.195	0.199	-0.318	-0.195	-0.059	-0.170	-0.233	-0.194	-0.529

Table 5. MiDaS Yields Overall the Most Representative Sub-hypergraphs

We compare \({\bf 15}\) sampling methods on \({\bf 11}\) real-world hypergraphs with five different sampling portions. We report their distances (D-statistics or relative differences), Z-Scores, and rankings, as described in Section 2.5. The smaller the measures are, the more representative the samples are.

Table 6.

3.4.3 Consistency: Does MiDaS Perform Well Regardless of the Sampling Portions?.

We demonstrate the robustness of MiDaS to sampling portions. In Figure 7, we show how D-statistics in degree distributions, average Z-Scores, and average rankings change depending on sampling proportions. MiDaS is consistently best regardless of sampling portions with few exceptions. MGS methods preserve intersection sizes, node-pair degrees, hyperedge sizes, relative single values, and connected component sizes better than MiDaS by small margins, and as a result, MiDaS is outperformed by some MGS methods in terms of average ranking in a few settings.

Fig. 7.

3.4.4 Speed: How Fast is MiDaS Compared to the Competitors?.

We measure the running times of all considered sampling methods in each dataset with five sampling portions. We compare the sum of running times in Figure 8. Despite its additional overhead for automatic hyperparameter tuning, MiDaS significantly outperforms both MGS and MiDaS-Basic with grid search in speed. Notably, it does so without compromising sample quality when compared to the grid search, particularly when the search space for \(\alpha\) is fixed. The speed and sample quality are plotted together in Figure 2. Additionally, in Appendix A.3, we examine the running time with respect to the input hypergraph size.

Fig. 8.

4 Back-In-Time Hypergraph Sampling

In this section, our focus shifts to back-in-time hypergraph sampling. In Section 4.1, we distinguish this concept from representative sampling by establishing a formal problem definition. Similar to the previous problem, in Section 4.2, we examine the characteristics of both intuitive sampling approaches (outlined in Section 2.3) and MiDaS, which is designed for representative sampling. Addressing the limitations of MiDaS in the context of back-in-time sampling, we introduce our approach, MiDaS-B, in Section 4.3. This method is an adaptation of MiDaS, specifically tailored for back-in-time sampling. In Section 4.4, we demonstrate the efficacy of MiDaS-B through experiments, where we follow the evaluation methodology detailed in Section 2.5.

4.1 Problem Definition

In back-in-time sampling, we consider a time-evolving hypergraph, whose snapshot is provided as input. Our objective is to construct a sub-hypergraph that closely approximates its past snapshot of the same size. It is important to note that the ground-truth past snapshots of the given hypergraph are not provided, meaning that the target of the sampling is not directly observable. This reflects the challenge in real-world situations. This problem is formally defined in Problem 2.

Problem 2 (Back-In-Time Hypergraph Sampling).

—

Given: - a snapshot \(\mathcal {G}= (\mathcal {V}, \mathcal {E})\) at time \(T\) of a time-evolving hypergraph \(\mathcal {\tilde{G}}\)

—

a sampling portion \(p\in (0,1)\)

—

Find: a sub-hypergraph \(\mathcal {\hat{G}}=(\mathcal {\hat{V}}, \mathcal {\hat{E}})\) where \(\mathcal {\hat{V}}\subseteq \mathcal {V}\) and \(\mathcal {\hat{E}}\subseteq \mathcal {E}\)

—

to Match: ten structural properties P1-P10 of the snapshot \(\mathcal {\bar{G}}= (\mathcal {\bar{V}}, \mathcal {\bar{E}})\) at time \(\bar{T}\) ( \(\lt T\) ) of \(\mathcal {\tilde{G}}\)

where \(|\mathcal {\bar{E}}|=\lfloor |\mathcal {E}| \cdot p \rfloor\) .

Similar to the representative hypergraph sampling problem, finding the optimal sub-hypergraph that fulfills the objective of Problem 2 is challenging. Therefore, we develop an effective heuristic to address this challenge. To evaluate the performance of these sampling algorithms, we compare the sub-hypergraph sampled by each algorithm with the target past snapshot of the input hypergraph and quantify their similarity, as described in Section 2.5.

Comparison with Representative Sampling: Both representative sampling and back-in-time sampling share the high-level goal of obtaining a structurally similar but smaller sub-hypergraph from the input hypergraph. Consequently, both approaches can be considered for various applications, including those discussed in Section 1. However, they differ in their specific objectives. Representative sampling aims at replicating the structural properties of the input hypergraph while accounting for the difference in scale between the sub-hypergraphs and the input hypergraph. In contrast, back-in-time sampling targets the past snapshot of the input hypergraph as the reference, taking into consideration the scale difference between the sub-hypergraphs and the input hypergraph.

4.2 Observations

As the initial step in tackling Problem 2, we conduct an analysis of the strengths and weaknesses of the six intuitive approaches described in Section 2.3. Additionally, we examine the following two additional sampling algorithms:

—

Ordered Node Sampling (ONS): We add nodes one by one in the order of their appearance in the input hypergraph \(\mathcal {\tilde{G}}\) and return the induced sub-hypergraph once its hyperedge count reaches the target count. It is important to note that this method relies on the ground-truth node appearance order, which is not provided according to the problem definition.

—

MiDaS (w. Oracle) : This method is an adaptation of MiDaS-Basic to the problem of back-in-time sampling. As in MiDaS-Basic, which is originally designed for representative sampling, each hyperedge \(e\) is sampled with a probability proportional to \(w(e)^{\alpha }\) , where \(w(e) := min_{v \in e} d_{v}\) . The value of \(\alpha\) is determined through a grid search for each dataset based on the comparison (with respect to all ten properties) with the ground-truth past snapshot of each dataset.⁸ It is important to note that this method relies on the ground-truth past snapshot, which is not provided according to the problem definition, and that is why we include “oracle” in its name. Note that this method also differs from MiDaS, which automatically tunes \(\alpha\) for representative sampling, not for back-in-time sampling.

We evaluate these eight algorithms on 11 datasets under five different sampling portions ( \(10\%,30\%,50\%,70\%,90\%\) ) and summarize the results in Table 7 and Table 8. Below, we provide a detailed analysis of each algorithm.

Table 7.

		RNS	RDN [46]	RW [64]	FF [35]	ONS	RHS	TIHS [2]	MiDaS (w. Oracle)
Degree	Dstat.	0.11	0.28	0.31	0.33	0.22	0.06	0.28	0.05
Degree	Rank (Z-Score)	3.11 (-0.59)	5.60 (0.55)	6.49 (0.99)	6.93 (0.76)	4.15 (-0.02)	2.05 (-1.06)	5.78 (0.52)	1.53 (-1.16)
Int. Size	Dstat.	0.07	0.04	0.04	0.03	0.02	0.03	0.04	0.03
Int. Size	Rank (Z-Score)	5.84 (0.79)	5.49 (0.26)	4.91 (0.15)	4.38 (-0.05)	3.18 (-0.50)	3.36 (-0.35)	4.98 (0.05)	3.49 (-0.34)
Pair Degree	Dstat.	0.05	0.10	0.10	0.13	0.07	0.04	0.09	0.03
Pair Degree	Rank (Z-Score)	3.62 (-0.30)	5.80 (0.62)	4.98 (0.27)	6.49 (0.75)	3.75 (-0.23)	3.47 (-0.52)	5.22 (0.23)	2.31 (-0.82)
Size	Dstat.	0.10	0.09	0.12	0.09	0.04	0.09	0.09	0.09
Size	Rank (Z-Score)	4.25 (-0.00)	4.42 (-0.07)	5.16 (0.43)	4.89 (0.14)	2.33 (-0.90)	5.15 (0.21)	4.38 (-0.05)	5.05 (0.24)
SV	Dstat.	0.08	0.11	0.10	0.11	0.09	0.05	0.10	0.05
SV	Rank (Z-Score)	4.69 (0.23)	5.29 (0.41)	4.47 (0.20)	5.25 (0.36)	4.49 (0.05)	2.20 (-0.87)	5.04 (0.32)	2.84 (-0.70)
CC	Dstat.	0.26	0.32	0.34	0.33	0.31	0.29	0.31	0.29
CC	Rank (Z-Score)	1.49 (-0.75)	4.49 (0.10)	4.96 (0.39)	6.11 (1.09)	4.36 (-0.01)	2.22 (-0.42)	4.36 (-0.11)	2.58 (-0.30)
GCC	Diff.	0.10	0.07	0.10	0.08	0.05	0.07	0.08	0.06
GCC	Rank (Z-Score)	5.15 (0.34)	4.62 (0.04)	5.04 (0.35)	4.91 (0.08)	3.42 (-0.41)	4.09 (-0.12)	4.75 (0.05)	3.56 (-0.34)
Density	Diff.	1.78	17.81	16.89	20.85	12.96	1.13	18.24	0.89
Density	Rank (Z-Score)	2.62 (-0.73)	6.65 (1.01)	5.78 (0.61)	6.56 (0.68)	4.15 (0.02)	2.60 (-0.91)	5.67 (0.47)	1.55 (-1.15)
Overlapness	Diff.	4.16	58.55	56.98	71.39	41.01	2.71	60.24	2.00
Overlapness	Rank (Z-Score)	3.31 (-0.54)	5.49 (0.49)	6.56 (1.01)	6.76 (0.77)	4.13 (-0.07)	2.33 (-0.92)	5.60 (0.39)	1.45 (-1.13)
Diameter	Diff.	1.24	1.16	0.89	0.81	0.65	0.69	0.87	0.64
Diameter	Rank (Z-Score)	4.02 (-0.12)	5.36 (0.36)	5.58 (0.50)	5.71 (0.36)	4.38 (0.09)	2.91 (-0.63)	4.93 (0.19)	2.75 (-0.75)
Average	Rank (Z-Score)	3.81 (-0.17)	5.32 (0.38)	5.39 (0.49)	5.80 (0.49)	3.83 (-0.20)	3.04 (-0.56)	5.07 (0.21)	2.71 (-0.64)

Table 7. Back-in-time Sampling Performances of Six Intuitive Sampling Methods and Two Additional Methods

Rankings and Z-Scores (in parentheses) averaged over five sampling portions (specifically, \(10\%\) , \(30\%,\) 50%, \(70\%\) , and \(90\%\) ) and across the 11 datasets. The best results with respect to each property are highlighted in bold. Note that, while MiDaS (w. Oracle) performs the best overall, it does have limitations in preserving hyperedge sizes.

Table 8.

4.2.1 Random Node Sampling (RNS).

Capture the bias toward small hyperedges: In comparison to the input hypergraph, RNS exhibits a bias toward smaller hyperedges, implying that smaller hyperedges are more likely to be sampled than larger ones. This bias arises from the fact that sampling a large hyperedge necessitates the sampling of a large number of nodes (i.e., all nodes within it). Interestingly, this bias aligns well with the target snapshot, positioning RNS as the second-best algorithm in the preservation of hyperedge-size distribution, surpassing RHS. This outcome contrasts with the findings in representative sampling, where RHS effectively maintains the distribution of hyperedge sizes.

Preference for dense sub-hypergraphs: RNS demonstrates a preference for sampling dense sub-hypergraphs, often surpassing the density of the target snapshot. This preference differs from the observation that RNS tends to result in weaker connectivity in representative hypergraph sampling.

4.2.2 Random Degree Node (RDN), Random Walk (RW), and Forest-Fire (FF).

Excessive high-degree nodes: RDN, RW, and FF result in the sampling of sub-hypergraphs that are denser than those sampled by RNS. Given the existing bias of RNS toward dense sub-hypergraphs, these approaches lead to too high density. Additionally, they have an excessive number of nodes with high degrees and pair degrees.

Larger hyperedges than RNS : RDN, RW, and FF tend to sample larger hyperedges than RNS, leading to a greater disparity in size distribution from the target snapshot.

4.2.3 Ordered Node Sampling (ONS).

Too many high-degree nodes compared to RNS : Although ONS leverages the actual appearance order of nodes in the input hypergraph, the induced sub-hypergraph it generates exhibits significant differences, such as higher degrees, density, and overlapness, when compared to the target snapshot.

Accurate preservation of hyperedge sizes: ONS performs the best in terms of preserving hyperedge sizes among the baseline methods.

4.2.4 Random Hyperedge Sampling (RHS).

Poor at preserving hyperedge sizes: As previously discussed, the target snapshot exhibits a preference for smaller hyperedges when compared to the input hypergraph. However, RHS samples hyperedges uniformly at random from the input hypergraph regardless of their sizes, and consequently, RHS has limitations in preserving the hyperedge size distribution of the target snapshot.

Tendency toward sparse sub-hypergraphs: RHS has a tendency to sample sparse sub-hypergraphs that have lower density and more nodes with lower degrees.

Bias toward larger connected component: Even though RHS generates sparse sub-hypergraphs, it tends to include more large hyperedges compared to the target snapshot. As a result, the generated hypergraphs tend to have larger connected components.

4.2.5 Totally-Induced Hyperedge Sampling (TIHS).

Smaller hyperedges than RHS : TIHS is capable of sampling smaller hyperedges when compared to RHS. However, in comparison to the target snapshot, TIHS still exhibits a tendency to sample larger hyperedges.

Strong connectivity: Because adding induced hyperedges strengthens the connectivity, TIHS tends to generate hypergraphs with greater density and overlapness compared to the target snapshot.

4.2.6 MiDaS (w. Oracle) .

Best preservation of multiple properties: MiDaS (w. Oracle), an adaptation of MiDaS-Basic for back-in-time sampling with access to the target snapshot, achieves the best performance in preserving multiple properties overall.

Continued weakness in hyperedge size preservation: Despite its overall effectiveness, MiDaS (w. Oracle) has limitations in preserving the distribution of hyperedge sizes in the target snapshot.

4.2.7 Summary.

To summarize, the node selection methods commonly encounter the problem of sampling dense sub-hypergraphs. Although MiDaS (w. Oracle) performs the best overall among the baselines, MiDaS (w. Oracle), along with other hyperedge selection methods, has limitations in accurately preserving hyperedge sizes. These methods tend to sample a greater number of larger hyperedges compared to the target snapshot.

4.3 Proposed Approach: MiDaS-B

Our analysis of MiDaS (w. Oracle) in Section 4.2 and its dependency on the ground-truth past snapshot for hyperparameter tuning raises the following two questions: (a) how can we effectively preserve hyperedge sizes while maintaining the overall structural properties? (b) how can we perform hyperparameter tuning when the target snapshot is unavailable? In response to these questions, we propose MiDaS-B, an algorithm designed for back-in-time hypergraph sampling. MiDaS-B addresses these challenges by incorporating a hyperedge-size-related term into the hyperedge sampling probabilities, effectively controlling associated biases. Furthermore, it leverages evolutionary characteristics that are prevalent in real-world hypergraphs as an alternative objective for hyperparameter tuning, eliminating the need for reliance on the ground-truth target snapshot.

4.3.1 MiDaS-B-Basic: Preliminary Version.

Our analysis in Section 4.2 reveals that, while MiDaS (w. Oracle) is the most effective algorithm, it lacks the ability to accurately reproduce the preference for small hyperedges seen in target snapshots. To address this limitation, we enhance MiDaS by introducing a hyperedge-size-related bias into its sampling process. Specifically, we integrate a new hyperparameter \(\beta\) into the hyperedge weight function as follows:

\begin{equation} \omega (e ; \alpha , \beta) = \frac{(min_{v \in e} d_{v})^{\alpha }}{|e|^{\beta }}, \end{equation}

(3)

where the numerator \((min_{v \in e} d_{v})^{\alpha }\) corresponds to the hyperedge weight function used in MiDaS. Each hyperedge \(e\) is sampled with a probability proportional to its weight \(\omega (e ; \alpha , \beta)\) . We refer to this variant, which uses \(\omega (e ; \alpha , \beta)\) but lacks automatic tuning of \(\alpha\) and \(\beta\) , as MiDaS-B-Basic.

Effects of \(\alpha\) and \(\beta\) : When sampling hyperedges, the weight function in Equation (3) allows us to control the biases regarding node degrees and hyperedge sizes through the hyperparameters \(\alpha\) and \(\beta\) , respectively. To better understand the relationship between the hyperparameter values and the structural properties (i.e., P1–P10), we compute the correlation coefficient⁹ between the values of each hyperparameter¹⁰ and the differences¹¹ w.r.t. each property between the sampled sub-hypergraphs and the ground-truth past snapshot. The summarized results can be found in Table 9, and detailed visual results for the Coauth-Geology dataset are presented in Figure 9. Increasing \(\alpha\) prioritizes hyperedges with high-degree nodes, leading to denser sub-hypergraphs. This results in higher values for the degrees of nodes, global clustering coefficients, density, and overlapness; and results in smaller effective diameters. However, its effect on hyperedge size is limited, as indicated by the value close to zero (specifically, 0.0493) in Table 9. On the other hand, increasing \(\beta\) introduces a bias toward smaller hyperedges. This leads to a reduction in the degrees of nodes, global clustering coefficients, and overlapness; and leads to an increase in effective diameters. In conclusion, the two hyperparameters, \(\alpha\) and \(\beta\) , have distinct effects on the properties. Thus, when reproducing the preference for small hyperedges in target snapshots by increasing \(\beta\) , the tendency of \(\alpha\) toward higher node degrees, overlapness, and smaller effective diameters can offset the corresponding tendency of \(\beta\) toward lower node degrees, overlapness, and larger effective diameters.

Table 9.

Parameter	Degree	Int. Size	Pair Degree	Size	SV	CC	GCC	Density	Overlapness	Diameter
\(\alpha\)	0.3887	0.0036	0.1141	0.0493	0.1683	0.0445	0.0859	0.4596	0.4473	-0.0580
\(\beta\)	-0.1098	0.0222	0.0842	-0.2748	0.0093	-0.0474	-0.3435	-0.0013	-0.2466	0.1636

Table 9. Correlation Coefficients between Hyperparameter Values and the Structural Properties of Sub-hypergraphs Obtained by MiDaS-B-Basic

The two hyperparameters \(\alpha\) and \(\beta\) exhibit distinctly different effects in terms of direction and magnitude. In particular, \(\beta\) exhibits a strong negative correlation with sampled hyperedge sizes, while \(\alpha\) demonstrates a weak positive correlation with these sizes.

Fig. 9.

4.3.2 Hyperparameter Tuning without Oracle.

To enhance the practicality and applicability of our algorithm to real-world hypergraphs, we aim at developing a hyperparameter tuning method that does not rely on the availability of past snapshots of the input hypergraph (i.e., oracle). To achieve this, we take into account the evolutionary characteristics of real-world hypergraphs and seek hyperparameter values that accurately replicate realistic hypergraph evolution. The hyperparameter tuning method is described in Algorithm 3.

Power-law patterns in real-world hypergraph evolution: As visualized in Figure 10, there are two key observations characterizing the evolution of real-world hypergraphs: (1) the fraction of intersecting hyperedge pairs exhibits power-law patterns over time [35] and (2) the average hyperedge size exhibits power-law patterns over time. Leveraging these pervasive patterns, we tune the hyperparameters \(\alpha\) and \(\beta\) of our sampling algorithm by aiming at maximizing the power-law fitness in these two features. Specifically, we arrange hyperedges based on their sampling order and treat this sequence as the chronological arrival order of hyperedges in a time-evolving hypergraph (lines 4–7). Then, we measure the mean absolute error of a linear regression model fitted to the log-log scale representations of these two properties over time (line 10). To determine the hyperparameter values, we search for the combination that minimizes the mean absolute error within a designated search space (lines 2–12).

Fig. 10.

Rejection conditions: To further refine the search for hyperparameter values, we establish rejection conditions of hyperparameters based on observations from RHS. In particular, RHS consistently demonstrates lower density and larger connected components when compared to the target snapshot. These rejection conditions are designed to filter out hyperparameter values that produce clearly undesirable sub-hypergraphs (lines 8–9), which can be identified through comparisons with RHS. That is, the rejection conditions are as follows:

—

Density Condition: Sampled sub-hypergraphs should not be sparser than those obtained by RHS.

—

Connected Component (CC) Condition: Sampled sub-hypergraphs should not have larger connected components than those obtained by RHS.

Specifically, these rejection conditions are applied by comparing the average density and the average size of the largest connected component over time¹² from our sampling algorithm with the corresponding values from RHS.

4.3.3 MiDaS-B: Full-fledged Version.

We introduce \(\text {MiDaS-B}\) , the full-fledged version of our back-in-time sampling algorithm that combines the hyperedge weight function Equation (3) and the evolutionary-pattern-based automatic hyperparameter tuning method. The pseudocode for MiDaS-B is given in Algorithm 4. Given (a) a hypergraph \(\mathcal {G}=(\mathcal {V},\mathcal {E})\) (i.e., a snapshot of the input hypergraph), (b) a sampling portion \(p\) , and (c) hyperparameter search spaces \(\mathcal {S}_{\alpha }\) and \(\mathcal {S}_{\beta }\) , MiDaS-B returns a sub-hypergraph \(\mathcal {\hat{G}}=(\mathcal {\hat{V}},\mathcal {\hat{E}})\) of \(\mathcal {G}\) . Specifically, after tuning its hyperparameters \(\alpha\) and \(\beta\) (line 1), MiDaS-B samples a target number of hyperedges with the sampling probability proportional to Equation (3) (lines 2–6), and then it returns the sub-hypergraph consisting of the sample hyperedges (line 7).¹³

Theoretical analysis: We analyze the time complexities of Algorithms 3 and 4.

Lemma 1 (Time Complexity of Hyperedge Sampling).

The time complexity of sampling hyperedges for a pair of \(\alpha ^{\star }\) and \(\beta ^{\star }\) values (i.e., lines 2-6 of Algorithm 4) is \(O(p \cdot |\mathcal {E}| \cdot \log C + \sum _{e \in \mathcal {E}} |e|)\) , where \(C\) is the size of the set \(\lbrace (\min _{v \in e}d_{v}, |e|) : e \in \mathcal {E}\rbrace\) .

Proof.

It takes \(O(\sum _{e\in \mathcal {E}} |e|)\) time to compute \(\omega (e)\) (i.e., \(min_{v \in e} d_{v}\) ) and \(|e|\) for every hyperedge \(e\) . It takes \(O(|\mathcal {E}|)\) time to build a balanced binary tree with \(C\) leaf nodes where each leaf node points to a list of all hyperedges who share the same \((\min _{v \in e} d_{v}, |e|)\) . Then, it takes \(O(C)=O(|\mathcal {E}|)\) time in total to store in each node \(i\) the sum of the weights of the hyperedges pointed by any node in the sub-tree rooted at \(i\) if we store them from leaf nodes to the root. The height of the tree is \(O(\log C)\) ; and thus the process of drawing each hyperedge – which involves starting from the root and iteratively selecting a child based on weights until a leaf is reached, followed by selecting the hyperedge associated with that leaf – and the subsequent updating of weights accordingly take \(O(\log C)\) time. Drawing \(p \cdot |\mathcal {E}|\) hyperedges takes \(O(p \cdot |\mathcal {E}| \cdot \log C)\) time, and since \(O(|\mathcal {E}|)=O(\sum _{e\in \mathcal {E}} |e|)\) , the total time complexity is \(O(p \cdot |\mathcal {E}| \cdot \log C+\sum _{e\in \mathcal {E}} |e|)\) . □

Theorem 2 (Time Complexity of Automatic Hyperparameter Tuning).

The time complexity of Algorithm 3 is \(O(|\mathcal {S}_{\alpha }|\cdot |\mathcal {S}_{\beta }| \cdot (|\mathcal {E}|\cdot \log C + \sum _{e \in \mathcal {E}} \sum _{v \in e} d_{v} + \sum _{e \in \mathcal {E}} |e|^{2} \alpha (|\mathcal {V}|))).\)

Proof.

We first present the time complexity for a specific pair \(\alpha ^{\star }\) and \(\beta ^{\star }\) (i.e., lines 4– 7 of Algorithm 3). By Lemma 1, hyperedge sampling takes \(O(|\mathcal {E}| \cdot \log C + \sum _{e \in \mathcal {E}} |e|)\) since the sampling portion \(p\) is set to 1.0, which corresponds to the sampling of the entire set of hyperedges. In addition, in order to check the rejection conditions and measure power-law fitness, we compute the following metrics over time: (1) the fraction of intersecting hyperedge pairs, (2) the average hyperedge size, (3) the average density, and (4) the size of the largest connected components. (1) Computing the fraction of intersecting hyperedge pairs takes \(O(\sum _{e \in \mathcal {E}} \sum _{v \in e} d_{v})\) . For each newly added hyperedge \(e\) , we examine at most \(\sum _{v \in e} d_{v}\) intersecting hyperedges to calculate their unique count. (2) Computing the average hyperedge size and (3) the average density takes \(O(\sum _{e \in \mathcal {E}} |e|)\) . (4) Computing the size of the largest connected components takes \(O(\sum _{e \in \mathcal {E}} |e|^{2} \alpha (|\mathcal {V}|))\) . When a new hyperedge is added, we examine all node pairs within the hyperedge, identify the components they belong to, and merge them if they belong to different components. We employ a disjoint-set data structure for this, and \(\alpha (n)\) represents the extremely slow-growing inverse Ackermann function. Lastly, given the metrics over time, evaluating rejection conditions and power-law fitness, which requires linear regression, takes \(O(|\mathcal {E}|)\) time, and if we record the metrics only a constant number of times (refer to Footnote 13), it takes \(O(1)\) time. Therefore, the total time complexity is \(O(|\mathcal {E}|\log C + \sum _{e \in \mathcal {E}} \sum _{v \in e} d_{v} + \sum _{e \in \mathcal {E}} |e|^{2} \alpha (|\mathcal {V}|))\) .

Since \(|\mathcal {S}_{\alpha }|\cdot |\mathcal {S}_{\beta }|\) pairs of \(\alpha ^{\star }\) and \(\beta ^{\star }\) values are considered, the total time complexity becomes \(O(|\mathcal {S}_{\alpha }|\cdot |\mathcal {S}_{\beta }| \cdot (|\mathcal {E}|\cdot \log C + \sum _{e \in \mathcal {E}} \sum _{v \in e} d_{v} + \sum _{e \in \mathcal {E}} |e|^{2} \alpha (|\mathcal {V}|)))\) . □

Theorem 3 (Time Complexity of.

MiDaS-B ) The time complexity of Algorithm 4 is \(O(|\mathcal {S}_{\alpha }|\cdot |\mathcal {S}_{\beta }| \cdot (|\mathcal {E}|\cdot \log C + \sum _{e \in \mathcal {E}} \sum _{v \in e} d_{v} + \sum _{e \in \mathcal {E}} |e|^{2} \alpha (|\mathcal {V}|)))\)

Proof.

From Lemma 1 and Theorem 2, it is clear that the time complexity of Algorithm 3, a subroutine of Algorithm 4, is the dominant factor in the time complexity of Algorithm 4. Therefore, the Big-O notation for the time complexity of Algorithm 4 is the same as that of Algorithm 3. □

4.4 Evaluation

In this section, we review our experiments for evaluating the output quality, consistency, and speed of MiDaS-B, our proposed algorithm for back-in-time hypergraph sampling.

4.4.1 Experimental Settings.

In our experiments, we examine the performance of 11 back-in-time sampling algorithms on 11 datasets with five different sampling portions (specifically, \(10\%, 30\%, 50\%, 70\%, 90\%\) ). We conduct all experiments on a machine equipped with an i9-10900 K CPU and 64 GB RAM. The sample quality of each algorithm is averaged over three independent trials.

The competing algorithms include the six intuitive algorithms (Section 2.3), ONS (Section 4.2), HRW (Section 3.4.1), and MiDaS-B (Section 4.3.3). We exclude MiDaS (w. Oracle), which requires the ground-truth past snapshots as input, and instead, we include MiDaS (w/o. Oracle). MiDaS (w/o. Oracle) is a variant of MiDaS-Basic that tunes its hyperparameter \(\alpha\) using the same rejection conditions and power-law fitness as MiDaS-B. For both MiDaS (w/o. Oracle) and MiDaS-B, we use \(\mathcal {S}_{\alpha } = \lbrace 0, 2^{-2}, 2^{-1}, 2^{0}, 2^{1}\rbrace\) and \(\mathcal {S}_{\beta } = \lbrace -2^{0},-2^{-1}, -2^{-2}, 0, 2^{-2}, 2^{-1}, 2^{0} \rbrace\) as the search space for \(\alpha\) and \(\beta\) , respectively.

4.4.2 Quality of MiDaS-B.

We compare the performances of MiDaS-B and ten competing methods across various datasets and sampling portions. The results in Table 10 are based on relative differences, rankings, and Z-Scores, as described in Section 2.5. These results are averaged over five sampling portions (10%, 30%, 50%, 70%, and 90%) and 11 real-world hypergraph datasets. In terms of average rankings and average Z-Scores, MiDaS-B consistently outperforms all competing methods, demonstrating its superior performance, especially in preserving degree, density, and overlapness. Especially, MiDaS-B maintains hyperedge size distributions better than \(\text {MiDaS (w/o. Oracle)}\) , demonstrating the benefit of incorporating hyperedge sizes into the sampling process of MiDaS-B. Refer to Table 11 for a visual presentation of detailed results on each dataset. Collectively, our findings strongly support the effectiveness of MiDaS-B compared to competing methods, particularly in preserving multiple properties and hyperedge size distributions.

Table 10.

		RNS	RDN [46]	RW [64]	FF [35]	ONS	RHS	TIHS [2]	NB	SK	MiDaS (w/o. Oracle)	MiDaS-B
		RNS	RDN [46]	RW [64]	FF [35]	ONS	RHS	TIHS [2]	HRW [78]		MiDaS (w/o. Oracle)	MiDaS-B
Degree	Dstat.	0.113	0.280	0.311	0.326	0.222	0.059	0.283	0.111	0.107	0.076	0.062
	Rank	4.800	8.200	9.455	9.855	6.255	2.964	8.491	4.400	4.218	3.982	3.109
	Z-Score	-0.410	0.772	1.255	1.006	0.175	-0.871	0.739	-0.558	-0.570	-0.659	-0.880
Int. Size	Dstat.	0.068	0.039	0.038	0.033	0.023	0.032	0.035	0.066	0.065	0.032	0.031
	Rank	6.964	6.455	5.964	5.127	3.727	4.236	5.982	9.036	8.909	4.636	4.691
	Z-Score	0.400	-0.055	-0.160	-0.364	-0.670	-0.489	-0.174	1.152	1.153	-0.393	-0.401
Pair Degree	Dstat.	0.050	0.098	0.097	0.125	0.070	0.039	0.091	0.054	0.054	0.039	0.039
	Rank	5.073	8.018	7.018	8.855	5.236	5.291	7.309	5.036	5.291	4.036	4.564
	Z-Score	-0.170	0.772	0.373	0.861	-0.187	-0.399	0.342	-0.289	-0.273	-0.565	-0.466
Size	Dstat.	0.100	0.086	0.119	0.091	0.044	0.094	0.086	0.279	0.282	0.101	0.085
	Rank	5.091	5.236	6.182	5.545	2.509	5.873	5.182	9.709	9.909	5.964	4.527
	Z-Score	-0.143	-0.340	0.073	-0.405	-0.785	-0.357	-0.386	1.530	1.551	-0.312	-0.426
SV	Dstat.	0.079	0.107	0.101	0.109	0.091	0.052	0.102	0.075	0.078	0.055	0.055
	Rank	6.455	7.455	6.509	7.455	6.564	3.145	7.273	5.400	5.673	3.982	3.818
	Z-Score	0.328	0.521	0.342	0.490	0.152	-0.728	0.454	-0.192	-0.156	-0.617	-0.595
CC	Dstat.	0.264	0.319	0.337	0.331	0.309	0.288	0.306	0.322	0.319	0.303	0.292
	Rank	1.818	6.018	6.818	8.182	5.727	2.982	5.782	5.836	5.127	3.964	2.655
	Z-Score	-0.829	0.114	0.437	1.320	-0.011	-0.458	-0.132	0.120	0.053	-0.206	-0.409
GCC	Diff.	0.097	0.074	0.099	0.078	0.050	0.067	0.080	0.119	0.119	0.070	0.065
	Rank	6.509	5.764	6.327	6.091	4.364	5.073	5.836	7.782	7.727	5.000	5.055
	Z-Score	0.211	-0.099	0.218	-0.048	-0.538	-0.299	-0.080	0.629	0.581	-0.283	-0.291
Density	Diff.	1.780	17.810	16.894	20.853	12.965	1.128	18.241	2.330	2.337	0.818	0.790
	Rank	3.818	9.382	8.255	8.945	6.236	3.618	8.036	5.582	6.109	3.145	2.109
	Z-Score	-0.643	1.140	0.793	0.841	0.169	-0.807	0.635	-0.260	-0.179	-0.697	-0.992
Diameter	Diff.	1.240	1.160	0.894	0.806	0.649	0.691	0.870	0.607	0.613	0.613	0.815
	Rank	5.945	7.636	7.964	7.891	6.509	4.527	7.055	4.618	5.036	3.891	4.655
	Z-Score	0.065	0.499	0.678	0.503	0.241	-0.458	0.331	-0.412	-0.364	-0.604	-0.480
Overlapness	Diff.	4.157	58.550	56.985	71.393	41.007	2.708	60.242	4.933	5.000	2.229	2.416
	Rank	4.909	8.018	9.455	9.418	6.036	3.745	8.018	4.491	4.709	3.709	3.218
	Z-Score	-0.391	0.682	1.264	0.967	0.076	-0.783	0.567	-0.511	-0.397	-0.669	-0.805
Average	Rank	5.138	7.218	7.395	7.736	5.316	4.145	6.896	6.189	6.271	4.231	3.840
Average	Z-Score	-0.158	0.401	0.527	0.517	-0.138	-0.565	0.230	0.121	0.140	-0.500	-0.574

Table 10. Among the 11 Sampling Methods Evaluated on 11 Real-world Hypergraphs with five Different Sampling Portions (Specifically, 10%, 30%, 50%, 70%, and 90%), MiDaS-B Demonstrates Overall the Best Performance in Back-in-time Sampling

To assess the effectiveness of each method, we report their distances (D-statistics or relative differences), Z-Scores, and rankings. A smaller value indicates better performance, reflecting the capability of a method to closely reproduce the target snapshots of the input hypergraphs.

Table 11.

4.4.3 Consistency of MiDaS-B.

Figure 11 presents the overall performances (in terms of average rankings and average Z-Scores) of different algorithms across various sampling portions. As depicted in Figure 11(a)–(b), MiDaS-B consistently outperforms all competitors. Furthermore, as shown in Figure 11(c), where we report the D-statistics for hyperedge size distributions, MiDaS-B preserves hyperedge sizes consistently better than MiDaS (w/o. Oracle). These results highlight the consistently superior performance of MiDaS-B across diverse sampling scenarios.

Fig. 11.

4.4.4 Speed of MiDaS-B.

In Figure 12, we compare the running times of all competing algorithms in each dataset with a fixed sampling portion of 0.9. MiDaS-B exhibits relatively longer computational times due to its extensive search across all candidate pairs of \(\alpha\) and \(\beta\) values (specifically, 35 pairs in our experimental setup). For each pair, MiDaS-B conducts hyperedge sampling and concurrently tracks the evolution of structural properties (specifically, the fraction of intersecting hyperedge pairs, the average hyperedge size, the average density, and the average size of the largest connected component) over time for the computation of fitness and the application of the rejection conditions. Nevertheless, MiDaS-B terminates within a reasonable period (specifically, less than 10,000 seconds) for all datasets considered, demonstrating its practical efficiency for real-world scenarios. Additionally, for reference, if we take into account the running time for sampling while excluding the computation of the four properties, it amounts to 344 seconds on the coauth-Geology dataset. Furthermore, in Appendix A.3, we analyze the running time of MiDaS-B with respect to the input hypergraph size.

Fig. 12.

4.4.5 Effectiveness of Each Component of MiDaS-B.

We demonstrate the effectiveness of each component of MiDaS-B: (a) bias associated with hyperedge sizes and (b) automatic hyperparameter tuning. To this end, we compare MiDaS-B with the following variants:

—

MiDaS (w. Oracle) (Section 4.2): The hyperparameter \(\alpha\) of MiDaS-Basic is tuned through a grid search for each dataset based on the comparison (with respect to all ten properties) with the ground-truth past snapshot of each dataset. It is important to note that this method relies on the ground-truth past snapshot, which is not provided according to the problem definition, and that is why we include “oracle” in its name.

—

MiDaS-B (w. Oracle): The hyperparameters \(\alpha\) and \(\beta\) of MiDaS-B-Basic are tuned through a grid search for each dataset based on the comparison (with respect to all ten properties) with the ground-truth past snapshot of each dataset. Note that this method also relies on the ground-truth past snapshot.

—

Top k: MiDaS-B-Basic with the \(k\) th best combination of \(\alpha\) and \(\beta\) values, which remains fixed across all datasets. The rankings of the combinations are determined based on the similarity (with respect to all ten properties) between the resulting sub-hypergraphs and the ground-truth past snapshots. Specifically, the rankings are computed based on the average of min-max normalized rankings and Z-Scores across all datasets and sampling portions.

Bias Associated with Hyperedge Sizes: As shown in Table 12, MiDaS-B (w. Oracle) demonstrates superior performance over MiDaS (w. Oracle), indicating that considering the hyperedge size in the hyperedge sampling probability (i.e., the introduction of \(\beta\) ) effectively mitigates the limitations of MiDaS (w. Oracle) while preserving its strengths.

Table 12.

		MiDaS (w. Oracle)	MiDaS-B (w. Oracle)	Top 1 (Out of 35)	Top 5 (Out of 35)	Top 10 (Out of 35)	Top 35 (Out of 35)	MiDaS-B
Degree	Dstat.	0.053	0.050	0.059	0.059	0.067	0.157	0.062
	Rank	2.491	2.982	3.291	3.582	4.200	6.655	3.527
	Z-Score	-0.484	-0.502	-0.226	-0.284	-0.137	1.829	-0.196
Int. Size	Dstat.	0.032	0.031	0.032	0.032	0.033	0.044	0.031
	Rank	3.400	3.564	4.073	3.109	3.909	5.182	3.491
	Z-Score	-0.230	-0.034	0.017	-0.281	-0.143	0.776	-0.105
Pair Degree	Dstat.	0.032	0.040	0.043	0.039	0.041	0.065	0.039
	Rank	2.473	3.545	3.945	4.455	3.418	5.091	3.800
	Z-Score	-0.532	-0.151	0.093	0.129	-0.253	0.796	-0.082
Size	Dstat.	0.093	0.062	0.081	0.094	0.119	0.178	0.085
	Rank	3.473	1.982	2.636	3.655	5.636	6.436	2.909
	Z-Score	-0.170	-0.809	-0.455	-0.133	0.456	1.548	-0.438
SV	Dstat.	0.052	0.049	0.053	0.052	0.052	0.088	0.055
	Rank	3.782	3.036	3.545	3.127	3.655	5.600	3.673
	Z-Score	-0.197	-0.272	-0.047	-0.429	-0.180	1.103	0.022
CC	Dstat.	0.294	0.289	0.289	0.288	0.296	0.339	0.292
	Rank	3.382	2.491	2.455	2.691	4.036	5.527	2.509
	Z-Score	-0.070	-0.367	-0.317	-0.243	-0.029	1.283	-0.369
GCC	Diff.	0.062	0.048	0.064	0.067	0.083	0.112	0.065
	Rank	2.909	2.345	3.600	3.709	5.255	4.873	3.873
	Z-Score	-0.304	-0.608	0.005	-0.065	0.345	0.620	0.007
Density	Diff.	0.892	0.893	0.771	1.128	1.134	4.311	0.790
	Rank	2.764	2.855	2.509	4.327	4.582	5.745	2.800
	Z-Score	-0.420	-0.483	-0.388	0.130	0.245	1.251	-0.334
Overlapness	Diff.	2.002	2.474	2.061	2.708	2.815	18.976	2.416
	Rank	2.636	2.964	2.800	4.073	4.309	6.818	3.127
	Z-Score	-0.548	-0.567	-0.262	-0.175	-0.178	1.996	-0.266
Diameter	Diff.	0.639	0.539	0.808	0.691	0.689	0.623	0.815
	Rank	2.982	2.891	3.709	4.036	4.509	4.618	3.982
	Z-Score	-0.304	-0.475	0.103	-0.004	0.059	0.421	0.200
Average	Rank	3.029	2.865	3.256	3.676	4.351	5.655	3.369
Average	Z-Score	-0.326	-0.427	-0.148	-0.136	0.018	1.162	-0.156

Table 12. Effectiveness of the Key Components of MiDaS-B

The results are averaged and compared across 11 real-world hypergraphs with five different sampling portions (i.e., 10%, 30%, 50%, 70%, and 90%). The overall superiority of MiDaS-B (w. Oracle) over MiDaS (w. Oracle) highlights the effectiveness of the hyperedge-related bias in MiDaS-B (w. Oracle). Furthermore, the comparable overall performance of MiDaS-B, which does not require ground-truth past snapshots, to that of Top 1, which relies on such snapshots, demonstrates the effectiveness of the automatic hyperparameter tuning by MiDaS-B.

Automatic Hyperparameter Tuning: Recall that MiDaS-B tunes its hyperparameters \(\alpha\) and \(\beta\) without access to the ground-truth past snapshot of the given hypergraph. Nevertheless, MiDaS-B exhibits a competitive performance, comparable even to the Top 1 (out of 35), as shown in Table 12. Specifically, MiDaS-B outperforms the Top 1 in terms of average Z-Scores but underperforms it in terms of average rankings. Recall that the Top 1 employs the best combination of \(\alpha\) and \(\beta\) values across all datasets, relying on access to the ground-truth past snapshot. MiDaS-B outperforms the Top 5 (out of 35) in terms of both average Z-Scores and average rankings. Noteworthy, MiDaS-B preserves hyperedge sizes and connected component sizes even better than MiDaS (w. Oracle), which tunes hyperparameters for each dataset using the ground-truth past snapshots.

5 Related Work

In this section, we conduct a review of relevant studies categorized into five subsections: (a) graph simplification, (b) graph sampling, (c) hypergraph sampling, (d) structural properties of real-world hypergraphs, and (e) others.

5.1 Graph Simplification

The ubiquity of large-scale graphs in real-world applications, often comprising millions of nodes and edges, presents formidable computational challenges. Consequently, various works have emerged to simplify graphs each with distinct objectives. Graph simplification may involve reducing graph size while preserving specific graph properties, such as community structures [53], pairwise distances [56], cuts [31], or eigenvalues [61]. Note that, in general, simplified graphs may not always be subgraphs of the original graphs. Spectral sparsifiers [61], for instance, approximate the Laplacian quadratic form of the original graph with a subgraph. Another significant direction involves graph condensation [30], where the goal is to learn a simplified graph structure, along with node attributes, to minimize the performance gap between machine learning models (e.g., graph neural networks) trained on the simplified graph and the original graph. In a related context, complex but implicit relationships between entities can be summarized in the form of a simple graph (or a similarity matrix) to be leveraged by classification methods, such as \(k\) -nearest neighbors [79, 80, 81, 82], with further improvement using quantum computing [49].

5.2 Graph Sampling

Graph sampling, also known as subgraph sampling, represents a specific approach to graph simplification where simplified graphs are selected among the subgraphs of the original graphs. This process enhances the interpretability of the simplified graphs and strengthens the connections (i.e., correspondences) between them and the original graphs. Graph sampling also has been explored with diverse objectives, such as graphical inference [72], graph visualization [23, 28, 57], online-social-network crawling [24, 39, 50, 69], and triangle-count estimation [40, 51, 62].

Our work is most closely related to representative sampling and back-in-time sampling [46]. Representative sampling aims at finding a subgraph that accurately represents the structural characteristics of the original graph [2, 27, 46, 65]. Back-in-time sampling aims at finding a subgraph that closely approximates a past snapshot of a given graph with a specified size [46]. Note that, in both tasks, the objective is to obtain general-purpose subgraphs, without presuming specific use cases for the sampled subgraphs.

In terms of methodologies, most graph-sampling methods can be categorized into two main types: (a) node selection methods and (b) edge selection methods. Node-selection methods [27, 46, 53, 55] involve selecting a subset of nodes and obtaining the induced subgraph. Edge-selection methods [2, 40, 45, 46, 51, 62, 74] select a subset of edges and construct the subgraph using these edges and their incident nodes.

In addition, research on graph evolution, especially those aiming at identifying the temporal order of nodes or edges in a graph, is closely related to back-in-time sampling. To this end, they exploit evolutionary patterns of real-world graphs and network growth models inspired by them [45, 55, 74], which are also leveraged for back-in-time sampling [46].¹⁴

5.3 Hypergraph Sampling

Hypergraphs have recently gained significant attention in various domains, including recommendation [67], entity ranking [12], misinformation detection [48], node classification [22, 29], and clustering [84]. These domains leverage the high-order relationships between nodes embedded in hyperedges to achieve better performance in various tasks. For the use of hypergraphs in machine learning, refer to an extensive survey [3].

Despite the increasing interest in hypergraphs, hypergraph sampling remains a relatively unexplored area. Yang et al. [70] proposed a novel sampling method for hypergraphs, but it is specifically tailored for a specific task, namely node embedding. In addition, there have been several studies attempting to generate hypergraphs that reproduce the properties of real-world hypergraphs [4, 13, 18, 19, 20, 33, 35, 42]; refer to a survey [41] on this topic for details. However, these generation processes involve the creation of new nodes and hyperedges, whereas our focus is on sampling from existing ones.

Particularly related to back-in-time sampling, Comrie and Kleinberg [16] introduced the problem of identifying the temporal order of hyperedges. However, their approach is customized for ordering hyperedges within a hypergraph ego-network, typically containing a limited number (specifically, at most 20) of hyperedges, rather than considering the entire hypergraph. Specifically, they employed a classifier trained on features derived from hypergraph ego-networks to predict the temporal order of hyperedges in a supervised manner.

5.4 Structural Properties of Real-world Hypergraphs

With respect to hypergraph sampling, it is important to decide the non-trivial structural properties of hypergraphs to be preserved. Kook et al. [35] discovered unique patterns in real-world hypergraphs regarding (a) hyperedge sizes, (b) intersection sizes, (c) the singular values of the incidence matrix, (d) edge density, and (e) diameter. Lee et al. [42] reported a number of properties regarding the overlaps of hyperedges by which real-world hypergraphs are clearly distinguished from random hypergraphs. In addition, Do et al. [19] uncovered several patterns regarding the connections between subsets of a fixed number of nodes (e.g., connected component sizes) in real-world hypergraphs; and Kim et al. [32] explored those related to transitivity (i.e., the propensity to form clusters). For directed hypergraphs, where nodes in each hyperedge are divided into heads and tails, Kim et al. [33] explored various empirical patterns regarding reciprocity (i.e., the inclination to form mutual connections).

Dynamic changes in the structural properties of time-evolving hypergraphs have been analyzed from various perspectives. At the node level, the same subsets of nodes tend to appear repeatedly [6]. Moreover, this tendency becomes stronger, as these subsets appear at a greater number of hyperedges, spanning diverse sizes [15]. At the hyperedge level, the repetition [6], recency [6], burstiness [8], and persistency [8] in the appearance of hyperedges have been investigated. At the hypergraph level, studies have revealed trends such as diminishing overlaps, densification, and shrinking diameter [35]. Furthermore, certain studies have explored properties related to sub-hypergraphs, including triads of nodes [5], triads of hyperedges [44], and ego networks [16]. For more static and dynamic structural patterns in real-world hypergraphs, refer to a survey [41].

5.5 Others

Throughout the article, we use the term “bias” to refer to the statistical prioritization of specific nodes or hyperedges during the sampling process. This concept is related but distinct from the concept of bias associated with sensitive node attributes, which we typically aim at minimizing to ensure fairness during graph representation learning [52, 76, 77]. It is important to note that our objective is not to minimize the sampling bias; rather, our goal is to adjust the sampling bias to align the sampled distribution more closely with the target distribution. To achieve this objective, our methods may even increase the bias toward high-degree nodes during sampling, especially when the target distribution consists of more high-degree nodes.

6 Conclusions and Future Directions

In this section, we provide conclusions and outline future research directions.

6.1 Conclusions

In this work, we tackle two hypergraph sampling problems: representative sampling and back-in-time sampling. For representative sampling, we propose MiDaS, a fast and effective algorithm designed to overcome the limitations of RHS by automatically adjusting the amount of bias toward high-degree nodes. For back-in-time sampling, we propose MiDaS-B, which is built upon the mechanism of MiDaS but integrates a bias related to hyperedge size to overcome the limitations of MiDaS. MiDaS-B is also equipped with an automatic hyperparameter tuning method that leverages the evolutionary patterns of real-world hypergraphs without requiring the ground-truth past snapshot. Our extensive experiments on 11 real-world hypergraphs with five different sampling portions demonstrate the superiority of MiDaS (or MiDaS-B) for representative sampling (respectively, back-in-time sampling) compared to 14 (respectively, 10) competing methods.

Our contributions are summarized as follows:

—

Problem Formulation: To the best of our knowledge, we are the first to formulate the problem of representative sampling and back-in-time sampling from real-world hypergraphs. Our formulation is based on ten pervasive structural properties of real-world hypergraphs.

—

Observations: We examine the characteristics of a number of intuitive sampling approaches in 11 datasets, and our findings guide the development of our more effective algorithms.

—

Algorithm Design: We propose MiDaS and MiDaS-B for representative sampling and back-in-time sampling, respectively. Their superiority is validated through extensive experiments conducted across 11 datasets and five different sampling portions.

For reproducibility, we make our code and datasets available at https://github.com/young917/MiDaS.

6.2 Future Directions

Applications: Recall that our methods are designed for sampling general-purpose sub-hypergraphs, with the primary goal of preserving a wide range of hypergraph structural properties. While they can exhibit versatility, they may not be optimal for specific objectives or applications. We plan to explore the effectiveness of our methods in diverse applications and further investigate sub-hypergraph sampling techniques tailored to specific applications.

Diverse types of hypergraphs: Furthermore, our current focus is primarily on homogeneous graphs, without explicit consideration for the potential presence of various node types and/or hyperedge types. In our future research, we plan to expand our scope to include heterogeneous graphs.

Scalability: Moreover, considering the vast size of real-world hypergraphs, we acknowledge the necessity for more efficient sampling strategies, particularly for back-in-time sampling. Moreover, the dynamic nature of real-world hypergraphs requires efficient updates of the sampled sub-hypergraph over time. We plan to address these challenges through technological advancements (e.g., quantum computing) as well as algorithmic innovation.

Footnotes

This article extends our previous work [14] on representative hypergraph sampling. In this extended version, we formulate a novel problem of back-in-time hypergraph sampling, whose goal is to accurately approximate a past snapshot of a given size of the input hypergraph (Section 4.1). Unlike representative sampling, we do not have access to the target (i.e., the past snapshot of a given size) in back-in-time sampling, and this unique challenge necessitates the development of a new algorithm. Thus, for the new problem, we examine a number of intuitive approaches (Section 4.2), and based on the examination, we design MiDaS-B, a novel algorithm for back-in-time sampling (Section 4.3). Finally, we demonstrate the empirical superiority of MiDaS-B (Section 4.4).

That is, \(F(x):=\sum _{i=1}^{x}f(i)\) and \(\hat{F}(x):=\sum _{i=1}^{x}\hat{f}(i)\) .

They are chosen among 0, \(2^{-3}\) , \(2^{-2.5}\) , \(2^{-2}\) , \(\ldots\) \(2^{5.5}\) , and \(2^{6}\) .

⁴

The skewness is defined as \(\frac{ \mathrm{E}_{v} [ (d_{v} - \mathrm{E}_{v}[d_{v}])^{3} ]}{ \mathrm{E}_{v}[ (d_{v} - \mathrm{E}_{v} \left[ d_{v} \right])^{2}]^{3/2} }\) .

⁵

We chose \(\alpha\) minimizing \(\mathcal {L}(\mathcal {G}, \mathcal {\hat{G}})\) among \(\mathcal {S}=\lbrace 0, 2^{-3}, 2^{-2.5}, \ldots , 2^{5.5}, 2^6\rbrace\) .

⁶

We set \(p\) to \(10\%\) , \(20\%\) , \(30\%\) , \(40\%\) , or \(50\%.\)

⁷

We set \(\mathcal {S}=\lbrace 0, 2^{-3}, 2^{-2.5}, \ldots , 2^{5.5}, 2^6\rbrace\) throughout the article.

⁸

We use the search space \(\mathcal {S}=\lbrace 0, 2^{-2}, 2^{-1}, 2^{0}, 2^{1}\rbrace\) for \(\alpha\) in back-in-time hypergraph sampling problem.

⁹

The correlation coefficients for a hyperparameter are averaged over the datasets, the sampling portions, and the values of the other hyperparameter.

¹⁰

We vary \(\alpha\) within \(\lbrace 0, 2^{-2}, 2^{-1}, 2^{0}, 2^{1}\rbrace\) and \(\beta\) within \(\lbrace -2^{0},-2^{-1}, -2^{-2}, 0, 2^{-2}, 2^{-1}, 2^{0} \rbrace\) .

¹¹

We measure signed differences. Specifically, for each of P1–P6, which are probability density functions, we measure \(\hat{F}(x^*) - F(x^*)\) where \(x^* = \text{arg\,max}_{x\in \mathcal {D}} \lbrace | \hat{F}(x) - F(x) |\rbrace\) , where \(F\) (or \(\hat{F}\) ) is the cumulative sum of the function \(f\) (respectively, \(\hat{f}\) ) for \(\mathcal {\bar{G}}\) (respectively, \(\mathcal {\hat{G}}\) ), and \(\mathcal {D}\) is the domain of \(f\) and \(\hat{f}\) . For each of P7–P10, which are scalars, we subtract the corresponding values of the ground-truth past snapshot from those of the sampled sub-hypergraph.

¹²

Similar to when computing the power-law fitness, we consider each hyperedge sampling order as a chronological arrival order of hyperedges.

¹³

In our implementation, in Algorithm 3, evolutionary properties critical for evaluating rejection conditions and calculating power-law fitness (e.g., the fraction of intersecting hyperedge pairs) are dynamically updated with each new hyperedge addition. For computational efficiency, these property values are recorded at regular intervals, specifically after every \(\lceil \frac{|\mathcal {E}|}{500} \rceil\) hyperedges are added, for subsequent utilization (i.e., rejection conditions and power-law fitness functions). Additionally, we maintain the sequence \(\mathcal {E}^{\prime }\) of hyperedges obtained with \(\alpha ^{\star }\) and \(\beta ^{\star }\) (i.e., those leading to the best fitness) in Algorithm 3 and reuse the first \(\lfloor |\mathcal {E}| \cdot p \rfloor\) hyperedges in the sequence as \(\mathcal {\hat{E}}\) in Algorithm 4. It is important to note that this modification enhances the efficiency of Algorithm 4 while maintaining its original semantics.

¹⁴

Adapting these methods to hypergraphs is non-trivial. Leskovec et al. [45], for instance, concentrated on the temporal order of nodes, but determining the order of hyperedges is crucial in the context of hypergraph sampling. Specifically, in Section 4.2, we demonstrate that selecting induced hyperedges based solely on the ground-truth order of node appearance (i.e., ONS) does not yield satisfactory results. Additionally, likelihood-based frameworks built on graph growth models [55, 74] are not easily applicable to hypergraphs due to the flexibility in hyperedge sizes.

¹⁵

These hypergraphs are characterized by node degrees that follow a power-law distribution with an exponent of \(-1\) ; hyperedge sizes following a power-law distribution with an exponent of \(-4\) and a maximum size at 100; and a node count of 1,000.

¹⁶

For HyperFF, we use \((p, q) \in [(0.5, 0.2), (0.55, 0.15), (0.55, 0.2), (0.55, 0.25), (0.55, 0.3)]\) . For HyperLap, we use \((d_{\alpha }, e_{\beta }, L) \in [(-0.75, -5.0, 4), (-1.0, -4.0, 3), (-1.25, -4.0, 3), (-1.25, -7.0, 3), (-1.25, -7.0, 4), (-1.5, -5.0, 3)]\) . For THera, we use \((d_{\alpha }, e_{\beta }, c, p, \alpha) \in [(-0.75, -4.0, 8, 0.75, 9.0), (-0.75, -4.0, 8, 0.5, 9.0), (-1.25, -4.0, 8, 0.75, 9.0), (-1.25,\) \(-4.0, 8, 0.5, 9.0), (-1.0, -4.0, 8, 0.5, 9.0)]\) . For HyperPA, we use \((d_{\alpha }, e_{\beta }) \in [(-0.5, -4.0), (-0.75, -2.0), (-0.75, -6.0),\) \((-1.0, -5.0), (-1.25, -5.0)]\) .

A Appendix

A.1 Theoretical analysis of Observation 3

We theoretically analyze Observation 3 by analyzing the relation between hyperedge weighting and biases of degree distributions in samples.

Definition 1.

Let \(S\) be a hyperedge sampling algorithm and \(\phi _S(e) \ge 0\) be the weight of a hyperedge \(e\) for being selected by \(S\) . Then, we define the probability \(p_S(e)\) of \(e\) being selected by \(S\) as

\begin{equation*} p_S(e) = \frac{\phi _S(e)^\alpha }{\sum _{e^{\prime } \in \mathcal {E}}\phi _S(e^{\prime })^\alpha } = \frac{1}{Z_S(\alpha)} \phi _S(e)^\alpha , \end{equation*}

where \(Z_S(\alpha)\) is the normalization constant, and \(\alpha\) is a parameter.

Definition 2.

Given a sampling algorithm \(S\) , we denote by \(l_S(k)\) the probability of sampling a hyperedge that contains a node whose degree is lower than or equal to \(k\) from \(\mathcal {G}=(\mathcal {V},\mathcal {E})\) . That is,

\begin{align*} l_S(k) = \sum \nolimits _{e \in \mathcal {E}(\mathcal {V}_{k})} p_S(e) \end{align*}

where \(\mathcal {V}_{k} = \lbrace v \in \mathcal {V}: d(v) \le k \rbrace\) and \(\mathcal {E}({A}) = \lbrace e \in \mathcal {E}: e \cap A \ne \emptyset \rbrace\) .

Using \(l_S(k)\) , we can define the probability of sampling a hyperedge that contains only nodes whose degrees are higher than \(k\) . We denote this as \(h_S(k) = 1 - l_S(k)\) . In this definition, \(h_S(k)\) indicates the following meaning:

Definition 3.

For any \(k\ge 0\) , if \(h_{A}(k) \lt h_{B}(k)\) , we say Algorithm \(B\) is more biased toward nodes with degree higher than \(k\) , compared to Algorithm \(A\) .

Selecting a node with a degree higher than \(k\) can be divided into two cases: (a) selecting a hyperedge where at least one node has a degree higher than \(k\) but not all (i.e., \(\lbrace e \in \mathcal {E}: \exists v \in e \text{ such that } d(v) \le k \text{ and } \exists v^{\prime } \in e \text{ such that } d(v^{\prime }) \gt k \rbrace\) ) and (b) selecting a hyperedge where all nodes have degrees higher than \(k\) (i.e., \(\lbrace e \in \mathcal {E}: \forall v \in e, d(v) \gt k \rbrace\) ). Because the former case increases the probability of sampling a node with a degree less than or equal to \(k\) , the latter contributes more to being strongly biased toward nodes with a degree more than \(k\) .

Below, we are going to use \(M_{\omega }(\alpha)\) to refer to random hyperedge sampling with \(\omega (e)^{\alpha }\) as the hyperedge weight function. Note that the case of \(\alpha = 0\) (i.e., \(M_{\omega }(0)\) ) corresponds to RHS; and the case of \(\omega (e)=\min _{v \in e} d_{v}\) corresponds to MiDaS-Basic.

Theorem 4.

Given \(k\) , \(h_{M_{\omega }(\alpha)}(k)\) is an increasing function of \(\alpha\) , if \(k\) satisfies

\begin{equation} \mathbb {MAX}_{e \in \mathcal {E}(\mathcal {V}_{k})} \ln \omega (e) \lt \mathbb {AVG}_{e \in \mathcal {E}} \ln \omega (e). \end{equation}

(4)

Proof.

\(h_{M_{\omega }(\alpha)}(k)\) is an increasing function of \(\alpha\) if \(\frac{\partial l_{M_{\omega }(\alpha)}(k)}{\partial \alpha }\lt 0\) holds for any \(\alpha\) .

\(\frac{\partial l_{M_{\omega }(\alpha)}(k)}{\partial \alpha }\) is arranged as

\begin{equation} \sum _{e \in \mathcal {E}(\mathcal {V}_{k})} \frac{\omega (e)^{\alpha } \ln \omega (e)}{ \sum _{e^{\prime } \in \mathcal {E}(\mathcal {V}_{k})} \omega (e^{\prime })^{\alpha }} \lt \sum _{e \in \mathcal {E}} \frac{\omega (e)^{\alpha } \ln \omega (e)}{\sum _{e^{\prime } \in \mathcal {E}} \omega (e^{\prime })^{\alpha }}. \end{equation}

(5)

For any set of hyperedges \(\mathcal {\hat{E}}\) , the following equation is satisfied if \(\alpha\) changes from 0 to \(\infty\) :

\begin{equation} \mathbb {AVG}_{e \in \mathcal {\hat{E}}} \ln \omega (e) \le \sum _{e \in \mathcal {\hat{E}}} \frac{\omega (e)^{\alpha } \ln \omega (e)}{ \sum _{e^{\prime } \in \mathcal {\hat{E}}} \omega (e^{\prime })^{\alpha }} \le \mathbb {MAX}_{e \in \mathcal {\hat{E}}} \ln \omega (e). \end{equation}

(6)

Based on Equation (6), we can get the upper bound of the left-hand side and the lower bound of the right-hand side of Equation (5).

Thus, if Equation (4) is satisfied for given \(k\) , \(\frac{\partial l_{M_{\omega }(\alpha)}(k)}{\partial \alpha }\lt 0\) holds for any \(\alpha\) . That is, \(h_{M_{\omega }(\alpha)}(k)\) is an increasing function of \(\alpha\) . □

We have the following corollary and lemma from Theorem 4.

Corollary 1.

For all \(k\) that satisfies Equation (4), MiDaS-Basic is more biased toward nodes with degrees larger than \(k\) as \(\alpha\) increases.

Lemma 2.

Given any sampling algorithm, if \(k=k^{\prime }\) satisfies the condition of Equation (4), then \(k \lt k^{\prime }\) also satisfies the condition.

Proof.

The proof is straightforward as the left-hand side of Equation (4) is an increasing function of \(k\) . □

We analyze MiDaS-Basic (i.e., \(\omega (e)=\min _{v \in e} d_{v}\) ), MiDaS-Basic-Max (i.e., \(\omega (e)=\max _{v \in e} d_{v}\) ), and MiDaS-Basic-Avg (i.e., \(\omega (e)=\mathrm{avg}_{v \in e} d_{v}\) ), which are described in Appendix A.2. Specifically, we examine whether they have \(k\) satisfying Equation (4) in the 11 real-world hypergraphs. Based on Lemma 2, we examine \(k^{*}\) , i.e., the maximum \(k\) value satisfying Equation (4). We find out that MiDaS-Basic has satisfying \(k\) in all datasets, and the results in three datasets are shown in Figure 13. However, both MiDaS-Basic-Max and MiDaS-Basic-Avg do not have any \(k\) satisfying Equation (4) in most datasets. Even if they have appropriate \(k\) in some datasets, \(k^{*}\) values from them are less than those from MiDaS-Basic, as summarized below.

Fig. 13.

Observation 6.

On all eleven considered real-world hypergraphs, Equation (4) is satisfied for a wider range of \(k\) values in MiDaS-Basic than in its variants. Recall that Equation (4) is a sufficient condition for bias toward high-degree nodes grows as \(\alpha\) increases.

A.2 Ablation Study of MiDaS-Basic

Below, in order to justify the design choices that we make when designing MiDaS-Basic, we compare it with its three variants:

—

MiDaS-Basic - Max : This variant uses \((\max _{v \in e}d_{v})^\alpha\) for hyperedge weighting.

—

MiDaS-Basic - Avg : This variant uses \((\text{avg}_{v \in e} d_{v})^\alpha\) for hyperedge weighting.

—

MiDaS-Basic - NS : This variant draws nodes with probability proportional to \(\lbrace d_{v}\rbrace ^{\alpha }\) and returns the induced sub-hypergraph.

Examination of Observation 3: We check whether the variants also show the tendency of biases in degree distributions in samples controlled by \(\alpha\) (Observation 3) when considering all settings. As shown in Table 13, we observe that neither MiDaS-Basic-MAX nor MiDaS-Basic-AVG completely show this tendency. But, only MiDaS-Basic-NS exhibits Observation 3.

Table 13.

Preservation of Degree Distribution: We visually compare the degree distributions in the best performing \(\alpha\) values, which minimize the degree d-statistics in each variant in Figure 14. As mentioned above, since only MiDaS-Basic and MiDaS-Basic-NS exhibit Observation 3, we could see that these two methods have degree distributions quite close to that of original hypergraphs.

Fig. 14.

Evaluation Table: Even though MiDaS-Basic-NS maintain the degree distributions well similar to MiDaS-Basic, not only MiDaS-Basic-MAX and MiDaS-Basic-AVG but also MiDaS-Basic-NS are outperformed by MiDaS-Basic mostly, as seen in Table 14. MiDaS-Basic consistently outperforms its variants regardless of sampling portions as shown in Figure 15.

Table 14.

	MiDaS-Basic-Max	MiDaS-Basic-Avg	MiDaS-Basic-NS	MiDaS-Basic
Degree	2.85 (0.40)	2.62 (0.24)	2.15 (-0.20)	1.91 (-0.45)
Int. Sizes.	1.91 (-0.34)	2.07 (-0.18)	3.27 (0.57)	2.27 (-0.06)
Pair Degree	2.07 (-0.13)	2.04 (-0.32)	2.85 (0.36)	2.56 (0.09)
Size	2.96 (0.53)	2.33 (0.06)	2.55 (0.14)	1.69 (-0.73)
SV	2.04 (-0.05)	1.75 (-0.24)	2.82 (0.41)	2.16 (-0.07)
CC	2.47 (0.39)	2.04 (0.12)	1.80 (-0.19)	1.65 (-0.32)
GCC	2.55 (0.01)	2.24 (0.05)	2.87 (0.34)	1.87 (-0.41)
Density	3.20 (0.65)	2.69 (0.39)	2.00 (-0.42)	1.62 (-0.62)
Overlapness	2.98 (0.46)	2.55 (0.28)	2.09 (-0.31)	1.91 (-0.43)
Diameter	2.64 (0.25)	2.27 (-0.03)	2.80 (0.24)	1.82 (-0.46)
AVG	2.57 (0.22)	2.26 (0.04)	2.52 (0.09)	1.95 (-0.35)

Table 14. MiDaS-Basic Gives Overall More Representative Samples than its Three Variants

We report rankings and Z-Scores (in parentheses) averaged over all \({\bf 11}\) datasets and five different sampling portions (10%, 20%, 30%, 40%, and 50%).

Fig. 15.

A.3 Scalability of MiDaS and MiDaS-B

We analyze the scalability of MiDaS and MiDaS-B with respect to the input hypergraph size (i.e., the number of hyperedges in the input hypergraph). To this end, we generate synthetic hypergraphs of varying sizes, with up to \(10^9\) hyperedges, using HyperCL [42].¹⁵ We report the running time of MiDaS aggregated across five distinct sampling portions ( \(10\%, \ldots , 50\%\) ) for each dataset as in Section 3.4.4; and as in Section 4.4.4, we measure the running time of MiDaS-B with a fixed sampling portion of 0.9. As shown by a regression line slope close to 1 in the log-log scale in Figure 16(a), MiDaS exhibits linear scalability with respect to the number of hyperedges. In contrast, as shown in Figure 16(b), MiDaS-B demonstrates super-linear scalability, lying between linear and quadratic growth rates (i.e., the slope is between 1 and 2 in the log-log scale). It successfully handles hypergraphs with up to \(10^7\) hyperedges. Recall that the super-linear complexity of MiDaS-B arises from computing the structural evolution over time. For detailed mathematical analysis, refer to Theorem 3 and its proof.

Fig. 16.

A.4 Parameter Sensitivity of MiDaS-Basic and MiDaS-B-Basic

The impact of \(\alpha\) of MiDaS-Basic : We analyze the sensitivity of MiDaS-Basic to the parameter \(\alpha\) . We specifically quantify the impact of varying \(\alpha\) on both average node degrees and hyperedge sizes within the sampled sub-hypergraph. As seen in Figure 17(a)–(b), in line with Observation 3, there is a noticeable increase in average node degrees within the sub-hypergraph as \(\alpha\) increases. Conversely, average hyperedge sizes exhibit minimal change. This observation stems from the fact that MiDaS-Basic samples hyperedge \(e\) with a probability proportional to \((\min _{v \in e} d_{v})^{\alpha }\) without explicit consideration of hyperedge size. Nonetheless, the sampled hyperedge size tends to decrease with increasing \(\alpha\) , attributing this phenomenon to larger \(\min _{v \in e} d_{v}\) values becoming more probable for smaller hyperedge sizes.

Fig. 17.

Further, we assess the degree D-statistic and size D-statistic w.r.t. \(\alpha\) . As seen in Figure 17(c)–(d), the best performing \(\alpha\) , resulting in the smallest D-statistic, varies for each property. However, in line with Observation 2, Figure 17(e)–(f) show that the optimal \(\alpha\) for achieving the smallest degree D-statistic closely aligns with that for achieving favorable average rankings and average Z-Scores. Additionally, note that the optimal \(\alpha\) varies for each dataset. The impact of \(\alpha\) and \(\beta\) of MiDaS-B-Basic : We further explore the sensitivity of MiDaS-B-Basic to parameters \(\alpha\) and \(\beta\) . As seen in Equation (3), \(\alpha\) influences the inclination toward sampling nodes with higher degrees, while \(\beta\) influences the inclination toward sampling hyperedges with smaller sizes. This tendency is clearly observed in Figure 18(a)–(b). However, due to the higher probability of obtaining larger \(\min _{v \in e} d_{v}\) values for smaller hyperedge sizes, the average node degrees in the sub-hypergraph are influenced by both \(\alpha\) and \(\beta\) .

Fig. 18.

In addition, we analyze how the sampling quality is influenced by these parameters, focusing on degree D-statistics, size D-statistics, average rankings, and average Z-Scores. Unlike the earlier findings related to MiDaS-Basic, parameter values exhibiting low degree D-statistics do not necessarily align with those associated with favorable average rankings and average Z-Scores. The best-performing parameter values vary for each property. Therefore, determining the optimal parameter values requires considering degrees, sizes, and other properties collectively. Moreover, it is noteworthy that the best-performing parameter values vary also for each dataset.

A.5 Evaluation of MiDaS and MiDaS-B on Random Hypergraphs with Diverse Structures

To show the extended performance and versatility of MiDaS and MiDaS-B, we conduct evaluations on random hypergraph datasets. We generate 20 random hypergraph datasets using various hypergraph generators, such as HyperFF [35], HyperLap [42], THera [33], and HyperPA [19]. As their inputs, we use (a) node degrees following a power-law distribution with an exponent \(d_{\alpha }\) and (b) hyperedge sizes following a power-law distribution with an exponent \(e_{\beta }\) . Specifically, we generate hypergraphs using HyperCL [42] with (a) and (b) as the inputs and extract all required inputs from the generated hypergraphs. We considered five distinct parameter (i.e., exponent) pairs for each hypergraph generator,¹⁶ resulting in 20 random hypergraphs with diverse structures.

Performance of MiDaS : In the representative hypergraph sampling problem, we compare MiDaS with six simple methods, HRW based sampling and MGS. We report their average performance across five distinct sampling portions (i.e., \(10\%, 20\%, 30\%, 40\%, 50\%\) ) in Table 15. Notably, MiDaS demonstrates superior performance even in these random datasets, outperforming 14 competitors in terms of both average rankings and average Z-Scores. Consistent with the results from real-world hypergraphs, MiDaS performs particularly well in preserving degrees, density, overlapness, and effective diameter.

Table 15.

		RNS	RDN [46]	RW [64]	FF [35]	RHS	TIHS [2]	NB	SK	Add	Rep	Del	Add	Rep	Del	MiDaS
		RNS	RDN [46]	RW [64]	FF [35]	RHS	TIHS [2]	HRW [78]		MGS - Deg [27]			MGS - Avg [27]			MiDaS
Degree	Dstat.	0.380	0.406	0.382	0.374	0.467	0.389	0.494	0.497	0.344	0.296	0.358	0.373	0.322	0.350	0.262
	Rank	8.200	8.810	7.886	7.467	13.352	8.010	9.990	10.048	7.790	4.324	7.905	8.533	6.219	6.962	4.219
	Z-Score	0.177	0.195	-0.031	-0.128	1.055	0.034	0.561	0.598	-0.198	-0.781	-0.022	0.061	-0.506	-0.121	-0.893
Int. Size	Dstat.	0.033	0.017	0.016	0.018	0.009	0.014	0.224	0.225	0.015	0.034	0.022	0.005	0.007	0.020	0.013
	Rank	10.533	8.171	8.162	7.990	4.362	7.590	14.190	14.019	5.743	8.819	8.371	2.867	3.667	7.752	7.581
	Z-Score	0.038	-0.314	-0.303	-0.290	-0.588	-0.372	2.024	2.028	-0.493	0.282	-0.183	-0.706	-0.609	-0.088	-0.427
Pair Degree	Dstat.	0.173	0.130	0.115	0.095	0.188	0.117	0.227	0.228	0.153	0.115	0.146	0.126	0.095	0.112	0.108
	Rank	11.905	7.943	6.105	5.667	13.457	6.390	8.600	8.600	9.943	7.667	8.924	7.610	4.876	6.257	5.914
	Z-Score	0.906	-0.039	-0.332	-0.451	1.210	-0.286	0.229	0.247	0.331	-0.244	0.277	-0.182	-0.752	-0.391	-0.523
Size	Dstat.	0.145	0.109	0.085	0.056	0.021	0.097	0.362	0.365	0.033	0.057	0.070	0.008	0.006	0.031	0.066
	Rank	12.581	11.029	9.095	6.810	3.867	10.333	14.257	14.400	5.352	7.124	8.124	1.733	1.762	5.429	7.733
	Z-Score	0.585	0.198	-0.103	-0.404	-0.720	0.072	2.033	2.081	-0.594	-0.264	-0.137	-0.898	-0.921	-0.615	-0.313
SV	Dstat.	0.095	0.252	0.252	0.268	0.089	0.252	0.107	0.104	0.131	0.122	0.122	0.120	0.118	0.120	0.211
	Rank	4.181	10.705	10.638	12.190	2.981	10.838	6.381	6.362	6.933	7.333	5.857	6.829	6.733	6.600	10.305
	Z-Score	-0.738	0.805	0.835	1.111	-0.950	0.828	-0.367	-0.416	-0.352	-0.191	-0.400	-0.349	-0.191	-0.194	0.568
CC	Dstat.	0.007	0.000	0.001	0.002	0.017	0.001	0.000	0.001	0.006	0.003	0.008	0.008	0.003	0.005	0.006
	Rank	4.886	1.210	1.181	1.400	6.400	2.038	1.190	1.724	4.067	3.295	5.124	4.600	3.457	4.790	3.429
	Z-Score	0.298	-0.356	-0.361	-0.335	0.616	-0.195	-0.372	-0.279	0.094	0.015	0.325	0.186	0.021	0.285	0.061
GCC	Diff.	0.172	0.120	0.090	0.073	0.196	0.098	0.142	0.158	0.175	0.167	0.193	0.169	0.154	0.177	0.105
	Rank	9.819	7.676	6.429	6.038	9.667	7.133	7.248	7.505	9.295	8.600	9.333	8.771	7.695	8.495	6.295
	Z-Score	0.550	-0.113	-0.365	-0.443	0.366	-0.251	-0.157	-0.039	0.203	0.053	0.450	0.114	-0.121	0.145	-0.391
Density	Diff.	0.477	0.350	0.341	0.317	0.654	0.341	0.593	0.594	0.507	0.550	0.604	0.561	0.558	0.616	0.229
	Rank	6.552	4.552	4.590	3.933	13.562	4.400	10.343	10.295	8.057	8.657	11.210	9.124	9.457	11.257	2.610
	Z-Score	-0.082	-0.738	-0.814	-1.002	0.981	-0.796	0.604	0.610	0.145	0.327	0.681	0.445	0.362	0.758	-1.482
Overlapness	Diff.	0.563	0.382	0.378	0.348	0.668	0.372	0.527	0.528	0.536	0.550	0.591	0.583	0.579	0.616	0.296
	Rank	8.981	4.619	4.524	3.829	14.305	4.362	7.524	7.619	9.571	8.657	10.800	10.638	10.200	11.467	2.895
	Z-Score	0.381	-0.684	-0.780	-1.086	1.125	-0.768	0.130	0.139	0.341	0.192	0.640	0.606	0.410	0.735	-1.381
Diameter	Diff.	0.217	0.204	0.193	0.190	0.343	0.198	0.113	0.120	0.238	0.136	0.220	0.289	0.197	0.202	0.108
	Rank	7.771	9.114	9.152	9.171	11.943	8.676	5.057	5.086	7.714	6.876	7.829	9.067	8.819	9.200	4.524
	Z-Score	-0.070	0.282	0.261	0.260	0.890	0.218	-0.726	-0.671	-0.095	-0.235	0.011	0.283	0.176	0.201	-0.785
Average	Rank	8.541	7.383	6.776	6.450	9.390	6.977	8.478	8.566	7.447	7.135	8.348	6.977	6.289	7.821	5.550
Average	Z-Score	0.204	-0.076	-0.199	-0.277	0.398	-0.152	0.396	0.430	-0.062	-0.085	0.164	-0.044	-0.213	0.071	-0.557

Table 15. Representative Sampling from Random Hypergraphs

MiDaS yields overall the most representative sub-hypergraphs. We compare 15 sampling methods on 20 random hypergraphs with five different sampling portions.

Performance of MiDaS-B : In the back-in-time hypergraph sampling problem, we conduct a comparison of MiDaS-B with ten other back-in-time sampling algorithms. The performances reported in Table 16 are averages across five distinct sampling portions (i.e., \(10\%, 30\%, 50\%, 70\%, 90\%\) ). MiDaS-B achieves the second-best performance, with ONS being the only algorithm outperforming it in terms of average rankings and Z-Scores. The outstanding performance of ONS can be attributed to the nature of hyperedges generated by HyperFF and HyperPA. Whenever a new node is introduced, these generators subsequently produce one or more hyperedges that include the new node. ONS, designed to sample induced hyperedges by adhering to the ground-truth generating order of nodes, performs well under these specific conditions. However, in practice, ground-truth node orders are seldom available, and MiDaS-B exhibits exceptional performance even without access to this information.

Table 16.

		RNS	RDN [46]	RW [64]	FF [35]	ONS	RHS	TIHS [2]	NB	SK	MiDaS (w/o. Oracle)	MiDaS-B
		RNS	RDN [46]	RW [64]	FF [35]	ONS	RHS	TIHS [2]	HRW [78]		MiDaS (w/o. Oracle)	MiDaS-B
Degree	Dstat.	0.195	0.329	0.333	0.346	0.206	0.222	0.326	0.211	0.207	0.199	0.189
	Rank	4.505	7.419	7.905	7.248	5.267	5.886	7.533	5.257	5.143	4.981	4.838
	Z-Score	-0.425	0.573	0.664	0.508	-0.224	-0.123	0.602	-0.211	-0.255	-0.496	-0.614
Int. Size	Dstat.	0.028	0.014	0.019	0.019	0.016	0.021	0.014	0.064	0.061	0.015	0.013
	Rank	7.438	5.248	5.962	5.086	3.505	6.000	5.286	9.067	8.619	4.676	5.114
	Z-Score	0.334	-0.291	-0.146	-0.378	-0.835	-0.016	-0.273	1.161	1.069	-0.375	-0.249
Pair Degree	Dstat.	0.106	0.099	0.104	0.103	0.076	0.110	0.092	0.084	0.085	0.103	0.109
	Rank	6.219	6.124	6.238	6.305	4.714	6.600	5.581	5.752	5.933	6.114	6.419
	Z-Score	0.093	-0.000	0.058	0.158	-0.353	0.168	-0.122	-0.064	-0.010	-0.017	0.089
Size	Dstat.	0.075	0.056	0.053	0.044	0.033	0.055	0.053	0.193	0.192	0.042	0.039
	Rank	6.705	5.676	5.190	4.324	3.181	6.095	5.486	9.981	9.943	4.562	4.657
	Z-Score	0.075	-0.258	-0.390	-0.516	-0.676	-0.041	-0.276	1.563	1.554	-0.575	-0.459
SV	Dstat.	0.094	0.167	0.167	0.176	0.076	0.097	0.166	0.103	0.104	0.089	0.085
	Rank	5.257	6.695	6.629	6.962	4.648	4.524	6.943	5.352	5.600	4.781	4.057
	Z-Score	-0.216	0.447	0.451	0.458	-0.247	-0.153	0.459	-0.109	-0.114	-0.480	-0.589
CC	Dstat.	0.009	0.006	0.006	0.007	0.006	0.010	0.006	0.006	0.006	0.012	0.014
	Rank	3.248	1.362	1.286	1.486	1.143	3.590	1.648	1.295	1.724	4.686	4.705
	Z-Score	0.143	-0.249	-0.250	-0.213	-0.266	0.254	-0.202	-0.251	-0.190	0.559	0.565
GCC	Diff.	0.068	0.071	0.080	0.090	0.070	0.092	0.071	0.126	0.122	0.098	0.096
	Rank	5.162	5.733	6.333	6.619	4.152	5.505	5.657	7.676	7.619	5.771	5.771
	Z-Score	-0.322	-0.130	0.049	0.052	-0.372	-0.176	-0.144	0.613	0.583	-0.059	-0.094
Density	Diff.	1.239	12.929	12.735	15.717	0.981	4.019	12.859	4.879	4.840	3.469	3.393
	Rank	3.648	7.524	7.571	7.600	5.219	5.590	7.514	6.190	6.067	4.705	4.019
	Z-Score	-0.669	0.509	0.549	0.563	-0.243	-0.169	0.495	-0.060	-0.053	-0.415	-0.505
Diameter	Diff.	0.310	0.609	0.647	0.612	0.398	0.424	0.594	0.320	0.334	0.403	0.407
	Rank	4.400	7.543	8.162	7.800	4.895	5.867	7.410	4.543	4.810	4.962	5.610
	Z-Score	-0.504	0.562	0.700	0.515	-0.298	-0.113	0.557	-0.462	-0.422	-0.289	-0.246
Overlapness	Diff.	5.721	27.916	27.924	35.639	2.655	9.603	27.828	8.391	8.341	8.656	8.822
	Rank	4.390	7.581	8.200	8.267	5.524	5.610	7.590	4.410	4.333	5.114	4.943
	Z-Score	-0.471	0.544	0.705	0.797	-0.063	-0.273	0.539	-0.571	-0.541	-0.275	-0.390
Average	Rank	5.097	6.090	6.348	6.170	4.225	5.527	6.065	5.952	5.979	5.035	5.013
Average	Z-Score	-0.196	0.171	0.239	0.194	-0.358	-0.064	0.163	0.161	0.162	-0.242	-0.249

Table 16. Back-in-time Sampling from Random Hypergraphs

Among the 11 sampling methods evaluated on 20 random hypergraphs with five different sampling portions, MiDaS-B achieves the second-best performance, with ONS being the only algorithm outperforming it in terms of average rankings and Z-Scores. Note that MiDaS-B exhibits exceptional performance even without access to ground-truth node orders, while ONS requires them.

References

[1]

Charu C. Aggarwal, Yuchen Zhao, and S. Yu Philip. 2011. Outlier detection in graph streams. In Proceedings of the 2011 IEEE 27th International Conference on Data Engineering.

Abstract

1 Introduction

2 Preliminaries and Datasets

2.1 Notations

2.2 Statistics for Structure of Hypergraphs

2.3 Simple and Intuitive Sampling Approaches

2.3.1 Node Selection (NS).

2.3.2 Hyperedge Selection (HS).

2.4 Datasets

2.5 Evaluation

3 Representative Hypergraph Sampling

3.1 Problem Definition

3.2 Observations

3.2.1 Random Node Sampling (RNS).

3.2.2 Random Degree Node (RDN), Random Walk (RW), and Forest-Fire (FF).

3.2.3 Random Hyperedge Sampling (RHS).

3.2.4 Totally-Induced Hyperedge Sampling (TIHS).

3.2.5 Summary of Observations.

3.3 Proposed Approach: MiDaS

3.3.1 Intuitions Behind MiDaS.

3.3.2 MiDaS-Basic: Preliminary Version.

3.3.3 MiDaS: Full-Fledged Version.

3.4 Evaluation

3.4.1 Experimental Settings.

3.4.2 Quality: How Well Does MiDaS Preserve the ten Structural Properties of Real-world Hypergraphs?.

3.4.3 Consistency: Does MiDaS Perform Well Regardless of the Sampling Portions?.

3.4.4 Speed: How Fast is MiDaS Compared to the Competitors?.

4 Back-In-Time Hypergraph Sampling

4.1 Problem Definition

4.2 Observations

4.2.1 Random Node Sampling (RNS).

4.2.2 Random Degree Node (RDN), Random Walk (RW), and Forest-Fire (FF).

4.2.3 Ordered Node Sampling (ONS).

4.2.4 Random Hyperedge Sampling (RHS).

4.2.5 Totally-Induced Hyperedge Sampling (TIHS).

4.2.6 MiDaS (w. Oracle) .

4.2.7 Summary.

4.3 Proposed Approach: MiDaS-B

4.3.1 MiDaS-B-Basic: Preliminary Version.

4.3.2 Hyperparameter Tuning without Oracle.

4.3.3 MiDaS-B: Full-fledged Version.

4.4 Evaluation

4.4.1 Experimental Settings.

4.4.2 Quality of MiDaS-B.

4.4.3 Consistency of MiDaS-B.

4.4.4 Speed of MiDaS-B.

4.4.5 Effectiveness of Each Component of MiDaS-B.

5 Related Work

5.1 Graph Simplification

5.2 Graph Sampling

5.3 Hypergraph Sampling

5.4 Structural Properties of Real-world Hypergraphs

5.5 Others

6 Conclusions and Future Directions

6.1 Conclusions

6.2 Future Directions

Footnotes

A Appendix

A.1 Theoretical analysis of Observation 3

A.2 Ablation Study of MiDaS-Basic

A.3 Scalability of MiDaS and MiDaS-B

A.4 Parameter Sensitivity of MiDaS-Basic and MiDaS-B-Basic

A.5 Evaluation of MiDaS and MiDaS-B on Random Hypergraphs with Diverse Structures

References

Index Terms

Recommendations

MiDaS: Representative Sampling from Real-world Hypergraphs

Covering Non-uniform Hypergraphs

Odd cycles and Θ-cycles in hypergraphs

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics