Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

Representative and Back-In-Time Sampling from Real-world Hypergraphs

Published: 26 April 2024 Publication History
  • Get Citation Alerts
  • Abstract

    Graphs are widely used for representing pairwise interactions in complex systems. Since such real-world graphs are large and often evergrowing, sampling subgraphs is useful for various purposes, including simulation, visualization, stream processing, representation learning, and crawling. However, many complex systems consist of group interactions (e.g., collaborations of researchers and discussions on online Q&A platforms) and thus are represented more naturally and accurately by hypergraphs than by ordinary graphs. Motivated by the prevalence of large-scale hypergraphs, we study the problem of sampling from real-world hypergraphs, aiming at answering (Q1) how can we measure the goodness of sub-hypergraphs, and (Q2) how can we efficiently find a “good” sub-hypergraph. Regarding Q1, we distinguish between two goals: (a) representative sampling, which aims at capturing the characteristics of the input hypergraph, and (b) back-in-time sampling, which aims at closely approximating a past snapshot of the input time-evolving hypergraph. To evaluate the similarity of the sampled sub-hypergraph to the target (i.e., the input hypergraph or its past snapshot), we consider 10 graph-level, hyperedge-level, and node-level statistics. Regarding Q2, we first conduct a thorough analysis of various intuitive approaches using 11 real-world hypergraphs. Then, based on this analysis, we propose MiDaS and MiDaS-B, designed for representative sampling and back-in-time sampling, respectively. Regarding representative sampling, we demonstrate through extensive experiments that MiDaS, which employs a sampling bias toward high-degree nodes in hyperedge selection, is (a) Representative: finding overall the most representative samples among 15 considered approaches, (b) Fast: several orders of magnitude faster than the strongest competitors, and (c) Automatic: automatically tuning the degree of sampling bias. Regarding back-in-time sampling, we demonstrate that MiDaS-B inherits the strengths of MiDaS despite an additional challenge—the unavailability of the target (i.e., past snapshot). It effectively handles this challenge by focusing on replicating universal evolutionary patterns, rather than directly replicating the target.

    1 Introduction

    A complex system is a group of many parts that interact with each other. These systems are everywhere in our world. For instance, think about how different parts of our body work together, how animals and plants rely on each other in nature, or how we connect with friends and family on social media. All of these are examples of complex systems.
    Graphs are extensively utilized to model such complex systems, consisting of nodes and edges. In these graphs, nodes represent entities and edges connect nodes that interact with each other. The expansion of the internet and the advancement of data digitization have led to the emergence of large-scale complex systems like e-mail networks, social media, and financial transactions. Hence, there is a growing demand for the efficient analysis of such large-scale graphs.
    Given the significant challenge of collecting and analyzing every entity in such large-scale graphs, a common approach involves sampling a smaller graph that retains properties similar to the original. This sampling strategy is widely employed in various tasks, including:
    Simulation: In the context of internet topology, where nodes represent hosts or routers and edges correspond to communication links, conducting simulations, especially packet-level ones, is notably time-intensive owing to the vast scale of the internet. These simulations often even require multiple runs to ensure the reliability of the protocols under examination. Therefore, this necessitates a significant reduction in simulation time in this field. To address the computational challenges, sampling small graphs that resemble the internet topology has been utilized [36, 37].
    Visualization: Visualizing a large-scale graph is essential for a thorough human interpretation, yet it presents challenges due to the vast number of components (i.e., nodes and edges), the lack of screen space, and the complexity of layout algorithms. A small representative subgraph can be used to mitigate these difficulties [7, 17, 23, 38].
    Stream Processing: A dynamic graph that grows indefinitely is naturally treated as a stream of edges whose number can potentially be infinite. In dealing with such graphs, it becomes impractical to store every edge for analysis due to the vast and ever-expanding nature of the data. Consequently, several studies have shifted focus toward maintaining a subgraph that reflects the current state of the entire graph. This method is especially prevalent in various graph-related tasks, including outlier detection [1, 21], edge prediction [83], and triangle counting [40, 51, 62].
    Crawling: Online social networks (e.g., Facebook and X (formerly known as Twitter)) provide information on connections mainly by API queries. Limitations on API request rates make it inevitable to deal with a subgraph instead of the entire graph [39, 50, 69].
    Graph Representation Learning: Despite their wide usage, graph neural networks (GNNs) often suffer from scalability issues due to the recursive expansion of neighborhoods across layers. Sampling has been employed to accelerate training by limiting the size of the neighborhoods [9, 10, 11, 25, 75, 85].
    Ordinary graphs are suitable for modeling connections between two entities, known as pairwise interactions. However, in many complex systems, group interactions are prevalent, where more than two entities interact with each other simultaneously. Such interactions are commonly seen in various contexts, including multiple researchers collaborating on a manuscript, users engaging in a group discussion on online Q&A platforms, and the dynamic interplay of ingredients in a recipe.
    Consequently, these complex systems are more aptly depicted using hypergraphs rather than traditional graphs. A hypergraph consists of nodes and hyperedges, with each hyperedge capable of including any number of nodes, thus effectively capturing the essence of group interactions. This approach is visually demonstrated in Figure 1, where nodes represent tags and hyperedges correspond to multi-tagged questions on an online Q&A platform. Modeling complex systems as hypergraphs, rather than graphs, can help capture domain-specific structural patterns [43], predict interactions [73], cluster nodes [68], and measure node importance [12]. Since real-world hypergraphs are similar in size to and more complex than real-world graphs, sampling from hypergraphs provides substantial benefits, including those listed above.
    Fig. 1.
    Fig. 1. Illustration of a hypergraph derived from an online Q&A platform: Nodes represent tags, such as RAM (R) and Windows (W), which can be attached to questions. Hyperedges, such as \(e_1\) to \(e_4\) , correspond to individual questions associated with sets of these tags. Specifically, hyperedge \(e_1\) is a question about the change from Windows to Linux, tagged with Windows (W), RAM (R), CPU (C), and GPU (G). This hypergraph, which is composed of six nodes and four hyperedges, effectively illustrates the multi-tag framework for categorizing questions on the platform.
    In this article, our primary focus is on the challenge of identifying a “good” sub-hypergraph sample within a given hypergraph. The definition of “good” can vary based on specific applications, prompting us to delve into general tasks and assess whether the sample adequately preserves the structural properties of the target. The target, in this context, can take one of two forms: either the input hypergraph itself or a past snapshot of the time-evolving hypergraph. For instance, when contemplating sampling a sub-hypergraph that represents half of the input hypergraph, a crucial question arises—should the goal be to preserve similar structural properties as the input hypergraph, or should it mimic the past version of the entire hypergraph when their size is halved? Given the validity of both perspectives, we approach the hypergraph sampling problem with two main objectives: (a) representative hypergraph sampling and (b) back-in-time hypergraph sampling. Furthermore, to the best of our knowledge, our work represents the first attempt to address the challenge of sampling from real-world hypergraphs. Consequently, we conduct an analysis of simple and intuitive sampling approaches, e.g., random sampling of hyperedges. Drawing insights from the properties of these straightforward approaches, we develop our algorithm to overcome their inherent weaknesses. To guide our investigation, we aim at answering the following questions for each problem:
    Q1. How can we measure the quality of a sub-hypergraph as a good sample?
    Q2. What are the benefits and limitations of simple and intuitive approaches for hypergraph sampling?
    Q3. How can we find a high-quality sample sub-hypergraph rapidly without extensively exploring the search space?
    In addressing the first problem, representative hypergraph sampling, the objective is to capture the characteristics of the input hypergraph within the sampled sub-hypergraph. Regarding Q1, we measure the difference between the input hypergraph and a sample sub-hypergraph using ten distinct statistics related to the unique structural properties of real-world hypergraphs [41]. These statistics include both node-level and hyperedge-level analyses, comparing the distributions of node degrees, hyperedge sizes, intersection sizes [35], and node-pair degrees [42] in both sampled and entire hypergraphs. Additionally, we assess their average clustering coefficient [19], density [26], overlapness [42], and effective diameter [35, 47] as graph-level statistics. Concerning Q2, we try six simple and intuitive sampling approaches from 11 real-world hypergraphs, as we are the first, to our knowledge, to tackle this problem. Then, we analyze their benefits and limitations. While some approaches preserve certain structural properties well, none of them succeeds in preserving all ten properties, demonstrating the difficulty of the considered problem. With respect to Q3, leveraging insights from our previous analyses, we propose Minimum Degree Biased Sampling of Hyperedges (MiDaS) for representative hypergraph sampling. MiDaS is inspired by two facts: (a) all the simple approaches fail to preserve degree distributions well and (b) the ability to preserve degree distributions is strongly correlated to the ability to preserve other properties. Utilizing these facts, MiDaS is designed to be able to draw hyperedges with a sampling bias (i.e., the statistical prioritization of specific nodes or hyperedges during sampling) toward those with high-degree nodes, while automatically adjusting the degree of bias to align with the degree distribution of the input hypergraph. Through extensive experiments, we show that MiDaS performs best overall among 14 competitors in 11 real-world hypergraphs, as shown in Figure 2.
    Fig. 2.
    Fig. 2. Strengths of MiDaS. (a) It rapidly finds overall the most representative sub-hypergraphs among \({\bf 15}\) approaches from \({\bf 11}\) real-world hypergraphs. (b) Especially, it accurately preserves the degree distribution. See Section 3.4 for details.
    The second problem, back-in-time hypergraph sampling, is defined as follows: Given a snapshot of a time-evolving hypergraph and a target size, the objective is to construct a sub-hypergraph that closely approximates the past snapshot of the hypergraph at the target size. Note that the target (i.e., the past snapshot of the hypergraph) is not provided, unlike in representative sampling, where the given hypergraph itself is the target. It is also important to note that both representative sampling and back-in-time sampling share the overarching goal of obtaining a structurally similar but smaller sub-hypergraph from the input hypergraph, making both suitable for the applications mentioned. Regarding Q1, we assess the quality of a sub-hypergraph by comparing it with a past snapshot of the same size by employing the aforementioned ten statistics. Concerning Q2, we analyze eight sampling methods, including the aforementioned six straightforward methods and MiDaS, across 11 real-world hypergraphs. They exhibit distinct characteristics, which also differ from those observed in the previous problem. Notably, while MiDaS, designed for representative sampling, exhibits superior performance compared to other simple sampling methods, it encounters challenges in effectively preserving hyperedge sizes in this back-in-time hypergraph sampling problem. Therefore, in response to Q3, we introduce MiDaS-B, an extension of MiDaS specifically tailored for back-in-time hypergraph sampling. MiDaS-B additionally incorporates a hyperedge-size-related term into the hyperedge sampling probabilities, effectively controlling biases toward both degrees and sizes to closely match the degree and size distribution of the target hypergraph. Note that, since the target hypergraph is unavailable, adjusting the hyperparameters of MiDaS-B to minimize the difference from it is not straightforward. In order to address this challenge, we leverage the replication of evolutionary patterns that are commonly observed in real-world hypergraphs as a substitute objective for tuning hyperparameters. Experimental results demonstrate that MiDaS-B significantly outperforms 10 competing methods across 11 real-world hypergraphs.
    Our contributions are summarized as follows1:
    New Problem: To the best of our knowledge, our work is the first to tackle the challenging task of sampling sub-hypergraphs from real-world hypergraphs. We aim at obtaining structurally similar yet smaller sub-hypergraphs while pursuing two distinct objectives (representative sampling and back-in-time sampling).
    Findings: We conduct a comprehensive analysis of a wide array of intuitive sampling approaches in the context of these new problems. Our examination, conducted on 11 datasets, focuses on uncovering the limitations of these approaches in preserving 10 essential properties of the target hypergraph.
    Algorithm: We propose MiDaS, which rapidly finds overall the most representative sample—a sub-hypergraph sharing structural similarities but smaller than the input hypergraph—among 15 methods (see Figure 2). Additionally, we present MiDaS-B, an extension of MiDaS, capable of accurately approximating the past snapshot of the input hypergraph, without relying on the ground-truth past snapshot information.
    For reproducibility, we make our code and datasets available at https://github.com/young917/MiDaS.
    The rest of the article is organized as follows. In Section 2, we establish the necessary notations and preliminaries, including structural statistics of hypergraphs, simple and intuitive sampling approaches, datasets used in the article, and evaluation criteria for assessing the quality of sub-hypergraphs. In Section 3, we focus on representative hypergraph sampling, introducing and evaluating our proposed algorithm, MiDaS. In Section 4, we shift our focus to back-in-time hypergraph sampling, where we introduce and evaluate our proposed algorithm, MiDaS-B. In Section 5, we discuss related works. In Section 6, we offer conclusions along with future research directions.

    2 Preliminaries and Datasets

    In this section, we provide an overview of the basic concepts related to hypergraphs. We then discuss the ten statistics that we use to measure the performance of hypergraph sampling. We also introduce simple and intuitive sampling approaches that serve as baselines for comparison. Additionally, we describe the datasets we use in our evaluation and provide an overview of the numerical evaluation process.

    2.1 Notations

    A hypergraph \(\mathcal {G}=(\mathcal {V},\mathcal {E})\) consists of a set of nodes \(\mathcal {V}\) and a set of hyperedges \(\mathcal {E}\subseteq 2^{\mathcal {V}}\) . Each hyperedge \(e\in \mathcal {E}\) is a non-empty subset of \(\mathcal {V}\) . The degree of a node \(v\) is the number of hyperedges containing \(v\) , i.e., \(d_{v}:=|\lbrace e \in \mathcal {E}: v \in e \rbrace |\) . A sub-hypergraph \(\mathcal {\hat{G}}=(\mathcal {\hat{V}},\mathcal {\hat{E}})\) of \(\mathcal {G}=(\mathcal {V},\mathcal {E})\) is a hypergraph consisting of nodes \(\mathcal {\hat{V}}\subseteq \mathcal {V}\) and hyperedges \(\mathcal {\hat{E}}\subseteq \mathcal {E}\) .

    2.2 Statistics for Structure of Hypergraphs

    We introduce ten node-level (P1, P3), hyperedge-level (P2, P4), and graph-level (P5P10) statistics that have been used extensively for structure analysis of real-world graphs [46, 47] and hypergraphs [19, 35, 42]. Refer to a survey [41] for the structure analyses. They are used throughout this article to measure the structural similarity of hypergraphs.
    P1. Degree: We consider the degree distribution of nodes. The distribution tends to be heavy-tailed in real-world hypergraphs but not in uniform random hypergraphs [19, 35].
    P2. Size: We consider the size distribution of hyperedges, which is shown to be heavy-tailed in real-world hypergraphs [35].
    P3. Pair Degree: We consider the pair degree distribution of neighboring node pairs. The pair degree of two nodes is defined as the number of hyperedges containing both. The distribution reveals structural similarity between nodes, and it tends to have a heavier tail in real-world hypergraphs than in randomized ones [42].
    P4. Intersection Size (Int. Size): We consider the intersection-size (i.e., count of common nodes) distribution of overlapping hyperedge pairs. The distribution from pairwise connections between hyperedges is heavy-tailed in many real-world hypergraphs [35].
    P5. Singular Values (SV): We consider the relative variance explained by singular vectors of the incidence matrix. In detail, for each \(i\in \lbrace 1,\ldots ,R\rbrace\) , we compute \(s_{i}^{2}\) / \(\sum _{k=1}^{R} s_{k}^{2}\) where \(s_i\) is the \(i\) th largest singular value and \(R\) is the rank of the incidence matrix. Singular values indicate the variance explained by the corresponding singular vectors [66], and they are highly skewed in many real-world hypergraphs [35]. They are also equal to the square root of eigenvalues of the weighted adjacency matrix. For the large datasets from the threads and co-authorship domains, we use 300 instead of \(R\) , and for a sample from them, we use \(300/R\) of the rank of its incidence matrix.
    P6. Connected Component Size (CC): We consider the portion of nodes in each \(i\) th largest connected component in the clique expansion. The clique expansion of a hypergraph \(\mathcal {G}=(\mathcal {V},\mathcal {E})\) is the undirected graph obtained by replacing each hyperedge \(e\in \mathcal {E}\) with the clique with the nodes in \(e\) . In many real-world hypergraphs, a majority of nodes belong to a few connected components [19].
    P7. Global Clustering Coefficient (GCC): We estimate the average of the clustering coefficients of all nodes in the clique expansion (defined in P6) using [59]. This statistic measures the cohesiveness of connections, and it tends to be larger in real-world hypergraphs than in uniform random hypergraphs [19].
    P8. Density: The density is defined as the ratio of the hyperedge count over the node count (i.e., \(|\mathcal {E}|/|\mathcal {V}|\) ) [26]. Hypergraphs from the same domain tend to share a similar significance of density [42].
    P9. Overlapness: The overlapness of a hypergraph \(\mathcal {G}=(\mathcal {V},\mathcal {E})\) is defined as \(\sum _{e \in \mathcal {E}} |e| / |\mathcal {V}|\) . It measures the degree of hyperedge overlaps, satisfying desirable axioms [42]. Hypergraphs from the same domain tend to share a similar significance of overlapness [42].
    P10. Diameter: The effective diameter is defined as the smallest \(d\) such that the paths of length at most \(d\) in the clique expansion (defined in P6) connect 90% of reachable pairs of nodes [47]. It measures how closely nodes are connected. The effective diameter tends to be small in real-world hypergraphs [35].

    2.3 Simple and Intuitive Sampling Approaches

    We describe the six intuitive approaches, which are categorized into node-selection methods and hyperedge-selection methods.

    2.3.1 Node Selection (NS).

    In node-selection methods, we choose a subset \(\mathcal {\hat{V}}\) of nodes and return the induced sub-hypergraph \(\mathcal {\hat{G}}=(\mathcal {\hat{V}},\mathcal {E}({\mathcal {\hat{V}}}))\) where \(\mathcal {E}({\mathcal {\hat{V}}}):=\lbrace e \in \mathcal {E}: \forall v \in e, v \in \mathcal {\hat{V}}\rbrace\) denotes the set of hyperedges composed only of nodes in \(\mathcal {\hat{V}}\) . Each process below is repeated until \(\mathcal {\hat{G}}\) has the desired size (i.e., \(|\mathcal {E}({\mathcal {\hat{V}}})|=\lfloor |\mathcal {E}| \cdot p \rfloor\) with \(p\) representing the proportion of sampling).
    Random Node Sampling (RNS): We repeat drawing a node uniformly at random and adding it to \(\mathcal {\hat{V}}\) .
    Random Degree Node (RDN): We repeat drawing a node with probabilities proportional to node degrees and adding it to \(\mathcal {\hat{V}}\) as in [46].
    Random Walk (RW): We perform random walk with restart [64], setting the restart probability \(c=0.15\) on the clique expansion (defined in Section 2.2), and add each visited node to \(\mathcal {\hat{V}}\) in turn. We select a new seed node, where random walks restart, after reaching the maximum number of steps, which is set to the number of nodes.
    Forest Fire (FF): We simulate forest fire in hypergraphs as in [35]. First, we choose a random node \(w\) as an ambassador and burn it. Then, we burn \(n\) neighbors of \(w\) where \(n\) is sampled from a geometric distribution with mean \(p/(1-p)\) . We recursively apply the previous step to each burned neighbor by considering it as a new ambassador, but the number of neighbors to be burned is sampled from a different geometric distribution with mean \(q/(1-q)\) . Each burned node is added to \(\mathcal {\hat{V}}\) in turn. If there is no new burned node, we choose a new ambassador uniformly at random. We set \(p\) to 0.51 and \(q\) to 0.2 as in [35]. This method extends a successful representative sampling method for graphs [46] to hypergraphs.

    2.3.2 Hyperedge Selection (HS).

    In hyperedge-selection methods, we draw a subset \(\mathcal {\hat{E}}\) of hyperedges and return \(\mathcal {\hat{G}}=(\mathcal {V}({\mathcal {\hat{E}}}),\mathcal {\hat{E}})\) , where \(\mathcal {V}({\mathcal {\hat{E}}}):=\bigcup _{e\in \mathcal {\hat{E}}}e\) is the set of nodes in any hyperedge in \(\mathcal {\hat{E}}\) .
    Random Hyperedge Sampling (RHS): We draw a target number (i.e., \(\lfloor |\mathcal {E}| \cdot p \rfloor\) with \(p\) representing the proportion of sampling) of hyperedges uniformly at random.
    Totally-Induced Hyperedge Sampling (TIHS): We extend totally-induced edge sampling [2] to hypergraphs. We repeat (a) adding a hyperedge uniformly at random to \(\mathcal {\hat{E}}\) and (b) adding all hyperedges induced by \(\mathcal {V}({\mathcal {\hat{E}}})\) (i.e., \(\lbrace e \in \mathcal {E}: \forall v \in e, v \in \mathcal {V}({\mathcal {\hat{E}}})\rbrace\) ) to \(\mathcal {\hat{E}}\) .

    2.4 Datasets

    Throughout the article, we use 11 datasets summarized in Table 1 after removing all duplicated hyperedges. Their domains are:
    Table 1.
    Dataset \(|\mathcal{V}|\) \(|\mathcal{\varepsilon}|\) \({\text{AVG.} d}(v)\) \(\text{AVG.}|e|\) No. of CCsLargest CCGCCDensityDiameter
    email-Enron1431,51432.33.0511430.6610.592.38
    email-Eu1,00525,14888.93.56209860.5725.022.78
    contact-primary24212,704126.92.4212420.5352.501.88
    contact-high3277,81855.62.3313270.5023.912.63
    NDC-classes1,1611,0905.65.971836280.830.944.65
    NDC-substances5,55610,27312.26.621,8883,4140.721.853.04
    tags-ubuntu3,029147 K164.83.3993,0210.6148.602.41
    tags-math1,629170 K364.13.4831,6270.63104.652.13
    threads-ubuntu125 K166 K2.51.9139 K82 K0.551.334.73
    coauth-geology1.2 M1.2 M3.03.17230 K903 K0.760.967.04
    coauth-history1 M896 K1.31.57617 K242 K0.820.8711.28
    Table 1. Overview of the Real-world Hypergraph Datasets Used in the Article
    The 11 datasets originate from six distinct domains, and they exhibit distinct structural properties.
    e-mail (email-Enron [34] and email-Eu [47, 71]): Each hyperedge represents an e-mail. It consists of the sender and receivers.
    contact (contact-primary [63] and contact-high [54]): Each hyperedge represents a group interaction. It consists of individuals.
    drugs (NDC-classes and NDC-substances): Each hyperedge represents an NDC code for a drug. It consists of classes or substances.
    tags (tags-ubuntu and tags-math): Each hyperedge represents a post. It consists of tags.
    threads (threads-ubuntu): Each hyperedge represents a question. It consists of a questioner and responders.
    co-authorship (coauth-geology [60] and coauth-history [60]): Each hyperedge represents a publication. It consists of co-authors.

    2.5 Evaluation

    In this work, our focus is on general-purpose sub-hypergraph sampling. Rather than assuming specific use cases of sampled sub-hypergraphs, we aim at preserving a wide range of structural properties that are identified as unique characteristics of real-world hypergraphs. Specifically, we evaluate the goodness of a sub-hypergraph \(\mathcal {\hat{G}}\) based on its ability to accurately preserve the structural properties of the target hypergraph in ten aspects P1–P10. The target hypergraph can be either the input hypergraph itself or a past snapshot of the input time-evolving hypergraph, depending on whether we are dealing with representative sampling (Section 3) or back-in-time hypergraph sampling (Section 4), respectively. Specifically, for each of P1P6, which are (probability density) functions, we measure the Kolmogorov-Smirnov D-statistic. Specifically, for functions \(f\) from \(\mathcal {G}\) and \(\hat{f}\) from \(\mathcal {\hat{G}}\) , if we let their cumulative sums be \(F\) and \(\hat{F}\) ,2 the D-statistic is defined as follows:
    \begin{equation} \text{D-statistic} (f, \hat{f}) = \max _{x\in \mathcal {D}} \lbrace | \hat{F}(x) - F(x) | \rbrace , \end{equation}
    (1)
    where \(\mathcal {D}\) is the domain of \(f\) and \(\hat{f}\) . For each of P7P10, which are scalars, we measure the relative difference. Specifically, for scalars \(y\) from \(\mathcal {G}\) and \(\hat{y}\) from \(\mathcal {\hat{G}}\) , the relative difference is defined as follows:
    \begin{equation} \text{Relative Difference}(y, \hat{y}) = \frac{|y-\hat{y}|}{|y|}. \end{equation}
    (2)
    In order to compare the qualities of sub-hypergraphs sampled by different methods, we aggregate the ten distances described in Section 2.2. Since the scales of the distances may differ, we compute rankings and Z-Scores to make it possible to directly compare and average them, as follows:
    Ranking: With respect to each of P1P10, we rank all sub-hypergraphs using their distances.
    Z-Score: With respect to each of P1P10, we standardize the distance of sub-hypergraphs by subtracting the mean and dividing the difference by the standard deviation.
    When comparing sampling methods in multiple settings (e.g., sampling portions and datasets), we compute the above rankings (or Z-Scores) of their samples in each setting and average the rankings (or Z-Scores) of samples from each method. Note that both metrics are determined based on the specific methods being compared. Therefore, the same sampling method may yield different metric values depending on the methods used for comparison.

    3 Representative Hypergraph Sampling

    In this section, we focus on representative hypergraph sampling. In Section 3.1, we provide a formal problem definition. In Section 3.2, given that we are, to the best of our knowledge, the first to explore this problem, we analyze the advantages and drawbacks of six simple and intuitive sampling approaches (described in Section 2.3). Leveraging insights from these investigations, in Section 3.3, we propose our approach, MiDaS. In Section 3.4, we demonstrate the effectiveness of MiDaS through experiments, where we follow the evaluation methodology outlined in Section 2.5.

    3.1 Problem Definition

    Based on the statistics (defined in Section 2.2), we formulate the representative hypergraph sampling problem in Problem 1.
    Problem 1 (Representative Hypergraph Sampling).
    Given: - a large hypergraph \(\mathcal {G}= (\mathcal {V}, \mathcal {E})\)
    a sampling portion \(p\in (0,1)\)
    Find: a sub-hypergraph \(\mathcal {\hat{G}}=(\mathcal {\hat{V}}, \mathcal {\hat{E}})\) where \(\mathcal {\hat{V}}\subseteq \mathcal {V}\) and \(\mathcal {\hat{E}}\subseteq \mathcal {E}\)
    to Preserve: ten structural properties of \(\mathcal {G}\) measured by P1-P10
    Subject to: \(|\mathcal {\hat{E}}| = \lfloor |\mathcal {E}| \cdot p \rfloor\)
    In Problem 1, the objective is to find the most representative sub-hypergraph composed of a given portion of hyperedges. However, achieving optimality is challenging as the ten structural properties need to be considered simultaneously. In this article, we focus on developing heuristics that work well in practice. We measure the performance of sampling algorithms by evaluating the sub-hypergraph \(\mathcal {\hat{G}}\) sampled by each algorithm as described in Section 2.5.

    3.2 Observations

    We evaluate the six intuitive approaches using the 11 datasets under five different sampling portions, as described in Section 2. The results are summarized in Table 2 and Table 3. Below, we describe the characteristics of each approach. We use \(\mathcal {\hat{G}}_{ALG}\) to denote a sub-hypergraph obtained by each approach \(ALG\) .
    Table 2.
    Table 2. Representative Sampling Results in Six Datasets from Different Domains when the Sampling Portion is 0.3
    Table 3.
    RNSRDN [46]RW [64]FF [35]RHSTIHS [2]
    DegreeDstat.0.290.290.320.300.300.28
    Rank (Z-Score)3.51 (-0.11)3.16 (-0.14)3.96 (0.29)3.89 (0.24)3.24(-0.13)3.24 (-0.15)
    Int. SizeDstat.0.090.030.040.040.010.03
    Rank (Z-Score)4.42 (0.69)3.49 (-0.06)4.00 (0.15)4.29 (0.41)1.07 (-1.14)3.73 (-0.04)
    Pair DegreeDstat.0.130.110.090.110.110.09
    Rank (Z-Score)3.95 (0.26)4.11 (0.30)2.69(-0.28)3.93 (0.13)3.64 (0.01)2.69(-0.43)
    SizeDstat.0.230.110.120.060.010.09
    Rank (Z-Score)5.85 (1.54)4.20 (0.08)4.11 (0.20)2.45 (-0.57)1.00 (-1.11)3.38 (-0.14)
    SVDstat.0.120.160.150.160.080.15
    Rank (Z-Score)2.89 (-0.24)4.07 (0.30)3.64 (0.21)4.56 (0.59)1.78 (-0.96)3.45 (0.10)
    CCDstat.0.160.130.180.140.100.14
    Rank (Z-Score)3.71 (0.61)2.25 (-0.17)3.73 (0.46)3.09 (-0.04)1.87 (-0.71)2.65 (-0.15)
    GCCDiff.0.120.150.090.120.100.08
    Rank (Z-Score)3.31 (0.06)4.60 (0.53)3.15 (-0.25)4.16 (0.16)2.84 (-0.30)2.95 (-0.21)
    DensityDiff.0.370.540.490.520.520.43
    Rank (Z-Score)3.07 (-0.38)4.20 (0.35)3.24 (-0.08)3.42 (-0.00)4.18 (0.44)2.89 (-0.33)
    OverlapnessDiff.0.550.550.620.530.520.46
    Rank (Z-Score)3.98 (0.16)3.49 (-0.09)3.80 (0.29)3.24 (-0.20)3.71 (0.20)2.78 (-0.38)
    DiameterDiff.0.340.140.120.110.200.12
    Rank (Z-Score)4.64 (0.72)3.02 (-0.21)3.42 (-0.15)3.20 (-0.26)3.98 (0.29)2.75 (-0.39)
    AverageRank (Z-Score)3.93 (0.33)3.66 (0.09)3.57 (0.08)3.62 (0.05)2.73 (-0.34)3.05 (-0.21)
    Table 3. Six Intuitive Sampling Methods are Compared as Described in Section 2
    D-statistics (Dstat.) and relative differences (Diff.) are computed for each property. As D-statistics and relative differences have varying scales for different properties, rankings and Z-Scores are calculated to facilitate comparisons among the six methods. Reported results are the averages over five sampling portions ( \(10\%, \ldots , 50\%\) ) across 11 datasets. The bold text highlights the best results in terms of each property. Notably, RHS provides the most representative sub-hypergraphs overall.

    3.2.1 Random Node Sampling (RNS).

    Small hyperedges: In \(\mathcal {\hat{G}}_{RNS}\) , large hyperedges are rarely sampled because all nodes in a hyperedge must be sampled for the hyperedge to be sampled, which is unlikely.
    Weak connectivity: As large hyperedges are rare, the local and global connectivity is weak. Locally, node degrees, node-pair degrees, hyperedge sizes, and intersection sizes tend to be low in \(\mathcal {\hat{G}}_{RNS}\) . Globally, \(\mathcal {\hat{G}}_{RNS}\) tends to have low density, especially low overlapness, and large diameter. It also tends to have many connected components with small portions of nodes.
    Precise preservation of relative singular values: Relative singular values (see P5) are preserved best in \(\mathcal {\hat{G}}_{RNS}\) among the sub-hypergraphs obtained by node-selection methods.

    3.2.2 Random Degree Node (RDN), Random Walk (RW), and Forest-Fire (FF).

    More high-degree nodes than RNS : RDN, RW, and FF lead to a larger portion of high-degree nodes than RNS since they prioritize high-degree nodes. Thus, they preserve degree distributions better than RNS in some datasets where RNS significantly increases the fraction of low-degree nodes. Especially, degree distributions are preserved best by RDN in terms of ranking.
    Stronger connectivity than RNS : High-degree nodes strengthen connectivity. Thus, sub-hypergraphs obtained by non-uniform node-selection methods tend to have higher density, higher overlapness, and smaller diameter than \(\mathcal {\hat{G}}_{RNS}\) but sometimes even than the original hypergraph \(\mathcal {G}\) . Notably, in the sub-hypergraphs, a larger fraction of nodes belong to the largest connected component, reducing the number of connected components, compared to \(\mathcal {G}\) .

    3.2.3 Random Hyperedge Sampling (RHS).

    Best preservation of many properties: RHS preserves hyperedge-level statistics (i.e., hyperedge sizes and intersection sizes) nearly perfectly. It is also best at preserving connected-component sizes, relative singular values, and global clustering coefficients.
    Weak connectivity: As RHS is equivalent to uniform hypergraph sparsification, \(\mathcal {\hat{G}}_{RHS}\) suffers from weak connectivity. Locally, node degrees and pair degrees tend to be low in \(\mathcal {\hat{G}}_{RHS}\) . Globally, in \(\mathcal {\hat{G}}_{RHS}\) , density and overlapness are low, and diameter is large.

    3.2.4 Totally-Induced Hyperedge Sampling (TIHS).

    Complementarity to RHS : TIHS preserves node degrees, node-pair degrees, density, overlapness, and diameter best, which are overlooked by RHS.
    Strong connectivity: Still, node degrees, density, and overlapness tend to be higher, and diameter tends to be smaller in \(\mathcal {\hat{G}}_{TIHS}\) than in the original hypergraph \(\mathcal {G}\) . That is, the connectivity tends to be a bit stronger in \(\mathcal {\hat{G}}_{TIHS}\) than in \(\mathcal {G}\) . Thus, \(\mathcal {\hat{G}}_{TIHS}\) tends to have fewer but larger connected components than \(\mathcal {G}\) .

    3.2.5 Summary of Observations.

    As summarized in Table 3, when considering all settings, RHS provides overall the best representative sub-hypergraphs. While RHS produces sub-hypergraphs with weaker connectivity, RHS is by far the best method in preserving hyperedge sizes, intersection sizes, relative singular values, connected-component sizes, and global clustering coefficients.

    3.3 Proposed Approach: MiDaS

    In this section, we propose MiDaS, our sampling method for Problem 1. We first discuss the motivations behind it. Then, we describe MiDaS-Basic, a preliminary version of MiDaS. Lastly, we present the full-fledged version of MiDaS.

    3.3.1 Intuitions Behind MiDaS.

    Analyzing the simple approaches in Section 3.2 motivates us to come up with MiDaS. Especially, we focus on the following findings:
    Observation 1.
    RHS performs best, but its samples suffer from weak connectivity, including the lack of high-degree nodes.
    Observation 2.
    The ability to preserve degree distributions is strongly correlated with the ability to preserve other properties and thus the overall performance, as shown in Figure 3
    Fig. 3.
    Fig. 3. Pearson correlation coefficients between rankings (and Z-Scores) w.r.t. the P1-P10 statistics. Overall, rankings (and Z-Scores) w.r.t. node degree are most strongly correlated with those w.r.t. the other statistics.
    .
    Specifically, when designing MiDaS, we aim at overcoming the limitations of RHS while maintaining its strengths. Especially, based on the above findings, our focus is on better preservation of node degrees by increasing the fraction of high-degree nodes while expecting that this also helps preserve other properties. Our expectation is also supported by the strong correlation between (a) the average degree in sub-hypergraphs and (b) their overlapness and density, which tend to be low in sub-hypergraphs sampled by RHS. This correlation, which is shown in Table 4, is naturally expected from the fact that high-degree nodes increase the number of hyperedges per node and the definitions of density and overlapness (see Section 2.2).
    Table 4.
    Table 4. The Overlapness and Density of the Sampled Sub-hypergraphs Increase as the Average Node Degree in them Increases

    3.3.2 MiDaS-Basic: Preliminary Version.

    How can we better preserve node degrees, which seem to be a decisive property, while maintaining the advantages of RHS? Towards this goal, we first present MiDaS-Basic, a preliminary sampling method that determines the amount of sampling bias (i.e., the statistical prioritization of specific nodes or hyperedges during sampling) toward high-degree nodes by a single hyperparameter, for Problem 1.
    Description: The pseudocode of MiDaS-Basic is provided in Algorithm 1. Given a hypergraph \(\mathcal {G}=(\mathcal {V},\mathcal {E})\) and a sampling portion \(p\) , it returns a sub-hypergraph \(\mathcal {\hat{G}}=(\mathcal {\hat{V}},\mathcal {\hat{E}})\) of \(\mathcal {G}\) where the number of hyperedges in \(\mathcal {\hat{G}}\) is \(p\) of that in \(\mathcal {G}\) . Starting from an empty hypergraph, MiDaS-Basic repeats drawing a hyperedge as in RHS. However, unlike RHS, the probability for each hyperedge \(e\) being drawn at each step is proportional to \(\omega (e)^{\alpha }\) where \(\omega (\cdot)\) is a hyperedge weight function and the exponent \(\alpha\) ( \(\ge 0\) ) is a given constant. Note that, if \(\alpha\) is zero, MiDaS-Basic is equivalent to RHS.
    Based on the intuitions, MiDaS-Basic prioritizes hyperedges with high-degree nodes to increase the fraction of such nodes. In order to prioritize especially hyperedges composed only of high-degree nodes, it uses \(\omega (e):=\min _{v \in e} d_{v}\) , where \(d_v\) is the degree of \(v\) in \(\mathcal {G}\) , as the hyperedge weight function.
    Empirical properties: The value of \(\alpha\) affects the amount of bias toward high-degree nodes in MiDaS-Basic. Below, we analyze how \(\alpha\) affects samples (i.e., sub-hypergraphs) in practice.
    The degrees of nodes within samples obtained with different \(\alpha\) values are shown in Figure 4, from which we make Observation 3. This result is promising, showing that the bias in degree distributions can be directly controlled by \(\alpha\) .
    Fig. 4.
    Fig. 4. Observation 3: Biases in degree distributions in sub-hypergraphs sampled by MiDaS-Basic are controlled by \(\alpha\) . We show the results when the sampling portion is \({\bf 0.3}\) .
    Observation 3.
    As \(\alpha\) increases, the degree distributions in samples tend to be more biased toward high-degree nodes.
    We additionally explore how the best-performing \(\alpha\) values,3 which lead to best preservation of degree distributions in terms of D-statistics (see Section 2.5), are related to the skewness of degree distributions.4 A strong negative correlation is observed as summarized in Observation 4 and shown in Figure 5.
    Fig. 5.
    Fig. 5. Observation 4: A strong negative correlation between the skewness of original degree distributions and best performing \(\alpha\) values (denoted by \(\alpha ^*\) ). Colors denote dataset domains.
    Observation 4.
    As degree distributions in original hypergraphs are more skewed, larger \(\alpha\) values are required (i.e., high-degree nodes need to be prioritized more) to preserve the distributions.
    We also find out a strong negative correlation between best-performing \(\alpha\) values and sampling portions, as shown in Figure 6 and summarized in Observation 5.
    Fig. 6.
    Fig. 6. Observation 5: A strong negative correlation between sampling portions and best-performing \(\alpha\) values (denoted by \(\alpha ^*\) ).
    Observation 5.
    As we sample fewer hyperedges, larger \(\alpha\) values are required (i.e., high-degree nodes need to be prioritized more) to preserve degree distributions.
    Theoretical analysis: We analyze the time complexity of Algorithm 1. We also theoretically analyze Observation 3 in Appendix A.1. Specifically, we provide a sufficient condition for bias toward high-degree nodes to grow as \(\alpha\) increases, and we confirm that only \(\min _{v \in e} d_{v}\) satisfies this condition, while \(\max _{v \in e} d_{v}\) and \(\mathrm{avg}_{v \in e} d_{v}\) do not.
    Theorem 1 (Time Complexity).
    The time complexity of Algorithm 1 is \(O(p \cdot |\mathcal {E}| \cdot \log (\max _{v\in \mathcal {V}} d_{v})+ \sum _{e\in \mathcal {E}} |e|)\) .
    Proof.
    It takes \(O(\sum _{e\in \mathcal {E}} |e|)\) time to compute \(\omega (e)\) for every hyperedge \(e\) , and it takes \(O(|\mathcal {E}|)\) time to build a balanced binary tree with \(\max _{e \in \mathcal {E}}\) \(\omega (e)\) leaf nodes where each \(k\) th leaf node points to the list of all hyperedges whose weight is \(k^{\alpha }\) . Then, it takes \(O(\max _{e\in \mathcal {E}} \omega (e))=O(|\mathcal {E}|)\) time in total to store in each node \(i\) the sum of the weights of the hyperedges pointed by any node in the sub-tree rooted at \(i\) if we store them from leaf nodes to the root. The height of the tree is \(O(\log (\max _{e\in \mathcal {E}} \omega (e)))=O(\log (\max _{v\in \mathcal {V}} d_{v}))\) , and thus drawing each hyperedge (i.e., from the root, repeatedly choosing a child with weights until reaching a leaf; and then drawing a hyperedge that the leaf points to) and updating weights accordingly takes \(O(\log (\max _{v\in \mathcal {V}} d_{v}))\) time. Drawing \(p \cdot |\mathcal {E}|\) hyperedges takes \(p \cdot |\mathcal {E}| \cdot \log (\max _{v\in \mathcal {V}} d_{v}))\) time, and since \(O(|\mathcal {E}|)=O(\sum _{e\in \mathcal {E}} |e|)\) , the total time complexity is \(O(p \cdot |\mathcal {E}| \cdot \log (\max _{v\in \mathcal {V}} d_{v})+\sum _{e\in \mathcal {E}} |e|)\) . □

    3.3.3 MiDaS: Full-Fledged Version.

    As suggested by Observation 3, the hyperparameter \(\alpha\) in MiDaS-Basic should be tuned carefully. We propose MiDaS, the full-fledged version of our sampling method that automatically tunes \(\alpha\) .
    Based on the strong correlations in Observations 4 and 5, MiDaS tunes \(\alpha\) using a linear regressor \(\mathcal {M}\) that maps (a) the skewness of the degree distribution in the input hypergraph \(\mathcal {G}\) and (b) the sampling portion to (c) a best-performing \(\alpha\) value. In our experiments in Section 3.4, \(\mathcal {M}\) was fitted using the best-performing \(\alpha\) values5 on the considered datasets with five different sampling portions.6 For a fair comparison, when evaluating MiDaS on a dataset, we used only the remaining datasets for fitting \(\mathcal {M}\) .
    The \(\alpha\) value obtained by the linear regression model \(\mathcal {M}\) is further tuned using hill climbing [58]. As the objective function, \(\mathcal {L}(\mathcal {G}, \mathcal {\hat{G}})\) , MiDaS uses the D-statistics (see Section 2.5) between the degree distributions in the input hypergraph \(\mathcal {G}\) and a sample \(\mathcal {\hat{G}}\) . For speed, we search for \(\alpha\) within a given discrete search space \(\mathcal {S}\) ,7 aiming at minimizing \(\mathcal {L^{\prime }}(\alpha):=\mathcal {L}(\mathcal {G},\text {MiDaS-Basic} (\mathcal {G}, p, \alpha))\) . A search ends when it (a) finds a local minimum of \(\alpha\) when we limit our attention to \(\mathcal {S}\) or (b) reaches an end of \(\mathcal {S}\) . Algorithm 2 describes MiDaS.

    3.4 Evaluation

    We review our experiments designed to answer the following questions:
    Q1.
    Quality: How well does MiDaS preserve the ten structural properties (P1P10) of real-world hypergraphs?
    Q2.
    Consistency: Does MiDaS perform well regardless of the sampling portions?
    Q3.
    Speed: How fast is MiDaS compared to the competitors?

    3.4.1 Experimental Settings.

    We use the 11 datasets described in Section 2.4. We compare MiDaS with the simple methods described in Section 2.3 and two more sophisticated approaches: Hybrid random walk (HRW) [78] and metropolis graph sampling (MGS), which are described below. We use a machine with an i9-10900 K CPU and 64 GB RAM in all cases except one. When running MGS-Avg-Del on the tags-math dataset, we use a machine with an AMD Ryzen 9 3900X CPU and 128 GB RAM due to its large memory requirement. The sample quality in each method is averaged over three trials.
    Hybrid random walk (HRW) [78]: HRW is a recent approach for efficient hypergraph sampling, utilizing a random walk that alternates between nodes and hyperedges. When transitioning from a node, a random walker moves to a neighboring hyperedge selected with a probability proportional to its size; and when transitioning from a hyperedge, it selects a node within it uniformly at random. We limit the maximum length of each walk to twice the number of nodes. The resulting sub-hypergraph consists of all hyperedges visited during the random walk and all nodes contained in them, potentially including unvisited nodes. If the target number of hyperedges is not reached, the walk restarts from unvisited nodes until the goal is met. In our experiments, we use two advanced versions of HRW, HRW-NB and HRW-SK, which are specifically designed to prevent backtracking and the revisiting of nodes, respectively (refer to [78] for details).
    Metropolis graph sampling (MGS): We also adapt MGS [27] for Problem 1. For a sub-hypergraph \(\mathcal {\hat{G}}\) , \(\varrho ^{*}(\mathcal {\hat{G}})\) := \(\frac{1}{\exp (k \cdot \Delta _{\mathcal {G}}(\mathcal {\hat{G}}))}\) . Then, the acceptance probability of a move from a state \(\mathcal {\hat{G}}\) to a state \({\mathcal {\hat{G}}}^{\prime }\) is min(1, \(\frac{\varrho ^{*}({\mathcal {\hat{G}}}^{\prime })}{\varrho ^{*}(\mathcal {\hat{G}})}\) ) = min(1, \(\exp (k \cdot (\Delta _{\mathcal {G}}(\mathcal {\hat{G}}) - \Delta _{\mathcal {G}}({\mathcal {\hat{G}}}^{\prime }))\) ). This algorithm makes greedy choices that decrease a predefined objective function \(\Delta _{\mathcal {G}}(\mathcal {\hat{G}})\) most. In our experiments, we use the \(k\) values in the search space \(\lbrace 1, 10, 100, 10000\rbrace\) . Depending on \(\Delta _{\mathcal {G}}(\mathcal {\hat{G}})\) , we divide MGS into MGS-Deg, and MGS-Avg. The former aims at preserving node degrees by setting \(\Delta _{\mathcal {G}}(\mathcal {\hat{G}})\) to the D-statistic between degree distributions in \(\mathcal {G}\) and \(\mathcal {\hat{G}}\) . The latter aims at preserving node degrees, hyperedge sizes, node-pair degrees, and hyperedge intersection sizes at the same time by setting \(\Delta _{\mathcal {G}}(\mathcal {\hat{G}})\) to the average of the D-statistics from their distributions. The four statistics are chosen since they are cheap to compute at every step. We further divide each of MGS-Deg and MGS-Avg into three versions depending on how to move between states as follows:
    Add: MGS-Deg-Add (MGS-DA) and MGS-Avg-Add (MGS-AA) start from \(\mathcal {\hat{E}}=\emptyset\) and repeatedly propose to add a hyperedge to \(\mathcal {\hat{E}}\) until \(|\mathcal {\hat{E}}| = \lfloor |\mathcal {E}| \cdot p \rfloor\) holds.
    Replace: MGS-Deg-Rep (MGS-DR) and MGS-Avg-Rep (MGS-AR) initialize \(\mathcal {\hat{E}}\) so that \(|\mathcal {\hat{E}}| = \lfloor |\mathcal {E}| \cdot p \rfloor\) by using RHS. They repeat 3,000 times proposing to replace a hyperedge in \(\mathcal {\hat{E}}\) with one outside \(\mathcal {\hat{E}}\) .
    Delete: MGS-Deg-Del (MGS-DD) and MGS-Avg-Del (MGS-AD) start from \(\mathcal {\hat{E}}= \mathcal {E}\) and repeatedly propose to remove a hyperedge from \(\mathcal {\hat{E}}\) until \(|\mathcal {\hat{E}}| = \lfloor |\mathcal {E}| \cdot p \rfloor\) holds.

    3.4.2 Quality: How Well Does MiDaS Preserve the ten Structural Properties of Real-world Hypergraphs?.

    We compare all 15 considered sampling methods numerically using distances (D-statistics or relative differences), rankings, and Z-Scores, as described in Section 2.5. The results under five sampling portions (10%–50%) are averaged in Table 5, and some are visualized in Table 6. MiDaSprovides overall the most representative samples, among the 15 considered methods, in terms of both average rankings and average Z-Scores. Especially, MiDaS best preserves node degrees, density, overlapness, and diameter. Compared to RHS, the D-statistics in degree distributions drop significantly, and the differences in density, overlapness, and diameters drop more significantly. That is, better preserving node degrees in MiDaS helps resolve the weaknesses of RHS. While MiDaS is outperformed by RHS in preserving hyperedge sizes, intersection sizes, relative singular value, and connected component sizes, the gaps between their D-statistics or differences are less than 0.05.
    Table 5.
    RNSRDN [46]RW [64]FF [35]RHSTIHS [2]HRW [78]MGS - Deg [27]MGS - Avg [27]MiDaS
    NBSKAddRepDelAddRepDel
    DegreeDstat.0.2910.2850.3170.3020.3020.2830.2820.2690.2410.2570.2170.2850.2700.2590.133
    Rank9.3098.58210.01810.23611.5458.7097.9646.5096.6007.1824.1279.9099.0737.2912.909
    Z-Score0.2530.2020.7170.5980.2610.1990.015-0.127-0.317-0.109-0.4870.113-0.004-0.082-1.234
    Int. SizeDstat.0.0930.0330.0380.0350.0070.0330.0830.0800.0140.0240.0530.0020.0020.0080.024
    Rank10.6009.2739.72710.6733.4919.76413.67313.4914.7645.9278.8552.5453.2005.0368.964
    Z-Score0.627-0.0260.0970.230-0.767-0.0011.6321.573-0.583-0.554-0.004-0.818-0.792-0.595-0.019
    Pair DegreeDstat.0.1320.1110.0890.1120.1120.0900.0450.0450.0920.0890.0640.0890.0750.0630.094
    Rank9.76410.8918.29110.67311.5278.1093.3453.6188.8009.3827.2737.9277.3275.7097.364
    Z-Score0.6750.6930.0670.5700.449-0.014-0.977-0.9480.1320.133-0.2720.082-0.101-0.341-0.148
    SizeDstat.0.2270.1050.1210.0570.0090.0850.2920.2950.0200.0340.0990.0070.0030.0230.051
    Rank13.10910.69110.6918.1643.5829.69114.07314.1094.9275.4188.8002.4181.6734.6008.055
    Z-Score1.1480.0620.305-0.353-0.770-0.1081.8291.836-0.699-0.591-0.049-0.796-0.818-0.603-0.393
    SVDstat.0.1220.1580.1540.1640.0840.1540.1040.1050.1010.0870.1150.0850.0850.0960.125
    Rank7.94510.4919.69111.6734.23610.0006.0185.6916.3825.0917.5275.0004.7275.4918.455
    Z-Score0.2410.7940.5570.873-0.6330.532-0.151-0.140-0.413-0.5080.162-0.465-0.551-0.3380.038
    CCDstat.0.1600.1320.1750.1420.0970.1350.1450.1430.1040.1020.1030.1010.1000.1010.115
    Rank9.3457.3829.8008.6184.0557.7827.3276.6555.3645.0005.8734.3273.5455.2006.982
    Z-Score1.1240.1530.9330.325-0.4620.1670.1360.072-0.364-0.470-0.337-0.354-0.517-0.375-0.029
    GCCDiff.0.1200.1530.0850.1190.1000.0830.0810.0920.0970.0960.0930.1060.0990.0750.081
    Rank8.83611.2738.1459.8918.1098.1825.7276.6366.5097.6009.0187.6737.7647.6367.000
    Z-Score0.5170.962-0.0320.401-0.116-0.019-0.360-0.248-0.238-0.2030.014-0.163-0.208-0.275-0.033
    DensityDiff.0.3740.5400.4880.5160.5230.4260.5080.5110.4000.5000.5200.4920.5010.5090.202
    Rank4.8189.3647.4918.14510.9276.8368.9098.8915.6558.4919.4737.5648.8189.7822.473
    Z-Score-0.4900.412-0.0120.0450.313-0.3170.3260.347-0.2820.1820.3070.1230.1890.236-1.379
    OverlapnessDiff.0.5460.5500.6160.5310.5230.4600.4330.4460.4040.4720.4510.5010.5000.4890.202
    Rank9.8917.9098.8188.45511.9826.7275.4736.2366.8368.6917.21810.07310.1458.9642.582
    Z-Score0.3890.0170.516-0.0600.353-0.296-0.236-0.163-0.0970.1430.0320.2640.2510.213-1.326
    DiameterDiff.0.3440.1390.1170.1090.1950.1170.1220.1320.1570.1620.1580.2040.1820.1820.079
    Rank10.9457.5828.2736.5829.3826.5826.8186.8556.2738.6558.5649.6009.7829.6184.491
    Z-Score1.169-0.105-0.020-0.2460.258-0.318-0.268-0.211-0.3170.0240.0480.3160.2240.214-0.768
    AverageRank9.4569.3449.0959.3117.8848.2387.9337.8696.2117.1447.6736.7046.6056.9335.927
    Z-Score0.5650.3170.3130.238-0.111-0.0180.1950.199-0.318-0.195-0.059-0.170-0.233-0.194-0.529
    Table 5. MiDaS Yields Overall the Most Representative Sub-hypergraphs
    We compare \({\bf 15}\) sampling methods on \({\bf 11}\) real-world hypergraphs with five different sampling portions. We report their distances (D-statistics or relative differences), Z-Scores, and rankings, as described in Section 2.5. The smaller the measures are, the more representative the samples are.
    Table 6.
    Table 6. Representative Sampling Results in Six Datasets from Different Domains when the Sampling Portion is 0.3. Note that the Samples Obtained by MiDaS Effectively Preserve Various Properties (Specifically, P1–P10) of the Original Hypergraph (Shown in Black)

    3.4.3 Consistency: Does MiDaS Perform Well Regardless of the Sampling Portions?.

    We demonstrate the robustness of MiDaS to sampling portions. In Figure 7, we show how D-statistics in degree distributions, average Z-Scores, and average rankings change depending on sampling proportions. MiDaS is consistently best regardless of sampling portions with few exceptions. MGS methods preserve intersection sizes, node-pair degrees, hyperedge sizes, relative single values, and connected component sizes better than MiDaS by small margins, and as a result, MiDaS is outperformed by some MGS methods in terms of average ranking in a few settings.
    Fig. 7.
    Fig. 7. In the case of the representative sampling problem, MiDaS consistently outperforms other methods, with few exceptions, across various sampling portions.

    3.4.4 Speed: How Fast is MiDaS Compared to the Competitors?.

    We measure the running times of all considered sampling methods in each dataset with five sampling portions. We compare the sum of running times in Figure 8. Despite its additional overhead for automatic hyperparameter tuning, MiDaS significantly outperforms both MGS and MiDaS-Basic with grid search in speed. Notably, it does so without compromising sample quality when compared to the grid search, particularly when the search space for \(\alpha\) is fixed. The speed and sample quality are plotted together in Figure 2. Additionally, in Appendix A.3, we examine the running time with respect to the input hypergraph size.
    Fig. 8.
    Fig. 8. Total running time for representative sampling with five different sampling portions in six datasets from different domains. MiDaS consistently outpaces MGS in speed. Furthermore, the automatic hyperparameter tuning in MiDaS significantly enhances its performance speed when compared to the grid search method (refer to MiDaS-Grid).

    4 Back-In-Time Hypergraph Sampling

    In this section, our focus shifts to back-in-time hypergraph sampling. In Section 4.1, we distinguish this concept from representative sampling by establishing a formal problem definition. Similar to the previous problem, in Section 4.2, we examine the characteristics of both intuitive sampling approaches (outlined in Section 2.3) and MiDaS, which is designed for representative sampling. Addressing the limitations of MiDaS in the context of back-in-time sampling, we introduce our approach, MiDaS-B, in Section 4.3. This method is an adaptation of MiDaS, specifically tailored for back-in-time sampling. In Section 4.4, we demonstrate the efficacy of MiDaS-B through experiments, where we follow the evaluation methodology detailed in Section 2.5.

    4.1 Problem Definition

    In back-in-time sampling, we consider a time-evolving hypergraph, whose snapshot is provided as input. Our objective is to construct a sub-hypergraph that closely approximates its past snapshot of the same size. It is important to note that the ground-truth past snapshots of the given hypergraph are not provided, meaning that the target of the sampling is not directly observable. This reflects the challenge in real-world situations. This problem is formally defined in Problem 2.
    Problem 2 (Back-In-Time Hypergraph Sampling).
    Given: - a snapshot \(\mathcal {G}= (\mathcal {V}, \mathcal {E})\) at time \(T\) of a time-evolving hypergraph \(\mathcal {\tilde{G}}\)
    a sampling portion \(p\in (0,1)\)
    Find: a sub-hypergraph \(\mathcal {\hat{G}}=(\mathcal {\hat{V}}, \mathcal {\hat{E}})\) where \(\mathcal {\hat{V}}\subseteq \mathcal {V}\) and \(\mathcal {\hat{E}}\subseteq \mathcal {E}\)
    to Match: ten structural properties P1-P10 of the snapshot \(\mathcal {\bar{G}}= (\mathcal {\bar{V}}, \mathcal {\bar{E}})\) at time \(\bar{T}\) ( \(\lt T\) ) of \(\mathcal {\tilde{G}}\)
    where \(|\mathcal {\bar{E}}|=\lfloor |\mathcal {E}| \cdot p \rfloor\) .
    Similar to the representative hypergraph sampling problem, finding the optimal sub-hypergraph that fulfills the objective of Problem 2 is challenging. Therefore, we develop an effective heuristic to address this challenge. To evaluate the performance of these sampling algorithms, we compare the sub-hypergraph sampled by each algorithm with the target past snapshot of the input hypergraph and quantify their similarity, as described in Section 2.5.
    Comparison with Representative Sampling: Both representative sampling and back-in-time sampling share the high-level goal of obtaining a structurally similar but smaller sub-hypergraph from the input hypergraph. Consequently, both approaches can be considered for various applications, including those discussed in Section 1. However, they differ in their specific objectives. Representative sampling aims at replicating the structural properties of the input hypergraph while accounting for the difference in scale between the sub-hypergraphs and the input hypergraph. In contrast, back-in-time sampling targets the past snapshot of the input hypergraph as the reference, taking into consideration the scale difference between the sub-hypergraphs and the input hypergraph.

    4.2 Observations

    As the initial step in tackling Problem 2, we conduct an analysis of the strengths and weaknesses of the six intuitive approaches described in Section 2.3. Additionally, we examine the following two additional sampling algorithms:
    Ordered Node Sampling (ONS): We add nodes one by one in the order of their appearance in the input hypergraph \(\mathcal {\tilde{G}}\) and return the induced sub-hypergraph once its hyperedge count reaches the target count. It is important to note that this method relies on the ground-truth node appearance order, which is not provided according to the problem definition.
    MiDaS (w. Oracle) : This method is an adaptation of MiDaS-Basic to the problem of back-in-time sampling. As in MiDaS-Basic, which is originally designed for representative sampling, each hyperedge \(e\) is sampled with a probability proportional to \(w(e)^{\alpha }\) , where \(w(e) := min_{v \in e} d_{v}\) . The value of \(\alpha\) is determined through a grid search for each dataset based on the comparison (with respect to all ten properties) with the ground-truth past snapshot of each dataset.8 It is important to note that this method relies on the ground-truth past snapshot, which is not provided according to the problem definition, and that is why we include “oracle” in its name. Note that this method also differs from MiDaS, which automatically tunes \(\alpha\) for representative sampling, not for back-in-time sampling.
    We evaluate these eight algorithms on 11 datasets under five different sampling portions ( \(10\%,30\%,50\%,70\%,90\%\) ) and summarize the results in Table 7 and Table 8. Below, we provide a detailed analysis of each algorithm.
    Table 7.
    RNSRDN [46]RW [64]FF [35]ONSRHSTIHS [2]MiDaS (w. Oracle)
    DegreeDstat.0.110.280.310.330.220.060.280.05
    Rank (Z-Score)3.11 (-0.59)5.60 (0.55)6.49 (0.99)6.93 (0.76)4.15 (-0.02)2.05 (-1.06)5.78 (0.52)1.53 (-1.16)
    Int. SizeDstat.0.070.040.040.030.020.030.040.03
    Rank (Z-Score)5.84 (0.79)5.49 (0.26)4.91 (0.15)4.38 (-0.05)3.18 (-0.50)3.36 (-0.35)4.98 (0.05)3.49 (-0.34)
    Pair DegreeDstat.0.050.100.100.130.070.040.090.03
    Rank (Z-Score)3.62 (-0.30)5.80 (0.62)4.98 (0.27)6.49 (0.75)3.75 (-0.23)3.47 (-0.52)5.22 (0.23)2.31 (-0.82)
    SizeDstat.0.100.090.120.090.040.090.090.09
    Rank (Z-Score)4.25 (-0.00)4.42 (-0.07)5.16 (0.43)4.89 (0.14)2.33 (-0.90)5.15 (0.21)4.38 (-0.05)5.05 (0.24)
    SVDstat.0.080.110.100.110.090.050.100.05
    Rank (Z-Score)4.69 (0.23)5.29 (0.41)4.47 (0.20)5.25 (0.36)4.49 (0.05)2.20 (-0.87)5.04 (0.32)2.84 (-0.70)
    CCDstat.0.260.320.340.330.310.290.310.29
    Rank (Z-Score)1.49 (-0.75)4.49 (0.10)4.96 (0.39)6.11 (1.09)4.36 (-0.01)2.22 (-0.42)4.36 (-0.11)2.58 (-0.30)
    GCCDiff.0.100.070.100.080.050.070.080.06
    Rank (Z-Score)5.15 (0.34)4.62 (0.04)5.04 (0.35)4.91 (0.08)3.42 (-0.41)4.09 (-0.12)4.75 (0.05)3.56 (-0.34)
    DensityDiff.1.7817.8116.8920.8512.961.1318.240.89
    Rank (Z-Score)2.62 (-0.73)6.65 (1.01)5.78 (0.61)6.56 (0.68)4.15 (0.02)2.60 (-0.91)5.67 (0.47)1.55 (-1.15)
    OverlapnessDiff.4.1658.5556.9871.3941.012.7160.242.00
    Rank (Z-Score)3.31 (-0.54)5.49 (0.49)6.56 (1.01)6.76 (0.77)4.13 (-0.07)2.33 (-0.92)5.60 (0.39)1.45 (-1.13)
    DiameterDiff.1.241.160.890.810.650.690.870.64
    Rank (Z-Score)4.02 (-0.12)5.36 (0.36)5.58 (0.50)5.71 (0.36)4.38 (0.09)2.91 (-0.63)4.93 (0.19)2.75 (-0.75)
    AverageRank (Z-Score)3.81 (-0.17)5.32 (0.38)5.39 (0.49)5.80 (0.49)3.83 (-0.20)3.04 (-0.56)5.07 (0.21)2.71 (-0.64)
    Table 7. Back-in-time Sampling Performances of Six Intuitive Sampling Methods and Two Additional Methods
    Rankings and Z-Scores (in parentheses) averaged over five sampling portions (specifically, \(10\%\) , \(30\%,\) 50%, \(70\%\) , and \(90\%\) ) and across the 11 datasets. The best results with respect to each property are highlighted in bold. Note that, while MiDaS (w. Oracle) performs the best overall, it does have limitations in preserving hyperedge sizes.
    Table 8.
    Table 8. Back-in-time Hypergraph Sampling Results in Six Datasets from Different Domains when the Sampling Portion is 0.5
    Refer to Section 4.2 for a detailed analysis of each algorithm. MiDaS (w. Oracle) demonstrates superior performance overall but tends to favor larger hyperedges compared to the target snapshot.

    4.2.1 Random Node Sampling (RNS).

    Capture the bias toward small hyperedges: In comparison to the input hypergraph, RNS exhibits a bias toward smaller hyperedges, implying that smaller hyperedges are more likely to be sampled than larger ones. This bias arises from the fact that sampling a large hyperedge necessitates the sampling of a large number of nodes (i.e., all nodes within it). Interestingly, this bias aligns well with the target snapshot, positioning RNS as the second-best algorithm in the preservation of hyperedge-size distribution, surpassing RHS. This outcome contrasts with the findings in representative sampling, where RHS effectively maintains the distribution of hyperedge sizes.
    Preference for dense sub-hypergraphs: RNS demonstrates a preference for sampling dense sub-hypergraphs, often surpassing the density of the target snapshot. This preference differs from the observation that RNS tends to result in weaker connectivity in representative hypergraph sampling.

    4.2.2 Random Degree Node (RDN), Random Walk (RW), and Forest-Fire (FF).

    Excessive high-degree nodes: RDN, RW, and FF result in the sampling of sub-hypergraphs that are denser than those sampled by RNS. Given the existing bias of RNS toward dense sub-hypergraphs, these approaches lead to too high density. Additionally, they have an excessive number of nodes with high degrees and pair degrees.
    Larger hyperedges than RNS : RDN, RW, and FF tend to sample larger hyperedges than RNS, leading to a greater disparity in size distribution from the target snapshot.

    4.2.3 Ordered Node Sampling (ONS).

    Too many high-degree nodes compared to RNS : Although ONS leverages the actual appearance order of nodes in the input hypergraph, the induced sub-hypergraph it generates exhibits significant differences, such as higher degrees, density, and overlapness, when compared to the target snapshot.
    Accurate preservation of hyperedge sizes: ONS performs the best in terms of preserving hyperedge sizes among the baseline methods.

    4.2.4 Random Hyperedge Sampling (RHS).

    Poor at preserving hyperedge sizes: As previously discussed, the target snapshot exhibits a preference for smaller hyperedges when compared to the input hypergraph. However, RHS samples hyperedges uniformly at random from the input hypergraph regardless of their sizes, and consequently, RHS has limitations in preserving the hyperedge size distribution of the target snapshot.
    Tendency toward sparse sub-hypergraphs: RHS has a tendency to sample sparse sub-hypergraphs that have lower density and more nodes with lower degrees.
    Bias toward larger connected component: Even though RHS generates sparse sub-hypergraphs, it tends to include more large hyperedges compared to the target snapshot. As a result, the generated hypergraphs tend to have larger connected components.

    4.2.5 Totally-Induced Hyperedge Sampling (TIHS).

    Smaller hyperedges than RHS : TIHS is capable of sampling smaller hyperedges when compared to RHS. However, in comparison to the target snapshot, TIHS still exhibits a tendency to sample larger hyperedges.
    Strong connectivity: Because adding induced hyperedges strengthens the connectivity, TIHS tends to generate hypergraphs with greater density and overlapness compared to the target snapshot.

    4.2.6 MiDaS (w. Oracle) .

    Best preservation of multiple properties: MiDaS (w. Oracle), an adaptation of MiDaS-Basic for back-in-time sampling with access to the target snapshot, achieves the best performance in preserving multiple properties overall.
    Continued weakness in hyperedge size preservation: Despite its overall effectiveness, MiDaS (w. Oracle) has limitations in preserving the distribution of hyperedge sizes in the target snapshot.

    4.2.7 Summary.

    To summarize, the node selection methods commonly encounter the problem of sampling dense sub-hypergraphs. Although MiDaS (w. Oracle) performs the best overall among the baselines, MiDaS (w. Oracle), along with other hyperedge selection methods, has limitations in accurately preserving hyperedge sizes. These methods tend to sample a greater number of larger hyperedges compared to the target snapshot.

    4.3 Proposed Approach: MiDaS-B

    Our analysis of MiDaS (w. Oracle) in Section 4.2 and its dependency on the ground-truth past snapshot for hyperparameter tuning raises the following two questions: (a) how can we effectively preserve hyperedge sizes while maintaining the overall structural properties? (b) how can we perform hyperparameter tuning when the target snapshot is unavailable? In response to these questions, we propose MiDaS-B, an algorithm designed for back-in-time hypergraph sampling. MiDaS-B addresses these challenges by incorporating a hyperedge-size-related term into the hyperedge sampling probabilities, effectively controlling associated biases. Furthermore, it leverages evolutionary characteristics that are prevalent in real-world hypergraphs as an alternative objective for hyperparameter tuning, eliminating the need for reliance on the ground-truth target snapshot.

    4.3.1 MiDaS-B-Basic: Preliminary Version.

    Our analysis in Section 4.2 reveals that, while MiDaS (w. Oracle) is the most effective algorithm, it lacks the ability to accurately reproduce the preference for small hyperedges seen in target snapshots. To address this limitation, we enhance MiDaS by introducing a hyperedge-size-related bias into its sampling process. Specifically, we integrate a new hyperparameter \(\beta\) into the hyperedge weight function as follows:
    \begin{equation} \omega (e ; \alpha , \beta) = \frac{(min_{v \in e} d_{v})^{\alpha }}{|e|^{\beta }}, \end{equation}
    (3)
    where the numerator \((min_{v \in e} d_{v})^{\alpha }\) corresponds to the hyperedge weight function used in MiDaS. Each hyperedge \(e\) is sampled with a probability proportional to its weight \(\omega (e ; \alpha , \beta)\) . We refer to this variant, which uses \(\omega (e ; \alpha , \beta)\) but lacks automatic tuning of \(\alpha\) and \(\beta\) , as MiDaS-B-Basic.
    Effects of \(\alpha\) and \(\beta\) : When sampling hyperedges, the weight function in Equation (3) allows us to control the biases regarding node degrees and hyperedge sizes through the hyperparameters \(\alpha\) and \(\beta\) , respectively. To better understand the relationship between the hyperparameter values and the structural properties (i.e., P1–P10), we compute the correlation coefficient9 between the values of each hyperparameter10 and the differences11 w.r.t. each property between the sampled sub-hypergraphs and the ground-truth past snapshot. The summarized results can be found in Table 9, and detailed visual results for the Coauth-Geology dataset are presented in Figure 9. Increasing \(\alpha\) prioritizes hyperedges with high-degree nodes, leading to denser sub-hypergraphs. This results in higher values for the degrees of nodes, global clustering coefficients, density, and overlapness; and results in smaller effective diameters. However, its effect on hyperedge size is limited, as indicated by the value close to zero (specifically, 0.0493) in Table 9. On the other hand, increasing \(\beta\) introduces a bias toward smaller hyperedges. This leads to a reduction in the degrees of nodes, global clustering coefficients, and overlapness; and leads to an increase in effective diameters. In conclusion, the two hyperparameters, \(\alpha\) and \(\beta\) , have distinct effects on the properties. Thus, when reproducing the preference for small hyperedges in target snapshots by increasing \(\beta\) , the tendency of \(\alpha\) toward higher node degrees, overlapness, and smaller effective diameters can offset the corresponding tendency of \(\beta\) toward lower node degrees, overlapness, and larger effective diameters.
    Table 9.
    ParameterDegreeInt. SizePair DegreeSizeSVCCGCCDensityOverlapnessDiameter
    \(\alpha\) 0.38870.00360.11410.04930.16830.04450.08590.45960.4473-0.0580
    \(\beta\) -0.10980.02220.0842-0.27480.0093-0.0474-0.3435-0.0013-0.24660.1636
    Table 9. Correlation Coefficients between Hyperparameter Values and the Structural Properties of Sub-hypergraphs Obtained by MiDaS-B-Basic
    The two hyperparameters \(\alpha\) and \(\beta\) exhibit distinctly different effects in terms of direction and magnitude. In particular, \(\beta\) exhibits a strong negative correlation with sampled hyperedge sizes, while \(\alpha\) demonstrates a weak positive correlation with these sizes.
    Fig. 9.
    Fig. 9. Effects of the hyperparameters \(\alpha\) and \(\beta\) of MiDaS-B-Basic in the Coauth-Geology dataset when the sampling portion is \(50\%\) . Figures (a), (b), (c), and (d) illustrate the changes in four properties depending on \(\alpha\) values when \(\beta\) is fixed at 0. Figures (f), (g), (h), and (i) depict the changes in the same four properties depending on \(\beta\) values when \(\alpha\) is fixed at 0. The black lines correspond to the results from the target snapshot.

    4.3.2 Hyperparameter Tuning without Oracle.

    To enhance the practicality and applicability of our algorithm to real-world hypergraphs, we aim at developing a hyperparameter tuning method that does not rely on the availability of past snapshots of the input hypergraph (i.e., oracle). To achieve this, we take into account the evolutionary characteristics of real-world hypergraphs and seek hyperparameter values that accurately replicate realistic hypergraph evolution. The hyperparameter tuning method is described in Algorithm 3.
    Power-law patterns in real-world hypergraph evolution: As visualized in Figure 10, there are two key observations characterizing the evolution of real-world hypergraphs: (1) the fraction of intersecting hyperedge pairs exhibits power-law patterns over time [35] and (2) the average hyperedge size exhibits power-law patterns over time. Leveraging these pervasive patterns, we tune the hyperparameters \(\alpha\) and \(\beta\) of our sampling algorithm by aiming at maximizing the power-law fitness in these two features. Specifically, we arrange hyperedges based on their sampling order and treat this sequence as the chronological arrival order of hyperedges in a time-evolving hypergraph (lines 4–7). Then, we measure the mean absolute error of a linear regression model fitted to the log-log scale representations of these two properties over time (line 10). To determine the hyperparameter values, we search for the combination that minimizes the mean absolute error within a designated search space (lines 2–12).
    Fig. 10.
    Fig. 10. Pervasive evolutionary patterns in real-world hypergraphs. Both the fraction of intersecting hyperedge pairs and the average hyperedge size exhibit power-law patterns over time. The red lines represent the fitted power-law lines, and the black lines represent the actual observations.
    Rejection conditions: To further refine the search for hyperparameter values, we establish rejection conditions of hyperparameters based on observations from RHS. In particular, RHS consistently demonstrates lower density and larger connected components when compared to the target snapshot. These rejection conditions are designed to filter out hyperparameter values that produce clearly undesirable sub-hypergraphs (lines 8–9), which can be identified through comparisons with RHS. That is, the rejection conditions are as follows:
    Density Condition: Sampled sub-hypergraphs should not be sparser than those obtained by RHS.
    Connected Component (CC) Condition: Sampled sub-hypergraphs should not have larger connected components than those obtained by RHS.
    Specifically, these rejection conditions are applied by comparing the average density and the average size of the largest connected component over time12 from our sampling algorithm with the corresponding values from RHS.

    4.3.3 MiDaS-B: Full-fledged Version.

    We introduce \(\text {MiDaS-B}\) , the full-fledged version of our back-in-time sampling algorithm that combines the hyperedge weight function Equation (3) and the evolutionary-pattern-based automatic hyperparameter tuning method. The pseudocode for MiDaS-B is given in Algorithm 4. Given (a) a hypergraph \(\mathcal {G}=(\mathcal {V},\mathcal {E})\) (i.e., a snapshot of the input hypergraph), (b) a sampling portion \(p\) , and (c) hyperparameter search spaces \(\mathcal {S}_{\alpha }\) and \(\mathcal {S}_{\beta }\) , MiDaS-B returns a sub-hypergraph \(\mathcal {\hat{G}}=(\mathcal {\hat{V}},\mathcal {\hat{E}})\) of \(\mathcal {G}\) . Specifically, after tuning its hyperparameters \(\alpha\) and \(\beta\) (line 1), MiDaS-B samples a target number of hyperedges with the sampling probability proportional to Equation (3) (lines 2–6), and then it returns the sub-hypergraph consisting of the sample hyperedges (line 7).13
    Theoretical analysis: We analyze the time complexities of Algorithms 3 and 4.
    Lemma 1 (Time Complexity of Hyperedge Sampling).
    The time complexity of sampling hyperedges for a pair of \(\alpha ^{\star }\) and \(\beta ^{\star }\) values (i.e., lines 2-6 of Algorithm 4) is \(O(p \cdot |\mathcal {E}| \cdot \log C + \sum _{e \in \mathcal {E}} |e|)\) , where \(C\) is the size of the set \(\lbrace (\min _{v \in e}d_{v}, |e|) : e \in \mathcal {E}\rbrace\) .
    Proof.
    It takes \(O(\sum _{e\in \mathcal {E}} |e|)\) time to compute \(\omega (e)\) (i.e., \(min_{v \in e} d_{v}\) ) and \(|e|\) for every hyperedge \(e\) . It takes \(O(|\mathcal {E}|)\) time to build a balanced binary tree with \(C\) leaf nodes where each leaf node points to a list of all hyperedges who share the same \((\min _{v \in e} d_{v}, |e|)\) . Then, it takes \(O(C)=O(|\mathcal {E}|)\) time in total to store in each node \(i\) the sum of the weights of the hyperedges pointed by any node in the sub-tree rooted at \(i\) if we store them from leaf nodes to the root. The height of the tree is \(O(\log C)\) ; and thus the process of drawing each hyperedge – which involves starting from the root and iteratively selecting a child based on weights until a leaf is reached, followed by selecting the hyperedge associated with that leaf – and the subsequent updating of weights accordingly take \(O(\log C)\) time. Drawing \(p \cdot |\mathcal {E}|\) hyperedges takes \(O(p \cdot |\mathcal {E}| \cdot \log C)\) time, and since \(O(|\mathcal {E}|)=O(\sum _{e\in \mathcal {E}} |e|)\) , the total time complexity is \(O(p \cdot |\mathcal {E}| \cdot \log C+\sum _{e\in \mathcal {E}} |e|)\) . □
    Theorem 2 (Time Complexity of Automatic Hyperparameter Tuning).
    The time complexity of Algorithm 3 is \(O(|\mathcal {S}_{\alpha }|\cdot |\mathcal {S}_{\beta }| \cdot (|\mathcal {E}|\cdot \log C + \sum _{e \in \mathcal {E}} \sum _{v \in e} d_{v} + \sum _{e \in \mathcal {E}} |e|^{2} \alpha (|\mathcal {V}|))).\)
    Proof.
    We first present the time complexity for a specific pair \(\alpha ^{\star }\) and \(\beta ^{\star }\) (i.e., lines 4– 7 of Algorithm 3). By Lemma 1, hyperedge sampling takes \(O(|\mathcal {E}| \cdot \log C + \sum _{e \in \mathcal {E}} |e|)\) since the sampling portion \(p\) is set to 1.0, which corresponds to the sampling of the entire set of hyperedges. In addition, in order to check the rejection conditions and measure power-law fitness, we compute the following metrics over time: (1) the fraction of intersecting hyperedge pairs, (2) the average hyperedge size, (3) the average density, and (4) the size of the largest connected components. (1) Computing the fraction of intersecting hyperedge pairs takes \(O(\sum _{e \in \mathcal {E}} \sum _{v \in e} d_{v})\) . For each newly added hyperedge \(e\) , we examine at most \(\sum _{v \in e} d_{v}\) intersecting hyperedges to calculate their unique count. (2) Computing the average hyperedge size and (3) the average density takes \(O(\sum _{e \in \mathcal {E}} |e|)\) . (4) Computing the size of the largest connected components takes \(O(\sum _{e \in \mathcal {E}} |e|^{2} \alpha (|\mathcal {V}|))\) . When a new hyperedge is added, we examine all node pairs within the hyperedge, identify the components they belong to, and merge them if they belong to different components. We employ a disjoint-set data structure for this, and \(\alpha (n)\) represents the extremely slow-growing inverse Ackermann function. Lastly, given the metrics over time, evaluating rejection conditions and power-law fitness, which requires linear regression, takes \(O(|\mathcal {E}|)\) time, and if we record the metrics only a constant number of times (refer to Footnote 13), it takes \(O(1)\) time. Therefore, the total time complexity is \(O(|\mathcal {E}|\log C + \sum _{e \in \mathcal {E}} \sum _{v \in e} d_{v} + \sum _{e \in \mathcal {E}} |e|^{2} \alpha (|\mathcal {V}|))\) .
    Since \(|\mathcal {S}_{\alpha }|\cdot |\mathcal {S}_{\beta }|\) pairs of \(\alpha ^{\star }\) and \(\beta ^{\star }\) values are considered, the total time complexity becomes \(O(|\mathcal {S}_{\alpha }|\cdot |\mathcal {S}_{\beta }| \cdot (|\mathcal {E}|\cdot \log C + \sum _{e \in \mathcal {E}} \sum _{v \in e} d_{v} + \sum _{e \in \mathcal {E}} |e|^{2} \alpha (|\mathcal {V}|)))\) . □
    Theorem 3 (Time Complexity of.
    MiDaS-B ) The time complexity of Algorithm 4 is \(O(|\mathcal {S}_{\alpha }|\cdot |\mathcal {S}_{\beta }| \cdot (|\mathcal {E}|\cdot \log C + \sum _{e \in \mathcal {E}} \sum _{v \in e} d_{v} + \sum _{e \in \mathcal {E}} |e|^{2} \alpha (|\mathcal {V}|)))\)
    Proof.
    From Lemma 1 and Theorem 2, it is clear that the time complexity of Algorithm 3, a subroutine of Algorithm 4, is the dominant factor in the time complexity of Algorithm 4. Therefore, the Big-O notation for the time complexity of Algorithm 4 is the same as that of Algorithm 3. □

    4.4 Evaluation

    In this section, we review our experiments for evaluating the output quality, consistency, and speed of MiDaS-B, our proposed algorithm for back-in-time hypergraph sampling.

    4.4.1 Experimental Settings.

    In our experiments, we examine the performance of 11 back-in-time sampling algorithms on 11 datasets with five different sampling portions (specifically, \(10\%, 30\%, 50\%, 70\%, 90\%\) ). We conduct all experiments on a machine equipped with an i9-10900 K CPU and 64 GB RAM. The sample quality of each algorithm is averaged over three independent trials.
    The competing algorithms include the six intuitive algorithms (Section 2.3), ONS (Section 4.2), HRW (Section 3.4.1), and MiDaS-B (Section 4.3.3). We exclude MiDaS (w. Oracle), which requires the ground-truth past snapshots as input, and instead, we include MiDaS (w/o. Oracle). MiDaS (w/o. Oracle) is a variant of MiDaS-Basic that tunes its hyperparameter \(\alpha\) using the same rejection conditions and power-law fitness as MiDaS-B. For both MiDaS (w/o. Oracle) and MiDaS-B, we use \(\mathcal {S}_{\alpha } = \lbrace 0, 2^{-2}, 2^{-1}, 2^{0}, 2^{1}\rbrace\) and \(\mathcal {S}_{\beta } = \lbrace -2^{0},-2^{-1}, -2^{-2}, 0, 2^{-2}, 2^{-1}, 2^{0} \rbrace\) as the search space for \(\alpha\) and \(\beta\) , respectively.

    4.4.2 Quality of MiDaS-B.

    We compare the performances of MiDaS-B and ten competing methods across various datasets and sampling portions. The results in Table 10 are based on relative differences, rankings, and Z-Scores, as described in Section 2.5. These results are averaged over five sampling portions (10%, 30%, 50%, 70%, and 90%) and 11 real-world hypergraph datasets. In terms of average rankings and average Z-Scores, MiDaS-B consistently outperforms all competing methods, demonstrating its superior performance, especially in preserving degree, density, and overlapness. Especially, MiDaS-B maintains hyperedge size distributions better than \(\text {MiDaS (w/o. Oracle)}\) , demonstrating the benefit of incorporating hyperedge sizes into the sampling process of MiDaS-B. Refer to Table 11 for a visual presentation of detailed results on each dataset. Collectively, our findings strongly support the effectiveness of MiDaS-B compared to competing methods, particularly in preserving multiple properties and hyperedge size distributions.
    Table 10.
    RNSRDN [46]RW [64]FF [35]ONSRHSTIHS [2]HRW [78]MiDaS (w/o. Oracle)MiDaS-B
    NBSK
    DegreeDstat.0.1130.2800.3110.3260.2220.0590.2830.1110.1070.0760.062
    Rank4.8008.2009.4559.8556.2552.9648.4914.4004.2183.9823.109
    Z-Score-0.4100.7721.2551.0060.175-0.8710.739-0.558-0.570-0.659-0.880
    Int. SizeDstat.0.0680.0390.0380.0330.0230.0320.0350.0660.0650.0320.031
    Rank6.9646.4555.9645.1273.7274.2365.9829.0368.9094.6364.691
    Z-Score0.400-0.055-0.160-0.364-0.670-0.489-0.1741.1521.153-0.393-0.401
    Pair DegreeDstat.0.0500.0980.0970.1250.0700.0390.0910.0540.0540.0390.039
    Rank5.0738.0187.0188.8555.2365.2917.3095.0365.2914.0364.564
    Z-Score-0.1700.7720.3730.861-0.187-0.3990.342-0.289-0.273-0.565-0.466
    SizeDstat.0.1000.0860.1190.0910.0440.0940.0860.2790.2820.1010.085
    Rank5.0915.2366.1825.5452.5095.8735.1829.7099.9095.9644.527
    Z-Score-0.143-0.3400.073-0.405-0.785-0.357-0.3861.5301.551-0.312-0.426
    SVDstat.0.0790.1070.1010.1090.0910.0520.1020.0750.0780.0550.055
    Rank6.4557.4556.5097.4556.5643.1457.2735.4005.6733.9823.818
    Z-Score0.3280.5210.3420.4900.152-0.7280.454-0.192-0.156-0.617-0.595
    CCDstat.0.2640.3190.3370.3310.3090.2880.3060.3220.3190.3030.292
    Rank1.8186.0186.8188.1825.7272.9825.7825.8365.1273.9642.655
    Z-Score-0.8290.1140.4371.320-0.011-0.458-0.1320.1200.053-0.206-0.409
    GCCDiff.0.0970.0740.0990.0780.0500.0670.0800.1190.1190.0700.065
    Rank6.5095.7646.3276.0914.3645.0735.8367.7827.7275.0005.055
    Z-Score0.211-0.0990.218-0.048-0.538-0.299-0.0800.6290.581-0.283-0.291
    DensityDiff.1.78017.81016.89420.85312.9651.12818.2412.3302.3370.8180.790
    Rank3.8189.3828.2558.9456.2363.6188.0365.5826.1093.1452.109
    Z-Score-0.6431.1400.7930.8410.169-0.8070.635-0.260-0.179-0.697-0.992
    DiameterDiff.1.2401.1600.8940.8060.6490.6910.8700.6070.6130.6130.815
    Rank5.9457.6367.9647.8916.5094.5277.0554.6185.0363.8914.655
    Z-Score0.0650.4990.6780.5030.241-0.4580.331-0.412-0.364-0.604-0.480
    OverlapnessDiff.4.15758.55056.98571.39341.0072.70860.2424.9335.0002.2292.416
    Rank4.9098.0189.4559.4186.0363.7458.0184.4914.7093.7093.218
    Z-Score-0.3910.6821.2640.9670.076-0.7830.567-0.511-0.397-0.669-0.805
    AverageRank5.1387.2187.3957.7365.3164.1456.8966.1896.2714.2313.840
    Z-Score-0.1580.4010.5270.517-0.138-0.5650.2300.1210.140-0.500-0.574
    Table 10. Among the 11 Sampling Methods Evaluated on 11 Real-world Hypergraphs with five Different Sampling Portions (Specifically, 10%, 30%, 50%, 70%, and 90%), MiDaS-B Demonstrates Overall the Best Performance in Back-in-time Sampling
    To assess the effectiveness of each method, we report their distances (D-statistics or relative differences), Z-Scores, and rankings. A smaller value indicates better performance, reflecting the capability of a method to closely reproduce the target snapshots of the input hypergraphs.
    Table 11.
    Table 11. Back-in-time Hypergraph Sampling Results in Six Datasets from Different Domains when a Sampling Portion is 0.5. Note that Samples Obtained by MiDaS-B Effectively Preserve Various Properties (Specifically, P1–P10) of the Target Snapshot (Shown in Black)

    4.4.3 Consistency of MiDaS-B.

    Figure 11 presents the overall performances (in terms of average rankings and average Z-Scores) of different algorithms across various sampling portions. As depicted in Figure 11(a)–(b), MiDaS-B consistently outperforms all competitors. Furthermore, as shown in Figure 11(c), where we report the D-statistics for hyperedge size distributions, MiDaS-B preserves hyperedge sizes consistently better than MiDaS (w/o. Oracle). These results highlight the consistently superior performance of MiDaS-B across diverse sampling scenarios.
    Fig. 11.
    Fig. 11. (a)–(b) MiDaS-B consistently achieves the best results in terms of average rankings and Z-Scores, regardless of the sampling portions. (c) MiDaS-B preserves hyperedge sizes better than MiDaS (w/o. Oracle) consistently across all sampling portions.

    4.4.4 Speed of MiDaS-B.

    In Figure 12, we compare the running times of all competing algorithms in each dataset with a fixed sampling portion of 0.9. MiDaS-B exhibits relatively longer computational times due to its extensive search across all candidate pairs of \(\alpha\) and \(\beta\) values (specifically, 35 pairs in our experimental setup). For each pair, MiDaS-B conducts hyperedge sampling and concurrently tracks the evolution of structural properties (specifically, the fraction of intersecting hyperedge pairs, the average hyperedge size, the average density, and the average size of the largest connected component) over time for the computation of fitness and the application of the rejection conditions. Nevertheless, MiDaS-B terminates within a reasonable period (specifically, less than 10,000 seconds) for all datasets considered, demonstrating its practical efficiency for real-world scenarios. Additionally, for reference, if we take into account the running time for sampling while excluding the computation of the four properties, it amounts to 344 seconds on the coauth-Geology dataset. Furthermore, in Appendix A.3, we analyze the running time of MiDaS-B with respect to the input hypergraph size.
    Fig. 12.
    Fig. 12. Total running time for back-in-time sampling in six datasets from different domains, with a fixed sampling portion of 0.9. Although MiDaS-B involves a comprehensive search for hyperparameters, it terminates within a reasonable period (specifically, less than 10,000 seconds) for all datasets considered, demonstrating its practical efficiency for real-world scenarios.

    4.4.5 Effectiveness of Each Component of MiDaS-B.

    We demonstrate the effectiveness of each component of MiDaS-B: (a) bias associated with hyperedge sizes and (b) automatic hyperparameter tuning. To this end, we compare MiDaS-B with the following variants:
    MiDaS (w. Oracle) (Section 4.2): The hyperparameter \(\alpha\) of MiDaS-Basic is tuned through a grid search for each dataset based on the comparison (with respect to all ten properties) with the ground-truth past snapshot of each dataset. It is important to note that this method relies on the ground-truth past snapshot, which is not provided according to the problem definition, and that is why we include “oracle” in its name.
    MiDaS-B (w. Oracle): The hyperparameters \(\alpha\) and \(\beta\) of MiDaS-B-Basic are tuned through a grid search for each dataset based on the comparison (with respect to all ten properties) with the ground-truth past snapshot of each dataset. Note that this method also relies on the ground-truth past snapshot.
    Top k: MiDaS-B-Basic with the \(k\) th best combination of \(\alpha\) and \(\beta\) values, which remains fixed across all datasets. The rankings of the combinations are determined based on the similarity (with respect to all ten properties) between the resulting sub-hypergraphs and the ground-truth past snapshots. Specifically, the rankings are computed based on the average of min-max normalized rankings and Z-Scores across all datasets and sampling portions.
    Bias Associated with Hyperedge Sizes: As shown in Table 12, MiDaS-B (w. Oracle) demonstrates superior performance over MiDaS (w. Oracle), indicating that considering the hyperedge size in the hyperedge sampling probability (i.e., the introduction of \(\beta\) ) effectively mitigates the limitations of MiDaS (w. Oracle) while preserving its strengths.
    Table 12.
      MiDaS (w. Oracle)MiDaS-B (w. Oracle)Top 1 (Out of 35)Top 5 (Out of 35)Top 10 (Out of 35)Top 35 (Out of 35)MiDaS-B
    DegreeDstat.0.0530.0500.0590.0590.0670.1570.062
    Rank2.4912.9823.2913.5824.2006.6553.527
    Z-Score-0.484-0.502-0.226-0.284-0.1371.829-0.196
    Int. SizeDstat.0.0320.0310.0320.0320.0330.0440.031
    Rank3.4003.5644.0733.1093.9095.1823.491
    Z-Score-0.230-0.0340.017-0.281-0.1430.776-0.105
    Pair DegreeDstat.0.0320.0400.0430.0390.0410.0650.039
    Rank2.4733.5453.9454.4553.4185.0913.800
    Z-Score-0.532-0.1510.0930.129-0.2530.796-0.082
    SizeDstat.0.0930.0620.0810.0940.1190.1780.085
    Rank3.4731.9822.6363.6555.6366.4362.909
    Z-Score-0.170-0.809-0.455-0.1330.4561.548-0.438
    SVDstat.0.0520.0490.0530.0520.0520.0880.055
    Rank3.7823.0363.5453.1273.6555.6003.673
    Z-Score-0.197-0.272-0.047-0.429-0.1801.1030.022
    CCDstat.0.2940.2890.2890.2880.2960.3390.292
    Rank3.3822.4912.4552.6914.0365.5272.509
    Z-Score-0.070-0.367-0.317-0.243-0.0291.283-0.369
    GCCDiff.0.0620.0480.0640.0670.0830.1120.065
    Rank2.9092.3453.6003.7095.2554.8733.873
    Z-Score-0.304-0.6080.005-0.0650.3450.6200.007
    DensityDiff.0.8920.8930.7711.1281.1344.3110.790
    Rank2.7642.8552.5094.3274.5825.7452.800
    Z-Score-0.420-0.483-0.3880.1300.2451.251-0.334
    OverlapnessDiff.2.0022.4742.0612.7082.81518.9762.416
    Rank2.6362.9642.8004.0734.3096.8183.127
    Z-Score-0.548-0.567-0.262-0.175-0.1781.996-0.266
    DiameterDiff.0.6390.5390.8080.6910.6890.6230.815
    Rank2.9822.8913.7094.0364.5094.6183.982
    Z-Score-0.304-0.4750.103-0.0040.0590.4210.200
    AverageRank3.0292.8653.2563.6764.3515.6553.369
    Z-Score-0.326-0.427-0.148-0.1360.0181.162-0.156
    Table 12. Effectiveness of the Key Components of MiDaS-B
    The results are averaged and compared across 11 real-world hypergraphs with five different sampling portions (i.e., 10%, 30%, 50%, 70%, and 90%). The overall superiority of MiDaS-B (w. Oracle) over MiDaS (w. Oracle) highlights the effectiveness of the hyperedge-related bias in MiDaS-B (w. Oracle). Furthermore, the comparable overall performance of MiDaS-B, which does not require ground-truth past snapshots, to that of Top 1, which relies on such snapshots, demonstrates the effectiveness of the automatic hyperparameter tuning by MiDaS-B.
    Automatic Hyperparameter Tuning: Recall that MiDaS-B tunes its hyperparameters \(\alpha\) and \(\beta\) without access to the ground-truth past snapshot of the given hypergraph. Nevertheless, MiDaS-B exhibits a competitive performance, comparable even to the Top 1 (out of 35), as shown in Table 12. Specifically, MiDaS-B outperforms the Top 1 in terms of average Z-Scores but underperforms it in terms of average rankings. Recall that the Top 1 employs the best combination of \(\alpha\) and \(\beta\) values across all datasets, relying on access to the ground-truth past snapshot. MiDaS-B outperforms the Top 5 (out of 35) in terms of both average Z-Scores and average rankings. Noteworthy, MiDaS-B preserves hyperedge sizes and connected component sizes even better than MiDaS (w. Oracle), which tunes hyperparameters for each dataset using the ground-truth past snapshots.

    5 Related Work

    In this section, we conduct a review of relevant studies categorized into five subsections: (a) graph simplification, (b) graph sampling, (c) hypergraph sampling, (d) structural properties of real-world hypergraphs, and (e) others.

    5.1 Graph Simplification

    The ubiquity of large-scale graphs in real-world applications, often comprising millions of nodes and edges, presents formidable computational challenges. Consequently, various works have emerged to simplify graphs each with distinct objectives. Graph simplification may involve reducing graph size while preserving specific graph properties, such as community structures [53], pairwise distances [56], cuts [31], or eigenvalues [61]. Note that, in general, simplified graphs may not always be subgraphs of the original graphs. Spectral sparsifiers [61], for instance, approximate the Laplacian quadratic form of the original graph with a subgraph. Another significant direction involves graph condensation [30], where the goal is to learn a simplified graph structure, along with node attributes, to minimize the performance gap between machine learning models (e.g., graph neural networks) trained on the simplified graph and the original graph. In a related context, complex but implicit relationships between entities can be summarized in the form of a simple graph (or a similarity matrix) to be leveraged by classification methods, such as \(k\) -nearest neighbors [79, 80, 81, 82], with further improvement using quantum computing [49].

    5.2 Graph Sampling

    Graph sampling, also known as subgraph sampling, represents a specific approach to graph simplification where simplified graphs are selected among the subgraphs of the original graphs. This process enhances the interpretability of the simplified graphs and strengthens the connections (i.e., correspondences) between them and the original graphs. Graph sampling also has been explored with diverse objectives, such as graphical inference [72], graph visualization [23, 28, 57], online-social-network crawling [24, 39, 50, 69], and triangle-count estimation [40, 51, 62].
    Our work is most closely related to representative sampling and back-in-time sampling [46]. Representative sampling aims at finding a subgraph that accurately represents the structural characteristics of the original graph [2, 27, 46, 65]. Back-in-time sampling aims at finding a subgraph that closely approximates a past snapshot of a given graph with a specified size [46]. Note that, in both tasks, the objective is to obtain general-purpose subgraphs, without presuming specific use cases for the sampled subgraphs.
    In terms of methodologies, most graph-sampling methods can be categorized into two main types: (a) node selection methods and (b) edge selection methods. Node-selection methods [27, 46, 53, 55] involve selecting a subset of nodes and obtaining the induced subgraph. Edge-selection methods [2, 40, 45, 46, 51, 62, 74] select a subset of edges and construct the subgraph using these edges and their incident nodes.
    In addition, research on graph evolution, especially those aiming at identifying the temporal order of nodes or edges in a graph, is closely related to back-in-time sampling. To this end, they exploit evolutionary patterns of real-world graphs and network growth models inspired by them [45, 55, 74], which are also leveraged for back-in-time sampling [46].14

    5.3 Hypergraph Sampling

    Hypergraphs have recently gained significant attention in various domains, including recommendation [67], entity ranking [12], misinformation detection [48], node classification [22, 29], and clustering [84]. These domains leverage the high-order relationships between nodes embedded in hyperedges to achieve better performance in various tasks. For the use of hypergraphs in machine learning, refer to an extensive survey [3].
    Despite the increasing interest in hypergraphs, hypergraph sampling remains a relatively unexplored area. Yang et al. [70] proposed a novel sampling method for hypergraphs, but it is specifically tailored for a specific task, namely node embedding. In addition, there have been several studies attempting to generate hypergraphs that reproduce the properties of real-world hypergraphs [4, 13, 18, 19, 20, 33, 35, 42]; refer to a survey [41] on this topic for details. However, these generation processes involve the creation of new nodes and hyperedges, whereas our focus is on sampling from existing ones.
    Particularly related to back-in-time sampling, Comrie and Kleinberg [16] introduced the problem of identifying the temporal order of hyperedges. However, their approach is customized for ordering hyperedges within a hypergraph ego-network, typically containing a limited number (specifically, at most 20) of hyperedges, rather than considering the entire hypergraph. Specifically, they employed a classifier trained on features derived from hypergraph ego-networks to predict the temporal order of hyperedges in a supervised manner.

    5.4 Structural Properties of Real-world Hypergraphs

    With respect to hypergraph sampling, it is important to decide the non-trivial structural properties of hypergraphs to be preserved. Kook et al. [35] discovered unique patterns in real-world hypergraphs regarding (a) hyperedge sizes, (b) intersection sizes, (c) the singular values of the incidence matrix, (d) edge density, and (e) diameter. Lee et al. [42] reported a number of properties regarding the overlaps of hyperedges by which real-world hypergraphs are clearly distinguished from random hypergraphs. In addition, Do et al. [19] uncovered several patterns regarding the connections between subsets of a fixed number of nodes (e.g., connected component sizes) in real-world hypergraphs; and Kim et al. [32] explored those related to transitivity (i.e., the propensity to form clusters). For directed hypergraphs, where nodes in each hyperedge are divided into heads and tails, Kim et al. [33] explored various empirical patterns regarding reciprocity (i.e., the inclination to form mutual connections).
    Dynamic changes in the structural properties of time-evolving hypergraphs have been analyzed from various perspectives. At the node level, the same subsets of nodes tend to appear repeatedly [6]. Moreover, this tendency becomes stronger, as these subsets appear at a greater number of hyperedges, spanning diverse sizes [15]. At the hyperedge level, the repetition [6], recency [6], burstiness [8], and persistency [8] in the appearance of hyperedges have been investigated. At the hypergraph level, studies have revealed trends such as diminishing overlaps, densification, and shrinking diameter [35]. Furthermore, certain studies have explored properties related to sub-hypergraphs, including triads of nodes [5], triads of hyperedges [44], and ego networks [16]. For more static and dynamic structural patterns in real-world hypergraphs, refer to a survey [41].

    5.5 Others

    Throughout the article, we use the term “bias” to refer to the statistical prioritization of specific nodes or hyperedges during the sampling process. This concept is related but distinct from the concept of bias associated with sensitive node attributes, which we typically aim at minimizing to ensure fairness during graph representation learning [52, 76, 77]. It is important to note that our objective is not to minimize the sampling bias; rather, our goal is to adjust the sampling bias to align the sampled distribution more closely with the target distribution. To achieve this objective, our methods may even increase the bias toward high-degree nodes during sampling, especially when the target distribution consists of more high-degree nodes.

    6 Conclusions and Future Directions

    In this section, we provide conclusions and outline future research directions.

    6.1 Conclusions

    In this work, we tackle two hypergraph sampling problems: representative sampling and back-in-time sampling. For representative sampling, we propose MiDaS, a fast and effective algorithm designed to overcome the limitations of RHS by automatically adjusting the amount of bias toward high-degree nodes. For back-in-time sampling, we propose MiDaS-B, which is built upon the mechanism of MiDaS but integrates a bias related to hyperedge size to overcome the limitations of MiDaS. MiDaS-B is also equipped with an automatic hyperparameter tuning method that leverages the evolutionary patterns of real-world hypergraphs without requiring the ground-truth past snapshot. Our extensive experiments on 11 real-world hypergraphs with five different sampling portions demonstrate the superiority of MiDaS (or MiDaS-B) for representative sampling (respectively, back-in-time sampling) compared to 14 (respectively, 10) competing methods.
    Our contributions are summarized as follows:
    Problem Formulation: To the best of our knowledge, we are the first to formulate the problem of representative sampling and back-in-time sampling from real-world hypergraphs. Our formulation is based on ten pervasive structural properties of real-world hypergraphs.
    Observations: We examine the characteristics of a number of intuitive sampling approaches in 11 datasets, and our findings guide the development of our more effective algorithms.
    Algorithm Design: We propose MiDaS and MiDaS-B for representative sampling and back-in-time sampling, respectively. Their superiority is validated through extensive experiments conducted across 11 datasets and five different sampling portions.
    For reproducibility, we make our code and datasets available at https://github.com/young917/MiDaS.

    6.2 Future Directions

    Applications: Recall that our methods are designed for sampling general-purpose sub-hypergraphs, with the primary goal of preserving a wide range of hypergraph structural properties. While they can exhibit versatility, they may not be optimal for specific objectives or applications. We plan to explore the effectiveness of our methods in diverse applications and further investigate sub-hypergraph sampling techniques tailored to specific applications.
    Diverse types of hypergraphs: Furthermore, our current focus is primarily on homogeneous graphs, without explicit consideration for the potential presence of various node types and/or hyperedge types. In our future research, we plan to expand our scope to include heterogeneous graphs.
    Scalability: Moreover, considering the vast size of real-world hypergraphs, we acknowledge the necessity for more efficient sampling strategies, particularly for back-in-time sampling. Moreover, the dynamic nature of real-world hypergraphs requires efficient updates of the sampled sub-hypergraph over time. We plan to address these challenges through technological advancements (e.g., quantum computing) as well as algorithmic innovation.

    Footnotes

    1
    This article extends our previous work [14] on representative hypergraph sampling. In this extended version, we formulate a novel problem of back-in-time hypergraph sampling, whose goal is to accurately approximate a past snapshot of a given size of the input hypergraph (Section 4.1). Unlike representative sampling, we do not have access to the target (i.e., the past snapshot of a given size) in back-in-time sampling, and this unique challenge necessitates the development of a new algorithm. Thus, for the new problem, we examine a number of intuitive approaches (Section 4.2), and based on the examination, we design MiDaS-B, a novel algorithm for back-in-time sampling (Section 4.3). Finally, we demonstrate the empirical superiority of MiDaS-B (Section 4.4).
    2
    That is, \(F(x):=\sum _{i=1}^{x}f(i)\) and \(\hat{F}(x):=\sum _{i=1}^{x}\hat{f}(i)\) .
    3
    They are chosen among 0, \(2^{-3}\) , \(2^{-2.5}\) , \(2^{-2}\) , \(\ldots\) \(2^{5.5}\) , and \(2^{6}\) .
    4
    The skewness is defined as \(\frac{ \mathrm{E}_{v} [ (d_{v} - \mathrm{E}_{v}[d_{v}])^{3} ]}{ \mathrm{E}_{v}[ (d_{v} - \mathrm{E}_{v} \left[ d_{v} \right])^{2}]^{3/2} }\) .
    5
    We chose \(\alpha\) minimizing \(\mathcal {L}(\mathcal {G}, \mathcal {\hat{G}})\) among \(\mathcal {S}=\lbrace 0, 2^{-3}, 2^{-2.5}, \ldots , 2^{5.5}, 2^6\rbrace\) .
    6
    We set \(p\) to \(10\%\) , \(20\%\) , \(30\%\) , \(40\%\) , or \(50\%.\)
    7
    We set \(\mathcal {S}=\lbrace 0, 2^{-3}, 2^{-2.5}, \ldots , 2^{5.5}, 2^6\rbrace\) throughout the article.
    8
    We use the search space \(\mathcal {S}=\lbrace 0, 2^{-2}, 2^{-1}, 2^{0}, 2^{1}\rbrace\) for \(\alpha\) in back-in-time hypergraph sampling problem.
    9
    The correlation coefficients for a hyperparameter are averaged over the datasets, the sampling portions, and the values of the other hyperparameter.
    10
    We vary \(\alpha\) within \(\lbrace 0, 2^{-2}, 2^{-1}, 2^{0}, 2^{1}\rbrace\) and \(\beta\) within \(\lbrace -2^{0},-2^{-1}, -2^{-2}, 0, 2^{-2}, 2^{-1}, 2^{0} \rbrace\) .
    11
    We measure signed differences. Specifically, for each of P1–P6, which are probability density functions, we measure \(\hat{F}(x^*) - F(x^*)\) where \(x^* = \text{arg\,max}_{x\in \mathcal {D}} \lbrace | \hat{F}(x) - F(x) |\rbrace\) , where \(F\) (or \(\hat{F}\) ) is the cumulative sum of the function \(f\) (respectively, \(\hat{f}\) ) for \(\mathcal {\bar{G}}\) (respectively, \(\mathcal {\hat{G}}\) ), and \(\mathcal {D}\) is the domain of \(f\) and \(\hat{f}\) . For each of P7–P10, which are scalars, we subtract the corresponding values of the ground-truth past snapshot from those of the sampled sub-hypergraph.
    12
    Similar to when computing the power-law fitness, we consider each hyperedge sampling order as a chronological arrival order of hyperedges.
    13
    In our implementation, in Algorithm 3, evolutionary properties critical for evaluating rejection conditions and calculating power-law fitness (e.g., the fraction of intersecting hyperedge pairs) are dynamically updated with each new hyperedge addition. For computational efficiency, these property values are recorded at regular intervals, specifically after every \(\lceil \frac{|\mathcal {E}|}{500} \rceil\) hyperedges are added, for subsequent utilization (i.e., rejection conditions and power-law fitness functions). Additionally, we maintain the sequence \(\mathcal {E}^{\prime }\) of hyperedges obtained with \(\alpha ^{\star }\) and \(\beta ^{\star }\) (i.e., those leading to the best fitness) in Algorithm 3 and reuse the first \(\lfloor |\mathcal {E}| \cdot p \rfloor\) hyperedges in the sequence as \(\mathcal {\hat{E}}\) in Algorithm 4. It is important to note that this modification enhances the efficiency of Algorithm 4 while maintaining its original semantics.
    14
    Adapting these methods to hypergraphs is non-trivial. Leskovec et al. [45], for instance, concentrated on the temporal order of nodes, but determining the order of hyperedges is crucial in the context of hypergraph sampling. Specifically, in Section 4.2, we demonstrate that selecting induced hyperedges based solely on the ground-truth order of node appearance (i.e., ONS) does not yield satisfactory results. Additionally, likelihood-based frameworks built on graph growth models [55, 74] are not easily applicable to hypergraphs due to the flexibility in hyperedge sizes.
    15
    These hypergraphs are characterized by node degrees that follow a power-law distribution with an exponent of \(-1\) ; hyperedge sizes following a power-law distribution with an exponent of \(-4\) and a maximum size at 100; and a node count of 1,000.
    16
    For HyperFF, we use \((p, q) \in [(0.5, 0.2), (0.55, 0.15), (0.55, 0.2), (0.55, 0.25), (0.55, 0.3)]\) . For HyperLap, we use \((d_{\alpha }, e_{\beta }, L) \in [(-0.75, -5.0, 4), (-1.0, -4.0, 3), (-1.25, -4.0, 3), (-1.25, -7.0, 3), (-1.25, -7.0, 4), (-1.5, -5.0, 3)]\) . For THera, we use \((d_{\alpha }, e_{\beta }, c, p, \alpha) \in [(-0.75, -4.0, 8, 0.75, 9.0), (-0.75, -4.0, 8, 0.5, 9.0), (-1.25, -4.0, 8, 0.75, 9.0), (-1.25,\) \(-4.0, 8, 0.5, 9.0), (-1.0, -4.0, 8, 0.5, 9.0)]\) . For HyperPA, we use \((d_{\alpha }, e_{\beta }) \in [(-0.5, -4.0), (-0.75, -2.0), (-0.75, -6.0),\) \((-1.0, -5.0), (-1.25, -5.0)]\) .

    A Appendix

    A.1 Theoretical analysis of Observation 3

    We theoretically analyze Observation 3 by analyzing the relation between hyperedge weighting and biases of degree distributions in samples.
    Definition 1.
    Let \(S\) be a hyperedge sampling algorithm and \(\phi _S(e) \ge 0\) be the weight of a hyperedge \(e\) for being selected by \(S\) . Then, we define the probability \(p_S(e)\) of \(e\) being selected by \(S\) as
    \begin{equation*} p_S(e) = \frac{\phi _S(e)^\alpha }{\sum _{e^{\prime } \in \mathcal {E}}\phi _S(e^{\prime })^\alpha } = \frac{1}{Z_S(\alpha)} \phi _S(e)^\alpha , \end{equation*}
    where \(Z_S(\alpha)\) is the normalization constant, and \(\alpha\) is a parameter.
    Definition 2.
    Given a sampling algorithm \(S\) , we denote by \(l_S(k)\) the probability of sampling a hyperedge that contains a node whose degree is lower than or equal to \(k\) from \(\mathcal {G}=(\mathcal {V},\mathcal {E})\) . That is,
    \begin{align*} l_S(k) = \sum \nolimits _{e \in \mathcal {E}(\mathcal {V}_{k})} p_S(e) \end{align*}
    where \(\mathcal {V}_{k} = \lbrace v \in \mathcal {V}: d(v) \le k \rbrace\) and \(\mathcal {E}({A}) = \lbrace e \in \mathcal {E}: e \cap A \ne \emptyset \rbrace\) .
    Using \(l_S(k)\) , we can define the probability of sampling a hyperedge that contains only nodes whose degrees are higher than \(k\) . We denote this as \(h_S(k) = 1 - l_S(k)\) . In this definition, \(h_S(k)\) indicates the following meaning:
    Definition 3.
    For any \(k\ge 0\) , if \(h_{A}(k) \lt h_{B}(k)\) , we say Algorithm \(B\) is more biased toward nodes with degree higher than \(k\) , compared to Algorithm \(A\) .
    Selecting a node with a degree higher than \(k\) can be divided into two cases: (a) selecting a hyperedge where at least one node has a degree higher than \(k\) but not all (i.e., \(\lbrace e \in \mathcal {E}: \exists v \in e \text{ such that } d(v) \le k \text{ and } \exists v^{\prime } \in e \text{ such that } d(v^{\prime }) \gt k \rbrace\) ) and (b) selecting a hyperedge where all nodes have degrees higher than \(k\) (i.e., \(\lbrace e \in \mathcal {E}: \forall v \in e, d(v) \gt k \rbrace\) ). Because the former case increases the probability of sampling a node with a degree less than or equal to \(k\) , the latter contributes more to being strongly biased toward nodes with a degree more than \(k\) .
    Below, we are going to use \(M_{\omega }(\alpha)\) to refer to random hyperedge sampling with \(\omega (e)^{\alpha }\) as the hyperedge weight function. Note that the case of \(\alpha = 0\) (i.e., \(M_{\omega }(0)\) ) corresponds to RHS; and the case of \(\omega (e)=\min _{v \in e} d_{v}\) corresponds to MiDaS-Basic.
    Theorem 4.
    Given \(k\) , \(h_{M_{\omega }(\alpha)}(k)\) is an increasing function of \(\alpha\) , if \(k\) satisfies
    \begin{equation} \mathbb {MAX}_{e \in \mathcal {E}(\mathcal {V}_{k})} \ln \omega (e) \lt \mathbb {AVG}_{e \in \mathcal {E}} \ln \omega (e). \end{equation}
    (4)
    Proof.
    \(h_{M_{\omega }(\alpha)}(k)\) is an increasing function of \(\alpha\) if \(\frac{\partial l_{M_{\omega }(\alpha)}(k)}{\partial \alpha }\lt 0\) holds for any \(\alpha\) .
    \(\frac{\partial l_{M_{\omega }(\alpha)}(k)}{\partial \alpha }\) is arranged as
    \begin{equation} \sum _{e \in \mathcal {E}(\mathcal {V}_{k})} \frac{\omega (e)^{\alpha } \ln \omega (e)}{ \sum _{e^{\prime } \in \mathcal {E}(\mathcal {V}_{k})} \omega (e^{\prime })^{\alpha }} \lt \sum _{e \in \mathcal {E}} \frac{\omega (e)^{\alpha } \ln \omega (e)}{\sum _{e^{\prime } \in \mathcal {E}} \omega (e^{\prime })^{\alpha }}. \end{equation}
    (5)
    For any set of hyperedges \(\mathcal {\hat{E}}\) , the following equation is satisfied if \(\alpha\) changes from 0 to \(\infty\) :
    \begin{equation} \mathbb {AVG}_{e \in \mathcal {\hat{E}}} \ln \omega (e) \le \sum _{e \in \mathcal {\hat{E}}} \frac{\omega (e)^{\alpha } \ln \omega (e)}{ \sum _{e^{\prime } \in \mathcal {\hat{E}}} \omega (e^{\prime })^{\alpha }} \le \mathbb {MAX}_{e \in \mathcal {\hat{E}}} \ln \omega (e). \end{equation}
    (6)
    Based on Equation (6), we can get the upper bound of the left-hand side and the lower bound of the right-hand side of Equation (5).
    Thus, if Equation (4) is satisfied for given \(k\) , \(\frac{\partial l_{M_{\omega }(\alpha)}(k)}{\partial \alpha }\lt 0\) holds for any \(\alpha\) . That is, \(h_{M_{\omega }(\alpha)}(k)\) is an increasing function of \(\alpha\) . □
    We have the following corollary and lemma from Theorem 4.
    Corollary 1.
    For all \(k\) that satisfies Equation (4), MiDaS-Basic is more biased toward nodes with degrees larger than \(k\) as \(\alpha\) increases.
    Lemma 2.
    Given any sampling algorithm, if \(k=k^{\prime }\) satisfies the condition of Equation (4), then \(k \lt k^{\prime }\) also satisfies the condition.
    Proof.
    The proof is straightforward as the left-hand side of Equation (4) is an increasing function of \(k\) . □
    We analyze MiDaS-Basic (i.e., \(\omega (e)=\min _{v \in e} d_{v}\) ), MiDaS-Basic-Max (i.e., \(\omega (e)=\max _{v \in e} d_{v}\) ), and MiDaS-Basic-Avg (i.e., \(\omega (e)=\mathrm{avg}_{v \in e} d_{v}\) ), which are described in Appendix A.2. Specifically, we examine whether they have \(k\) satisfying Equation (4) in the 11 real-world hypergraphs. Based on Lemma 2, we examine \(k^{*}\) , i.e., the maximum \(k\) value satisfying Equation (4). We find out that MiDaS-Basic has satisfying \(k\) in all datasets, and the results in three datasets are shown in Figure 13. However, both MiDaS-Basic-Max and MiDaS-Basic-Avg do not have any \(k\) satisfying Equation (4) in most datasets. Even if they have appropriate \(k\) in some datasets, \(k^{*}\) values from them are less than those from MiDaS-Basic, as summarized below.
    Fig. 13.
    Fig. 13. Equation (4) is satisfied more easily (i.e., it is satisfied for a wider range of \(k\) values) in MiDaS-Basic than in its variants. Note that Equation (4) is a sufficient condition for bias toward high-degree nodes to grow as \(\alpha\) increases.
    Observation 6.
    On all eleven considered real-world hypergraphs, Equation (4) is satisfied for a wider range of \(k\) values in MiDaS-Basic than in its variants. Recall that Equation (4) is a sufficient condition for bias toward high-degree nodes grows as \(\alpha\) increases.

    A.2 Ablation Study of MiDaS-Basic

    Below, in order to justify the design choices that we make when designing MiDaS-Basic, we compare it with its three variants:
    MiDaS-Basic - Max : This variant uses \((\max _{v \in e}d_{v})^\alpha\) for hyperedge weighting.
    MiDaS-Basic - Avg : This variant uses \((\text{avg}_{v \in e} d_{v})^\alpha\) for hyperedge weighting.
    MiDaS-Basic - NS : This variant draws nodes with probability proportional to \(\lbrace d_{v}\rbrace ^{\alpha }\) and returns the induced sub-hypergraph.
    Examination of Observation 3: We check whether the variants also show the tendency of biases in degree distributions in samples controlled by \(\alpha\) (Observation 3) when considering all settings. As shown in Table 13, we observe that neither MiDaS-Basic-MAX nor MiDaS-Basic-AVG completely show this tendency. But, only MiDaS-Basic-NS exhibits Observation 3.
    Table 13.
    Table 13. The Biases of Degree Distributions in Sub-hypergraphs Sampled by Three Variants Cannot be Fully Controlled by \(\alpha\) when the Sampling Portion is 0.3
    Preservation of Degree Distribution: We visually compare the degree distributions in the best performing \(\alpha\) values, which minimize the degree d-statistics in each variant in Figure 14. As mentioned above, since only MiDaS-Basic and MiDaS-Basic-NS exhibit Observation 3, we could see that these two methods have degree distributions quite close to that of original hypergraphs.
    Fig. 14.
    Fig. 14. Preservation of Degree Distribution: We show the results in the best performing \(\alpha\) , which minimizes degree d-statistics in each algorithm. Both MiDaS-Basic and MiDaS-Basic-NS preserve the degree distribution well.
    Evaluation Table: Even though MiDaS-Basic-NS maintain the degree distributions well similar to MiDaS-Basic, not only MiDaS-Basic-MAX and MiDaS-Basic-AVG but also MiDaS-Basic-NS are outperformed by MiDaS-Basic mostly, as seen in Table 14. MiDaS-Basic consistently outperforms its variants regardless of sampling portions as shown in Figure 15.
    Table 14.
     MiDaS-Basic-MaxMiDaS-Basic-AvgMiDaS-Basic-NSMiDaS-Basic
    Degree2.85 (0.40)2.62 (0.24)2.15 (-0.20)1.91 (-0.45)
    Int. Sizes.1.91 (-0.34)2.07 (-0.18)3.27 (0.57)2.27 (-0.06)
    Pair Degree2.07 (-0.13)2.04 (-0.32)2.85 (0.36)2.56 (0.09)
    Size2.96 (0.53)2.33 (0.06)2.55 (0.14)1.69 (-0.73)
    SV2.04 (-0.05)1.75 (-0.24)2.82 (0.41)2.16 (-0.07)
    CC2.47 (0.39)2.04 (0.12)1.80 (-0.19)1.65 (-0.32)
    GCC2.55 (0.01)2.24 (0.05)2.87 (0.34)1.87 (-0.41)
    Density3.20 (0.65)2.69 (0.39)2.00 (-0.42)1.62 (-0.62)
    Overlapness2.98 (0.46)2.55 (0.28)2.09 (-0.31)1.91 (-0.43)
    Diameter2.64 (0.25)2.27 (-0.03)2.80 (0.24)1.82 (-0.46)
    AVG2.57 (0.22)2.26 (0.04)2.52 (0.09)1.95 (-0.35)
    Table 14. MiDaS-Basic Gives Overall More Representative Samples than its Three Variants
    We report rankings and Z-Scores (in parentheses) averaged over all \({\bf 11}\) datasets and five different sampling portions (10%, 20%, 30%, 40%, and 50%).
    Fig. 15.
    Fig. 15. MiDaS-Basic consistently outperforms its variants, in terms of degree preservation, average rankings, and average Z-Scores. The results justify our design choices.

    A.3 Scalability of MiDaS and MiDaS-B

    We analyze the scalability of MiDaS and MiDaS-B with respect to the input hypergraph size (i.e., the number of hyperedges in the input hypergraph). To this end, we generate synthetic hypergraphs of varying sizes, with up to \(10^9\) hyperedges, using HyperCL [42].15 We report the running time of MiDaS aggregated across five distinct sampling portions ( \(10\%, \ldots , 50\%\) ) for each dataset as in Section 3.4.4; and as in Section 4.4.4, we measure the running time of MiDaS-B with a fixed sampling portion of 0.9. As shown by a regression line slope close to 1 in the log-log scale in Figure 16(a), MiDaS exhibits linear scalability with respect to the number of hyperedges. In contrast, as shown in Figure 16(b), MiDaS-B demonstrates super-linear scalability, lying between linear and quadratic growth rates (i.e., the slope is between 1 and 2 in the log-log scale). It successfully handles hypergraphs with up to \(10^7\) hyperedges. Recall that the super-linear complexity of MiDaS-B arises from computing the structural evolution over time. For detailed mathematical analysis, refer to Theorem 3 and its proof.
    Fig. 16.
    Fig. 16. Scalability of MiDaS and MiDaS-B with respect to the size of the real-world hypergraphs. MiDaS exhibits near-linear scalability with respect to the number of hyperedges (i.e., the slope is close to 1 in the log-log scale); and MiDaS-B exhibits super-linear scalability, lying between linear and quadratic growth rates.

    A.4 Parameter Sensitivity of MiDaS-Basic and MiDaS-B-Basic

    The impact of \(\alpha\) of MiDaS-Basic : We analyze the sensitivity of MiDaS-Basic to the parameter \(\alpha\) . We specifically quantify the impact of varying \(\alpha\) on both average node degrees and hyperedge sizes within the sampled sub-hypergraph. As seen in Figure 17(a)–(b), in line with Observation 3, there is a noticeable increase in average node degrees within the sub-hypergraph as \(\alpha\) increases. Conversely, average hyperedge sizes exhibit minimal change. This observation stems from the fact that MiDaS-Basic samples hyperedge \(e\) with a probability proportional to \((\min _{v \in e} d_{v})^{\alpha }\) without explicit consideration of hyperedge size. Nonetheless, the sampled hyperedge size tends to decrease with increasing \(\alpha\) , attributing this phenomenon to larger \(\min _{v \in e} d_{v}\) values becoming more probable for smaller hyperedge sizes.
    Fig. 17.
    Fig. 17. Parameter sensitivity of MiDaS-Basic in the email-eu and contact-primary datasets at a sampling portion of \(50\%\) .
    Further, we assess the degree D-statistic and size D-statistic w.r.t. \(\alpha\) . As seen in Figure 17(c)–(d), the best performing \(\alpha\) , resulting in the smallest D-statistic, varies for each property. However, in line with Observation 2, Figure 17(e)–(f) show that the optimal \(\alpha\) for achieving the smallest degree D-statistic closely aligns with that for achieving favorable average rankings and average Z-Scores. Additionally, note that the optimal \(\alpha\) varies for each dataset. The impact of \(\alpha\) and \(\beta\) of MiDaS-B-Basic : We further explore the sensitivity of MiDaS-B-Basic to parameters \(\alpha\) and \(\beta\) . As seen in Equation (3), \(\alpha\) influences the inclination toward sampling nodes with higher degrees, while \(\beta\) influences the inclination toward sampling hyperedges with smaller sizes. This tendency is clearly observed in Figure 18(a)–(b). However, due to the higher probability of obtaining larger \(\min _{v \in e} d_{v}\) values for smaller hyperedge sizes, the average node degrees in the sub-hypergraph are influenced by both \(\alpha\) and \(\beta\) .
    Fig. 18.
    Fig. 18. Parameter sensitivity of MiDaS-B-Basic in the email-eu and contact-primary datasets at a sampling portion of \(50\%\) .
    In addition, we analyze how the sampling quality is influenced by these parameters, focusing on degree D-statistics, size D-statistics, average rankings, and average Z-Scores. Unlike the earlier findings related to MiDaS-Basic, parameter values exhibiting low degree D-statistics do not necessarily align with those associated with favorable average rankings and average Z-Scores. The best-performing parameter values vary for each property. Therefore, determining the optimal parameter values requires considering degrees, sizes, and other properties collectively. Moreover, it is noteworthy that the best-performing parameter values vary also for each dataset.

    A.5 Evaluation of MiDaS and MiDaS-B on Random Hypergraphs with Diverse Structures

    To show the extended performance and versatility of MiDaS and MiDaS-B, we conduct evaluations on random hypergraph datasets. We generate 20 random hypergraph datasets using various hypergraph generators, such as HyperFF [35], HyperLap [42], THera [33], and HyperPA [19]. As their inputs, we use (a) node degrees following a power-law distribution with an exponent \(d_{\alpha }\) and (b) hyperedge sizes following a power-law distribution with an exponent \(e_{\beta }\) . Specifically, we generate hypergraphs using HyperCL [42] with (a) and (b) as the inputs and extract all required inputs from the generated hypergraphs. We considered five distinct parameter (i.e., exponent) pairs for each hypergraph generator,16 resulting in 20 random hypergraphs with diverse structures.
    Performance of MiDaS : In the representative hypergraph sampling problem, we compare MiDaS with six simple methods, HRW based sampling and MGS. We report their average performance across five distinct sampling portions (i.e., \(10\%, 20\%, 30\%, 40\%, 50\%\) ) in Table 15. Notably, MiDaS demonstrates superior performance even in these random datasets, outperforming 14 competitors in terms of both average rankings and average Z-Scores. Consistent with the results from real-world hypergraphs, MiDaS performs particularly well in preserving degrees, density, overlapness, and effective diameter.
    Table 15.
    RNSRDN [46]RW [64]FF [35]RHSTIHS [2]HRW [78]MGS - Deg [27]MGS - Avg [27]MiDaS
    NBSKAddRepDelAddRepDel
    DegreeDstat.0.3800.4060.3820.3740.4670.3890.4940.4970.3440.2960.3580.3730.3220.3500.262
    Rank8.2008.8107.8867.46713.3528.0109.99010.0487.7904.3247.9058.5336.2196.9624.219
    Z-Score0.1770.195-0.031-0.1281.0550.0340.5610.598-0.198-0.781-0.0220.061-0.506-0.121-0.893
    Int. SizeDstat.0.0330.0170.0160.0180.0090.0140.2240.2250.0150.0340.0220.0050.0070.0200.013
    Rank10.5338.1718.1627.9904.3627.59014.19014.0195.7438.8198.3712.8673.6677.7527.581
    Z-Score0.038-0.314-0.303-0.290-0.588-0.3722.0242.028-0.4930.282-0.183-0.706-0.609-0.088-0.427
    Pair DegreeDstat.0.1730.1300.1150.0950.1880.1170.2270.2280.1530.1150.1460.1260.0950.1120.108
    Rank11.9057.9436.1055.66713.4576.3908.6008.6009.9437.6678.9247.6104.8766.2575.914
    Z-Score0.906-0.039-0.332-0.4511.210-0.2860.2290.2470.331-0.2440.277-0.182-0.752-0.391-0.523
    SizeDstat.0.1450.1090.0850.0560.0210.0970.3620.3650.0330.0570.0700.0080.0060.0310.066
    Rank12.58111.0299.0956.8103.86710.33314.25714.4005.3527.1248.1241.7331.7625.4297.733
    Z-Score0.5850.198-0.103-0.404-0.7200.0722.0332.081-0.594-0.264-0.137-0.898-0.921-0.615-0.313
    SVDstat.0.0950.2520.2520.2680.0890.2520.1070.1040.1310.1220.1220.1200.1180.1200.211
    Rank4.18110.70510.63812.1902.98110.8386.3816.3626.9337.3335.8576.8296.7336.60010.305
    Z-Score-0.7380.8050.8351.111-0.9500.828-0.367-0.416-0.352-0.191-0.400-0.349-0.191-0.1940.568
    CCDstat.0.0070.0000.0010.0020.0170.0010.0000.0010.0060.0030.0080.0080.0030.0050.006
    Rank4.8861.2101.1811.4006.4002.0381.1901.7244.0673.2955.1244.6003.4574.7903.429
    Z-Score0.298-0.356-0.361-0.3350.616-0.195-0.372-0.2790.0940.0150.3250.1860.0210.2850.061
    GCCDiff.0.1720.1200.0900.0730.1960.0980.1420.1580.1750.1670.1930.1690.1540.1770.105
    Rank9.8197.6766.4296.0389.6677.1337.2487.5059.2958.6009.3338.7717.6958.4956.295
    Z-Score0.550-0.113-0.365-0.4430.366-0.251-0.157-0.0390.2030.0530.4500.114-0.1210.145-0.391
    DensityDiff.0.4770.3500.3410.3170.6540.3410.5930.5940.5070.5500.6040.5610.5580.6160.229
    Rank6.5524.5524.5903.93313.5624.40010.34310.2958.0578.65711.2109.1249.45711.2572.610
    Z-Score-0.082-0.738-0.814-1.0020.981-0.7960.6040.6100.1450.3270.6810.4450.3620.758-1.482
    OverlapnessDiff.0.5630.3820.3780.3480.6680.3720.5270.5280.5360.5500.5910.5830.5790.6160.296
    Rank8.9814.6194.5243.82914.3054.3627.5247.6199.5718.65710.80010.63810.20011.4672.895
    Z-Score0.381-0.684-0.780-1.0861.125-0.7680.1300.1390.3410.1920.6400.6060.4100.735-1.381
    DiameterDiff.0.2170.2040.1930.1900.3430.1980.1130.1200.2380.1360.2200.2890.1970.2020.108
    Rank7.7719.1149.1529.17111.9438.6765.0575.0867.7146.8767.8299.0678.8199.2004.524
    Z-Score-0.0700.2820.2610.2600.8900.218-0.726-0.671-0.095-0.2350.0110.2830.1760.201-0.785
    AverageRank8.5417.3836.7766.4509.3906.9778.4788.5667.4477.1358.3486.9776.2897.8215.550
    Z-Score0.204-0.076-0.199-0.2770.398-0.1520.3960.430-0.062-0.0850.164-0.044-0.2130.071-0.557
    Table 15. Representative Sampling from Random Hypergraphs
    MiDaS yields overall the most representative sub-hypergraphs. We compare 15 sampling methods on 20 random hypergraphs with five different sampling portions.
    Performance of MiDaS-B : In the back-in-time hypergraph sampling problem, we conduct a comparison of MiDaS-B with ten other back-in-time sampling algorithms. The performances reported in Table 16 are averages across five distinct sampling portions (i.e., \(10\%, 30\%, 50\%, 70\%, 90\%\) ). MiDaS-B achieves the second-best performance, with ONS being the only algorithm outperforming it in terms of average rankings and Z-Scores. The outstanding performance of ONS can be attributed to the nature of hyperedges generated by HyperFF and HyperPA. Whenever a new node is introduced, these generators subsequently produce one or more hyperedges that include the new node. ONS, designed to sample induced hyperedges by adhering to the ground-truth generating order of nodes, performs well under these specific conditions. However, in practice, ground-truth node orders are seldom available, and MiDaS-B exhibits exceptional performance even without access to this information.
    Table 16.
    RNSRDN [46]RW [64]FF [35]ONSRHSTIHS [2]HRW [78]MiDaS (w/o. Oracle)MiDaS-B
    NBSK
    DegreeDstat.0.1950.3290.3330.3460.2060.2220.3260.2110.2070.1990.189
    Rank4.5057.4197.9057.2485.2675.8867.5335.2575.1434.9814.838
    Z-Score-0.4250.5730.6640.508-0.224-0.1230.602-0.211-0.255-0.496-0.614
    Int. SizeDstat.0.0280.0140.0190.0190.0160.0210.0140.0640.0610.0150.013
    Rank7.4385.2485.9625.0863.5056.0005.2869.0678.6194.6765.114
    Z-Score0.334-0.291-0.146-0.378-0.835-0.016-0.2731.1611.069-0.375-0.249
    Pair DegreeDstat.0.1060.0990.1040.1030.0760.1100.0920.0840.0850.1030.109
    Rank6.2196.1246.2386.3054.7146.6005.5815.7525.9336.1146.419
    Z-Score0.093-0.0000.0580.158-0.3530.168-0.122-0.064-0.010-0.0170.089
    SizeDstat.0.0750.0560.0530.0440.0330.0550.0530.1930.1920.0420.039
    Rank6.7055.6765.1904.3243.1816.0955.4869.9819.9434.5624.657
    Z-Score0.075-0.258-0.390-0.516-0.676-0.041-0.2761.5631.554-0.575-0.459
    SVDstat.0.0940.1670.1670.1760.0760.0970.1660.1030.1040.0890.085
    Rank5.2576.6956.6296.9624.6484.5246.9435.3525.6004.7814.057
    Z-Score-0.2160.4470.4510.458-0.247-0.1530.459-0.109-0.114-0.480-0.589
    CCDstat.0.0090.0060.0060.0070.0060.0100.0060.0060.0060.0120.014
    Rank3.2481.3621.2861.4861.1433.5901.6481.2951.7244.6864.705
    Z-Score0.143-0.249-0.250-0.213-0.2660.254-0.202-0.251-0.1900.5590.565
    GCCDiff.0.0680.0710.0800.0900.0700.0920.0710.1260.1220.0980.096
    Rank5.1625.7336.3336.6194.1525.5055.6577.6767.6195.7715.771
    Z-Score-0.322-0.1300.0490.052-0.372-0.176-0.1440.6130.583-0.059-0.094
    DensityDiff.1.23912.92912.73515.7170.9814.01912.8594.8794.8403.4693.393
    Rank3.6487.5247.5717.6005.2195.5907.5146.1906.0674.7054.019
    Z-Score-0.6690.5090.5490.563-0.243-0.1690.495-0.060-0.053-0.415-0.505
    DiameterDiff.0.3100.6090.6470.6120.3980.4240.5940.3200.3340.4030.407
    Rank4.4007.5438.1627.8004.8955.8677.4104.5434.8104.9625.610
    Z-Score-0.5040.5620.7000.515-0.298-0.1130.557-0.462-0.422-0.289-0.246
    OverlapnessDiff.5.72127.91627.92435.6392.6559.60327.8288.3918.3418.6568.822
    Rank4.3907.5818.2008.2675.5245.6107.5904.4104.3335.1144.943
    Z-Score-0.4710.5440.7050.797-0.063-0.2730.539-0.571-0.541-0.275-0.390
    AverageRank5.0976.0906.3486.1704.2255.5276.0655.9525.9795.0355.013
    Z-Score-0.1960.1710.2390.194-0.358-0.0640.1630.1610.162-0.242-0.249
    Table 16. Back-in-time Sampling from Random Hypergraphs
    Among the 11 sampling methods evaluated on 20 random hypergraphs with five different sampling portions, MiDaS-B achieves the second-best performance, with ONS being the only algorithm outperforming it in terms of average rankings and Z-Scores. Note that MiDaS-B exhibits exceptional performance even without access to ground-truth node orders, while ONS requires them.

    References

    [1]
    Charu C. Aggarwal, Yuchen Zhao, and S. Yu Philip. 2011. Outlier detection in graph streams. In Proceedings of the 2011 IEEE 27th International Conference on Data Engineering.
    [2]
    Nesreen Ahmed, Jennifer Neville, and Ramana Rao Kompella. 2011. Network sampling via edge-based node selection with graph induction. Department of Computer Science Technical Reports 11-016 (2011), 1747–1756. Retrieved from https://docs.lib.purdue.edu/cstech/1747/
    [3]
    Alessia Antelmi, Gennaro Cordasco, Mirko Polato, Vittorio Scarano, Carmine Spagnuolo, and Dingqi Yang. 2023. A survey on hypergraph representation learning. Computing Surveys 56, 1 (2023), 1–38.
    [4]
    Naheed Anjum Arafat, Debabrota Basu, Laurent Decreusefond, and Stéphane Bressan. 2020. Construction and random generation of hypergraphs with prescribed degree and dimension sequences. In Database and Expert Systems Applications: 31st International Conference, DEXA 2020, Bratislava, Slovakia, September 1417, 2020, Proceedings, Part II 31.
    [5]
    Austin R. Benson, Rediet Abebe, Michael T. Schaub, Ali Jadbabaie, and Jon Kleinberg. 2018. Simplicial closure and higher-order link prediction. Proceedings of the National Academy of Sciences 115, 48 (2018), E11221–E11230.
    [6]
    Austin R. Benson, Ravi Kumar, and Andrew Tomkins. 2018. Sequences of sets. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
    [7]
    Enrico Bertini and Giuseppe Santucci. 2011. Improving visual analytics environments through a methodological framework for automatic clutter reduction. Journal of Visual Languages and Computing 22, 3 (2011), 194–212.
    [8]
    Giulia Cencetti, Federico Battiston, Bruno Lepri, and Márton Karsai. 2021. Temporal properties of higher-order interactions in social networks. Scientific Reports 11, 1 (2021), 1–10.
    [9]
    Jie Chen, Tengfei Ma, and Cao Xiao. 2018. Fastgcn: Fast learning with graph convolutional networks via importance sampling. In Proceedings of the International Conference on Learning Representations.
    [10]
    Jianfei Chen, Jun Zhu, and Le Song. 2018. Stochastic training of graph convolutional networks with variance reduction. arXiv:1710.10568. Retrieved from https://arxiv.org/abs/1710.10568
    [11]
    Wei-Lin Chiang, Xuanqing Liu, Si Si, Yang Li, Samy Bengio, and Cho-Jui Hsieh. 2019. Cluster-gcn: An efficient algorithm for training deep and large graph convolutional networks. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
    [12]
    Uthsav Chitra and Benjamin Raphael. 2019. Random walks on hypergraphs with edge-dependent vertex weights. In Proceedings of the International Conference on Machine Learning.
    [13]
    Philip S. Chodrow. 2020. Configuration models of random hypergraphs. Journal of Complex Networks 8, 3 (2020), cnaa018.
    [14]
    Minyoung Choe, Jaemin Yoo, Geon Lee, Woonsung Baek, U. Kang, and Kijung Shin. 2022. Midas: Representative sampling from real-world hypergraphs. In Proceedings of the ACM Web Conference 2022.
    [15]
    Hyunjin Choo and Kijung Shin. 2022. On the persistence of higher-order interactions in real-world hypergraphs. In Proceedings of the 2022 SIAM International Conference on Data Mining.
    [16]
    Cazamere Comrie and Jon Kleinberg. 2021. Hypergraph ego-networks and their temporal evolution. In Proceedings of the 2021 IEEE International Conference on Data Mining.
    [17]
    Qingguang Cui, Matthew Ward, Elke Rundensteiner, and Jing Yang. 2006. Measuring data abstraction quality in multiresolution visualizations. IEEE Transactions on Visualization and Computer Graphics 12, 5 (2006), 709–716.
    [18]
    Antoine Deza, Asaf Levin, Syed M. Meesum, and Shmuel Onn. 2019. Hypergraphic degree sequences are hard. Bulletin of the European Association for Theoretical Computer Science 127 (2019), 63–64.
    [19]
    Manh Tuan Do, Se-eun Yoon, Bryan Hooi, and Kijung Shin. 2020. Structural patterns and generative models of real-world hypergraphs. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
    [20]
    Martin Dyer, Catherine Greenhill, Pieter Kleer, James Ross, and Leen Stougie. 2021. Sampling hypergraphs with given degrees. Discrete Mathematics 344, 11 (2021), 112566.
    [21]
    Dhivya Eswaran and Christos Faloutsos. 2018. Sedanspot: Detecting anomalies in edge streams. In Proceedings of the 2018 IEEE International Conference on Data Mining.
    [22]
    Yifan Feng, Haoxuan You, Zizhao Zhang, Rongrong Ji, and Yue Gao. 2019. Hypergraph neural networks. In Proceedings of the Proceedings of the AAAI Conference on Artificial Intelligence.
    [23]
    Anna C. Gilbert and Kirill Levchenko. 2004. Compressing network graphs. In Proceedings of the LinkKDD workshop at the 10th ACM Conference on KDD.
    [24]
    Minas Gjoka, Maciej Kurant, Carter T. Butts, and Athina Markopoulou. 2011. Practical recommendations on crawling online social networks. IEEE Journal on Selected Areas in Communications 29, 9 (2011), 1872–1892.
    [25]
    William L. Hamilton, Rex Ying, and Jure Leskovec. 2017. Inductive representation learning on large graphs. In Proceedings of the Advances in Neural Information Processing Systems.
    [26]
    Shuguang Hu, Xiaowei Wu, and TH Hubert Chan. 2017. Maintaining densest subsets efficiently in evolving hypergraphs. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management.
    [27]
    Christian Hübler, Hans-Peter Kriegel, Karsten Borgwardt, and Zoubin Ghahramani. 2008. Metropolis algorithms for representative subgraph sampling. In Proceedings of the 2008 8th IEEE International Conference on Data Mining.
    [28]
    Yuntao Jia, Jared Hoberock, Michael Garland, and John Hart. 2008. On the visualization of social and other scale-free networks. IEEE Transactions on Visualization and Computer Graphics 14, 6 (2008), 1285–1292.
    [29]
    Jianwen Jiang, Yuxuan Wei, Yifan Feng, Jingxuan Cao, and Yue Gao. 2019. Dynamic hypergraph neural networks. In Proceedings of the International Joint Conference on Artificial Intelligence. 2635–2641.
    [30]
    Wei Jin, Lingxiao Zhao, Shichang Zhang, Yozen Liu, Jiliang Tang, and Neil Shah. 2022. Graph condensation for graph neural networks. In Proceedings of the International Conference on Learning Representations.
    [31]
    David R. Karger. 1994. Random sampling in cut, flow, and network design problems. In Proceedings of the 26th Annual ACM Symposium on Theory of Computing.
    [32]
    Sunwoo Kim, Fanchen Bu, Minyoung Choe, Jaemin Yoo, and Kijung Shin. 2023. How transitive are real-world group interactions?–Measurement and reproduction. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining.
    [33]
    Sunwoo Kim, Minyoung Choe, Jaemin Yoo, and Kijung Shin. 2022. Reciprocity in directed hypergraphs: Measures, findings, and generators. In Data Mining and Knowledge Discovery 37, 6 (2023), 2330–2388.
    [34]
    Bryan Klimt and Yiming Yang. 2004. The enron corpus: A new dataset for email classification research. In Proceedings of the European Conference on Machine Learning.
    [35]
    Yunbum Kook, Jihoon Ko, and Kijung Shin. 2020. Evolution of real-world hypergraphs: Patterns and models without oracles. In Proceedings of the 2020 IEEE International Conference on Data Mining.
    [36]
    Vaishnavi Krishnamurthy, Michalis Faloutsos, Marek Chrobak, Jun-Hong Cui, Li Lao, and Allon G. Percus. 2007. Sampling large internet topologies for simulation purposes. Computer Networks 51, 15 (2007), 4284–4302.
    [37]
    Vaishnavi Krishnamurthy, Michalis Faloutsos, Marek Chrobak, Li Lao, J-H Cui, and Allon G. Percus. 2005. Reducing large internet topologies for faster simulations. In Proceedings of the International Conference on Research in Networking.
    [38]
    Maciej Kurant, Minas Gjoka, Yan Wang, Zack W Almquist, Carter T. Butts, and Athina Markopoulou. 2012. Coarse-grained topology estimation via graph sampling. In Proceedings of the 2012 ACM Workshop on Workshop on Online Social Networks.
    [39]
    Chul-Ho Lee, Xin Xu, and Do Young Eun. 2012. Beyond random walk and metropolis-hastings samplers: Why you should not backtrack for unbiased graph sampling. ACM SIGMETRICS Performance Evaluation Review 40, 1 (2012), 319–330.
    [40]
    Dongjin Lee, Kijung Shin, and Christos Faloutsos. 2020. Temporal locality-aware sampling for accurate triangle counting in real graph streams. The VLDB Journal 29, 6 (2020), 1501–1525.
    [41]
    Geon Lee, Fanchen Bu, Tina Eliassi-Rad, and Kijung Shin. 2024. A survey on hypergraph mining: Patterns, tools, and generators. arXiv:2401.08878. Retrieved from https://arxiv.org/abs/2401.08878
    [42]
    Geon Lee, Minyoung Choe, and Kijung Shin. 2021. How do hyperedges overlap in real-world hypergraphs?-Patterns, measures, and generators. In Proceedings of the Web Conference 2021.
    [43]
    Geon Lee, Jihoon Ko, and Kijung Shin. 2020. Hypergraph motifs: Concepts, algorithms, and discoveries. PVLDB 13, 11 (2020), 2256–2269.
    [44]
    Geon Lee and Kijung Shin. 2021. Thyme+: Temporal hypergraph motifs and fast algorithms for exact counting. In Proceedings of the 2021 IEEE International Conference on Data Mining.
    [45]
    Jure Leskovec, Lars Backstrom, Ravi Kumar, and Andrew Tomkins. 2008. Microscopic evolution of social networks. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
    [46]
    Jure Leskovec and Christos Faloutsos. 2006. Sampling from large graphs. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
    [47]
    Jure Leskovec, Jon Kleinberg, and Christos Faloutsos. 2005. Graphs over time: Densification laws, shrinking diameters and possible explanations. In Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery in Data Mining.
    [48]
    Fangfang Li, Zhi Liu, Junwen Duan, Xingliang Mao, Heyuan Shi, and Shichao Zhang. 2023. Exploiting conversation-branch-tweet hypergraph structure to detect misinformation on social media. ACM Transactions on Knowledge Discovery from Data 18, 2 (2023), 1–20.
    [49]
    Jiaye Li, Jian Zhang, Jilian Zhang, and Shichao Zhang. 2023. Quantum KNN classification with K Value selection and neighbor selection. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (2023).
    [50]
    Rong-Hua Li, Jeffrey Xu Yu, Lu Qin, Rui Mao, and Tan Jin. 2015. On random walk based graph sampling. In Proceedings of the 2015 IEEE 31st International Conference on Data Engineering.
    [51]
    Yongsub Lim and U. Kang. 2015. Mascot: Memory-efficient and accurate sampling for counting local triangles in graph streams. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
    [52]
    Jing Ma, Ruocheng Guo, Mengting Wan, Longqi Yang, Aidong Zhang, and Jundong Li. 2022. Learning fair node representations with graph counterfactual fairness. In Proceedings of the 15th ACM International Conference on Web Search and Data Mining.
    [53]
    Arun S. Maiya and Tanya Y. Berger-Wolf. 2010. Sampling community structure. In Proceedings of the 19th International Conference on World Wide Web.
    [54]
    Rossana Mastrandrea, Julie Fournet, and Alain Barrat. 2015. Contact patterns in a high school: A comparison between data collected using wearable sensors, contact diaries and friendship surveys. PloS One 10, 9 (2015), e0136497.
    [55]
    Saket Navlakha and Carl Kingsford. 2011. Network archaeology: Uncovering ancient networks from present-day interactions. PLoS Computational Biology 7, 4 (2011), e1001119.
    [56]
    David Peleg and Alejandro A Schäffer. 1989. Graph spanners. Journal of Graph Theory 13, 1 (1989), 99–116.
    [57]
    Davood Rafiei. 2005. Effectively visualizing large networks through sampling. In Proceedings of the IEEE Visualization.
    [58]
    Stuart Russell and Peter Norvig. 2002. Artificial Intelligence: A Modern Approach. Pearson.
    [59]
    C. Seshadhri, Ali Pinar, and Tamara G. Kolda. 2014. Wedge sampling for computing clustering coefficients and triangle counts on large graphs. Statistical Analysis and Data Mining: The ASA Data Science Journal 7, 4 (2014), 294–307.
    [60]
    Arnab Sinha, Zhihong Shen, Yang Song, Hao Ma, Darrin Eide, Bo-June Hsu, and Kuansan Wang. 2015. An overview of microsoft academic service (mas) and applications. In Proceedings of the 24th International Conference on World Wide Web.
    [61]
    Daniel A. Spielman and Shang-Hua Teng. 2011. Spectral sparsification of graphs. SIAM Journal on Computing 40, 4 (2011), 981–1025.
    [62]
    Lorenzo De Stefani, Alessandro Epasto, Matteo Riondato, and Eli Upfal. 2017. Triest: Counting local and global triangles in fully dynamic streams with fixed memory size. ACM Transactions on Knowledge Discovery from Data 11, 4 (2017), 1–50.
    [63]
    Juliette Stehlé, Nicolas Voirin, Alain Barrat, Ciro Cattuto, Lorenzo Isella, Jean-François Pinton, Marco Quaggiotto, Wouter Van den Broeck, Corinne Régis, Bruno Lina, et al. 2011. High-resolution measurements of face-to-face contact patterns in a primary school. PloS One 6, 8 (2011), e23176.
    [64]
    Hanghang Tong, Christos Faloutsos, and Jia-Yu Pan. 2008. Random walk with restart: Fast solutions and applications. Knowledge and Information Systems 14, 3 (2008), 327–346.
    [65]
    Elli Voudigari, Nikos Salamanos, Theodore Papageorgiou, and Emmanuel J. Yannakoudakis. 2016. Rank degree: An efficient algorithm for graph sampling. In Proceedings of the 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining.
    [66]
    Michael E. Wall, Andreas Rechtsteiner, and Luis M. Rocha. 2003. Singular value decomposition and principal component analysis. In A Practical Approach to Microarray Data Analysis (2003), 91–109.
    [67]
    Jianling Wang, Kaize Ding, Liangjie Hong, Huan Liu, and James Caverlee. 2020. Next-item recommendation with sequential hypergraphs. In Proceedings of the 43rd international ACM SIGIR Conference on Research and Development in Information Retrieval.
    [68]
    Michael M. Wolf, Alicia M. Klinvex, and Daniel M. Dunlavy. 2016. Advantages to modeling relational data using hypergraphs versus graphs. In Prcoeedings of the 2016 IEEE High Performance Extreme Computing Conference.
    [69]
    Xin Xu, Chul-Ho Lee, and Young Eun Do. 2014. A general framework of hybrid graph sampling for complex network analysis. In Proceedings of the IEEE International Conference on Computer Communications. 2795–2803.
    [70]
    Dingqi Yang, Bingqing Qu, Jie Yang, and Philippe Cudre-Mauroux. 2019. Revisiting user mobility and social relationships in lbsns: A hypergraph embedding approach. In Proceedings of the World Wide Web Conference.
    [71]
    Hao Yin, Austin R. Benson, Jure Leskovec, and David F. Gleich. 2017. Local higher-order graph clustering. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
    [72]
    Jaemin Yoo, U. Kang, Mauro Scanagatta, Giorgio Corani, and Marco Zaffalon. 2020. Sampling subgraphs with guaranteed treewidth for accurate and efficient graphical inference. In Proceedings of the Proceedings of the 13th International Conference on Web Search and Data Mining.
    [73]
    Se-eun Yoon, Hyungseok Song, Kijung Shin, and Yung Yi. 2020. How much and when do we need higher-order information in hypergraphs? A case study on hyperedge prediction. In Proceedings of the Web Conference 2020.
    [74]
    Jean-Gabriel Young, Guillaume St-Onge, Edward Laurence, Charles Murphy, Laurent Hébert-Dufresne, and Patrick Desrosiers. 2019. Phase transition in the recoverability of network history. Physical Review X 9, 4 (2019), 041056.
    [75]
    Hanqing Zeng, Hongkuan Zhou, Ajitesh Srivastava, Rajgopal Kannan, and Viktor Prasanna. 2020. Graphsaint: Graph sampling based inductive learning method. In Proceedings of the International Conference on Learning Representations.
    [76]
    Guixian Zhang, Debo Cheng, Guan Yuan, and Shichao Zhang. 2024. Learning fair representations via rebalancing graph structure. Information Processing & Management 61, 1 (2024), 103570.
    [77]
    Guixian Zhang, Debo Cheng, and Shichao Zhang. 2023. Fpgnn: Fair path graph neural network for mitigating discrimination. World Wide Web 26, 5 (2023), 3119–3136.
    [78]
    Lingling Zhang, Zhiwei Zhang, Guoren Wang, and Ye Yuan. 2023. Efficiently sampling and estimating hypergraphs by hybrid random walk. In Proceedings of the 2023 IEEE 39th International Conference on Data Engineering.
    [79]
    Shichao Zhang and Jiaye Li. 2023. KNN classification with one-step computation. IEEE Transactions on Knowledge & Data Engineering 35, 03 (2023), 2711–2723.
    [80]
    Shichao Zhang, Jiaye Li, and Yangding Li. 2023. Reachable distance function for KNN classification. IEEE Transactions on Knowledge and Data Engineering 35, 07 (2023), 7382–7396.
    [81]
    Shichao Zhang, Jiaye Li, Wenzhen Zhang, and Yongsong Qin. 2022. Hyper-class representation of data. Neurocomputing 503, C (2022), 200–218.
    [82]
    Shichao Zhang, Xuelong Li, Ming Zong, Xiaofeng Zhu, and Ruili Wang. 2017. Efficient kNN classification with different numbers of nearest neighbors. IEEE Transactions on Neural Networks and Learning Systems 29, 5 (2017), 1774–1785.
    [83]
    Peixiang Zhao, Charu Aggarwal, and Gewen He. 2016. Link prediction in graph streams. In Proceedings of the 2016 IEEE 32nd International Conference on Data Engineering.
    [84]
    Xiaofeng Zhu, Shichao Zhang, Yonghua Zhu, Pengfei Zhu, and Yue Gao. 2020. Unsupervised spectral feature selection with dynamic hyper-graph learning. IEEE Transactions on Knowledge and Data Engineering 34, 6 (2020), 3016–3028.
    [85]
    Difan Zou, Ziniu Hu, Yewen Wang, Song Jiang, Yizhou Sun, and Quanquan Gu. 2019. Layer-dependent importance sampling for training deep and large graph convolutional networks. In Proceedings of the Advances in Neural Information Processing Systems.

    Index Terms

    1. Representative and Back-In-Time Sampling from Real-world Hypergraphs

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Transactions on Knowledge Discovery from Data
        ACM Transactions on Knowledge Discovery from Data  Volume 18, Issue 6
        July 2024
        760 pages
        ISSN:1556-4681
        EISSN:1556-472X
        DOI:10.1145/3613684
        • Editor:
        • Jian Pei
        Issue’s Table of Contents

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 26 April 2024
        Online AM: 19 March 2024
        Accepted: 09 March 2024
        Revised: 08 March 2024
        Received: 13 September 2023
        Published in TKDD Volume 18, Issue 6

        Check for updates

        Author Tags

        1. Hypergraph
        2. sampling
        3. structural property
        4. higher-order network

        Qualifiers

        • Research-article

        Funding Sources

        • Korea government (MSIT) grant funded by the Korea government (MSIT)
        • Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT)

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • 0
          Total Citations
        • 570
          Total Downloads
        • Downloads (Last 12 months)570
        • Downloads (Last 6 weeks)163
        Reflects downloads up to 12 Aug 2024

        Other Metrics

        Citations

        View Options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Get Access

        Login options

        Full Access

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media