survey

Open access

Private Graph Data Release: A Survey

Authors:

Yang Li,

Michael Purcell,

Thierry Rakotoarivelo,

David Smith,

Thilina Ranbaduge,

Kee Siong NgAuthors Info & Claims

ACM Computing Surveys, Volume 55, Issue 11

Article No.: 226, Pages 1 - 39

https://doi.org/10.1145/3569085

Published: 22 February 2023 Publication History

All formats PDF

Abstract

The application of graph analytics to various domains has yielded tremendous societal and economical benefits in recent years. However, the increasingly widespread adoption of graph analytics comes with a commensurate increase in the need to protect private information in graph data, especially in light of the many privacy breaches in real-world graph data that were supposed to preserve sensitive information. This article provides a comprehensive survey of private graph data release algorithms that seek to achieve the fine balance between privacy and utility, with a specific focus on provably private mechanisms. Many of these mechanisms are natural extensions of the Differential Privacy framework to graph data, but we also investigate more general privacy formulations like Pufferfish Privacy that address some of the limitations of Differential Privacy. We also provide a wide-ranging survey of the applications of private graph data release mechanisms to social networks, finance, supply chain, and health care. This article should benefit practitioners and researchers alike in the increasingly important area of private analytics and data release.

1 Introduction

Graph analytics refer to the methods and tools used to study and understand the relationships that exist between vertices and edges within and between graphs¹ [106]. In contrast to traditional analytics on tabular data, graph analytics is concerned with the flows, the structure, and the relationships between the vertices and the edges that compose graph data. Examples of common statistics of interests for such graph data include node degrees and the related degree distribution, centrality metrics (e.g., degree, closeness, betweenness, etc.), subgraph counts (e.g., triangle, \(k\)-star, etc.), and various distance metrics (e.g., diameter, eccentricity, etc.) [131].

The applications of graph analytics to various domains has yielded tremendous societal and economical benefits. Indeed, they are used to understand and track the spread of diseases within communities; to study proteins, chemical configurations, and interactions in the design of novel medicines; to uncover irregular fraud patterns in financial records; and increase the resilience of supply chains in various sectors such as agrifood, healthcare, and manufacturing [82, 116, 150]. As in traditional data analytics, these benefits are even greater when graph analytics are applied to collections of graph data from various stakeholders, e.g., uncovering international money laundering by analyzing graph data from different financial institutions across countries [158]. In References [93, 124], the authors estimate global value of graph analytics to be about USD 600 million in 2019. They further project that this value could grow to USD 2 to 3 billion by 2026.

One critical challenge in having graph data available across different parties is trust. The parties making graph data available need to trust that confidential and private information within such data will be protected, whereas the analysts using such data need to trust that they still contain the information required for their needs. Such trust is challenging to achieve and maintain. Indeed, there have been many cases where the release of graph data has lead to the inadvertent exposure of sensitive personal information. For example, specific graph analytics applied to an easily available social network graph could accurately predict an individual’s sexual orientation [59]. Further works proposed attacks to extract sensitive information from graphs that have been treated with some form of anonymization, including identity disclosure and attribute disclosure [8, 9, 20, 60].

Several authors have studied the fine balance between privacy and utility for various types of data [16, 85, 132]. These mechanisms are often categorized as either non-provable or provable [60, 161]. Provable privacy mechanisms offer mathematical guarantees about the privacy protection or the utility that they can provide. Non-provable privacy mechanisms do not provide such strong theoretical guarantees and thus are more empirical in nature. Many of these provable methods are based on a formal definition of privacy known as Differential Privacy (DP) [34]. Differential Privacy is a property of a release mechanism, i.e., an algorithm that can be used to publish information about a dataset. A differentially private release mechanism essentially guarantees that its outputs (e.g., a synthetic dataset or a query answer) will be almost indistinguishable from the output it would produce if any one individual’s data were removed from the dataset. As such, differentially private release mechanisms provide a degree of plausible deniability for every individual whose data may be included in the data.

Some surveys have studied the category of non-provable privacy mechanisms for graph data. For example, Reference [161] catalogs several non-provable methods into three classes, namely \(k\)-anonymity, probabilistic, and generalization. Similarly, Reference [20] describes non-provable approaches for the release of entire graphs, thus not considering the release of graph statistics or metrics only. More recent surveys have included some descriptions of provable privacy mechanisms for graph data. For example, Reference [60], briefly describes the emerging DP-based mechanisms for graph data at the time and recognizes that DP-based methods for graph data were still in their infancy. Another study [3] provides a more thorough survey of some DP-based graph mechanisms, including edge DP and node DP methods. However, it focuses solely on DP-based mechanisms that were proposed in the context of social network graphs and does not discuss other domain applications. A recent preprint [62] also describes DP-based mechanisms only for social network graphs. Finally, a recently published survey on privacy preserving data publishing [92] also has a dedicated section on graph data. It also considers only the application domain of social networks. Furthermore, it does not distinguish between provable and non-provable approaches and provides only a brief overview of DP-based mechanisms and no mention of other provable (but non-DP-based) methods. Thus, to the best of our knowledge, there has not yet been a thorough survey of provable privacy mechanisms for graph data, both DP and non-DP, and which considers a wider set of domain applications. This article fills that gap. It provides a comprehensive study of the published state-of-the-art methods used to provide provable privacy guarantees for graph analytics in a variety of application domains.

To guide our survey of privacy mechanisms for graph data, we propose a tree-based classification of the papers that we review in this article. At a high level, this classification differentiates between mechanisms that release an entire privacy-enhanced graph and others that release only statistics or query responses on given graph data. We then further differentiate between non-provable and provable mechanisms. The goal of this proposed taxonomy is to allow data custodians to easily navigate the large list of existing mechanisms and identify the mechanisms that could be fit-for-purpose for their needs. We acknowledge that other taxonomies could have been equally adopted and that some of these alternatives may be more useful for other readers of this survey (e.g., scientists looking for knowledge gaps in the area). Thus, we also discuss another alternative classification of some existing works along different application domains.

The remainder of this article is organized as follows. Section 2 presents our methodology. It introduces the taxonomy used for the surveyed contributions, describes the different related categories in that taxonomy, and describes the criteria for a given work to be assigned to a specific category. Section 3 focuses on mechanisms that return responses to specific statistical queries on graph data. In contrast, Section 4 focuses on mechanisms that return entire graphs. While Sections 3 and 4 specifically discuss non-provable and DP-based provable mechanisms, Section 5 introduces other provable privacy definitions, which address some of the shortcomings of DP in the context of graph data. Section 6 provides an overview of the different domain applications where the previously surveyed contributions may be applied and some example use cases. Section 7 discusses some existing empirical studies, and some common limitations shared by several of the surveyed contributions, which suggest future research opportunities. Section 8 concludes this survey.

2 Method and Background

We present the structure that we use to organize the different surveyed privacy mechanisms. We then review the background concepts on DP, formally define DP in the context of graph data, and describe some of the fundemental differentially private release mechanisms.

2.1 Method

Figure 1 presents the taxonomy that we used to classify existing contributions on privacy-preserving graph analytics. It also links each part of this classification to the sections of this article. This structure provides a clear delineation of the contributions of this survey and identify its key focus areas. Table 1 provides an overview of the surveyed published contributions following our proposed taxonomy. We begin with a top-level differentiation between graph release mechanisms that release an entire (transformed) graph and query release mechanisms that release (transformed) responses to specific graph queries. We then progress to more specific areas as we traverse down the taxonomy and differentiate between various notions of privacy. At the second level of our taxonomy we differentiate between provable and non-provable privacy. We then restrict our attention to provable privacy and differentiate between differentially private query release mechanisms and query release mechanisms that are based on other provable notions of privacy (e.g., Pufferfish Privacy). We concentrate most of our attention on differentially private mechanisms due to DP’s widespread use and acceptance in the research community. Finally, at the bottom level of our taxonomy we differentiate between the various notions of graph differential privacy.

Fig. 1.

Table 1.

Graph Statistics or Query Release	Non-Provable			[8, 78, 101]
	Provable	DP based	Edge DP	Outlink [151], Clustering Coefficient [157], Eigenvectors [5, 156], Graph Clustering [100], Community Detection [104], Edge Weight [26, 86] Egocentric Betweenness Centrality [125]
				Subgraph Counting: [24, 66, 89, 107, 118, 170]
				Degree Sequence: [49, 67, 117]
				Cut Query: [11, 43, 153]
			Node DP	Erdős–Rényi Model Parameter [14, 15, 136], Difference Sequence [139]
				Degree sequence: [30, 69, 120]
				Subgraph Counting: [12, 32, 69]
			Edge Weight DP	[135]
			Local DP	[146, 168]
			Graph Mining	Frequent Pattern Mining: [138, 166]
				Subgraph Discovery: [71]
				Clustering: [115]
				Graph Embedding: [120, 165, 171]
				Graph Neural Networks: [31, 57, 99, 111, 128]
		Beyond DP	Correlated Data DP	[7, 72, 74, 87, 145, 174]
			Pufferfish Privacy	[51, 73, 74, 140, 167]
			Others	[38, 39, 121]
Graph Release	Non-Provable			[20, 21, 77, 130, 149, 161]
	Provable		Generative Models	[40, 49, 56, 64, 80, 91, 96, 129, 155, 156, 164, 170]
			Graph Matrix	[11, 13, 18, 23, 47, 65, 153, 156, 164]
			Local DP	[37, 119]
			Iterative Refinement	[43, 117, 118]

Table 1. Classification of the Surveyed Papers on Private Graph Data Release

Notice that Figure 1 suggests that the primary focus of this article is on provably private mechanisms that release statistics about graph data. Non-provably private mechanisms are discussed for completeness and to provide context for our discussion of provable mechanisms. For more details of non-provably private release of graph data the interested reader is referred to [20, 161].

2.2 Background

Table 2 introduces the notation that we will use in this survey. Following the terminologies in Reference [34], given a universe \(\mathcal {X}\) of \(n\) distinct data values, we consider a dataset \(D \in \mathbb{N}^{|\mathcal {X}|}\) as a length-\(n\) vector of counts, where \(D_i\) is the number of times the \(i\)th element in \(\mathcal {X}\) occurs in the dataset. The \(L_1\) norm of the dataset is defined as

\begin{equation*} ||D||_1 = \sum _{i=1}^n D_i. \end{equation*}

The distance between two datasets \(D^{(1)}, D^{(2)} \in \mathbb{N}^{|\mathcal {X}|}\) is then defined as

\begin{equation*} ||D^{(1)}-D^{(2)}||_1 = \sum _{i=1}^n \left|D^{(1)}_i-D^{(2)}_i\right|\!, \end{equation*}

the total count differences between the two datasets. We say two datasets are neighboring, denoted by \(D^{(1)} \sim D^{(2)}\), if they differ on at most one coordinate (or record),² i.e.,

\begin{equation} ||D^{(1)}-D^{(2)}||_1 \le 1. \end{equation}

(1)

The classical definition of differential privacy [34] is then as follows:

Table 2.

\(\|\|.\|\|_1\)	\(L_1\) norm	\(\mathcal {X}\)	Data universe
\(\mathcal {M}\)	Randomized algorithm	\(\mathbb{N}\)	Natural numbers
\(\mathcal {O}\)	Output from a randomized algorithm	\(\mathbb {R}^n\)	Real numbers of dimension \(n\)
\(D^n\)	Dataset domain of dimension \(n\)	\(D\)	Dataset
\(G\)	Graph	\(V\)	Graph vertices
\(E\)	Graph edges	\(M_E\)	Edge Matrix
\(S(G)\)	Degree sequence of \(G\)	\(d(\cdot)\)	Degree of \((\cdot)\)
\(\bar{d}\)	Average degree	\(\varepsilon\)	Privacy budget
\(\delta\)	Privacy budget approximation	\(q(\cdot)\)	query function
\(GS_f\)	Global sensitivity of \(f\)	\(LS_f\)	Local sensitivity of \(f\)
\(RS_f\)	Restricted sensitivity of \(f\)	\(SS_f\)	Smooth sensitivity of \(f\)
\(\mathrm{Lap}(\lambda)\)	Laplace mechanism, scale factor \(\lambda\)	\(\mu\)	Data projection
\(\Delta (g)\)	\(L_1\) sensitivity of \(g(\cdot)\)	\(\Delta q\)	Range sensitivity of \(q(\cdot)\)
\(\theta\)	Degree threshold	\(x \oplus y\)	Symmetric difference between sets \(x\) and \(y\)
\(\omega\)	Graph weight function

Table 2. Notation Used in This Survey

Definition 2.1.

A randomized algorithm, \(\mathcal {M}\), guarantees \((\varepsilon ,\delta)\)-differential privacy if, for any two neighboring datasets \(D^{(1)}\) and \(D^{(2)}\) and any subset of outputs \(\mathcal {O}\subseteq {\it range}(\mathcal {M})\), we have

\begin{align} \mathrm{Pr}{[\mathcal {M}(D^{(1)}) \in \mathcal {O}]} \le \exp (\varepsilon){\mathrm{Pr}{[\mathcal {M}(D^{(2)}) \in \mathcal {O}]}}+\delta . \end{align}

(2)

We refer to \(\varepsilon \gt 0\) as the privacy budget, with smaller values of \(\varepsilon\) providing stronger privacy protection. When \(\delta = 0\), we sometimes say that we have pure differential privacy. When \(\delta \gt 0\), we sometimes say that we have approximate differential privacy. Furthermore, when \(\delta \gt 0\), its value is typically less than the inverse of the number of records in the dataset. This precludes the (blatantly non-private) mechanism that simply returns a random record in response to a query \(\mathcal {M}\).

Differential Privacy as defined in Definition 2.1 is largely a syntactic construct. The semantics of Definition 2.1, in terms of the indistinguishability of the prior and posterior probabilities after seeing the result returned by a Differential Privacy mechanism, can be found in Reference [70].

For graph data, differential privacy can be expressed as follows. Suppose we have records from a universe \(\mathcal {X}\), a graph \(G \in 2^{\mathcal {X} \times \mathcal {X}}\) is one where the nodes are records from \(\mathcal {X}\).

Definition 2.2.

A randomized algorithm \(\mathcal {M}\) with domain \(2^{\mathcal {X} \times \mathcal {X}}\) is \((\epsilon ,\delta)\)-differentially private for a distance function \(d\) if for any subset \(\mathcal {O}\subseteq {\it range}({\mathcal {M}})\) and \(G_1, G_2 \in 2^{\mathcal {X}\times \mathcal {X}}\) such that \(d(G_1,G_2) \le k\):

\begin{equation*} \Pr [\mathcal {M}(G_1) \in \mathcal {O}] \le \exp (\epsilon)\Pr [\mathcal {M}(G_2) \in \mathcal {O}] + \delta , \end{equation*}

where the probability is over the randomness in the mechanism \(\mathcal {M}\).

There are several forms of differential privacy that are particularly relevant in our context. These include node and edge differential privacy, both of which are described in detail in Section 3.2. These forms of differential privacy can be understood in terms of node and edge neighboring graphs. For node differential privacy, the distance function \(d\) is the symmetric difference between the node sets of two graphs. For edge differential privacy, the distance function \(d\) is the symmetric difference between the edge sets of the two graphs.

Importantly there are some simple differentially private mechanisms upon which many more complicated mechanisms are based. Chief among these is the Laplace mechanism [33, 34]. The Laplace mechanism works by adding noise drawn from the Laplace distribution to the result of a real-valued query. A fundamental result in differential privacy is that if we choose the scale for the Laplace noise appropriately, then the Laplace mechanism preserves \((\varepsilon , 0)\) differential privacy.

We denote a random variable drawn from a Laplace (symmetric exponential) distribution with mean 0 and scale \(\lambda\) (or equivalently a variance \(\sigma ^2=2\lambda ^2\)) as \(Y \sim \mathrm{Lap}(\lambda)\). Recall that the Laplace distribution has the following probability density function:

\begin{equation*} f(x|\lambda) = \frac{1}{2\lambda } \exp {\left(-\frac{|x|}{\lambda }\right)}. \end{equation*}

The scale of the noise used in the Laplace mechanism depends on both the value of the privacy budget \(\varepsilon\) and on the global sensitivity of the underlying query. The global sensitivity of a query is a measure of the largest possible amount of change in a function when one record is removed from a dataset. More precisely, if we let \(\Delta f\) denote the global sensitivity of a query \(f\), then we have

\begin{equation} \Delta f = \max _{D^{(1)} \sim D^{(2)}} ||f(D^{(1)}) - f(D^{(2)})||_1. \end{equation}

(3)

To ensure that the Laplace mechanism preserves \((\varepsilon , 0)\) differential privacy, then we must choose \(\lambda \gt \Delta f/\varepsilon\). The global sensitivity therefore is related to how much noise is required in the worst case to protect the privacy of an individual record in the dataset.

There are other statistical-distribution-based mechanisms used to implement differential privacy. The exponential mechanism [94] is a particularly important one-sided mechanism for producing differentially private graphs, particularly with respect to categorical answers. It uses a function \(q(D, \mathcal {O})\) to represent how good an output \(\mathcal {O}\) is for a dataset \(D\) (or, equivalently, a graph \(G\)). The exponential mechanism is the natural building block for answering queries with arbitrary utilities (and arbitrary non-numeric range), while preserving differential privacy [34]. Given some arbitrary range \(R\), the exponential mechanism is defined with respect to some utility function \(q : \mathbb{N}^{|\mathcal {X} |} \times \mathbb {R} \rightarrow \mathbb {R}^n\), which maps dataset and output pairs to utility scores. Intuitively, for a fixed dataset (e.g., a graph \(G\)), the user prefers that the mechanism outputs some element of \(R\) with the maximum possible utility score.

As with the Laplace mechanism, the precise shape of the distribution from which outputs are drawn depends on the sensitivity of a function. In the case of the Exponential mechanism, the relevant quantity is the sensitivity of the utility function with respect to its dataset argument. More precisely, the relevant quantity is

\begin{equation} \Delta q = \max _{r \in \mathbb {R}}\max _{x,y,\Vert x-y\Vert _1 \le 1} |q(x,r)-q(y,r)|. \end{equation}

(4)

Notice \(\Delta q\) does not describe the sensitivity of \(q\) with respect to its range argument. That is, we are interested in how the utility of each output changes for neighboring datasets but not in how the utility of each dataset changes for neighboring outputs. To ensure that the exponential mechanism preserves \((\varepsilon , 0)\) differential privacy, we need the probability mass of each output to be proportional to \(\exp \left(\varepsilon q(x,r) / 2\Delta q\right)\).

The application of these fundamental differentially private mechanisms in the context of graph release will be described in the following sections.

3 Private Release of Graph Statistics or Queries

There are two different approaches to protecting sensitive information from an analyst who can ask questions about a particular graph. The first approach is to act as a proxy for the analyst by querying the graph on their behalf. Then the exact query responses can be transformed in some privacy-preserving way before being passed to the analyst. The second approach is to release a synthetic graph that is a close approximation to the true one but is guaranteed to be private according to some privacy framework. The analyst can use the synthetic graph to compute answers to their queries. This section surveys works of the first type. The second type will be surveyed in Section 4.

We refer to methods of the first type as private query release mechanisms. These mechanisms typically take a graph, a class of graph queries, and some additional privacy parameters as inputs. They return privacy-preserving responses to the specified graph queries as outputs. Frequently, they do this by computing non-private query responses on the underlying graph and then transforming those responses in some way to ensure that they are privacy preserving. Although these mechanisms are query dependent, their advantages are at least twofold. First, they explicitly define what information about the graph will be provided to the analyst. This makes it possible to identify which graph features need to be considered in the design of these release mechanisms and which do not. Second, the noise needed to protect against a known class of queries is in general much less than that required to protect against the much larger class of all possible queries that an analyst could ask about a synthetic graph. As a general rule, adding less noise results in better query utility.

Section 3.1 briefly addresses the class of non-provable mechanisms for privately releasing graph statistics, while Section 3.2 extensively discusses the class of provable mechanisms, in particular those that are based on DP.

3.1 Non-Provable Private Release of Graph Statistics

There is a reasonably large literature on statistical disclosure control [55, 159] for releasing statistics on tabular data, including techniques like value generalization, cell suppression [27], micro-aggregation [35], randomization [41], and anonymization [88]. Most of these techniques have been applied to graph data, but their naïve application is often inefficient or ineffective.

In Reference [8], in the context of social networks where the curator replaces names with meaningless unique identifiers, both active and passive attacks are presented and shown to be effective in revealing the existence of edges between users. Some of these attacks can be carried out using a single anonymized copy of the network and require relatively little effort from the attacker. Similar work has been done by Korolova et al. [78], which shows an attacker recovering a significant fraction of sensitive edges through link analysis and getting a good picture of the whole network. Narayanan and Shmatikov [101] also propose an algorithm to re-identify anonymized participants that were represented as graph vertices and apply it to social networks from Twitter and Flickr. All these results show that mathematically rigorous definitions of privacy are required.

3.2 Provable Private Release of Graph Statistics using Differential Privacy

Provable privacy techniques provide mathematical guarantees about what an analyst can learn from the answers to a series of queries about a dataset. These techniques differ from the traditional methods by precisely quantifying and controlling the amount of information that can be leaked using tunable parameters. Among all provable privacy techniques, perhaps the most well studied and widely recognized privacy definition is Differential Privacy (Definition 2.1). Since the paper of Dwork et al. [33] that formalizes the concept of noise addition according to the sensitivity of a query function, there have been enormous studies on DP and its applications in the context of tabular data. It was inevitable that DP would be employed to manage privacy concerns in graph data, starting from an early work that introduces smooth sensitivity [107]. The key to graph data is the representation and storage of records in graphs, in which nodes represent dataset entities and edges represent relationships between entities. This also gives rise to two distinct concepts of graph differential privacy—edge differential privacy and node differential privacy—that will be discussed in detail in the following sections.

The fundamental difference between the two versions of graph differential privacy is how a pair of neighboring graphs is defined. In the standard form of DP (e.g., in References [33, 34]), two datasets \(x\) and \(y\) are neighbors if they differ by at most one record. Hay et al. [49] formalizes the concepts of edge and node differential privacy for graphs by generalizing the definition of neighboring datasets using the symmetric difference between two sets. The symmetric difference \(x \oplus y\) between two sets \(x\) and \(y\) (not necessarily the same size) is the set of elements in either \(x\) or \(y\), but not in both, i.e., \(x \oplus y = (x \cup y) \setminus (x \cap y)\). With this, neighboring graphs in the context of edge and node differential privacy are defined as follows:

Definition 3.1.

Given a graph \(G=(V, E)\), a graph \(G^{\prime }=(V^{\prime },E^{\prime })\) is an edge neighboring graph of \(G\) if it differs from \(G\) by exactly one edge, i.e., \(|V \oplus V^{\prime }|+|E \oplus E^{\prime }|=1\).

Example 3.2.

The graphs in Figure 2(a) and (b) are edge neighboring graphs, because \(|V \oplus V^{\prime }| + |E \oplus E^{\prime }| = 0 + 1 = 1\), i.e., they differ by exactly one edge 12.

Fig. 2.

Definition 3.3.

Given a graph \(G=(V, E)\), a graph \(G^{\prime }=(V^{\prime },E^{\prime })\) is a node neighboring graph of \(G\) if it differs from \(G\) by exactly one node and the edges incident to the node, i.e., \(|V \oplus V^{\prime }|=1\) and \(E \oplus E^{\prime } = \lbrace uv \mid u = V \oplus V^{\prime } \text{ or } v = V \oplus V^{\prime }\rbrace\).

Example 3.4.

The graphs in Figure 2(a) and (c) are node neighboring graphs, because \(|V \oplus V^{\prime \prime }| = 1\) and \(E \oplus E^{\prime \prime } = \lbrace 32, 34, 35\rbrace\), i.e., they differ by exactly one node 3 and the edge differences are the edges incident to 3 in \(G\).

By generalizing Definition 3.1 to multiple edge differences, Hay et al. [49] states the following.

Definition 3.5.

Given a graph \(G=(V, E)\), a graph \(G^{\prime }=(V^{\prime },E^{\prime })\) is a \(k\)-edge neighboring graph of \(G\) if it differs from \(G\) by at most \(k\) edges. That is, \(|V \oplus V^{\prime }|+|E \oplus E^{\prime }| \le k\).

The connection between \(k\)-edge differential privacy and node differential privacy depends on \(k\) and node degrees in a graph. If \(k\) is larger than the maximum degree in the graph, then \(k\)-edge differential privacy is stronger than node differential privacy. Otherwise, it may simultaneously protect multiple relationships for one node or several nodes. With the three versions of neighboring graphs given above, it is straightforward to formalize edge and node differential privacy based on Definition 2.2.

Generally speaking, it is more difficult to satisfy differential privacy on graph data than on tabular data, because graph queries typically have higher sensitivities than statistical queries. The two versions of graph differential privacy address different privacy concerns. Edge differential privacy protects the relationship between two entities while node differential privacy protects the existence of an entity and its relationships with others. Frequently, edge differential privacy is easier to achieve than node differential privacy.

Example 3.6.

Consider the graph \(G\) that is depicted in Figure 2(a). It has the degree sequence \((4,3,3,3,3,2)\). It has an edge neighboring graph \(G^{\prime }\) and node neighboring graph \(G^{\prime \prime }\), which are depicted in Figure 2(b) and (c), respectively. Notice that \(G^{\prime }\) has the degree sequence \((3,3,3,3,2,2)\) and \(G^{\prime \prime }\) has the degree sequence \((3,3,2,2,2,0)\). So the sensitivities of the degree sequence function for the graph \(G\) under the edge and node differential privacy frameworks are 2 and 6, respectively.

3.2.1 Edge Differential Privacy.

As discussed above, edge differential privacy can protect relationships between entities in a network from a malicious analyst who can query that network. These queries usually have high sensitivities due to their unique nature. In spite of that, there are many examples of mechanisms that preserve edge differential privacy and that can be used to release a wide variety of graph statistics. These include the protections of edge weight [26, 86] that may reflect communication frequency, vertex clustering coefficient [157] for analyzing a vertex’s connectivity to its neighbors, eigenvalues and eigenvectors [5, 156] for analyzing characteristics of network adjacency matrices, and egocentric betweenness centrality [125] for analyzing the importance of a vertex linking two parts of a network, community detection [104]. There are also techniques that preserve variations of edge differential privacy, such as graph clustering under a weaker version of \(k\)-edge differential privacy [100] and popularity graphs under outlink privacy [151].

In the rest of this subsection, we focus on reviewing published works for subgraph counting, degree sequence/distribution, and cut queries. We focus on these classes of queries, because they have been relatively well studied due to their usefulness for other related studies. The main building blocks of edge differentially private mechanisms are the Laplace and the exponential mechanisms. The former is used in combination with different types of sensitivities to calculate the right amount of noise that needs to be added. The latter is often used for parameter selection or non-numerical outputs.

3.2.1.1 Subgraph Counting. Subgraph counting queries count the number of times a certain subgraph appears in a given graph. Common subgraphs include triangles and stars and their generalization to \(k\)-triangles (i.e., a subgraph consists of \(k\) triangles, all of which share a common edge) and \(k\)-stars (i.e., a subgraph with \(k+1\) nodes in which a central node has degree \(k\) and the other \(k\) nodes have degree 1). The work in Reference [107] is one of the earliest on making subgraph-counting queries satisfy edge differential privacy. It introduces the notion of smooth sensitivity for a given graph, which can be used in place of global sensitivity (Equation (3)) to reduce the amount of noise required to ensure that the queries preserve DP.

Definition 3.7.

Given a dataset \(D_x \in D^n\), for a real-valued function \(f: D^n \rightarrow \mathbb {R}^1\), the local sensitivity of \(f\) at \(D_x\) is

\begin{equation*} LS_f(D_x) = \max _{D_y \in D^n \,:\, d(D_x,D_y)=1} ||f(D_x) - f(D_y)||_1. \end{equation*}

In the worst case, the local sensitivity of a function is the same as the global sensitivity, but it can be much smaller for a given dataset \(D_x\). For triangle-counting queries, the global sensitivity is \(|V| - 2\), but the local sensitivity is \(\max _{i,j \in [n]} a_{ij}\), where \(a_{ij}\) is the number of common neighbors between adjacent nodes \(i\) and \(j\). For example, the graph \(G\) in Figure 3(a) has seven triangles. The local and global sensitivities of the triangle counting queries at \(G\) are 3 and 6, respectively.

Fig. 3.

Unfortunately, using local sensitivity directly can reveal sensitive information about the underlying graph. For example, the local sensitivity of a triangle-counting query gives the maximum number of common neighbors between two vertices in the given graph. A more suitable candidate is the smallest smooth upper bound of the local sensitivity, namely the \(\beta\)-smooth sensitivity \(SS_{f,\beta }\). For \(\beta \gt 0\), we have

\begin{equation} SS_{f,\beta }(D_x) = \max _{D_y \in D^n} \left(LS_f(D_y) \cdot \exp ^{-\beta d(D_x,D_y)} \right)\!. \end{equation}

(5)

Intuitively, the smooth sensitivity of a function \(f\) of a dataset \(D_x\) comes from a dataset \(D_y\) that is close to \(D_x\) and also has a large local sensitivity.

Example 3.8.

To calculate the smooth sensitivity of the triangle-counting query for the graph \(G\) in Figure 3(a), we go through all \(k\)-edge-neighbors of \(G\), calculate their local sensitivities, and then pick the one that satisfies Equation (5). In this case, the smooth sensitivity \(SS_{f,0.1} \approx 4.1\) happens at a neighboring graph \(G^{\prime }\) as shown in Figure 3(b), where \(d(G,G^{\prime })=2\) and \(LS_f(G^{\prime }) = 5\).

The advantage of using smooth sensitivity is demonstrated in Reference [107] through a few examples, including privately releasing the cost of the minimum spanning tree and the number of triangles in a graph. Since then, smooth sensitivity has been used widely in subsequent works to improve the utility of differentially private query responses, including many works in graph differential privacy.

Two mechanisms based on local sensitivity are proposed in Reference [66] to release \(k\)-star counts and \(k\)-triangle counts. The first mechanism is a direct extension from Reference [107] to \(k\)-star counting queries. It is achieved by an efficient algorithm to compute the smooth sensitivity in time \(O(n \log n + m)\), where \(n\) and \(m\) are the numbers of nodes and edges in a graph, respectively. The second mechanism relies on a bound on the smooth sensitivity rather than an efficient algorithm to compute its exact value. For this mechanism, the local sensitivity of the \(k\)-triangle counting queries for \(k \ge 2\) is masked by its local sensitivity, which gives the second-order local sensitivity \(LS_f^{\prime }\) with a simple upper bound,

\begin{equation} LS_f^{\prime }(G) \le 3 \binom{a_{max}}{k - 1} + a_{max} \binom{a_{max}}{k-2}, \end{equation}

(6)

where \(a_{max}\) is the maximum number of common neighbors between a pair of vertices in \(G\). The mechanism releases the true \(k\)-triangle count with Laplace noise proportional to the above upper bound and runs in time \(O(md)\), where \(m\) is the number of edges and \(d\) is the maximum degree.

Example 3.9.

Let \(k=2\) and observe that the number of 2-triangles in Figure 3(a) is 9. A closed form equation in Reference [66] shows that the local sensitivity at \(G\) is 8, which happens when deleting the edge 45. Since \(a_{max} = a_{45}=3\), the upper bound \(LS^{\prime }(G) \le 12\) by Equation (6). So the mechanism output is 8 plus Laplace noise (proportional to 12) plus an additional term.

As stated in Reference [66], the released answers for 2-stars and 3-stars are useful in moderately dense graphs, and for triangles and 2-triangles the answers are useful in dense graphs but not so for sparse graphs. An application of private \(k\)-triangle counting appears in Reference [89], where the authors use it in conjunction with private alternating k-star counting and private alternating k-twopath counting to estimate the parameters for exponential random graph models.

In contrast to the Laplace-based approaches, a different method for subgraph counting based on the exponential mechanism is proposed in Reference [170]. In this work, the authors use a ladder function as the utility function. Under certain conditions of the ladder function, the proposed mechanism is differentially private. An optimal choice of a ladder function is the generalized local sensitivity at distance \(t\) (i.e., the maximum change of a function’s output between a given dataset \(D^{(1)}\) and a neighboring dataset \(D^{(2)}\) such that \(d(D^{(1)},D^{(2)}) \le t\)). This function is only efficiently computable for triangle and \(k\)-star counting queries. For \(k\)-clique and \(k\)-triangle counting queries, a more efficient choice is a convergent upper bound of the local sensitivity at distance \(t\). Empirical evaluations in Reference [170] on real graph data show substantial improvements in the exponential-mechanism-based methods over the Laplace-mechanism-based methods with global sensitivity, smooth sensitivity [107], second-order local sensitivity [66], and a recursive strategy [24].

To reduce the noise effect in the worst case, Proserpio et al. [118] scales down the influence of troublesome records in weighted datasets using a platform called weighted Privacy Integrated Query (wPINQ), which is built upon the PINQ [95] platform that guarantees all acceptable queries satisfy differential privacy based on the Laplace and exponential mechanisms. In wPINQ, weights are scaled down differently for various built-in operators such as Select, Join, and GroupBy. An earlier work [117] demonstrates how wPINQ can produce private degree distributions and joint degree distributions of weighted graphs for the purpose of generating private synthetic graphs. Proserpio et al. [118] then demonstrates wPINQ for triangle by degrees counts and square by degrees counts and shows how these statistics can be combined with Markov Chain Monte Carlo (MCMC) to generate private synthetic graphs.

3.2.1.2 Degree Sequence. Another widely studied graph statistic is the degree sequence of a graph. Give a graph \(G\), its degree sequence \(S(G)\) is a monotonic non-decreasing sequence of node degrees. It can be used to compute the average or maximum degree of a graph. It can also be used to recover a graph, provided there is a consistent graph with the given degree sequence. A degree sequence can be transformed into a degree distribution, which is a useful feature for graph classification, e.g., the degree distribution of a scale-free network follows a power law. When perturbing the existence of an edge in a given graph, its degree sequence’s global sensitivity is 2, which behaves much better than that of subgraph-counting query functions.

Hay et al. [49] proposes a constraint inference-based algorithm that can be used as a post-processing step to improve the quality of degree sequences released using private mechanisms. Given a noisy degree sequence \(\tilde{S}(G)\) produced by a private mechanism, constraint inference finds a non-increasing ordered degree sequence \(\bar{S}(G)\) (in the same vertex order) based on isometric regression such that the difference \(||\tilde{S}(G)-\bar{S}(G)||_2\) is minimized.

Example 3.10.

The graph in Figure 3(a) has the degree sequence \(S = (2,3,3,3,3,4,5,5)\) a possible private degree sequence is \(\tilde{S}(G) = (3,2,5,3,-4,16,6,3)\), produced by the Laplace mechanism with noise sampled from \(Lap(\frac{2}{0.5})\). The distance between these two is \(||S(G) - \tilde{S}(G)||_2 \approx 14\). After applying constraint inference to \(\tilde{S}(G)\), the post-processed degree sequence is \(\overline{S}(G) = (2,2,2,2,2,8,8,8)\) with distance reduced to \(||S(G) - \overline{S}(G)||_2 \approx 6\).

Constraint inference is commonly adopted in subsequent works to improve query utility. The advantages of constraint inference are its computational efficiency (100 million nodes were processed in just a few seconds in Reference [49]), its applicability to a wide range of graph statistics, and the fact that its error increases only linearly with the number of unique degrees. An issue with constraint inference is that the post-processed degree sequence can be non-graphical, i.e., not consistent with any graph. In Example 3.10, the post-processed degree sequence \((2,2,2,2,2,8,8,8)\) is non-graphical. This could be a problem if the private degree sequence is used to carry out statistical inference or generate synthetic graphs. To overcome this, Reference [67] presents an additional optimization step after constraint inference in the domain of all graphical degree sequences.

A mechanism that produces both the private degree distribution of a graph and the joint degree distribution of a graph is the aforementioned [117], which is essentially based on the Laplace and exponential mechanisms.

3.2.1.3 Cut Query. Sometimes one is interested in the number of interactions between two groups of entities in a network, for example the number of sales interactions between two companies or the number of collaborations between two groups of researchers from different organizations. In graph data, these can be measured by cut queries between two vertex subsets. More precisely, given a weighted graph \(G=(V,E)\) and two non-empty subsets \(V_S, V_T \subseteq V\), an s-t-cut query returns the total weight of the edges crossing \(V_S\) and \(V_T\).

Gupta et al. [43] employs an iterative database construction (IDC) framework that generalizes the median mechanism [126] and multiplicative weights mechanism [48] with tighter bounds. It iteratively compares the response \(Q^t(D^t)\) from an approximated dataset \(D^t\) with a noisy response \(Q^t(D)+Lap(\cdot)\) from the given dataset \(D\) and updates the current approximation to \(D^{t+1}\) if the two responses are not close. However, it requires graphs that are sufficiently dense.

A different approach is taken in Reference [11] to release answers of cut queries with less additive noise. They show that one can apply the Johnson–Lindenstrauss (JL) transform [63] to an updated edge matrix of a weighted graph to obtain a sanitized graph Laplacian matrix. This sanitized Laplacian matrix can then be used to approximate s-t-cut queries while preserving \((\epsilon ,\delta)\)-edge differential privacy. Before applying the JL transform, each weight \(w_{uv}\) in the edge matrix \(M_E\) is updated by \(w_{uv} = \frac{w}{n} + (1-\frac{w}{n}) w_{uv}\), where \(n\) is the number of vertices and \(w\) is calculated from pre-determined parameters. Once the update is complete, the JL transform is applied by the step \(\overline{L} = \frac{1}{r}M_E^T\cdot M^T\cdot M\cdot M_E\), where entries of the matrix \(M\) are sampled i.i.d. from \(N(0,1)\). This method only adds (with high probability) constant noise (w.r.t. graph size) to cut query answers and hence provides superior results compared to Reference [43] for small cuts \(\le O(n)\). In fact, the JL transform-based strategy can serve as a more general approach to publishing randomized graphs that satisfy edge differential privacy for any downstream graph queries.

The above JL transform-based method is revised for sparse graphs in Reference [153]. The authors observe that the weight update in Reference [11] is for all possible pairs of vertices, which is the same as overlaying a complete graph on \(G\). This is done to ensure the graph corresponding to the updated edge weights is well connected prior to applying the JL transform. To maintain the sparsity of the given graph \(G\), a \(d\)-regular expander graph \(E\) is used when performing weight update in \(G\) by \(L_H = \frac{w}{d} L_E + (1-\frac{w}{d}) L_G\). (An expander graph is a sparse graph such that every small vertex subset is well connected to the rest of the graph.) This is followed by a step of Gaussian noise addition to ensure differential privacy. More generally, combining this sanitization process with a graph sparsification technique preserves graph differential privacy for cut queries for arbitrary graphs.

3.2.2 Node Differential Privacy.

The distinction between node and edge differential privacy originates from how a neighboring graphs are defined. In the context of node differential privacy, a pair of neighboring graphs differ by exactly one node and the edges incident to the node (Definition 3.3). The advantage of this is that it gives a higher level of protection about an entity’s privacy, including its existence in the dataset and its relations to others. The disadvantage is it tends to give rise to high query sensitivity. Hence, it is more difficult to achieve node differential privacy with useful query responses. Today, there are two main types of solutions that deal with such high query sensitivity. The first is based on a top-down projection to lower-degree graphs. The second is based on a bottom-up generalization using Lipschitz extensions.

The top-down methods project the given graph to a graph with a bounded maximum degree, then answer queries using the projected graph with noise that is proportional to the sensitivities of the query function and the projection map. A projection could be a naïve removal of all high-degree nodes or a random edge removal that reduces high node degrees. These projections are generic and easy to implement but suffer from high sensitivity and potential information loss.

The bottom-up methods first answer queries on bounded degree graphs and then extend the answers to arbitrary graphs using Lipschitz extensions. The extended answers can then be released with additive noise according to existing differential privacy mechanisms, such as the Laplace mechanism. The main drawback is designing an appropriate Lipschitz extension for each query function, which is a non-trivial task in general. Other than that, Lipschitz extension-based methods are usually preferred because of their generalizations to arbitrary graphs to avoid information loss.

For the same reason as explained in Section 3.2.1, we only review published works for releasing degree sequence/distribution and subgraph counting queries. Due to the difficulty in obtaining high utility private mechanisms, there are fewer works concerning node differential privacy than there are concerning edge differential privacy. That said, some recent works have studied private estimation of generative graph model parameters [14, 15, 136], such as the edge probability in the Erdős–Rényi model.

Sometimes, instead of a static graph, there may be a sequence of dynamic graphs. In this case, one may want to release some common graph statistics such as degree sequence or subgraph counts for the entire graph sequence. A study in Reference [139] presents a mechanism to publish difference sequence, which is the difference between the graph statistic for two graphs adjacent in time.

Before proceeding, we give the definitions of Lipschitz function and Lipschitz extension. Intuitively, a function is Lipschitz if the difference between its images is bounded by a constant times the difference between the corresponding preimages. Formally, it is defined as follows:

Definition 3.11 (Lipschitz Constant).

Given two metric spaces \((X, d_X)\) and \((Y, d_Y)\), a function \(f:X \rightarrow Y\) is called \(c\)-Lipschitz (or has a Lipschitz constant \(c\)) if there exists a real constant \(c \ge 0\) such that for all \(x_1, x_2 \in X\), \(d_Y(f(x_1),f(x_2)) \le c \cdot d_X(x_1, x_2).\)

In the context of graph differential privacy, the graph space with node or edge edit distance as the metric is mapped by the function \(f\) to a real valued space with, for example, \(L_1\)-norm as the metric. Choices of \(f\) include degree sequence, subgraph counting, node centrality score, and so on. The global sensitivity of a function is the smallest Lipschitz constant that upper bounds the maximum changes in the codomain.

A \(c\)-Lipschitz function may be extended to another Lipschitz function that takes on a larger domain with the same codomain. Such an extension is called a Lipschitz extension and is essential for extending a node differentially private query-answering mechanism from a restricted graph domain (e.g., bounded degree graphs) to the general graph domain.

Definition 3.12 (Lipschitz Extension).

Given two metric spaces \((X, d_X)\) and \((Y, d_Y)\) and a \(c\)-Lipschitz function \(f^{\prime }:X^{\prime } \rightarrow Y\) with the domain \(X^{\prime } \subseteq X\), a function \(f:X \rightarrow Y\) is a Lipschitz extension of \(f^{\prime }\) from \(X^{\prime }\) to \(X\) with stretch \(s \ge 1\) if

(1)

the two functions \(f(x)=f^{\prime }(x)\) are identical for all \(x \in X^{\prime }\) and

(2)

the extended function \(f\) is \(sc\)-Lipschitz.

The desire for high-utility node differentially private mechanisms motivates the search for efficiently computable Lipschitz extensions with low Lipschitz constants and stretches from restricted graphs to arbitrary graphs. The following subsections review some results in this area, with a focus on degree-sequence and subgraph-counting queries, using both the top-down projection and the bottom-up Lipschitz-extension approaches.

3.2.2.1 Degree Sequence. Kasiviswanathan et al. [69] describes a mechanism that privately releases the degree distribution of a graph using a simple projection function \(f_T:G \rightarrow G_\theta\) that discards all nodes in \(G\) whose degrees are higher than a threshold \(\theta\). They showed that if \(U_s(G)\) is a smooth upper bound on the local sensitivity of \(f_T\) and \(\Delta _{\theta }f\) is the global sensitivity of a query \(f\) on graphs with bounded maximum degree \(\theta\), then the function composition \(f \circ f_T\) has a smooth upper bound \(U_s(G) \cdot \Delta _{\theta }f\). Kasiviswanathan et al. [69] gave explicit formulas for computing \(\Delta _{\theta }f\) and the local and smooth sensitivities of the truncation function \(f_T\). Furthermore, it was proved that randomizing the truncation cutoff in a range close to the given threshold \(\theta\) is likely to reduce its smooth sensitivity. The presented mechanism was proved to satisfy node differential privacy with Cauchy noise. It runs in time \(O(|E|)\) and produces private degree distributions with \(L_1\) error \(O(\frac{\bar{d}^{\alpha } \ln n \ln \theta }{\epsilon ^2 \theta ^{\alpha -2}} + \frac{\theta ^3 \ln \theta }{n \epsilon ^2})\) for graphs with \(n\) nodes and average degree \(\bar{d}\), provided the graphs satisfy certain constraints and \(\alpha\)-decay, which is a mild assumption on the tail of the graph’s degree distribution. The authors proved that if \(\alpha \gt 2\) and \(\bar{d}\) is polylogarithmic in \(n\), then this error goes to 0 as \(n\) increases.

The naïve truncation of Reference [69] suffers from high local sensitivity due to the deletion of a large number of edges, especially in dense graphs. To address this, a projection based on edge addition is proposed in Reference [30] for releasing private degree histograms. The authors prove that by adding edges to the empty graph in a stable edge ordering, the final graph not only has more edges than the one obtained through naïve projection but also is maximal in terms of edge addition. Given two (node) neighboring graphs, an edge ordering of a graph is stable if, for any pair of edges that appears in both graphs, their relative ordering stays the same in the edge ordering of both graphs. The mechanism for releasing degree histograms employs the exponential mechanism to choose the optimal degree threshold \(\theta\) and bin aggregation \(\Omega\) for reducing sensitivity and then adds \(Lap(\frac{2\theta +1}{\epsilon _2})\) noise to the aggregated bins. This mechanism satisfies \((\epsilon _1+\epsilon _2)\)-node differential privacy, where \(\epsilon _1\) and \(\epsilon _2\) are the privacy budgets for the exponential and Laplace mechanisms respectively. It runs in time \(O(\Theta \cdot |E|)\), where \(\Theta\) is an integer upper bound of \(\theta\). An extension that releases cumulative histograms require only \(Lap(\frac{\theta +1}{\epsilon })\) noise and has the additional benefit that the result can be post-processed by the constraint-inference algorithm [49]. The extension relies on the exponential mechanism to select an optimal \(\theta\) so it has the same time complexity and privacy bound.

A key method based on the Lipschitz extension for releasing private degree sequence is [120]. Since the global sensitivity for bounded degree graphs \(G_\theta\) satisfies \(\Delta _{S(G_{\theta })} \le 2\theta\), it implies that \(S(G_{\theta })\) has a Lipschitz constant \(2\theta\). By constructing a flow graph \(FG(G)\) from the given graph \(G\) and a degree threshold \(\theta\), Raskhodnikova and Smith [120] present a Lipschitz extension of \(S(G_{\theta })\) from the set of bounded-degree graphs to arbitrary graphs with low stretch via a strongly convex optimization problem. More specifically, the flow graph \(FG(G)\) has a source node \(s\), a sink node \(t\) and two sets of nodes \(V_l\) and \(V_r\) that are exact copies of the nodes in \(G\). The source \(s\) is connected to all of \(V_l\) via directed edges with capacity \(\theta\), and similarly for \(V_r\) and \(t\). Each node \(x \in V_l\) is connected to a node \(y \in V_r\) via a directed edge with capacity 1 if there is an edge \(xy\) in \(G\). Given the flow graph, Reference [120] solves for an optimal flow \(f\) that minimizes the objective function \(\Phi (f)=||(f_{s\cdot },f_{\cdot t})-\vec{\theta }||_2^2\), where \(f_{s\cdot }\) and \(f_{\cdot t}\) are the vectors of flows leaving \(s\) and entering \(t\) respectively and \(\vec{\theta } = (\theta ,\ldots ,\theta)\) has length \(2n\). The authors prove that the sorted \(f_{s\cdot }\) is an efficiently computable Lipschitz extension of the degree-sequence function with a stretch of 1.5.

The same flow graph construction is used in Reference [69] for subgraph counting queries. The result in Reference [120] is consistent with Reference [69]’s Lipschitz extension on edge count, but Reference [120] minimizes the above objective function \(\Phi (f)\) rather than maximize the network flow, because the maximum flow may not be unique and the two formulations have different sensitivities.

With some additional work that replaces the scoring function in the standard exponential mechanism, Reference [120] also proves that using the adjusted exponential mechanism to select the degree threshold leads to a Lipschitz extension with low sensitivity, and hence better utility in the private outputs. Combining the Lipschitz extension and the adjusted exponential mechanism, the authors are also able to release private degree distributions with improvement on the error bound of Reference [69].

Example 3.13.

Consider the graph \(G\) shown in Figure 2(a). Its flow graph is shown in Figure 4. The sorted out-flows is \(f_{s\cdot }=(4,3,3,3,3,2)\), which is exactly the degree sequence of the original graph. The threshold \(\theta\) is chosen to be 4 on purpose, which matches the maximum degree in the original graph, so no information is lost during the problem transformation.

Fig. 4.

3.2.2.2 Subgraph Counting. The concept of restricted sensitivity is introduced in Reference [12] to provide a top-down projection-based method for releasing private subgraph counts This projection is the opposite of the edge-addition projection of Reference [30] in the sense that it removes edges by following a canonical edge ordering. An edge \(e\) is removed from the graph if it is incident to a vertex whose degree is higher than a threshold \(\theta\) and \(e\) is not the first \(\theta\) edges in the ordering. The composition of the projection with a subgraph-counting query \(f\) results in a function \(f_{\mathcal {H}}\) that has global sensitivity (in the context of edge DP) and smooth sensitivity (in the context of node DP) proportional to the restricted sensitivity \(RS_f(\mathcal {H}_{\theta })\) and \(RS_f(\mathcal {H}_{2\theta }),\) respectively, where \(\mathcal {H}_{\theta }\) is the set of graphs with bounded degree \(\theta\). The article theoretically shows the advantage of restricted sensitivity in local-profile and subgraph-counting queries when adding Laplace noise and gives explicit upper bounds for both query classes.

A key early work [69] studies Lipschitz extension for private subgraph counting. For edge-counting queries, flow graphs are constructed as described for Reference [120] above (see also Figure 4). The maximum flow satisfies \(f_{max}(G_{\theta })=2|E(G_{\theta })|\) and \(f_{max}(G) \le 2|E(G)|\) for bounded-degree graphs and arbitrary graphs, respectively. So \(f_{max}\) is an efficiently computable Lipschitz extension of the edge-counting queries with global sensitivity \(\Delta _{f_{max}} \le 2\theta\). To achieve \(\epsilon\)-node DP, one could then alternate between \(|E|+Lap(\frac{2n}{\epsilon })\) and \(\frac{f_{max}}{2}+Lap(\frac{2\theta }{\epsilon })\), depending on how close the former is to the truth. The same technique can be generalized to concave functions, where a function \(h\) is concave if its increments \(h(i+1)-h(i)\) are non-increasing as \(i\) goes from 0 to \(n-2\).

For small subgraph counting queries such as triangle counts, Reference [69] proposes a linear programming formulation that maximizes \(\sum x_C\) over all subgraphs of \(G\) with some constraints, where \(x_C=1\) if the subgraph of \(G\) matches the given subgraph of interest and \(x_C=0\) otherwise. The maximum value \(v_{LP}(G)\) of this linear program satisfies the requirements of being a Lipschitz extension of subgraph counting queries \(f_H(G)\) and has sensitivity bounded by \(\Delta _{v_{LP}(G)} \le 6\theta ^2\) for subgraphs with three nodes. Similarly to private edge counts, small subgraph counts can also be released with \(\epsilon\)-node DP by alternating between \(f_H(G)+Lap(\frac{6n^2}{\epsilon })\) and \(v_{LP}(G)+Lap(\frac{6\theta ^2}{\epsilon })\), depending on how close the former is to the truth.

Another related work [32] applies a stable edge ordering scheme [30] to a sequence of node pairs to obtain a projected graph, in which each node appears in a bounded number of triangles. The proposed private mechanisms can be used for releasing triangle count distributions, cumulative triangle count distributions and local clustering coefficient distributions.

3.2.3 Edge Weight Differential Privacy.

The previous two subsections focused on protecting the graph structures that may not be known to the public. When a graph structure is public and a system protects the edge weights or related statistics of the graph, such as distances between vertices, the aforementioned edge and node differential privacy frameworks may not be appropriate. Instead, a more suitable setting is to define differential privacy in the context of neighboring weight functions for the given graph. This is formally introduced in Reference [135]. A use case of this is a navigation system that has access to a public map and road user-traffic data and is required to keep user data private. Another use case is the World Wide Web as mentioned in Reference [115]. Two related but different works for protecting edge weight are References [26, 86]. The former considers edge weights as counts from the dataset, so the Laplace mechanism is used to protect counting queries. The latter focuses on neighboring graphs that differ on at most one edge weight.

Given a graph and a weight function on the edges, the private edge weight differential privacy model in Reference [135] is based on the following definition of neighboring weight functions:

Definition 3.14.

Given a graph \(G=(V,E)\), two weight functions \(\omega , \omega ^{\prime }: E \rightarrow \mathbb {R}^+\) are neighboring if the total weight difference is at most 1, i.e.,

\begin{equation*} ||\omega - \omega ^{\prime }||_1 := \sum _{e \in E} |\omega (e) - \omega ^{\prime }(e)| \le 1. \end{equation*}

Under this new framework, the weight change in a pair of edge neighboring graphs is at most 1, so Reference [135] uses the Laplace mechanism to release the distance between a pair of nodes and approximate distances between all pairs of nodes. A similar work [115] defines neighboring weight functions in terms of the \(L_{\infty }\)-norm and uses the exponential mechanism to release a private approximation of the minimum spanning tree topology for a given graph.

3.2.4 Distributed Private Query Release.

So far, we have reviewed published works for releasing graph statistics in the edge and node differential privacy frameworks and some variants. One thing they have in common is a trustworthy centralized data curator who collects sensitive information from participants and answers analysts’ queries in a private manner. This is also known as the centralized differential privacy model. In contrast, in the local differential privacy (LDP) model [68] there is no central data curator and each individual holds their sensitive information locally. When an analyst wants to calculate a global statistic over the population, each individual first answers the query locally in a private manner, and the analyst then collects the private local statistics from all the individuals and tries to get an aggregate view. Because it affords contributors strong privacy protection, local differential privacy has become a popular research topic in recent years.

An earlier work [146] proposes mechanisms for releasing private triangle, three-hop path, and \(k\)-clique counts in a localized setting called decentralized differential privacy under the edge DP framework. In this new privacy model, each node shares its subgraph count in a private manner that not only protects its connected edges but also edges in its extended local view (ELV), i.e., the two-hop neighborhood. To protect its ELV, a node must add noise proportional to the counting-query sensitivity calculated from neighboring ELVs, which are ELVs of the nodes in neighboring global graphs. This definition of a neighboring ELV leads to large query sensitivity, because two neighboring ELVs could have different sets of nodes. Thus, the authors present a two-phase framework to reduce noise magnitude using local sensitivity. The first phase determines an upper bound over all user local sensitivities. Each node in the second phase then shares its count privately according to this upper bound. The first phase includes two steps, which estimate the second-order local sensitivities [66] and derive an upper bound for the local sensitivities from that estimation.

Another work [168] presents a local framework for graph with differentially private release (LF-GDPR), which is claimed to be the first LDP-enabled graph metric estimation framework for general graph analysis. It designs an LDP solution for a graph metric estimation task by local perturbation, collector/curator-side aggregation, and calibration. It makes the assumption that the target graph metrics can be derived from atomic metrics, in particular the adjacency bit vector and node degree. An optimal solution is then described for the allocation of privacy budget between an adjacency bit vector (derived from a graph adjacency matrix) and node degree. LF-GDPR is stated to enable solution generality and estimation accuracy from LDP perturbation of these two atomic metrics, from which a further range of graph metrics can be derived. The authors show how LF-GDPR can be used to compute clustering coefficients and perform community detection.

3.2.5 Private Graph Data Mining.

This section surveys complex differentially private graph queries used in Graph Data Mining [4].

3.2.5.1 Frequent Pattern Mining. Frequent subgraph mining counts the number of occurrences of a given subgraph in a collection of graphs. A subgraph is considered frequent if it occurs more than a user-specified threshold. The problem has wide applications in bioinformatics and social network analysis. While mining subgraphs of interests is an attractive practical problem, privacy concerns arise when the collection of graphs contain sensitive information.

The first differentially private subgraph mining mechanism [138] uses the exponential mechanism to release the top \(k\) most frequent subgraphs. The main challenge of applying the exponential mechanism directly to obtain frequent subgraphs is calculating the normalizing constant in the sampling distribution, because it is infeasible to enumerate the output space. For this reason, the authors use the Metropolis-Hasting algorithm to sample frequent subgraphs from a Markov chain, in which each chain node is a subgraph and each chain edge represents an operation, which is one of edge addition, deletion, and node addition. The proposed mechanism is shown to be \(\varepsilon\)-differentially private if the Markov chain converges to a stationary distribution.

Another related mechanism [166] proposes to return frequent subgraphs with a different number of edges up to a maximal number. To release private subgraph counts, it adds Laplace noise that is calculated using a lattice-based method to reduce the noise magnitude. Since subgraphs with \(m\) edges can be obtained from subgraphs with \(m-1\) edges by adding one edge, this subgraph inclusion property can be used to create a lattice in which each lattice point is a subgraph. This lattice partitions a collection of graphs into mutually disjoint subsets based on lattice paths of the frequent subgraphs. This technique eliminates irrelevant graphs from the domain, thus reducing the sensitivity of subgraph-counting queries and the amount of noise for a given level of DP.

3.2.5.2 Subgraph Discovery. An earlier work [71] provides a different privacy model over social network data to identify a targeted group of individuals in a graph. This model is of particular interest in domains like criminal intelligence and healthcare. It provides privacy guarantees for individuals that do not belong to the targeted group of interest. To achieve that, it introduces a graph-search algorithm that is based on a general notion of the proximity statistics, which measure how close a given individual is to the targeted set of individuals in the graph. This algorithm performs an iterative search over the graph for finding \(k\) targeted disjoint connected components. In each iteration, the algorithm starts from a given node \(v\) and finds the set of all nodes that are part of the same connected component as \(v\). It ensures the search for a new component is modified via randomization, which follows node DP by sampling noise from the Laplace distribution. The privacy cost of the algorithm increases with the number of targeted disjoint connected components (subgraphs defined on targeted individuals), and not with the total number of nodes examined. Thus, the privacy cost can be small if the targeted individuals appear only in a small number of connected components in the graph.

3.2.5.3 Clustering. Pinot et al. [115] recently propose a method that combines a sanitizing mechanism (such as exponential mechanism) with a minimum spanning tree-based clustering algorithm. Their approach provides an accurate method for clustering nodes in a graph while preserving edge DP. The proposed algorithm is able to recover arbitrarily shaped clusters based on the release of a private approximate minimum spanning tree of the graph, by performing cuts iteratively to reveal the clusters. At every iteration, it uses the exponential mechanism to find the next edge to be added to the current tree topology while keeping the weights private, which provides a tradeoff between the degree of privacy and the accuracy of the clustering result.

3.2.5.4 Graph Embedding. Graph embedding [46] is a relatively new graph-analysis paradigm that encodes the vertices of a graph into a low-dimensional vector space in a way that captures the structure of the graph. Reference [165] studies the use of matrix factorization to achieve DP in graph embedding. The application of Laplace and exponential mechanisms can incur high utility loss in existing random-walk-based embedding techniques because of the large amount of edge sampling required and the sensitivity of stochastic gradients. Thus, that study proposes a perturbed objective function for the matrix factorization, which achieves DP on the learned embedding representations. However, to bound the global sensitivity of the target non-private function, it requires complex analytic calculations that scale poorly. A following work [171] proposes to use a Lipschitz condition [120] on the objective function of matrix factorization and a gradient clipping strategy to bound the global sensitivity, with composite noise added in the gradient descent to guarantee privacy and enhance utility.

3.2.5.5 Graph Neural Networks . Graph Neural Networks (GNNs) [175] are designed to improve the computational efficiency and generalization ability of graph embedding techniques. GNNs have superior performance in learning node representations for various graph-inference tasks, including node classification and missing-value imputation, edge or link prediction, and node clustering [162]. While the use of DP in traditional graph analysis and statistics applications is now reasonably well established, there are significantly fewer studies on differentially private GNN training methods [31].

A recent work [99] shows how the procedure of differentially private stochastic gradient descent (DP-SGD) [2] can be transferred from database queries to multi-graph learning tasks where each graph can be seen as an individual entity in a multi-graph dataset. However, this approach cannot be applied to GNNs in a single-graph setting, because the individual data points (nodes or edges) in a graph cannot be separated without breaking up the graph structure.

Another study [57] proposes a random graph splitting method to graph convolutional networks, by partitioning a given graph into smaller batches to approximate sub-sampling amplification and then applying differentially private versions of gradient-based techniques like DP-SGD and Adam [76] for training. This method provides higher training efficiency and privacy amplification by sub-sampling. However, it results in tighter privacy bounds than when applied to the whole population. To address this problem, Reference [128] proposes to apply local DP [146] on the node features, without protecting the graph structure. Following the strategy of Private Aggregation of Teacher Ensembles [109, 111] recently proposes different teacher-student models to allow the differentially private release of GNNs.

Several recent works [50, 172, 173] discuss the possibility of performing privacy attacks against GNNs and quantify the privacy leakage of GNNs trained on sensitive graph data. For example, Reference [173] shows that DP in its canonical form cannot defend against a possible privacy attack while preserving utility. Thus, more research is required to investigate how differential privacy noise can be added to graphs to protect sensitive structural information against these privacy attacks.

4 Private Graph Releases

Besides answering graph queries privately, the other popular way of protecting sensitive information in graph data is by generating synthetic graphs similar to the original ones. The major advantage of such methods is that they are independent of graph queries and hence can be used to answer any subsequent graph questions with low or no risks of privacy leakage. Recent works [19, 169] propose generic strategies for evaluating the utility and privacy tradeoff in synthetic graphs to give guidance on how existing private mechanisms perform. In general, a private graph release mechanism can be evaluated on a variety of graph statistics to reflect its reconstruction accuracy from different perspectives. Such evaluations can be compared with other private mechanisms or with non-private graph-generation mechanisms to determine the impact on utility under different privacy settings.

This section surveys the main approaches that have been proposed for releasing synthetic graphs. Section 4.1 touches on non-provable methods and refers readers to a few existing survey papers in that area. Our focus is principally on provable methods, which are discussed in detail in Section 4.2.

4.1 Synthetic Graphs with No Provable Privacy Guarantee

There have been numerous studies on how to release a synthetic graph that is a close approximation to the original graph, while making it difficult to reconstruct the original graph or identify individuals in it. If one does not need any guarantee of the level of protection on the generated synthetic graphs, then there are plenty of options for doing so, ranging from edge/vertex perturbation-based to sampling-based to generalization-based techniques, and so on.

The main problem with these methods is that they do not provide any mathematical guarantees of privacy. For example, consider \(k\)-anonymized graphs. In a \(k\)-anonymized dataset, each record is indistinguishable from at least \(k - 1\) other records with respect to certain identifying attributes [130, 149]. A \(k\)-anonymized dataset can have major privacy problems due to a lack of diversity of sensitive attributes. In particular, the degree of privacy protection does not depend on the size of the quasi-identifier attribute set but is rather determined by the number of distinct sensitive values associated with each quasi-identifier attribute set [90]. Second, attackers often have background knowledge, and \(k\)-anonymity does not guarantee privacy against background knowledge attacks [90]. For more details and the milestone works that have been done in this line of research, the interested readers can read these surveys [20, 21, 77, 161].

4.2 Synthetic Graphs with Provable Privacy Guarantee

The previous section described private graph release mechanisms that rely on the amount of background information adversaries have about the sensitive data. As this is difficult to anticipate, they do not provide strong privacy guarantees. This section reviews alternate approaches, which are based on the edge differential privacy notion, and thus provide provable privacy guarantees.

4.2.1 Generative Graph Models.

Before privacy was addressed in public graph data, researchers had been working on generative graph models to replicate an underlying unknown data-generating process. These models usually have parameters that can be estimated from a given class of graphs. Thus, one way to synthesize a provably private graph is to ensure the parameters of such generative models are estimated in a way that is differentially private. For this reason, there is a close connection between private query-answering mechanisms and private synthetic-graph generation mechanisms.

Pygmalion [129] is an example of such mechanisms, where a \(dK\)-graph model [91] is used to capture the number of connected \(k\)-node subgraphs with different degree combinations into \(dK\)-series. These \(dK\)-series are then sorted and partitioned into disjoint unions of close sub-series, each of which is made private by the Laplace mechanism. A further noise reduction to the entire series is performed using the constraint inference method [49]. Pygmalion is applied to three real graphs with tens to hundreds of thousands of nodes and tested under some popular graph metrics (e.g., degree distribution, assortativity, graph diameter, etc.) and two application-level tasks, spam filter and influencer identification. It shows a limited impact on the generated synthetic graphs across a range of privacy budgets when compared with its non-private alternatives.

A following work [155] improves the utility of the \(dK\)-graph model method by adding Laplace noise proportional to smooth sensitivity (Equation (5)). The test subjects, i.e., \(dK\)-1 and \(dK\)-2 based mechanisms, outperformed a non-private stochastic Kronecker graph (SKG) generator on four real networks as demonstrated in Reference [155]. Although \(dk\)-2 has higher utility than \(dk\)-1, it is only superior for very large privacy budgets. A more recent work [56] reduces the noise magnitude by adding a microaggregation step to the \(dK\)-series before adding Laplace noise. This microaggregation step partitions \(dk\)-series into clusters of similar series and replaces each cluster with a cluster prototype, providing an aggregated series with lower sensitivity.

Another method [96] uses an SKG model [80] that recursively creates self-similar graphs by using Kronecker product of the adjacent matrix of an initiator graph to itself. The adjacency matrix entries of an initiator graph are the estimated SKG model parameters from a given graph. Following Reference [40], four different subgraph counts (i.e., edges, triangles, 2-stars, 3-stars) are selected as the SKG parameters. The number of triangles is calculated using smooth-sensitivity-based Laplace noise. The other three subgraphs are counted from a private degree sequence produced by the constraint inference algorithm [49]. This private SKG-based mechanism produces synthetic graphs with comparable graph statistics as the ones produced by two non-private SKG models.

A different approach in Reference [164] utilizes the Hierarchical Random Graph (HRG) model [25] to encode a network in terms of its edge probabilities. An HRG model of a given graph consists of a dendrogram and the associated probabilities. The dendrogram is a rooted binary tree, where each leaf node corresponds to a node in the given graph and each internal node has an associated probability. The probability that two nodes are connected in the original graph is captured by the probability of their lowest common ancestor in the dendrogram. The authors use a MCMC to select a good dendrogram. This MCMC samples through the space by varying the subtree rooted at a randomly picked internal node from the dendrogram. This step also ensures dendrogram privacy, as the MCMC plays a similar role as the exponential mechanism. This HRG-based mechanism shows superior performance over the \(dK\)-2-based mechanism [155] and a spectral method [156] when applied to four real networks.

A mechanism that releases private graphs with node attributes is Reference [64], which is based on the Attributed Graph Model (AGM) [113]. AGM has three classes of model parameters, each of which can be privately estimated. First, node attribute parameters are estimated using counting queries with the Laplace mechanism due to their low sensitivity. Second, attribute-edge correlation parameters are also estimated using counting queries, but on a projection [12] of the original graph into a bounded maximum degree one. Third, edge-generation parameters are modeled by a generative model, called TriCycLe, which simulates the degree sequence and clustering coefficients. This model is used with two parameters, the degree sequence and triangle count, both of which are then privately estimated by methods from References [49, 170], respectively.

4.2.2 Graph Matrix Perturbations.

Besides generative graph models, another approach is to release private approximations of the original adjacency and Laplacian matrices using matrix perturbation strategies.

Examples of such mechanisms are proposed in Reference [156] to release the largest \(k\) eigenvalues \(\lbrace \lambda _1, \dots , \lambda _k\rbrace\) and eigenvectors \(\lbrace u_1, \dots , u_k\rbrace\) of an adjacency matrix, which can be turned into a lower rank adjacency matrix by \(M_A^k = \sum _{i=1}^k \lambda _i u_i u_i^T\). A first mechanism directly adds Laplace noise to the top \(k\) eigen pairs in proportion to their global sensitivity. A second mechanism uses a Gibbs sampler [52] to sample the largest \(k\) eigenvectors from the matrix Bingham-von Mises-Fisher distribution, which is a probability distribution over orthonormal matrices, such as the eigenvector matrix. The first mechanism outperforms the second one in many experimental settings on real network data.

Another contribution [23] focuses on generating private graphs from original graph data in which the existence of an edge may be correlated with the existence of other ones. It introduces an extra new parameter \(k\) that controls the maximum number of correlated edges and evenly splits the privacy budget \(\epsilon\) to \(\epsilon /k\). This setting is consistent with the \(k\)-edge differential privacy framework [49]. Once the privacy budget is evenly split, the adjacency matrices are sanitized as if they were not correlated. The adjacency matrix perturbation process contains node relabeling, dense region discovery and edge reconstruction using the exponential mechanism. These first two steps find high density regions in the adjacency matrix, which can then be reconstructed with accuracy [43]. Each of these three steps receives a portion of \(\epsilon /k\) as its privacy budget. This proposed density-based exploration and reconstruction (DER) outperforms a simple Laplace mechanism in all test cases [23].

A drawback of the HRG [164] and the DER [23] mechanisms is their quadratic running time in the number of nodes. A more efficient matrix perturbation mechanism is Top-m Filter [103], which runs linearly in the number of edges. It starts by adding Laplace noise to each cell in the adjacency matrix, then only chooses the top noisy cells as edges in the perturbed matrix.

In Reference [18], the authors focus on releasing private adjacency matrices for weighted directed graphs, and define different neighboring graphs than in edge and node DP. They use the Laplace mechanism to add noise to the adjacency matrices according to the sensitivity within blocks of entries, as some edge weights are less sensitive than others and so should be treated differently. They then propose an automated method to partition the matrix entries without any prior knowledge of the graph.

Some earlier works on publishing low-rank private approximations of matrices could be used to publish sanitized private adjacency or Laplacian matrices. For example, in Reference [13] the authors study the connection between singular value decomposition and eigen decomposition of matrices. More precisely, given a matrix \(M_A\), they used the top \(k\) eigenvalues of a perturbed matrix \(M_A\cdot M_A^T\) with carefully calculated Gaussian noise to privately approximate \(M_A\). The truncated eigenvalues can then be used to obtain a rank-\(k\) approximation of the original matrix. A subsequent work improves on the utility of the approximated low-rank matrices [47], but that is inapplicable to graph data due to the unbalanced dimension constraint on the input matrices to satisfy a low-coherent assumption. Although proved under the assumption that the input matrices are symmetric and positive semidefinite, the work in Reference [65] can be generalized to symmetric matrices and still improves the quality of the released low-rank matrices under \(\epsilon\)-differential privacy. The strategy is to use the exponential mechanism to sample a rank-1 approximation of the given matrix \(M_A\) with the utility function being proportional to \(exp(z^T\cdot M_A\cdot z)\), hence this requires \(M_A\) to be positive semidefinite. This step is repeated \(k\) times and the sampled rank-1 vector \(v_i\) is accumulated in the form \(v_i^T\cdot M_A\cdot v_i+Lap(\frac{k}{\epsilon })\) to get a final private rank-\(k\) approximation of \(M_A\).

4.2.3 Distributed Private Graph Release.

In the above approaches, a single data custodian knows about the entire input graph, then applies a DP-based mechanism to release a synthetic version of it. As we discussed in Section 3.2.4, LDP can be used to protect sensitive information of individuals from an untrustworthy data custodian [68]. In this setting, each data source locally perturbs sensitive data before sending them to the data curator to construct a representative graph.

LDPGen [119] is one of the first LDP-based private synthetic graph generation techniques. It groups nodes with similar degree vectors into the same cluster. Inter and intra-cluster edges can then be generated to get a private synthetic graph. More specifically, the clustering step starts with a random node clustering. Each node \(u\) then shares a noisy degree vector \((\tilde{\sigma }_1^u, \dots , \tilde{\sigma }_{k_0}^u)\) under the current clustering scheme, where \(\sigma _1^u\) is the degree of \(u\) in the first cluster. Each noisy degree vector satisfies the LDP property. The data curator updates the clustering scheme once all noisy degree vectors are received. The updated clustering scheme is communicated back to all individuals for refinement of their private degree vectors, who then share with the data curator again for a further update round. When compared against two simpler mechanisms [119], LDPGen generates synthetic graphs with higher utility under different use cases, such as community discovery.

Another LDP-based approach focused on preserving the structural utility of the original graph \(G\) [37]. To generate the synthetic graph, the authors first split \(G\) into multiple subgraphs. They then carefully select a set of subgraphs without any mutual influence to be sanitized. They use the HRG model [25] to capture the local features from each of the selected subgraphs. LDP is then introduced into each HRG such that the corresponding subgraph in \(G\) is regenerated according to the sanitized HRG. This produces an updated privatized graph \(G\). The added local noise on each HRG preserves more structural information compared to applying global differential privacy on the original graph. It provides synthetic graphs with higher utility.

4.2.4 Iterative Refinement.

Some strategies iteratively look for the best synthesized graph data that is both private and close to the original graph under the guidance of an objective function. For example, the IDC mechanism [43] removes bad approximated graphs until a good candidate is found. In Reference [43], the graph data consists of the weighted edges between pairs of nodes and the synthetic graph-release mechanism for linear queries (including cut queries) uses the Laplace mechanism to add noise to edge weights, but with an additional linear programming step that solves for a close approximation where weights are restricted to \([0,1]\) to remove negative edge weights. In References [117, 118], an MCMC-based mechanism starts from a random graph generated from a differentially private degree sequence then searches for an optimal graph with edge swapping operations to better fit the wPINQ measurements, while remaining consistent with the given private degree sequence.

5 Beyond Differential Privacy: Limitations and Alternatives

As we have seen throughout this article, differential privacy is by far the most popular framework for analyzing and designing provably private graph data release algorithms. In this section, we discuss some known limitations of the differential privacy framework, especially as applied to graph data, and describe other formal privacy definitions as alternative frameworks.

5.1 Differential Privacy on Correlated Data

Correlation among records in a dataset can affect the privacy guarantees of DP mechanisms [72]. Indeed, real-world concept of privacy often cannot be modeled properly using only the existence or otherwise of an entity’s record in the data but needs to take into account the participation or otherwise of an entity in the data-generating process. For example, in social network data where, given two graphs \(G_1\) and \(G_2\) that differ only in one edge and a randomized algorithm that evolves \(G_1\) into \(G_1^{\prime }\) and \(G_2\) into \(G_2^{\prime }\) using a common Forest Fire model [81], the query on the number of edges between two communities in the evolved graph, at best, cannot be answered in a differentially private way with sufficient utility and, at worst, is answered in a way vulnerable to attacks, because the privacy issue is modeled incorrectly as a single edge difference between \(G_1^{\prime }\) and \(G_2^{\prime }\) instead of that on \(G_1\) and \(G_2\). Indeed, path dependency in a graph-evolution model like Forest Fire results in correlated data in \(G_1^{\prime }\) and \(G_2^{\prime }\) that allow their origins \(G_1\) and \(G_2\) to be easily distinguished.

More generally [72], under almost any reasonable formalization, the assumption that evidence of participation can be encapsulated by exactly one record is implied by the assumption that all the records are generated independently, although not necessarily from the same distribution. This is formally proved in Reference [74] and the applicability of DP is limited by that independence assumption. This is a serious limitation, because real-world data, especially graph data, often exhibit strong correlations between records. These correlations between records exist due to behavioral, social, or genetic relationships between users [87, 174]. For example, in social network data, it is highly likely that the locations of friends exhibit strong correlations, since they tend to visit the same places.

Following the theoretical work in Reference [72], Liu et al. [87] use an inference attack to demonstrate the vulnerability of applying differential privacy mechanisms on correlated graph data. In their experiments on social network data, they show that an adversary can infer sensitive information about a user from private query outputs by exploiting her social relationships with other users that share similar interests. Such social and behavioral correlations between users have also been used to perform de-anonymization attacks on released statistical datasets [145]. In another related work [7] the authors show that an adversary has the ability to infer information about an individual in a statistical genomic dataset using the information of their related other household members. For example, they can infer the susceptibility of an individual to a contagious disease by using the correlation between genomes of family members.

5.2 Dependent Differential Privacy

The notion of dependent differential privacy (DDP) [87] considers correlations between records in a statistical dataset to overcome inference attack by adversaries who have prior information about the probabilistic dependence between these records. DDP introduces the novel concept of dependence coefficient that quantifies the level of correlation between two records. Summing the dependence coefficients between a record and all other records correlated with it, we can get a quantification of how changes in each record can affect other related records in a dataset. The maximum dependence coefficient allows a user to calculate the dependent sensitivity for answering the query over a correlated dataset. Thus, this sensitivity measure can then be used to instantiate the Laplace mechanism to achieve privacy while minimizing noise. However, in practice, the effectiveness of the DDP framework on tabular data is limited by how well the correlation among data can be modeled which is a challenging problem in itself.

In contrast, graphs show correlations that are inherent among nodes as relationships between nodes represent how the nodes are connected. Assuming each node has a degree of \(m\), a modification of an attribute value of a node potentially causes changes in at most \(m-1\) other nodes due to the probabilistic dependence relationships between nodes. Thus, the dependence coefficient measure can be used in graph data to quantify the amount of noise that needs to be added to each query answer considering the nodes and their surrounding neighbors in the graph. As an example, in Reference [174] the authors use probabilistic graphical models to explicitly represent the dependency between records, and show how the structure of correlated data can be carefully exploited to introduce noise into query responses to achieve higher utility.

5.3 Pufferfish Privacy

The above works on correlated data have led to further studies on generalizing DP as a framework that can be customized to the needs of a given application. We now discuss one such framework called Pufferfish [74] that makes it easier to generate new privacy definitions with rigorous statistical guarantees about the leakage of sensitive information. Building on Pufferfish, several classes of privacy definitions have recently been proposed, including Blowfish [51] and Bayesian DP [167].

The Pufferfish framework is defined by the following components: a set \(\mathcal {S}\) of potential secrets, a set \(\mathcal {S}_{\text{pairs}} \subseteq \mathcal {S} \times \mathcal {S}\) of mutually exclusive pairs of secrets to be protected, and a set \(\mathcal {D}\) of data-generation processes. The set \(\mathcal {S}\) serves as an explicit specification of what we would like to protect, for example, the record of an entity \(x\) is/is not in the dataset. The set \(\mathcal {S}_{\text{pairs}}\) represents all the pairs of secrets that should remain indistinguishable from each other given the query response. Finally, \(\mathcal {D}\) represents a set of assumptions about how the data evolved (or was generated) that reflects the adversary’s belief about the data, such as probability distributions, or variable correlations.

Definition 5.1 ([74]).

Given \(\mathcal {S}\), \(\mathcal {S}_{\text{pairs}}\), \(\mathcal {D}\), and a privacy parameter \(\epsilon \gt 0\), a randomized algorithm \(\mathcal {M}\) satisfies \(\epsilon\)-\(\mathit {Pufferfish}(\mathcal {S}, \mathcal {S}_{\text{pairs}}, \mathcal {D})\) privacy if \(\forall \mathcal {O}\in \mathit {range}(\mathcal {M})\), \(\forall (s_i,s_j) \in \mathcal {S}_{\text{pairs}}\), \(\forall \theta \in \mathcal {D}\), and for each dataset \(D\) that can be generated from \(\theta\), we have

\[\begin{gather} Pr[\mathcal {M}(D) = \mathcal {O}\,|\, s_i,\theta ] \le e^\epsilon Pr[\mathcal {M}(D) = \mathcal {O}\,|\, s_j,\theta ], \end{gather}\]

(7)

\[\begin{gather} Pr[\mathcal {M}(D) = \mathcal {O}\,|\, s_j,\theta ] \le e^\epsilon Pr[\mathcal {M}(D) = \mathcal {O}\,|\, s_i,\theta ] , \end{gather}\]

(8)

where the probabilities are taken over the randomness in \(\theta\) and \(\mathcal {M}\).

One can show that the inequalities (Equation (7)) and (Equation (8)) are equivalent to the following condition on the odds ratio of \(s_i\) and \(s_j\) before and after seeing the query output \(\mathcal {O}\):

\begin{equation} e^{-\epsilon } \le \frac{Pr[s_i \,|\, \mathcal {M}(D) = \mathcal {O}, \theta ] / Pr[s_j \,|\, \mathcal {M}(D) = \mathcal {O}, \theta ]}{ Pr[s_i \,|\, \theta ] / Pr[s_j \,|\, \theta ]} \le e^\epsilon . \end{equation}

(9)

Recall that in the Pufferfish framework, each probability distribution \(\theta \in \mathcal {D}\) corresponds to an attacker’s probabilistic beliefs and background knowledge. Thus, for small values of \(\epsilon\), Equation (9) denotes that observing the query output \(\mathcal {O}\) provides little to no information gain to attackers \(\theta\) who are trying to infer whether \(s_i\) or \(s_j\) is true. When we assume each record in the dataset is independent of one another such that no correlation exists, then the privacy definition of Pufferfish is the same as the privacy definition of \(\epsilon\)-differential privacy.

To the best of our knowledge, the only mechanism that currently provides Pufferfish privacy is the Wasserstein mechanism [140], which is a generalization of the Laplace mechanism for DP. In Reference [140], the authors prove that it always adds less noise than other lesser used mechanisms such as group differential privacy [110]. This makes Pufferfish-based techniques more applicable to graph data. For example, let us assume a connected graph \(G=(V,E)\) where each node \(v\in V\) represents an individual and each edge \((v_i,v_j)\in E\) represents the relationship between two individuals. The set of edges can be interpreted as values in the domain that an adversary must not distinguish between; i.e., the set of discriminative secrets is \(S^G_{pairs}=\lbrace (s_{v_i},s_{v_j}): \forall (v_i,v_j)\in E\rbrace\). Following Definition 5.1, one can add Laplacian noise equal to \((\mathcal {O}/\epsilon)\) to achieve \(\epsilon\)-Pufferfish\((V\times V, S^G_{pairs}, G)\) privacy for any \(V^{\prime }\subseteq V\), where \(w\) is the average size of a connected component in \(G\).

5.4 Other Provable Privacy Definitions

Inferential Privacy [39] is a similar notion to Pufferfish privacy. It relies on modeling correlated data as a Markov Chain, and adding noise proportional to a parameter that measures the correlation. Its mechanisms are less general than the Wasserstein mechanism [140], but are applicable to a broader class of models than the Markov quilt mechanisms that measure the difference between an adversary’s belief about sensitive inferences before and after observing any released data.

Other notions such as Adversarial Privacy [121] are weaker than DP, but give higher utility when querying social networks under certain assumptions. Adversarial privacy is achieved in graph queries if the prior and posterior of a data point after seeing the query output are almost indistinguishable. In Reference [121], the authors restrict adversaries’ prior distributions to a special class of distributions and prove that adversarial privacy is equivalent to \(\epsilon\)-indistinguishability, a generalization of \(\epsilon\)-differential privacy, in the sense that neighboring datasets \(D_1\) and \(D_2\) satisfy either \(|D_1|=|D_2|=n\) and \(|D_1 \oplus D_2|=2\) or \(|D_1|=|D_2|+1\) and \(D_2\subsetneq D_1\). They further state that adversarial privacy can be applied to stable queries, such as subgraph counting queries.

Compared to the above, Zero-Knowledge Privacy [38] proposes a stronger privacy definition to protect individual privacy where differential privacy may fail. It argues that the standard concept of differential privacy may not protect individual privacy in some social networks, where specific auxiliary information about an individual may be known. Given an adversary who has access to a mechanism that runs on a dataset, Zero-Knowledge Privacy says that the outputs gained by an adversary with or without accessing the mechanism with aggregated information will be similar. The choice of aggregate information is sensitive to the privacy concept, such as the use of aggregate information from any computation on a subset of randomly chosen \(k\) samples from the given dataset with an individual’s record concealed. With this setting, Zero-Knowledge Privacy satisfies composition and group privacy just as differential privacy does. Using an extended formulation for graphs with bounded degrees, Gehrke et al. [38] proves Zero-Knowledge Privacy for the average-degree query and graph-edit-distance queries (in terms of edge addition and deletion), as well as Boolean queries like whether a graph is connected, Eulerian, or acyclic.

6 Application Domains and Example Use Cases

The previous sections provided a domain-agnostic description and classification of mechanisms to release graph data and answer graph queries with enhanced privacy. In contrast, this section discusses these same mechanisms in the context of domain applications and industry sectors where private graph analytics are used in practice to provide values. Table 3 summarizes the examples of application domains discussed in this section and the related surveyed contributions. Moreover, some recent technology such as the Internet-of-Things (IoT) can be applied to many of these domains, and produce data that are best captured and analyzed using graph-based models and analytics. Indeed, distributed IoT devices readily map to graph vertices, while their relationships, interactions, and measured data map to graph edges. The supplemental material (online only) to the main content will also briefly introduce use-cases of IoT technology in the context of graph analytics for different domain applications.

Table 3.

Application Domains	Example Graph Analytics (private and non-private)
Social Networks	[58, 61, 101, 121, 151, 152, 176]
Financial Services	[21, 22, 49, 67, 71, 116, 118, 123, 134, 138, 142, 158, 166]
Supply Chains	[10, 42, 102, 108, 122, 141, 150, 154]
Health	[29, 30, 73, 75, 82, 98, 107, 110, 114, 137, 143, 147]

Table 3. Example of Application Domains of Graph Analytics with Some Related Surveyed Contributions

6.1 Social Networks and Related Services

Over the past few decades, social networks have become a global platform for individuals to connect with each other. These platforms allow third party businesses and their advertising partners to access an unprecedented level of information, which can be used to reach potential customers, or for better social targeting [77]. However, the sensitive information of individuals contained in social networks could be leaked due to insecure data sharing [3]. For example, in early 2018, it was reported that up to 87 million Facebook users’ personal information and their social relationships might have been shared with a political consulting company, Cambridge Analytica, without individuals authorizations [58]. Differentially private techniques can be used to ensure the privacy of individuals is preserved when social network data are queried or published [101, 121, 176].

One real-life application of graph analytics to social networks is described in Reference [133], where the authors use graph metrics on social interactions between 82 students and lecturers in the context of courses at Qassim University. They identify patterns associated with learning outcomes, and use these graph-derived insights to design and apply interventions to improve student engagement, marks, and knowledge acquisition. For example, they use in/out degree sequence metrics to estimate students’ level of activity, or closeness and betweenness centrality metrics to identify roles in collaborating groups of students. As surveyed earlier in this article, many provably private mechanisms have been proposed for these specific graph metrics [30, 69, 120, 125].

Community detection is another common graph analysis, which is routinely performed on social network data, to mine complex topologies and understand the relationships and interactions between individuals and groups [151]. For example, in Reference [84], a team Social Science researchers performs such analytics on graph data from students of a well-known U.S. college, to identify and study ethnic and cultural communities and derives several structural and longitudinal insights on them (e.g., taste in given music genre are more commonly shared within a given ethnic group, etc.). As illustrated by that real-world study, third-parties usually conduct such community detection research, which could lead to significant privacy breaches [58]. Ji et al. [61] proposes a community detection algorithm, which protects the privacy of network topology and node attributes. It formulates community detection as a maximum log-likelihood problem that is decomposed into a set of convex sub-problems on the relationships and attributes of one individual in the network. It then achieves DP by adding generated noises to the objective function of these sub-problems.

6.2 Financial Services

The financial sector is one of the earliest to embrace the Big Data revolution of the past decades [144]. While initial applications focused on extracting insights on internal data within a single organization to provide added benefits to their stakeholders and customers, more recent applications aim at data across businesses, types of institutions, and countries, to produce further utility and financial benefits. Graph analytics are the natural tools to derive knowledge from the networks of data points, which arise when combining diverse datasets from multiple sources.

The financial sector has been forecast to spend more than \(\\)\)9 billion annually to combat fraud [148]. Several graph-based techniques have been proposed and surveyed to tackle this challenge [116]. In a specific real-world example [158], forensic analysis researchers apply several graph metrics, such as cycle detection, degree distribution, and PageRank, on graph data collected as part of the so-called Know-Your-Customer process of a private bank. Using these graph analytics, they achieve a reduction of 20–30\(\%\) on false positive for the detection of suspicious financial activities by expert analysts. However, these approaches often do not consider the privacy and confidentiality issues that constrain the sharing of data between different financial organizations. Many of the previously surveyed works provide privacy-enhanced alternatives to these techniques. For example, some methods exploit the density of groups of nodes and their interconnections [22], which can be obtained through provably private degree sequences, as described in References [49, 67, 118].

Modeling customer behaviors in financial services is another application of graph analytics in the finance sector. It allows businesses such as banks and insurers to better understand their customers, leading to tailored products and services [1]. In one existing trial [44], data analysts use a credit scoring prediction solution, which model SMEs’ financial behaviors into graph, and apply a process based on adjacency matrices, correlation distances, node degree and closeness to compute their credit scores. They partner with a European FinTech registered as a Credit Rating Agency and apply this solution to their data. This produces credit scoring models that are significant predictors of loan defaults. Identifying and characterizing patterns in financial graph data is another type of techniques to achieve such models of customer behaviors [21, 123]. Several provably private methods were previously described in Section 3.2.5 to allow this pattern mining [71, 138, 142, 166].

6.3 Supply Chain

Supply chains are fundamentally graphs. Indeed, regardless of the sector, they capture participants (e.g., food producer, manufacturer, transporter, retailer, etc.) and their relationships (e.g., transactions, raw material supply/delivery, subcontracting services, etc.), which form the nodes and edges of the corresponding graphs. Thus it is not surprising that various graph analytics have been applied to different types of supply chains for many purposes.

Resilience is critical in supply chains, and several graph analytics contributions focus on addressing this concern. In a recent industrial use-case [54], researchers at Ford Motor Company propose a graph model to capture the flows and relationships of materials from suppliers to finished products in their automotive supply chains. They further develop a novel graph-based metric called Time-to-Stockout, which allows them to estimate the resilience of parts of these chains with respect to both market-side demand and supply-side inventory. They apply their graph model and metric to real industrial data from the Ford supply chain to demonstrate their effectiveness. Other more traditional graph metrics, such as adjacency matrices can be used to identify and update the weakest parts in a given chain to increase its robustness [154]. Specific graph models also exist to assess and increase structural redundancy in supply chains [150]. One example of such novel models defines a supply chain resilience index to quantify resilience based on major structural enablers in the graph and their interrelationships [141]. In another example, an approach based on node-degree is proposed to evaluate the resilience of different Australian-based supply chains (e.g., lobster and prawn fisheries, iron ore mining) [150].

Optimization is another applied area of graph analytics in supply chain. In some real-world use cases, the graph analytics provided by the Neo4J tool have been used to lower response time in product quality management where several suppliers are involved [102], and to lower cost and complexity in inventory, payment and delivery management [122]. In another case study on a laptop manufacturing supply chain, researchers use a novel graph-based cost function and an algorithm based on similarity measures to optimize the reconfiguration of a supply chain to lower the overall manufacturing cost of the laptops [42].

6.4 Health

With the recent COVID-19 pandemic, the healthcare industry has been developing various systems that can mine insights from healthcare data to support diagnoses, predictions, and treatments [6]. Graph analytics allow researchers to effectively and efficiently process large, connected data. However, due to the highly sensitive nature of patient data, such data requires strong privacy guarantees before it can be released or used in healthcare applications.

Differential privacy has been proposed as a possible approach to allow the release of healthcare data with sufficient guarantee against possible privacy attacks [114]. For example, as we reviewed in Section 3.2.2, node differential privacy can be applied to select nodes in a patient graph to detect outbreaks of diseases [82]. Next, we discuss a few applications where graph-based differential privacy techniques have been used in the healthcare domain.

Along with the development of IoT technologies, smart healthcare services are receiving significant attention. They focus on disease prevention by continuously monitoring a person’s health and providing real-time customized services. Such devices enable the collection of a vast amount of personal health data, which can be modeled as graphs. However, such graph data needs to be treated with appropriate privacy techniques before they are used in data analytic processes as individuals can be re-identified by tracking and analyzing their health data [29].

An earlier work [75] proposes a novel local differential privacy mechanism for releasing health data collected by wearable devices. The proposed approach first identifies a small number of important data points from an entire data stream, perturbs these points under local differential privacy, and then reports the perturbed data to a data analyst, instead of reporting all the graph data. Compared to other approaches that release private graph data in the form of histograms, such as [30], local differential privacy-based approaches provide significant improvement in utility while preserving privacy against possible attacks [7]. Further, the development of differentially private data analysis systems, such as GUPT [98], allow efficient allocation of different levels of privacy for different user queries following the smooth sensitivity principle [107].

In another recent contribution [147], the authors analyze different challenges present in healthcare data analysis. In their analysis, they argue that different algorithms should be used to approximate group influences to understand the privacy fairness tradeoffs in graph data. Thus, some of the techniques discussed in Section 5 can potentially result in the high influence of majority groups in datasets as opposed to minority ones, imposing asymmetric valuation of data by the analysis model. This requires novel privacy techniques, such as Pufferfish privacy [73], to be used in such clinical settings to make sure minority class memberships are represented appropriately in data analysis models without compromising privacy.

7 Empirical Studies and Open Research Questions

This section describes existing open source DP tools and discusses their limitations. It also introduce a novel DP library that overcomes some of these limitations. It then discusses possible research directions related to graph queries that are difficult to make provably private.

7.1 Implementations

Despite the extensive literature devoted to graph differential privacy, there is a paucity of open source tools that implement these techniques. As such, it is frequently the case that researchers that want to use or extend these results will need to build their own tools to do so. In many cases, a differentially private release mechanism for graph statistics can be decomposed into two pieces: a computation step that produces exact query responses on an underlying graph and a perturbation step that adds noise to the exact query responses to achieve some level of differential privacy.

Computing the exact query responses involves working with the underlying graph data structure. There are many open source tools that provide efficient implementations of common graph algorithms; see, for example, References [17, 28, 45, 112]. In some cases, these algorithms can be used directly to compute the statistics of interest. In other cases, these algorithms are used to produce lower-sensitivity approximations to a query of interest. For example, in Reference [69] network flows on graphs derived from the original graph are used to compute query responses on arbitrary graphs with sensitivity similar to that of the query restricted to graphs with bounded degree.

The perturbation step then amounts to a straightforward use of the appropriate differentially private release mechanism. While many such mechanisms are mathematically simple, creating secure implementations is fraught with peril. As with cryptographic libraries, it is important to consider issues such as secure random number generation and robustness with respect to various side-channel attacks. Furthermore, because many differentially private release mechanisms operate on real numbers, as opposed to integers or floating point numbers that can be represented exactly in a computer, naïve implementations of some mechanisms can be insecure [97]. There are a number of open source differential privacy libraries, such as References [36, 53, 127, 160]. These libraries vary widely in the kinds of functionality that they provide, the security guarantees that they can offer, and the performance that they can achieve. While most open source libraries are sufficient for some tasks, e.g., research involving numerical simulations that are based on synthetic data, some may be insufficient for use in applications that will be used to protect real sensitive data.

We have produced reference implementations³ of some of the algorithms described in References [69, 107]. These reference implementations use a new differential privacy library, RelM,⁴ developed by one of the authors to perturb the data prior to release. RelM provides secure implementations of many of the differentially private release mechanisms described in Reference [34]. In cases where the computation of differentially private query responses cannot be decomposed into separate computation and perturbation steps, the situation is grimmer. We are unaware of any libraries that provide such functionality. Thus, further research on such algorithms will require the use of bespoke tools.

7.2 Empirical Studies

Many differentially private release mechanisms provide utility guarantees in the form of a bound on the probability that the difference between the perturbed and exact query responses will exceed some bound. These worst-case guarantees do not necessarily describe how a release mechanism will perform when applied to a given dataset. As such, many authors provide the results of experiments run on example datasets to demonstrate the average-case behavior of their mechanism.

While many datasets have been used in such experiments, several have been used frequently enough to comprise a de facto standard corpus. In particular, the datasets provided by the Stanford Network Analysis Project [83] have been used by multiple authors to test release mechanisms intended for use with a wide variety of graph analytics. Of these, datasets describing the collaboration network for papers submitted to various categories of the e-print arXiv (ca-HepPh, ca-HepTh, and ca-GrQc), the Enron email communication network (email-Enron), ego networks from various social networks (ego-Facebook, ego-Twitter, com-LiveJournal), and voting data for the election of Wikipedia administrators (wiki-Vote) were particularly popular.

In addition, many authors describe the performance of proposed release mechanisms on random graphs. The most common model used to generate such graphs in the papers we surveyed was the Erdös-Rényi-Gilbert model. For mechanisms intended for use with scale-free networks, the Barabási-Albert model was frequently used to generate suitable random graphs.

A recent work in Reference [163] has benchmarked selected edge DP and node DP mechanisms for privately releasing answers of degree sequence and subgraph counting queries. This work is implemented as a web-based platform DPGraph with built-in DP-based graph mechanisms, real graph data that frequently appear in related empirical studies and visual evaluations to assist a user in selecting the appropriate mechanism and its privacy parameter for a project in hand.

The empirical studies carried out in Reference [163] and Reference [105] cover a wide range of DP-based graph mechanisms that have been reviewed in this article, including edge DP degree sequence [49], node DP degree sequence [30, 69, 120], and edge DP subgraph counting [24, 66, 107, 170].

The accuracy of these mechanisms is compared across a range of privacy budget values and their running time is compared across networks of different sizes. In general, graph size and shape, query type all have an impact on the performance of a mechanism, so the question of which mechanism to use with what parameter values is related to a specific project. Having said that, the authors also make some noticeable summaries of the different mechanisms. For private degree sequence release, Reference [49] has the best utility and running time under edge DP, while Reference [30] has the best performance in both aspects for node DP. The performance of subgraph counting query mechanisms is more related to subgraph types. For example, Reference [66] has the best performance for \(k\)-star counting, while [170] shows the lowest error for other subgraphs.

In summary, it is challenging to benchmark all surveyed DP-based graph mechanisms, as the optimal parameter choices depend on the privacy, utility and computational efficiency requirements in each specific application. Furthermore, even if a scientific consensus could be reach on a lowest common denominator of parameters for a baseline scenario, the benchmark itself would need to be actively maintained and updated by the community to remain relevant as novel algorithms emerge.

Instead, this survey has made available a well-engineered DP for Graphs library⁵ that, in conjunction with systems like DPGraph, can be used by the community to test different algorithms on their own specific applications and graph data, and to develop new DP for Graph algorithms. This code repository, together with the results that the research community can post and share with each other will effectively become a live benchmark, which would continuously guide the community toward a more innovative direction.

7.3 Useful But Difficult Graph Statistics

Most of the graph statistics reviewed in Section 3 are related to subgraph queries and degree distribution queries. The former includes some common subgraphs such as triangle, \(k\)-triangle and \(k\)-star. The latter includes degree distribution/sequence and joint degree distribution. There are, however, many other graph statistics that are useful for understanding networks. One such class of metrics are centrality metrics. There are different types of centrality metrics, all of which measure the importance or influence of a node on the rest of the network. For example, degree centrality (equivalently node degree) measures how influential a node could be by looking at its direct neighbors. Betweenness centrality measures the importance of a node by counting the number of shortest paths it appears in between all pairs of nodes. Another important class of metrics are connectivity metrics. For example, assortativity measures how well nodes of similar types (e.g., degrees) connect to each other. Transitivity reflects to what extent edges in a path are transitive, e.g., if \(x\) is a friend of \(y\) who is a friend of \(z\), then how likely is it for \(x\) and \(z\) to be friends?

Some of these metrics are high-level statistics, meaning they are calculated based on lower level statistics such as degree, neighbor or shortest path. Hence, they are more difficult to be made private while still remaining useful due to their high global sensitivity. A recent analysis in Reference [79] concludes that the eigenvector, Laplacian and closeness centralities are extremely difficult to be made private using the smooth sensitivity technique. Research progress in private versions of such queries will significantly expand the application areas of private graph release mechanisms.

8 Conclusion

This article provided a thorough survey of published methods for the provable privacy-preserving release of graph data. It proposes a taxonomy to organize the existing contributions related to private graph data, and use that taxonomy as a structure to provide comprehensive descriptions of existing mechanisms, analytics, and their application domains. At the top level, this survey differentiated between private query release mechanisms and private graph release mechanisms. Within each category, we discussed existing non-provable and provable mechanisms, with significant emphasis on the latter type. Such provably private mechanisms offer mathematical guarantees on formally defined privacy properties. In contrast, non-provable methods lack such strong theoretical assurance. Most provable techniques have been based on the concept of DP, which we briefly introduced in the context of graph data.

For private graph statistics release, this survey further explored different classes of methods, such as node DP, edge DP, local DP, with high-level descriptions of numerous existing works. Similarly, we described several families of DP-based methods for the private release of synthetic graphs, such as generative graph models, graph matrix perturbations, and iterative refinement. We then elucidated the limitations of DP as pertinent to graph data release, and discussed several alternatives to DP that provide provable privacy, such as Pufferfish privacy (and the related Wasserstein mechanism), Dependent DP, Adversarial Privacy, and Zero-Knowledge Privacy. We followed with a short review of several key application domains where the availability and use of private graph data analytics is critical (e.g., as social networks, financial service, supply chains, or health service sectors), with an emphasis on specific use cases within each domain. Finally, this survey concluded on some open issues in the field of private graph analytics, such as the paucity of open source tools that implement the extensive amount of surveyed mechanisms, and the fact that a wide range of useful graph statistics do not have (yet) provably private counterparts (e.g., degree centrality), thus providing potential leads for future research in the area. This survey article should benefit practitioners and researchers alike in the increasingly important area of private graph data release and analysis.

Footnotes

Throughout, these terms are used interchangeably, graph and network, node and vertex, edge and link.

Some versions of neighboring datasets allow a record in a dataset \(x\) to be replaced by a different value to obtain another dataset \(y\). This implies a distance 2 if distance is measured by \(L_1\)-norm but distance 1 if it is measured by edit distance.

Available at https://github.com/anusii/graph-dp.

⁴

Code and documentation are available at https://github.com/anusii/RelM.

⁵

Available at: https://github.com/anusii/graph-dp.

Supplementary Material

3569085-supp (3569085-supp.pdf)

Supplementary material

Download
46.64 KB

References

[1]

A. Beckett, P. Hewer, and B. Howcroft2000. An exposition of consumer behaviour in the financial services industry. Int. J. Bank Market. 18, 1 (2000), 15–26.

Abstract

1 Introduction

2 Method and Background

2.1 Method

2.2 Background

3 Private Release of Graph Statistics or Queries

3.1 Non-Provable Private Release of Graph Statistics

3.2 Provable Private Release of Graph Statistics using Differential Privacy

3.2.1 Edge Differential Privacy.

3.2.2 Node Differential Privacy.

3.2.3 Edge Weight Differential Privacy.

3.2.4 Distributed Private Query Release.

3.2.5 Private Graph Data Mining.

4 Private Graph Releases

4.1 Synthetic Graphs with No Provable Privacy Guarantee

4.2 Synthetic Graphs with Provable Privacy Guarantee

4.2.1 Generative Graph Models.

4.2.2 Graph Matrix Perturbations.

4.2.3 Distributed Private Graph Release.

4.2.4 Iterative Refinement.

5 Beyond Differential Privacy: Limitations and Alternatives

5.1 Differential Privacy on Correlated Data

5.2 Dependent Differential Privacy

5.3 Pufferfish Privacy

5.4 Other Provable Privacy Definitions

6 Application Domains and Example Use Cases

6.1 Social Networks and Related Services

6.2 Financial Services

6.3 Supply Chain

6.4 Health

7 Empirical Studies and Open Research Questions

7.1 Implementations

7.2 Empirical Studies

7.3 Useful But Difficult Graph Statistics

8 Conclusion

Footnotes

Supplementary Material

References

Cited By

Index Terms

Recommendations

A differentially private algorithm for location data release

Differentially private data release for data mining

Graph publication when the protection algorithm is available

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

HTML Format

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations