research-article

Influence Maximization Revisited: Efficient Sampling with Bound Tightened

Authors:

Jing TangAuthors Info & Claims

ACM Transactions on Database Systems (TODS), Volume 47, Issue 3

Article No.: 12, Pages 1 - 45

https://doi.org/10.1145/3533817

Published: 18 August 2022 Publication History

Get Access

Abstract

Given a social network G with n nodes and m edges, a positive integer k, and a cascade model C, the influence maximization (IM) problem asks for k nodes in G such that the expected number of nodes influenced by the k nodes under cascade model C is maximized. The state-of-the-art approximate solutions run in O(k(n+m)log n/ε²) expected time while returning a (1 - 1/e - ε) approximate solution with at least 1 - 1/n probability. A key phase of these IM algorithms is the random reverse reachable (RR) set generation, and this phase significantly affects the efficiency and scalability of the state-of-the-art IM algorithms.

In this article, we present a study on this key phase and propose an efficient random RR set generation algorithm under IC model. With the new algorithm, we show that the expected running time of existing IM algorithms under IC model can be improved to O(k ċ n log n ċ²), when for any node v, the total weight of its incoming edges is no larger than a constant. For the general IC model where the weights are skewed, we present a sampling algorithm SKIP. To the best of our knowledge, it is the first index-free algorithm that achieves the optimal time complexity of the sorted subset sampling problem.

Moreover, existing approximate IM algorithms suffer from scalability issues in high influence networks where the size of random RR sets is usually quite large. We tackle this challenging issue by reducing the average size of random RR sets without sacrificing the approximation guarantee. The proposed solution is orders of magnitude faster than states of the art as shown in our experiment.

Besides, we investigate the issues of forward propagation and derive its time complexity with our proposed subset sampling techniques. We also present a heuristic condition to indicate when the forward propagation approach should be utilized to estimate the expected influence of a given seed set.

Appendices

A Proof of Theorem 4

We introduce some notations and lemmas that are useful to prove Theorem 4. Denote $\bar{\mu }$ as the total expected number of elements in $T$ checked by SKIP to determine whether $x_{j}$ is added to $S$ (Lines 9– 10). Thus, SKIP takes $O(1+\bar{\mu })$ time in expectation, where “1” is from the stopping step of while loop when it meets the condition $j\gt h$ (Line 7).

Similarly, let $\bar{\mu }(p_i,\ldots\,,p_{j})$ denote the expected number of elements checked when $T=\lbrace x_i,\ldots ,x_j\rbrace$ , e.g., $\bar{\mu }=\bar{\mu }(p_1,\ldots\,,p_{h})$ . Consider that SKIP performs on $\lbrace x_i,\ldots ,x_j\rbrace$ and $x_k$ is the first element checked. The probability for such a case is $(1-p_{i})^{k-i}p_{i}$ , as $\Pr [X=k-i+1]$ follows the geometric distribution $G(p_{i})$ . After checking $x_k$ , SKIP performs on $\lbrace x_{k+1},\ldots\,,x_j\rbrace$ . Thus, by Markov chain,

\begin{equation*} \bar{\mu }(p_i,\ldots ,p_{j})=\sum _{k=i}^{j} \left((1-p_i)^{k-i}p_i \cdot \left(1+\bar{\mu }(p_{k+1},\ldots ,p_{j})\right)\right). \end{equation*}

The following lemma shows the monotonicity of $\bar{\mu }(p_i,\ldots\,,p_{j})$ .

Lemma 20.

$\bar{\mu }(p_{i+1},\ldots\,,p_{j})\le \bar{\mu }(p_{i},\ldots\,,p_{j})$ .

Proof.

We prove the lemma by induction. When $i=j-1$ , we have $\bar{\mu }(p_{i+1})=p_{i+1}$ and $\bar{\mu }(p_{i},p_{i+1})=p_i+(1-p_i)p_i+p_i p_{i+1}=(2+p_{i+1}-p_i)p_i$ , which indicates that $\bar{\mu }(p_{i+1})\le \bar{\mu }(p_{i},p_{i+1})$ . Assume that for any $i\ge i^\ast$ , it holds that

\begin{equation*} \bar{\mu }(p_{i+1},\ldots\,,p_{j})\le \bar{\mu }(p_{i},\ldots\,,p_{j}). \end{equation*}

Now consider $i=i^\ast -1$ . Similar to () we have

\begin{equation*} \bar{\mu }(p_{i+1},\ldots\,,p_{j})=\sum _{k=i+1}^{j} \left((1-p_{i+1})^{k-i-1}p_{i+1} \cdot \left(1+\bar{\mu }(p_{k+1},\ldots\,,p_{j})\right)\right). \end{equation*}

For any $\ell =i,\ldots\,,j$ , define

\begin{equation*} \Delta _\ell :=\sum _{k=i}^{\ell } (1-p_i)^{k-i}p_i -\sum _{k=i+1}^{\ell } (1-p_{i+1})^{k-i-1}p_{i+1} =(1-p_{i+1})^{\ell -i}-(1-p_i)^{\ell -i+1}. \end{equation*}

It is easy to verify that $\Delta _\ell \ge 0$ for any $\ell = i, \ldots\,, j$ owing to the fact that $\lbrace p_i, \ldots\,, p_j\rbrace$ are in non-ascending order. Besides, the following conclusion holds by definition, and will be used later,

\begin{equation} \Delta _{\ell +1} + p_{i+1}(1-p_{i+1})^{l-i} = \Delta _{\ell } + p_i(1-p_i)^{l-i+1}. \end{equation}

(20)

Then, we have

\begin{align*} &\bar{\mu }(p_{i},\ldots\,,p_{j})=\sum _{k=i}^{j} \left((1-p_i)^{k-i}p_i \cdot \left(1+\bar{\mu }(p_{k+1},\ldots\,,p_{j})\right)\right) \\ &=p_i\left(1+\bar{\mu }(p_{i+1},\ldots\,,p_{j})\right)+\sum _{k=i+1}^{j} \left((1-p_{i})^{k-i}p_{i} \cdot \left(1+\bar{\mu }(p_{k+1},\ldots\,,p_{j})\right)\right)\\ &\ge \Delta _{i+1}\left(1+\bar{\mu }(p_{i+2},\ldots\,,p_{j})\right)+p_{i+1}\left(1+\bar{\mu }(p_{i+2},\ldots\,,p_{j})\right)\\ &\quad {+}\,\sum _{k=i+2}^{j} \left((1-p_{i})^{k-i}p_{i} \cdot \left(1+\bar{\mu }(p_{k+1},\ldots\,,p_{j})\right)\right). \end{align*}

The inequality is due to $\bar{\mu }(p_{i+1},\ldots\,,p_{j})\ge \bar{\mu }(p_{i+2},\ldots\,,p_{j})$ and $\Delta _{i+1}+p_{i+1}=p_{i}+(1-p_i)p_i$ by Equation (20).

Recursively, we have

\begin{align*} \bar{\mu }(p_{i},\ldots\,,p_{j}) &\ge \Delta _{i+2}\left(1+\bar{\mu }(p_{i+3},\ldots\,,p_{j})\right) {+}\sum _{k=i+1}^{i+2}\left((1-p_{i+1})^{k-i-1}p_{i+1} \cdot \left(1+\bar{\mu }(p_{k+1},\ldots\,,p_{j})\right)\right)\\ &\quad +\sum _{k=i+3}^{j} \left((1-p_{i})^{k-i}p_{i} \cdot \left(1+\bar{\mu }(p_{k+1},\ldots\,,p_{j})\right)\right)\\ &\ge \cdots \ge \Delta _{j}+\bar{\mu }(p_{i+1},\ldots\,,p_{j}). \end{align*}

Note that $\Delta _j\ge 0$ as $p_i\ge p_{i+1}$ , which immediately concludes the lemma.□

Lemma 21.

For any $p_k\le p_k^\prime$ , $\bar{\mu }(p_{i},\ldots\,,p_k,\ldots\,,p_{j})\le \bar{\mu }(p_{i},\ldots\,,p_k^\prime ,\ldots\,,p_{j})$ .

Proof.

We prove the lemma by induction. When $i=j=k$ , it obviously holds that $\bar{\mu }(p_k)=p_k\le p_k^\prime =\bar{\mu }(p_k^\prime)$ . Assume that $\bar{\mu }(p_{i},\ldots\,,p_k,\ldots\,,p_{j})\le \bar{\mu }(p_{i},\ldots\,,p_k^\prime ,\ldots\,,p_{j})$ that for any $i\ge i^\ast$ . Now consider the following two cases when $i=i^\ast -1$ .

Case (i) $k\gt i$ . With assumption that for any $\ell +1\ge i+1\ge i^\ast$ , $\bar{\mu }(p_{\ell +1},\ldots\,,p_k,\ldots\,,p_{j})\le \bar{\mu }(p_{\ell +1},\ldots\,,p_k^\prime ,\ldots\,,p_{j})$ , then we have:

\begin{align*} \bar{\mu }(p_{i},\ldots\,,p_k,\ldots\,,p_{j}) &=\sum _{\ell =i}^{j} \left((1-p_i)^{\ell -i}p_i \cdot \left(1+\bar{\mu }(p_{\ell +1},\ldots\,,p_k,\ldots\,,p_{j})\right)\right)\\ &\le \sum _{\ell =i}^{j} \left((1-p_i)^{\ell -i}p_i \cdot \left(1+\bar{\mu }(p_{\ell +1},\ldots\,,p_k^\prime ,\ldots\,,p_{j})\right)\right)=\bar{\mu }(p_{i},\ldots\,,p_k^\prime ,\ldots\,,p_{j}). \end{align*}

Case (ii) $k=i$ . Let $\Gamma _\ell :=\sum _{i=k}^{\ell } (1-p_k^\prime)^{i-k}p_k^\prime -\sum _{i=k}^{\ell }(1-p_k)^{i-k}p_k$ , which implies that $\Gamma _\ell =(1-p_k)^{\ell -k+1}-(1-p_k^\prime)^{\ell -k+1}\ge 0$ for any $\ell =k,\ldots\,,j$ , since $p_k\le p_k^\prime$ . Similar with (20), we have

\begin{equation} -\Gamma _{\ell +1} + p_k^\prime (1-p_k^\prime)^{\ell -k +1} = - \Gamma _{\ell } + p_k(1-p_k)^{\ell -k+1}. \end{equation}

(21)

Then, we can get that

\begin{align*} & \bar{\mu }(p_k,\ldots\,,p_{j}) = \sum _{\ell =k}^{j} \left((1-p_k)^{\ell -k}p_k \cdot \left(1+\bar{\mu }(p_{\ell +1},\ldots\,,p_{j})\right)\right)\\ &= -\Gamma _k\left(1+\bar{\mu }(p_{k+1},\ldots\,,p_{j})\right)+p_k^\prime \left(1+\bar{\mu }(p_{k+1},\ldots\,,p_{j})\right)\\ &\quad +\,\sum _{\ell =k+1}^{j} \left((1-p_k)^{\ell -k}p_k \cdot \left(1+\bar{\mu }(p_{\ell +1},\ldots\,,p_{j})\right)\right)\\ &\le -\Gamma _{k+1}\left(1+\bar{\mu }(p_{k+2},\ldots\,,p_{j})\right) {+}\sum _{\ell =k}^{k+1} \left((1-p_k^\prime)^{\ell -k}p_k^\prime \cdot \left(1+\bar{\mu }(p_{\ell +1},\ldots\,,p_{j})\right)\right) \\ &\quad +\sum _{\ell =k+2}^{j} \left((1-p_k)^{\ell -k}p_k \cdot \left(1+\bar{\mu }(p_{\ell +1},\ldots\,,p_{j})\right)\right)\\ &\le \cdots \le -\Gamma _{j}+\bar{\mu }(p_k^\prime ,\ldots\,,p_{j}) \le \bar{\mu }(p_k^\prime ,\ldots\,,p_{j}), \end{align*}

where the inequality is due to Lemma 20 and (21). Thus, Lemma 21 follows.□

Given any $\beta =2,\ldots\,,n$ , we divide $V$ into $L(\beta)+1$ buckets such that bucket $B_k=\lbrace x_i:\beta ^{k}\le i\lt \beta ^{k+1}\rbrace$ with $k=0,1,\ldots\,,L(\beta)$ , where $L(\beta)=\lfloor \log _\beta n\rfloor$ . For $x_i\in B_k$ , let $\hat{p}_i:=p_{\beta ^k}$ which is an upper bound on $p_i$ . The following lemma establishes an upper bound on $\bar{\mu }$ .

Lemma 22.

Given $\beta =2,\ldots\,,n$ , $\bar{\mu }\le \hat{\mu }(\beta)+L(\beta) + 1$ , where $\hat{\mu }(\beta)=\sum _{i=1}^{n}\hat{p}_i$ .

Proof.

Define $\bar{\mu }^\ast :=\bar{\mu }(\hat{p}_1,\ldots\,,\hat{p}_n)$ . By Lemma 21, we have $\bar{\mu }\le \bar{\mu }^\ast$ . We just need to show that $\bar{\mu }^\ast \le \hat{\mu }(\beta)+L(\beta)+1$ . By Lemma 20, we have

\begin{align*} \bar{\mu }^\ast &=\sum _{k=1}^{n} \left((1-\hat{p}_1)^{k-1} \cdot \left(1+\bar{\mu }(\hat{p}_{k+1},\ldots\,,\hat{p}_{n})\right)\right)\\ &\le \sum _{k=1}^{\beta -1} \left((1-\hat{p}_1)^{k-1}\hat{p}_1 \cdot \left(1+\bar{\mu }(\hat{p}_{k+1},\ldots\,,\hat{p}_{n})\right)\right) {+}\sum _{k=\beta }^{n} \left((1-\hat{p}_1)^{k-1}\hat{p}_1 \cdot \left(1+\bar{\mu }(\hat{p}_{\beta },\ldots\,,\hat{p}_{n})\right)\right)\\ &\le \sum _{k=1}^{\beta -1} \left((1-\hat{p}_1)^{k-1}\hat{p}_1 \cdot \left(1+\bar{\mu }(\hat{p}_{k+1},\ldots\,,\hat{p}_{n})\right)\right) {+}(1-\hat{p}_1)^{\beta -1}\cdot \left(1+\bar{\mu }(\hat{p}_{\beta },\ldots\,,\hat{p}_{n})\right). \end{align*}

For $\beta = 2$ , it is straightforward to verify that,

\begin{align*} \mu ^\ast &\le \hat{p}_1 \left(1 + \bar{\mu }(\hat{p}_{\beta },\ldots\,,\hat{p}_{n})\right) + (1- \hat{p}_1)\left(1+\bar{\mu }(\hat{p}_{\beta },\ldots\,,\hat{p}_{n})\right)\\ &= 1+\bar{\mu }(\hat{p}_{\beta },\ldots\,,\hat{p}_{n}) \le \hat{p}_1 + 1+\bar{\mu }(\hat{p}_{\beta },\ldots\,,\hat{p}_{n}). \end{align*}

For $\beta \gt 2$ , we have

\begin{align*} \bar{\mu }(\hat{p}_{2},\ldots\,,\hat{p}_{n}) &\le \sum _{k=2}^{\beta -1} \left((1-\hat{p}_2)^{k-2}\hat{p}_2 \cdot \left(1+\bar{\mu }(\hat{p}_{k+1},\ldots\,,\hat{p}_{n})\right)\right) {+}(1-\hat{p}_2)^{\beta -2}\cdot \left(1+\bar{\mu }(\hat{p}_{\beta },\ldots\,,\hat{p}_{n})\right). \end{align*}

Note that $\hat{p}_1=\hat{p}_2$ when $\beta \gt 2$ . Thus, $(1-\hat{p}_1)^{k-1}\hat{p}_1 +\hat{p}_1(1-\hat{p}_2)^{k-2}\hat{p}_2=(1-\hat{p}_1)^{k-2} \hat{p}_1.$ As a result,

\begin{align*} \bar{\mu }^\ast &\le \hat{p}_1 + \sum _{k=2}^{\beta -1} \left(\hat{p}_1 (1 - \hat{p}_2)^{k-2} \hat{p}_2 + (1 - \hat{p}_1)^{k-1} \hat{p}_1) \right)\left(1+\bar{\mu }(\hat{p}_{k+1},\ldots\,,\hat{p}_{n}) \right) \\ &\quad + \left(\hat{p}_1 (1 - p_2)^{\beta - 2} + (1 - \hat{p}_1)^{\beta - 1}\right) \left(1+\bar{\mu }(\hat{p}_{\beta },\ldots\,,\hat{p}_{n})\right)\\ &\le \hat{p}_1+\sum _{k=2}^{\beta -1} \left((1-\hat{p}_1)^{k-2}\hat{p}_1 \cdot \left(1+\bar{\mu }(\hat{p}_{k+1},\ldots\,,\hat{p}_{n})\right)\right){+}(1-\hat{p}_1)^{\beta -2}\cdot \left(1+\bar{\mu }(\hat{p}_{\beta },\ldots\,,\hat{p}_{n})\right)\\ &\le \cdots \le (\beta -2)\hat{p}_1+\hat{p}_1 \cdot \left(1+\bar{\mu }(\hat{p}_{\beta },\ldots\,,\hat{p}_{n})\right)+(1-\hat{p}_1)\cdot \left(1+\bar{\mu }(\hat{p}_{\beta },\ldots\,,\hat{p}_{n})\right)\\ &\le (\beta -1)\hat{p}_1+1+\bar{\mu }(\hat{p}_{\beta },\ldots\,,\hat{p}_{n}). \end{align*}

Therefore for $\beta = 2, \ldots\,, n$ , we recursively have

\begin{align*} \bar{\mu }^\ast &\le (\beta -1)\hat{p}_1+1+\bar{\mu }(\hat{p}_{\beta },\ldots\,,\hat{p}_{n}) \le (\beta -1)\hat{p}_1+(\beta ^2-\beta)\hat{p}_{\beta }+2+\bar{\mu }(\hat{p}_{\beta ^2},\ldots\,,\hat{p}_{n})\\ &\le \cdots \le \sum _{k=0}^{L(\beta)-1}(\beta ^{k+1}-\beta ^{k})\hat{p}_{\beta ^k}+L(\beta)+\bar{\mu }(\hat{p}_{\beta ^{L(\beta)}},\ldots\,,\hat{p}_{n}). \end{align*}

Meanwhile, SKIP performs sampling with standard geometric distribution $G(\hat{p}_{\beta ^{L(\beta)}})$ on $\lbrace \hat{p}_{\beta ^{L(\beta)}},\ldots\,,\hat{p}_{n}\rbrace$ , which indicates that $\bar{\mu }(\hat{p}_{\beta ^{L(\beta)}},\ldots\,,\hat{p}_{n})=(n-\beta ^{L(\beta)}+1)\hat{p}_{\beta ^{L(\beta)}} +1$ . Therefore, $\bar{\mu }^\ast \le \hat{\mu }(\beta)+L(\beta) + 1$ .□

Now, we are ready to prove our main result.

Proof of Theorem 4

Recall that SKIP takes an expected time of $O(1+\bar{\mu })$ . By Lemma 22, we know that $\bar{\mu }\le \min _{\beta \in \lbrace 2,\ldots\,,n\rbrace }\lbrace \hat{\mu }(\beta)+L(\beta) + 1\rbrace$ . In addition, according to the definition of $\hat{p}_i$ under a given $\beta$ , it is easy to verify that $\hat{p}_i\le p_{\lceil i/\beta \rceil }$ . Thus,

\begin{equation*} \hat{\mu }(\beta)=\sum _{i=1}^{n}\hat{p}_i\le \sum _{i=1}^{n}p_{\lceil i/\beta \rceil }\le \beta \sum _{i=1}^{n}p_i=\beta \mu . \end{equation*}

Therefore, $\bar{\mu }\le \min _{\beta \in \lbrace 2,\ldots\,,n\rbrace }\lbrace \beta \mu +\log _\beta n + 1\rbrace$ .

When $\mu \ge {(\log n)}/{2}$ , $\bar{\mu }\le 2\mu +\log _2 n\le 6\mu$ by setting $\beta =2$ . Thus, Theorem 4 holds, since $O(1+\bar{\mu })=O(\mu)$ .

Next, we consider $\mu \lt {(\log n)}/{2}$ . Define

\begin{equation*} \gamma :=\frac{({\log n})/{\mu }}{\log \left(({\log n})/{\mu }\right)}\quad \text{and}\quad \beta ^\ast :=\lceil \gamma \rceil . \end{equation*}

Thus, $\beta ^\ast \mu =O(\frac{{\log n}}{\log (({\log n})/{\mu })})$ and $\log _{\beta ^\ast } n=O(\frac{{\log n}}{\log (({\log n})/{\mu })})$ . Therefore,

\begin{equation*} O(1+\bar{\mu })=O\left(1+\frac{{\log n}}{\log (({\log n})/{\mu })}\right). \end{equation*}

This completes the proof.□

B Variant of Greedy Algorithm

In this appendix, we present another revised greedy algorithm, which is different from Greedy-Degree in Algorithm 9. For ease of explanation, we name it Greedy-Cost.

Recap that in the second phase of HIST, we can stop the RR set generation process as soon as we hit any sentinel node, thus reducing the average size of the RR sets. Suppose we have a function $C_\mathcal {R}(S)$ which represents the amount of cost reduction on the collection $\mathcal {R}$ of RR sets when $S$ is selected as the sentinel set. Then we can design the Greedy-Cost algorithm by replacing Line 3 in Algorithm 1, that is,

\begin{equation*} v \leftarrow \operatorname{arg\,max}_{u\in V}(\Lambda _\mathcal {R}(S^*_{k}\cup \lbrace u\rbrace))-\Lambda _{\mathcal {R}}(S^*_k), \end{equation*}

with the following statements:

\begin{align*} &\mathcal {M} \leftarrow \operatorname{arg\,max}_{u\in V}(\Lambda _\mathcal {R}(S^*_{k}\cup \lbrace u\rbrace))-\Lambda _{\mathcal {R}}(S^*_k) \\ &v \leftarrow \operatorname{arg\,max}_{v^{\prime } \in \mathcal {M}} C_\mathcal {R}(S^*_{k}\cup \lbrace v^{\prime }\rbrace) - C_\mathcal {R}(S^*_k), \end{align*}

where $\mathcal {M}$ is the set of nodes which have the largest marginal coverage. Obviously if $|\mathcal {M}| = 1$ , we have only one candidate, and it must be selected as the sentinel node.

In the following, we present how to define the cost funtion $C_\mathcal {R}(\cdot)$ . Recap that when generating an RR set $R$ , each sampled node is added into $R$ one by one (please see Algorithm 2). Thus, we say the index of $u$ is $i$ if it is the $i$ th node added into $R$ , where $i = 0, 1, 2, \ldots , |R|-1$ . Let $l(u, R)$ be the function which returns the index of $u$ in $R$ . Let $l(u, R) = |R|$ if $u \not\in R$ . Due to the existence of the sentinel set $S$ , the process of generating $R$ can be stopped immediately when it reaches the sentinel node $u^* \in S$ with the minimum index,

\begin{align*} u^* = \operatorname{arg\,min}_{u^{\prime } \in S} l(u^{\prime }, R). \end{align*}

Therefore, we define the cost reduction function on an RR set $R$ caused by the sentinel set $S$ as

\begin{equation*} C(R, S) = |R| - l(u^*, R). \end{equation*}

It is easy to realize that if no node of $S$ is hit by $R$ , the cost reduction $C(R, S) = 0$ due to $l(u, R) = |R|$ for each $u \in S$ . It implies that it can not save any sampling cost when generating $R$ . By summing up all the cost reduction in $\mathcal {R}$ , the function $C_\mathcal {R}(S)$ is defined as

\begin{equation*} C_\mathcal {R}(S) = \sum _{R \in \mathcal {R}} C(R, S). \end{equation*}

References

[1]

2013. KONECT Datasets. Retrieved June 2019 from http://konect.uni-koblenz.de/. (2013).

Abstract

A Proof of Theorem 4

B Variant of Greedy Algorithm

References

Cited By

Index Terms

Recommendations

Influence Maximization in Near-Linear Time: A Martingale Approach

Influence Maximization Revisited: Efficient Reverse Reachable Set Generation with Bound Tightened

Influence maximization: near-optimal time complexity meets practical efficiency

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Get Access

Login options

Full Access

View options

PDF

eReader

Full Text

HTML Format

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations