Abstract
Triangle approximate counting has emerged as a prominent issue in graph stream research in the past few years, with applications ranging from social network analysis to web topic mining and motif detection in informatics. Many graph stream sampling and triangle approximate counting algorithms have been proposed, with the majority of them guaranteeing unbiased estimation. However, they either cannot ensure that the memory overhead or the result’s uncertainty is too great due to the use of an excessively large sampling space. In this article, we propose RFES, a set of one-pass stream algorithms for counting the global number of triangles in a fully dynamic graph stream in an unbiased, low-variance, and high-precision manner. RFES has three algorithms: RFESBASE, RFES-IMPR, and RFES-FD, which represent the basic, improved, and complete dynamic versions, respectively. Each algorithm is based on our proposed first-edge reservoir sampling method, which shrinks the sampling space while increasing the uncertainty of triangles in the sample. It can deal with fully dynamic data with a lower theoretical estimation variance than state-of-the-art algorithms. A significant number of experimental results demonstrated that our RFES algorithm is more accurate and takes less time. The source codes of RFES can be downloaded from the website: https://github.com/BioLab310/RFES.
Similar content being viewed by others
Availability of supporting data
All data generated or analyzed during this study are included in this published article.
References
Newman MEJ (2003) The structure and function of complex networks. Siam Rev 45:167–256. https://doi.org/10.1137/S003614450342480
Berry JW, Hendrickson B (2011) Tolerating the community detection resolution limit with edge weighting. Phys Rev E Stat Nonlinear Soft Matter Phys. 83:056119. https://doi.org/10.1103/PhysRevE.83.056119
Suri S, Vassilvitskii S (2011) Counting triangles and the curse of the last reducer. In: Proceedings of the 20th International Conference on World Wide Web−WWW’11, vol. 42. ACM, Hyderabad
Li ZJ, Lu YT, Zhang WP, Li RH, Guo J, Huang X, Mao R (2018) Discovering hierarchical subgraphs of k-core-truss. Data Sci Eng 3(2):136–149
Eckmann JP, Moses E (2001) Curvature of co-links uncovers hidden thematic layers in the world wide web. Proc Nat Acad Sci US 99:5825–5829. https://doi.org/10.2307/3058584
Zhi Y, Wilson C et al (2014) Uncovering social network sybils in the wild. Trans Knowl Dis Data 8:259–265. https://doi.org/10.1145/2556609
Shin K, Eliassi-Rad T, Faloutsos C (2016) Corescope: graph mining using k-core analysis - patterns, anomalies and algorithms. In: 2016 IEEE 16th International Conference on Data Mining (ICDM), pp 469–478. https://doi.org/10.1109/ICDM.2016.0058
Yang X, Song C, Yu M et al (2022) Distributed triangle approximately counting algorithms in simple graph stream. ACM Trans Knowl Dis Data 16(4):1–43. https://doi.org/10.1145/3494562
Kavassery-Parakkat N, Hanjani KM, Pavan A (2018) Improved triangle counting in graph streams: In: Power of multi-sampling. IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), pp 33–40
Jayaram R, Kallaugher J (2021) An optimal algorithm for triangle counting in the stream. https://doi.org/10.4230/LIPICS.APPROX/RANDOM.2021.11
Graham C, Hossein J (2019) Lp samplers and their applications: a survey. ACM Comput Surv 52(1):1–31. https://doi.org/10.1145/3297715
Zhang LL, Jiang H et al (2020) Reservoir-based sampling over large graph streams to estimate triangle counts and node degrees. Future Generation Comput Syst 108:244–255. https://doi.org/10.1016/j.future.2020.02.077
Watts D, Strogatz S (1998) Collective dynamics of small world networks. Nature 393:440–442. https://doi.org/10.1038/30918
Pavan A, Tangwongsan K et al (2013) Counting and sampling triangles from a graph stream. Proc Vldb Endow 6(14):1870–1881. https://doi.org/10.14778/2556549.2556569
Pinar A, Jha M, Seshadhri C (2013) A space-efficient streaming algorithm for estimating transitivity and triangle counts using the birthday paradox. ACM Trans Knowl Dis Data 9:1–21. https://doi.org/10.1145/2700395
Lim Y, Jung M, Kang U (2018) Memory-efficient and accurate sampling for counting local triangles in graph streams: From simple to multigraphs. ACM Trans Knowl Dis Data 12:1–28. https://doi.org/10.1145/3022186
Stefani LD, Epasto A, Riondato M, Upfal E (2016) triest: counting local and global triangles in fully-dynamic streams with fixed memory size. In: International Conference on Knowledge Discovery and Data Mining, pp 825–834. https://doi.org/10.1145/2939672.2939771
Shin K, Kim J, Hooi B (2018) Think before you discard: accurate triangle counting in graph streams with deletions. Springer, Cham, pp 141–157. https://doi.org/10.1007/978-3-030-10928-8_9
Singh P, Srinivasan V, Thomo A (2021) Fast and scalable triangle counting in graph streams: The hybrid approach. In: International Conference on Advanced Information Networking and Applications, pp 107–119. https://doi.org/10.1007/978-3-030-75075-6_9
Jung MLY, Lee S (2019) Furl:fixed-memory and uncertainty reducing local triangle counting for graph streams. Data Min Knowl Dis 33:1225–1253
Gou X, Zou L (2021) Sliding window-based approximate triangle counting over streaming graphs with duplicate edges. In: SIGMOD/PODS ’21: International Conference on Management of Data. https://doi.org/10.1145/3448016.3452800
Han G, Sethu H (2017) Edge sample and discard: a new algorithm for counting triangles in large dynamic graphs. In: the 2017 IEEE/ACM International Conference, pp 44–48
Seshadhri C, Pinar A, Kolda TG (2014) Wedge sampling for computing clustering coefficients and triangle counts on large graphs. Stat Anal Data Min 7:294–307. https://doi.org/10.1002/sam.11224
Turk A, Türkoğlu D (2019) Revisiting wedge sampling for triangle counting. In: Proceedings of the 2019 World Wide Web Conference (WWW’19). https://doi.org/10.1145/3308558.3313534
Vitter, Jeffrey S (1985) Random sampling with a reservoir. ACM Trans Math Softw 11:37–57. https://doi.org/10.1145/3147.3165
Al-Kateb M, Lee BS, Wang XS (2007) Adaptive-size reservoir sampling over data streams. In: International Conference on Scientific and Statistical Database Management, pp 1–22. doi: https://doi.org/10.1109/ssdbm.2007.29
Al-Kateb M, Lee BS (2014) Stratified reservoir sampling over heterogeneous data streams. In: Scientific and Statistical Database Management. In: 22nd International Conference, vol 39, pp 199–216. https://doi.org/10.1016/j.is.2012.03.005
Wei H, Cao HW, Yan MY et al (2021) Bsr-tc: Adaptively sampling for accurate triangle counting over evolving graph streams. Int J Softw Eng Knowl Eng 31:1561–1581. https://doi.org/10.1142/S021819402140012X
Gemulla R, Lehner W, Haas PJ (2008) Maintaining bounded-size sample synopses of evolving datasets. VLDB J 17:173–202. https://doi.org/10.1007/s00778-007-0065-y
Shin K (2017) Wrs: Waiting room sampling for accurate triangle counting in real graph streams. IEEE Comput So, pp 1087–1092. https://doi.org/10.1109/ICDM.2017.143
Skala M (2013) Hypergeometric tail inequalities: ending the insanity. Statistics
Acknowledgements
Not applicable.
Funding
This work is supported by the National Natural Science Foundation of China (61772124).
Author information
Authors and Affiliations
Contributions
CY contributed to the evolution of research ideas and overall research goals and performed experiments; HL contributed in terms of theoretical proof, the writing, and modification of the initial draft; FW contributed to participate in the writing, translation and revision of the main content of this paper; ZL contributed in terms of data analysis and visualization and performed relevant comparison experiments; TR contributed to the collection and processing of data; HM contributed to the construction of the experimental platform and the verification of the experimental design; YZ contributed to the review of the first draft and the paper’s funding support.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no competing interests.
Ethical approval and consent to participate
Not applicable.
Human and animal ethics
Not applicable.
Consent for publication
Not applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A: Theoretical results for RFES-B
In this section, we present the theoretical results (statements and proofs) not included in the main body.
1.1 A.1: Expectation
Before proving Lemma 4.1, in order to better illustrate the property of the standard reservoir sampling approach, we first introduce the following technical lemma, which cites the conclusion of Lemma A.1 in [17], and has been proven to be correct.
Lemma A.1
[17] For any \(T>M\), let A be any subset of \(E^T\) of size \(|A |=M\). Then, at the end of time step T, we have:
i.e., the set of edges in S at the end of time T is a subset of \(E^T\) of size M chosen uniformly and at random from all subsets of \(E^T\) of the same size. When the properties of standard reservoir sampling are further clarified, we then proceed to a detailed proof of Lemma 4.1.
Proof of Lemma 4.1
Proof. If \(k>min\{M,T\}\), we have \(Pr(B\subseteq S)=0\) because it is impossible for B to be equal to S in these cases. From now on, we then assume \(k \le \{M,T\}\).
Case1: If \(T \le M\), then all edges that have appeared in the graph stream so far will be saved to the edge sample set S. Then, \(E^T\subseteq S\) and \(Pr(B\subseteq S)=1=\varphi ^{-1}_{k,T}\).
Case2: Assume that \(T>M\), and let \(\mathcal {B}\) be the family of subsets of \(E^T\) that : 1. have size M, and 2. contain B. That is, \(\mathcal {B}=\{C\subset E^T: \vert C\vert =M, B\subseteq C\}\). Based on this, we can symbolize \(\mathcal {B}\) as:
From this and and Lemma A.1 we then have:
\(\square \)
Now, we can prove Theorem 4.1 on the unbiasedness of the estimation computed by RFES-BASE (and on their exactness for \(T \le M\)).
Proof of Theorem.1
Proof. If \(T\le M\), all the edges appearing in the graph stream are stored in S, so the counting result of RFES-BASE is completely accurate under this condition. That is to say, when \(T \le M\), the value of the counter \(\mu ^T\) is the real number of the global triangle in the graph \(G^T\) at time T.
Next, we prove that the RFES-BASE algorithm is unbiased when \(T>M\). Assume now that \(T>M\) and assume that \(\vert \Delta ^T\vert >0\); otherwise, \(\mu ^T=\vert \Delta ^S\vert =0\) and RFES-BASE estimation is deterministically correct. Let triangle \(\lambda =(a,b,c)\in \Delta ^T\), (where a, b, c are edges in \(E^T\), suppose a is the first edge of \(\lambda \) and c is the last edge of \(\lambda \)). Let \(\delta ^T_{\lambda }\) be a random variable that takes value \(\varphi ^T\) if \(\lambda =(a,b,c) \in \Delta ^S\)(i.e.,\(\{a\} \subseteq S\)) at the end of the step instant T, and 0 otherwise.
From Lemma 4.1, we have that,
We can write \(\varphi ^T\mu ^T=\sum _{\lambda \in \Delta ^T}\delta ^T_{\lambda }\), and from this, (16), and linearity of expectation, we have:
\(\square \)
1.2 A.2: Variance
We now analyze the estimation variance returned by RFES-BASE for \(T>M\)(the variance is 0 for \(T\le M\)).
Proof of Theorem 4.2
Proof. Assume \(\vert \Delta ^T \vert >0\); otherwise, the estimation is deterministically correct and has variance 0, and the thesis holds. Let any triangle \(\alpha \in \Delta ^T\) and \(\delta ^T_{\alpha }\) be as in the proof of Theorem 4.1. According to Theorem 4.1, \(E(\delta ^T_{\alpha })=1\), we have \(Var(\delta ^T_{\alpha })=E[(\delta ^T_{\alpha })^2]-E^2[\delta ^T_{\alpha }]=(\varphi ^T)^2\frac{1}{\varphi ^T}-1=\varphi ^T-1\). From this and the definition of variance and covariance, we obtain:
Assume now \(\vert \Delta ^T \vert >2\), otherwise, we have \(\varrho ^T=\xi ^T=0\), and the thesis holds as the second term on the r.h.s. of (A.6) is 0. Let \(\alpha =(p,q,r)\) and \(\beta =(x,y,z)\) be two distinct triangles in, where p is the first edge of \(\alpha \) , r is the last edge of \(\alpha \), x is the first edge of \(\beta \) and z is the last edge of \(\beta \). If \(\alpha \) and \(\beta \) do not share the first edge p(or x), we have \(\delta ^T_{\alpha }\delta ^T_{\beta }=\varphi ^T\varphi ^T=\varphi ^2_{1,T}\) if the first edges of \(\alpha \) and \(\beta \) are both in S at the end of time T, and \(\delta ^T_{\alpha }\delta ^T_{\beta }=0\) otherwise. From Lemma 4.1, we have that:
If instead \(\alpha \) and \(\beta \) share exactly the first edge p(or x), we have \(\delta ^T_{\alpha }\delta ^T_{\beta }=\varphi ^2_{1,T}\), and \(\delta ^T_{\alpha }\delta ^T_{\beta }=0\) otherwise. From Lemma 4.1, we then have that:
The thesis follows by combining (A6), (A7), (A8), recalling the definitions of \(\varrho ^T\) and \(\xi ^T\), and slightly reorganizing the terms. \(\square \)
Appendix B: Theoretical results for RFES-I
1.1 B.1: Expectation
Proof of Theorem 4.3
Proof. If \(T\le M\), all the edges appearing in the graph stream are stored in S, RFES-IMPR behaves exactly like RFES-BASE, and the statement follows from Theorem 4.1.
Assume now \(T>M\) and assume that \(\vert \Delta ^T\vert >0\); otherwise, the algorithm deterministically returns 0 as an estimation, and the thesis follows. Let any triangle \(\lambda \in \Delta ^T\) denote with a, b, and c the edges of \(\lambda \) and assume, w.l.o.g., that they appear in this order (not necessarily consecutively) on the stream. Let \(T_{\lambda }\) be the time step at which c is on the stream. Let \(\delta _{\lambda }\) be a random variable that takes value \(\varphi _{1,\delta _{\lambda }}\) if the first edge a of \(\lambda \) is in S at the end of time step \(T_{\lambda }\), and 0 otherwise. Since it must be \(T_{\lambda } \ge 1\), from Lemma 4.1 we have that:
When the third edge c of triangle \(\lambda \) arrives, that is, at time \(T_\lambda \), use the two nodes of edge c to update the neighbor node lists of the nodes of the related edges in S obtained by sampling before time \(T_\lambda \), then call the UpdateCounter function, and then use \(\vert \mathbb {N}^{S\_C}_{u^S\_{list},v^S\_{list}}\vert \varphi _{1, T_\lambda }\) to update the value of the counter, where \(\vert \mathbb {N}^{S\_C}_{u^S\_{list},v^S\_{list}}\vert \varphi _{1, T_\lambda }\) indicates that after updating the neighbor node lists of the related edges with the two vertices of edge c, the number of common neighbor lists that have changed (\(\le 2\)), the new edge c in the graph stream can only add at most two triangles at a time. Once the common neighbor nodes lists of the two nodes of the existing edge in the sample set S change, it means that the edge arriving at time \(T_\lambda \) and the edge in S form new triangles. With the arrival of edge c, all these triangles have the corresponding random variables taking the same value \(\varphi _{1,T_\lambda }\). This means that the random variable \(\mu ^T\) can be expressed as:
From this, linearity of expectation, and (B9), we get:
\(\square \)
1.2 B.2: Variance
Proof of Lemma 4.2
Proof. Consider first the case where all edges of \(\alpha =(p,q,r)\) appear on the stream before any edge of \(\beta =(x,y,z)\), i.e., \(T_p<T_q<T_r<T_x<T_y<T_z\).
The presence or absence of p in S at the beginning of the time step \(T_\beta \) (i.e., whether \(D_\alpha \) happens or not) has no effect whatsoever on the probability that x is in the sample set S at the beginning of the time step \(T_z\). Hence in this case, \(Pr(D_\beta \vert D_\alpha )=Pr(D_\beta )\).
Consider now the case where the first edge of triangle \(\alpha \) appears in the graph stream before the third edge of triangle \(\beta \). Define now the events:
—A: “at time \(T_\alpha \), the edge x has been sampled and remains in the sample set S”, and
—B: “during the period from \(T_r\) t o \(T_z\) (including time points \(T_r\) and \(T_z\)), the incoming edge in the graph stream does not replace and delete x from S. That is, during this time, x is always kept in the sample set S”.
Therefore, we can write \(D_\beta \) as \(D_\beta =A\cap B\).
Hence,
We now show that, \(Pr(A\vert D_\alpha )\le Pr(A)\).
If we assume that \(T_r\le M\), then all the edges that appeared on the stream up until the beginning of \(T_r\) are in S. Therefore, \(Pr(A\vert D_\alpha )\le Pr(A)=1\).
Assume instead that \(T_r>M\). Among the \(\left( {\begin{array}{c}T_r\\ M\end{array}}\right) \) subsets of \(E^{T_r}\) of size M, there are \(\left( {\begin{array}{c}T_r-1\\ M-1\end{array}}\right) \) that contain edge p. At the time \(T_r\), the storage of edge p in S will affect the storage of edge x in S. Therefore, \(Pr(A\vert D_\alpha )=\frac{\left( {\begin{array}{c}T_r-1-1\\ M-1-1\end{array}}\right) }{\left( {\begin{array}{c}T_r-1\\ M-1\end{array}}\right) }=\frac{M-1}{T_r-1}\). According to initial assumptions and from the fact that for any \(j\ge 0\) and any \(n\ge m > j\) it holds \(\frac{m-j}{n-j}\le \frac{m}{n}\), then we have \(Pr(A\vert D_\alpha )\le Pr(A)\).This implies, from (B12), that,
Consider now event B. When conditioned on A, event B is actually independent of event \(D_\alpha \). Thus, \(Pr(B\vert A\cap D_\alpha )=Pr(B\vert A)\). Putting this together with (B13), then we can obtain \(Pr(D_\beta \vert D_\alpha )=Pr(A\cap B\vert D_\alpha )=Pr(A\vert D_\alpha )Pr(B\vert A\cap D_\alpha )\le Pr(A)Pr(B\vert A\cap D_\alpha )\le Pr(A)Pr(B\vert A)\le Pr(A\cap B)\le Pr(D_\beta )\), where the last inequality follows from the fact that \(D_\beta =A\cap B\) by definition. \(\square \)
Proof of Theorem 4.4
Proof. Assume \(\vert \Delta ^T\vert >0\), otherwise, RFES-IMPR estimation is deterministically correct and has variance 0, and the thesis holds. Let \(\alpha \in \Delta ^T \) and let \(\delta _\alpha \) be a random variable that takes value \(\varphi _{1, T_r}\) if the first edge of \(\alpha \) is in S at the end of the time step \(T_r\), and 0 otherwise. According to the variance formula and properties, we can get \(Var[\delta _\alpha ]=E[(\delta _\alpha )^2]-E^2[\delta _\alpha ]=\varphi _{1, T_r}-1\le \varphi _{1, T}\).Therefore, the variance of the RFES-IMPR algorithm can finally be expressed as:
For any triangle \(\alpha \in \Delta ^T\) and define \(q_\alpha =\varphi _{1,T_r}\). Assume now \(\vert \Delta ^T\vert \ge 2\), otherwise, we have \(\varrho ^T=\eta ^T=0\), and the thesis holds as the second term on the r.h.s. of (B14) is 0. Let now \(\alpha \) and \(\beta \) be two distinct triangles in \(\Delta ^T\)(since there may be shared edges between these two different triangles, \(T\ge 1\)). When calculating \(E[\delta _\alpha \delta _\beta ]=q_\alpha q_\beta Pr(\delta _\alpha \delta _\beta =q_\alpha q_\beta )\), it is necessary to ensure that both r and z are sampled and stored in S, and at the same time, use the remaining four edges of the two triangles to update neighbor node lists of the two nodes of the first edge of the triangle. The event “\(\delta _\alpha \delta _\beta =q_\alpha q_\beta \)” is the intersection of events \(D_\alpha \cap D_\beta \), where \(D_\alpha \) is the event that the first edge of \(\alpha \) is in S at the end of the time step \(T_\alpha \), and similarly for \(D_\beta \). Next, we need to calculate \(Pr(D_\alpha \cap D_\beta )\) in the various possible cases.
(1)If \(\alpha \) and \(\beta \) share an edge l, we perform a detailed analysis based on the first or third edge of all edges where l is \(\alpha \) and \(\beta \).
\(\textcircled {1}\) \(\alpha \) and \(\beta \) share the first edge. We suppose \(T_\alpha < T_\beta \) and let event \(E_1\) be “at time \(T_\alpha \), the shared edge l is in S”, and event \(E_2\) be “in the time period from \(T_\alpha \) to \(T_\beta \), the edge replaced and deleted from S does not include l”. Then, \(Pr(D_\beta \vert D_\alpha )=Pr(E_1\vert D_\alpha )Pr(E_2\vert E_1 \cap D_\alpha )=Pr(E_1)Pr(E_2\vert E_1)\).From Lemma 4.1, we then have \(Pr(E_2\vert E_1)=\prod _{j=max\{T_\alpha ,M\}}^{T_\beta }((1-\frac{M}{j})+\frac{M}{j}\frac{M-1}{M})=\prod _{j=max\{T_\alpha ,M\}}^{T_\beta }\frac{j-1}{j}=\frac{max\{T_\alpha -1,M-1\}}{max\{M-1,T_\beta \}}\le \frac{1}{q_\beta }\frac{max\{M,T_\alpha -1\}}{M}\).Thus, \(Pr(D_\beta \vert D_\alpha )=Pr(E_1)Pr(E_2\vert E_1)\le \frac{1}{q_\alpha }\frac{1}{q_\beta }\frac{max\{M,T_\alpha -1\}}{M}\). From the conditional probability formula and Lemma 4.2, we have \(Pr(D_\beta \cap D_\alpha )=Pr(D_\beta \vert D_\alpha )Pr(D_\alpha )\le \frac{1}{q_\alpha }\frac{1}{q_\beta }\frac{1}{q_\alpha }\frac{max\{M,T_\alpha -1\}}{M}\le \frac{1}{q^2_\alpha q_\beta }\frac{max\{M, T_\alpha -1\}}{M}\).
\(\textcircled {2}\) \(\alpha \) and \(\beta \) share the third edge. Let \(T_l\) be the time when l appears in the graph stream, then \(T_l=T_\alpha =T_\beta \). The event “\(D_\alpha \cap D_\beta \)” occurs if and only if both edges r and z are in S at the end of time \(T_l\). From Lemma 4.1, we have \(Pr(D_\beta \cap D_\alpha )=\frac{1}{\varphi _{2,T_l}}=\frac{1}{\varphi _{2,T_\alpha }}=\frac{1}{\varphi _{1,T_\alpha }}\frac{\varphi _{1,T_\alpha }}{\varphi _{2,T_\alpha }}\le \frac{1}{q_\alpha }\frac{\varphi _{1,T_\alpha }}{\varphi _{2,T_\alpha }}\le \frac{1}{q_\alpha }\frac{M-1}{T_\alpha -1}\le \frac{1}{q_\alpha }\frac{M-1}{T_\beta -1}\le \frac{1}{q_\alpha }\frac{M}{T_\beta }\le \frac{1}{q_\alpha }\frac{1}{q_\beta }\le \frac{1}{q_\alpha q_\beta }\).
(2)If \(\alpha \) and \(\beta \) do not share the first or third edge, we can start the discussion from the following aspects.
\(\textcircled {1}\) \(\alpha \) and \(\beta \) have a shared edge l, and l is the third edge of \(\alpha \) and the first edge of \(\beta \). Under this condition, both events \(D_\alpha \) and \(D_\beta \) occur independently of each other. Thus, \(Pr(D_\alpha \cap D_\beta )=Pr(D_\alpha )Pr(D_\beta )=\frac{1}{q_\alpha }\frac{1}{q_\beta }=\frac{1}{q_\alpha q_\beta }\).
\(\textcircled {2}\) \(\alpha \) and \(\beta \) do not share any edges and \(T_\alpha < T_x\). Both events \(D_\alpha \) and \(D_\beta \) occur independently of each other. Thus, \(Pr(D_\alpha \cap D_\beta )=Pr(D_\alpha )Pr(D_\beta )=\frac{1}{q_\alpha }\frac{1}{q_\beta }=\frac{1}{q_\alpha q_\beta }\).
\(\textcircled {3}\) \(\alpha \) and \(\beta \) may share the second edge, and \(T_p<T_x<T_\alpha <T_\beta \). Let event \(E_3\) be “at time \(T_\alpha \), edges p and x are in S”, and event \(E_4\) be “in the time period from \(T_\alpha \) to \(T_\beta \), the edge replaced and deleted from S does not include x”. Then, \(Pr(D_\beta \vert D_\alpha )=Pr(E_3\vert D_\alpha )Pr(E_4\vert E_3\cap D_\alpha )\). If \(T_\alpha \le M\), edges p and x must be in the sample set S. Consider instead the case \(T_\alpha >M\), and the event \(D_\alpha \) is established again, indicating that the moment p must be in S. At this time, all subsets of \(E^{T_\alpha }\) of size M and containing edge p and the other edge of \(\alpha \) have an equal probability of being S, from Lemma A.1. There are \(\left( {\begin{array}{c}T_\alpha -1\\ M-1\end{array}}\right) \) such set. Among these, there are \(\left( {\begin{array}{c}T_\alpha -2\\ M-2\end{array}}\right) \) sets that also contain x. Therefore, if \(T_\alpha >M\), we have \(Pr(E_3\vert D_\alpha )=\frac{\left( {\begin{array}{c}T_\alpha -2\\ M-2\end{array}}\right) }{\left( {\begin{array}{c}T_\alpha -1\\ M-1\end{array}}\right) }=\frac{M-1}{T_\alpha -1}\). Considering what we said before for the case \(T_\alpha \le M\), we then have \(Pr(E_3\vert D_\alpha )=min\{\frac{M-1}{T_\alpha -1}\}\). We also have \(Pr(E_4\vert E_3\cap D_\alpha )=\prod _{j=max\{T_\alpha ,M\}}^{T_\beta }((1-\frac{M}{j})+\frac{M}{j}\frac{M-1}{M})=\prod _{j=max\{T_\alpha ,M\}}^{T_\beta }\frac{j-1}{j}=\frac{max\{T_\alpha -1,M-1\}}{max\{M-1,T_\beta \}}\). Therefore, \(Pr(D_\beta \vert D_\alpha )=Pr(E_3\vert D_\alpha )Pr(E_4\vert E_3\cap D_\alpha )=min\{1,\frac{M-1}{T_\alpha -1}\}\frac{max\{T_\alpha -1,M-1\}}{max\{M-1,T_\beta \}}\). With a case analysis, one can show that, \(Pr(D_\alpha \cap D_\beta )\le \frac{1}{q_\alpha q_\beta }\frac{max\{M,T_\alpha -1\}}{M} min\{1,\frac{M-1}{T_\alpha -1}\}\).
\(\textcircled {4}\) \(\alpha \) and \(\beta \) may share the second edge, and \(T_p<T_x<T_\beta <T_\alpha \). Let event \(E_5\) be “at time \(T_\beta \), edges p and x are in S”, and event \(E_4\) be “in the time period from \(T_\beta \) to \(T_\alpha \), the edge replaced and deleted from S does not include p”. Then, \(Pr(D_\alpha \vert D_\beta )=Pr(E_5\vert D_\beta )Pr(E_6\vert E_5\cap D_\beta )\). If \(T_\beta \le M\), edges p and x must be in the sample set S. Consider instead the case \(T_\beta >M\), and the event \(D_\beta \) is established again, indicating that the moment x must be in S. At this time, all subsets of \(E^{T_\beta }\) of size M and containing edge x and the other edge of \(\beta \) have an equal probability of being S, from Lemma A.1. There are \(\left( {\begin{array}{c}T_\beta -1\\ M-1\end{array}}\right) \) such set. Among these, there are \(\left( {\begin{array}{c}T_\beta -2\\ M-2\end{array}}\right) \) sets that also contain p. Therefore, if \(T_\beta >M\), we have \(Pr(E_5\vert D_\beta )=\frac{\left( {\begin{array}{c}T_\beta -2\\ M-2\end{array}}\right) }{\left( {\begin{array}{c}T_\beta -1\\ M-1\end{array}}\right) }=\frac{M-1}{T_\beta -1}\). Considering what we said before for the case \(T_\beta \le M\), we then have \(Pr(E_5\vert D_\beta )=min\{1,\frac{M-1}{T_\beta -1}\}\). We also have \(Pr(E_6\vert E_5\cap D_\beta )=\prod _{j=max\{T_\beta ,M\}}^{T_\alpha }((1-\frac{M}{j})+\frac{M}{j}\frac{M-1}{M})=\prod _{j=max\{T_\beta ,M\}}^{T_\alpha }\frac{j-1}{j}=\frac{max\{T_\beta -1,M-1\}}{max\{M-1,T_\alpha \}}\). Therefore, \(Pr(D_\alpha \vert D_\beta )=Pr(E_5\vert D_\beta )Pr(E_6\vert E_5\cap D_\beta )=min\{1,\frac{M-1}{T_\beta -1}\}\frac{max\{T_\beta -1,M-1\}}{max\{M-1,T_\alpha \}}\). With a case analysis, one can show that, \(Pr(D_\alpha \cap D_\beta )\le \frac{1}{q_\alpha q_\beta }\frac{max\{M,T_\beta -1\}}{M} min\{1,\frac{M-1}{T_\beta -1}\}\).
To recap, when considering two different triangles \(\alpha \) and \(\beta \), we have several cases discussed above. Therefore, we can finally express the upper bound value of the variance of the RFES-IMPR algorithm as: \(Var[\mu ^T]\le \vert \Delta ^T \vert (\varphi _{1,T}-1)+\varrho ^T(\frac{1}{q_\alpha }\frac{max\{M,T_\alpha -1\}}{M}-1)+\eta ^T max\{(\frac{max\{M,T_\alpha -1\}}{M} min\{1,\frac{M-1}{T_\alpha -1}\})-1, (\frac{max\{M,T_\beta -1\}}{M} min\{1,\frac{M-1}{T_\beta -1}\})-1\}\). \(\square \)
Appendix C: Theoretical results for RFES-FD
1.1 C.1: Expectation
Before proving Theorem 4.5, we need the following lemmas. The following is a conclusion of Lemma A.4 in [17].
Lemma C.1
[17] For any \(T>0\) and any \(j(0\le j\le s^T)\), let \(\mathcal {B}^T\) be the collection of subsets \(E^T\) of size j. For any \(B \in \mathcal {B}^T\) it holds
That is, conditioned on its size at the end of time step T, S is equally likely to be, at the end of time step T, any of the subsets of \(E^T\) of that size.
Lemma C.2
Recall the definition of \(\kappa ^T\) from (8) in the text. We have
Our algorithm only needs to ensure that the first edge of any triangle appears in the sample set S. Therefore, at time T, the minimum capacity of the sample set S is 1, which is also different from the Triest algorithm.
The next lemma follows from Lemma C.1 in the same way as Lemma 4.1 follows from Lemma A.1.
Lemma C.3
For any time step \(T(T\ge 0)\) and any \(j(0\le j \le s^T)\), let B be any subset of \(E^T\) of size \(\vert B \vert =k\le s^T\). Then, at the end of time step T,
The next two lemmas discuss properties of RFES-FD for \(T<T^*\), where \(T^*\) is the same as defined above, it means the first time that \(\vert E^T\vert \) had size \(M+1(T^*\ge M+1)\). Lemma C.4 is the conclusion of Lemma A.7 in [17], which has been proven to be true.
Lemma C.4
[17] For all \(T<T^*\), we have:
(1)\(n^T_g=0\); and
(2)\(S=E^T\); and
(3)\(M^T=s^T\).
Proof
The third conclusion of Lemma C.4 is dependent upon the first two conclusions. Therefore, we concentrate on proving the first two conclusions, and then, we can infer that the third conclusion is also true.
The proof is by induction on T. In the base for \(T=1\): the edge on the stream must be an insertion, and the algorithm deterministically inserts the edge in S. Assume now that it is true for all time steps up to (but excluding) some \(T\le T^*-1\). We now show that it is also true for T.
Assume the edge on the stream at time T is a deletion. The corresponding edge must be in S, from the inductive hypothesis. Hence RFES-FD removes it from S and increments the counter \(n_b\) by one. Thus it is still true that \(n^T_g=0\) and \(S=E^T\), and the thesis holds.
Assume now that the element on the stream at time T is an insertion. From the inductive hypothesis, we have that the current value of the counter \(n_g\) is zero.
If the counter \(n_b\) has currently value of zero as well, then, because of the hypothesis that \(T<T^*\), it must be that \(\vert S\vert =M^{(T-1)}=s^{(T-1)}<M\). Therefore RFES-FD always inserts the edge in S. Thus it is still true that \(n^T_g=0\) and \(S=E^T\), and the thesis holds.
If otherwise \(n_g > 0\), then RFES-FD flips a biased coin with a probability of heads equal to
Therefore, RFES-FD always inserts the edge in S and decrements \(n_g\) by one. Thus it is still true that \(n^T_g=0\) and \(S=E^T\), and the thesis holds.
The following result of Lemma C.5 is an immediate consequence of Lemma C.2 and Lemma C.4.
Lemma C.5
For all \(T<T^*\) such that \(s^T\ge 1\), we have \(\kappa ^T=1\).
That is to say, as long as the first edge of the triangle is in the sample set S, the neighbor node lists corresponding to vertexes of the first edge can be continuously updated with the subsequent edges in the graph stream.
Based on the five lemmas presented above, we next use them to begin a detailed proof of the expectations of the RFES-FD algorithm.
Proof of Theorem 4.5
Proof. Assume for now that \(T<T^*\). From Lemma C.4, we have that \(M^T=s^T\). If \(M^T<1\), then it must be \(s^T<1\), hence \(\vert \Delta ^T\vert =0\) and indeed the algorithm returns \(\rho ^T=0\) in this case. If instead \(M^T=s^T\ge 1\), then we have
From Lemma C.5 we have that \(\kappa ^T=1\) for all \(T<T^*\), hence \(\rho ^T=\frac{\mu ^T}{\kappa ^T}=\mu ^T=\vert \Delta ^S\vert =\vert \Delta ^T\vert \) in these cases. The establishment of equation \(\vert \Delta ^S\vert =\vert \Delta ^T\vert \) is derived from the conclusion (2) in Lemma C.4. As described in this paper, \(\rho ^T=\mu ^T=\vert \Delta ^T\vert \) is true for all \(T\le T^*\).
Assume now that \(T\ge T^*\). Using the law of total expectation, we can write
Assume that \(\vert \Delta ^T\vert >0\); otherwise, the algorithm deterministically returns 0 as an estimation and the thesis follows. Let \(\lambda \in \Delta ^T\), and let \(\delta ^T_\lambda \) be a random variable that takes value
if the first edge of \(\lambda \) is in S at the end of the time instant T, and 0 otherwise. Thus, we can write
Then, using Lemma C.3 and Lemma C.5, we have, for \(1\le j\le min\{s^T, M\}\),
and
Plugging this into (C15), we can finally have,
From the above formula, while proving the conclusion of Theorem 4.5, we can further illustrate that the RFES-FD algorithm, like the previous two algorithms, is also unbiased in estimating the number of triangles in the graph stream.
1.2 C.2: Variance
Similarly to the process of demonstrating the mathematical expectation of RFES-FD, we must provide some lemmas for a valid proof of this algorithm’s variance.
Lemma C.6
For any time \(T\ge T^*\), and any \(j(1\le j \le min\{s^T,M\})\), we have
Proof
The proof is analogous to that of Theorem 4.2, using j in place of M, \(s^T\) in place of T, \(\psi _{a, M^T,s^T}\) in place of \(\varphi _{a, T}\), and using Lemma C.3 instead of Lemma 4.1. The additional \((\kappa ^T)^{-2}\) multiplicative term comes from the \((\kappa ^T)^{-1}\) term used in the definition of \(\rho ^T\).
Lemma C.7
For any time \(T\ge T^*\), and any \(j(1<j\le min\{s^T,M\})\),if \(s^T\ge M\), we have
Proof
The proof follows by observing that the term \(\omega ^T(\dfrac{\psi _{1,j,s^T}}{\psi _{2,j,s^T}}-1)\) is non-positive, and that (C18) is a non-increasing function of the sample size.
The following lemma deals with properties of the r.v. \(M^T\), which is the conclusion of Lemma A.11 in [17].
Lemma C.8
[17, 31] Let \(T>T^*\), with \(s^T\ge M\). Let \(d^T=n^T_b+n^T_g\) denote the total number of unpaired deletions at time T. The sample size \(M^T\) follows the hypergeometric distribution.
We have
and for any \(0<c<1\),
Proof
Since \(T>T^*\), from the definition of \(T^*\), we have that the \(M^T\) has reached size M at least once (at \(T^*\)). From this and the definition of \(d^T\)(number of uncompensated deletions), we have that \(M^T\) cannot be less than \(M-d^T\). The rest of the proof for (C19) and for (C20) follows from the proof process of Lemma A.11 in [17], and the concentration bound in (C21) follows from the properties of the hypergeometric distribution. \(\square \)
The following corollary is a consequence of Lemma A.11 in [17].
Corollary C.1
[17] Consider the execution of RFES-FD at time \(T>T^*\). Suppose we have \(d^T\le \alpha s^T(0\le \alpha <1)\), with \(s^T\ge M\). If \(M\ge \frac{1}{2\sqrt{\alpha '-\alpha }}c'lns^T\) for \(\alpha<\alpha '<1\), we have:
Proof of Theorem 4.6
Proof. From the law of total variance, we have:
As shown in (C16) and (C17), for any \(j=0,1,...,M\) we have \(E[\rho ^T\vert M^T=j]\ge 0\). This, in turn, implies:
Let us consider separately the two main components of (C22). From Lemma C.7, we have:
According to our hypothesis \(M\ge \dfrac{1}{2\sqrt{\alpha '-\alpha }}2lns^T\), thus, we have, from Corollary C.1:
As \(\vert \Delta ^T\vert \le s^T\) and \(\zeta ^T\le \vert \Delta ^T\vert ^2\), we have:
We can therefore rewrite (C24) as:
Now, let us consider the term \(\sum _{j=0}^{M}E[\rho ^T\vert M^T=j]^2(1-Pr(M^T=j))Pr(M^T=j)\). Recall that, from (C16) and (C17), we then have that \(E[\rho ^T\vert M^T=j]=\vert \Delta ^T\vert (\kappa ^T)^{-1}\) for \(j=1,2,...,M\) and \(E[\rho ^T\vert M^T=j]=0\) for \(j=0\). From Corollary C.1, we have that for \(j\le (1-\alpha ')M\) and \(M\ge \frac{1}{2\sqrt{\alpha '-\alpha }}2lns^T\),
and thus,
Let us now consider the values \(j>M(1-\alpha ')\), we have:
where the last passages of equations (C26) and (C27) follow since, by hypothesis, \(M\le s^T\).
The theorem follows from composing the upper bounds obtained in (C25), (C26), and (C27) according to (C22). \(\square \)
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Yu, C., Liu, H., Wahab, F. et al. Global triangle estimation based on first edge sampling in large graph streams. J Supercomput 79, 14079–14116 (2023). https://doi.org/10.1007/s11227-023-05205-3
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-023-05205-3