Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Global triangle estimation based on first edge sampling in large graph streams

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Triangle approximate counting has emerged as a prominent issue in graph stream research in the past few years, with applications ranging from social network analysis to web topic mining and motif detection in informatics. Many graph stream sampling and triangle approximate counting algorithms have been proposed, with the majority of them guaranteeing unbiased estimation. However, they either cannot ensure that the memory overhead or the result’s uncertainty is too great due to the use of an excessively large sampling space. In this article, we propose RFES, a set of one-pass stream algorithms for counting the global number of triangles in a fully dynamic graph stream in an unbiased, low-variance, and high-precision manner. RFES has three algorithms: RFESBASE, RFES-IMPR, and RFES-FD, which represent the basic, improved, and complete dynamic versions, respectively. Each algorithm is based on our proposed first-edge reservoir sampling method, which shrinks the sampling space while increasing the uncertainty of triangles in the sample. It can deal with fully dynamic data with a lower theoretical estimation variance than state-of-the-art algorithms. A significant number of experimental results demonstrated that our RFES algorithm is more accurate and takes less time. The source codes of RFES can be downloaded from the website: https://github.com/BioLab310/RFES.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Availability of supporting data

All data generated or analyzed during this study are included in this published article.

References

  1. Newman MEJ (2003) The structure and function of complex networks. Siam Rev 45:167–256. https://doi.org/10.1137/S003614450342480

    Article  MathSciNet  MATH  Google Scholar 

  2. Berry JW, Hendrickson B (2011) Tolerating the community detection resolution limit with edge weighting. Phys Rev E Stat Nonlinear Soft Matter Phys. 83:056119. https://doi.org/10.1103/PhysRevE.83.056119

    Article  Google Scholar 

  3. Suri S, Vassilvitskii S (2011) Counting triangles and the curse of the last reducer. In: Proceedings of the 20th International Conference on World Wide Web−WWW’11, vol. 42. ACM, Hyderabad

  4. Li ZJ, Lu YT, Zhang WP, Li RH, Guo J, Huang X, Mao R (2018) Discovering hierarchical subgraphs of k-core-truss. Data Sci Eng 3(2):136–149

    Article  Google Scholar 

  5. Eckmann JP, Moses E (2001) Curvature of co-links uncovers hidden thematic layers in the world wide web. Proc Nat Acad Sci US 99:5825–5829. https://doi.org/10.2307/3058584

    Article  MathSciNet  Google Scholar 

  6. Zhi Y, Wilson C et al (2014) Uncovering social network sybils in the wild. Trans Knowl Dis Data 8:259–265. https://doi.org/10.1145/2556609

    Article  Google Scholar 

  7. Shin K, Eliassi-Rad T, Faloutsos C (2016) Corescope: graph mining using k-core analysis - patterns, anomalies and algorithms. In: 2016 IEEE 16th International Conference on Data Mining (ICDM), pp 469–478. https://doi.org/10.1109/ICDM.2016.0058

  8. Yang X, Song C, Yu M et al (2022) Distributed triangle approximately counting algorithms in simple graph stream. ACM Trans Knowl Dis Data 16(4):1–43. https://doi.org/10.1145/3494562

    Article  Google Scholar 

  9. Kavassery-Parakkat N, Hanjani KM, Pavan A (2018) Improved triangle counting in graph streams: In: Power of multi-sampling. IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), pp 33–40

  10. Jayaram R, Kallaugher J (2021) An optimal algorithm for triangle counting in the stream. https://doi.org/10.4230/LIPICS.APPROX/RANDOM.2021.11

  11. Graham C, Hossein J (2019) Lp samplers and their applications: a survey. ACM Comput Surv 52(1):1–31. https://doi.org/10.1145/3297715

    Article  Google Scholar 

  12. Zhang LL, Jiang H et al (2020) Reservoir-based sampling over large graph streams to estimate triangle counts and node degrees. Future Generation Comput Syst 108:244–255. https://doi.org/10.1016/j.future.2020.02.077

    Article  Google Scholar 

  13. Watts D, Strogatz S (1998) Collective dynamics of small world networks. Nature 393:440–442. https://doi.org/10.1038/30918

    Article  MATH  Google Scholar 

  14. Pavan A, Tangwongsan K et al (2013) Counting and sampling triangles from a graph stream. Proc Vldb Endow 6(14):1870–1881. https://doi.org/10.14778/2556549.2556569

    Article  Google Scholar 

  15. Pinar A, Jha M, Seshadhri C (2013) A space-efficient streaming algorithm for estimating transitivity and triangle counts using the birthday paradox. ACM Trans Knowl Dis Data 9:1–21. https://doi.org/10.1145/2700395

    Article  Google Scholar 

  16. Lim Y, Jung M, Kang U (2018) Memory-efficient and accurate sampling for counting local triangles in graph streams: From simple to multigraphs. ACM Trans Knowl Dis Data 12:1–28. https://doi.org/10.1145/3022186

    Article  Google Scholar 

  17. Stefani LD, Epasto A, Riondato M, Upfal E (2016) triest: counting local and global triangles in fully-dynamic streams with fixed memory size. In: International Conference on Knowledge Discovery and Data Mining, pp 825–834. https://doi.org/10.1145/2939672.2939771

  18. Shin K, Kim J, Hooi B (2018) Think before you discard: accurate triangle counting in graph streams with deletions. Springer, Cham, pp 141–157. https://doi.org/10.1007/978-3-030-10928-8_9

  19. Singh P, Srinivasan V, Thomo A (2021) Fast and scalable triangle counting in graph streams: The hybrid approach. In: International Conference on Advanced Information Networking and Applications, pp 107–119. https://doi.org/10.1007/978-3-030-75075-6_9

  20. Jung MLY, Lee S (2019) Furl:fixed-memory and uncertainty reducing local triangle counting for graph streams. Data Min Knowl Dis 33:1225–1253

    Article  MATH  Google Scholar 

  21. Gou X, Zou L (2021) Sliding window-based approximate triangle counting over streaming graphs with duplicate edges. In: SIGMOD/PODS ’21: International Conference on Management of Data. https://doi.org/10.1145/3448016.3452800

  22. Han G, Sethu H (2017) Edge sample and discard: a new algorithm for counting triangles in large dynamic graphs. In: the 2017 IEEE/ACM International Conference, pp 44–48

  23. Seshadhri C, Pinar A, Kolda TG (2014) Wedge sampling for computing clustering coefficients and triangle counts on large graphs. Stat Anal Data Min 7:294–307. https://doi.org/10.1002/sam.11224

    Article  MathSciNet  MATH  Google Scholar 

  24. Turk A, Türkoğlu D (2019) Revisiting wedge sampling for triangle counting. In: Proceedings of the 2019 World Wide Web Conference (WWW’19). https://doi.org/10.1145/3308558.3313534

  25. Vitter, Jeffrey S (1985) Random sampling with a reservoir. ACM Trans Math Softw 11:37–57. https://doi.org/10.1145/3147.3165

    Article  MathSciNet  MATH  Google Scholar 

  26. Al-Kateb M, Lee BS, Wang XS (2007) Adaptive-size reservoir sampling over data streams. In: International Conference on Scientific and Statistical Database Management, pp 1–22. doi: https://doi.org/10.1109/ssdbm.2007.29

  27. Al-Kateb M, Lee BS (2014) Stratified reservoir sampling over heterogeneous data streams. In: Scientific and Statistical Database Management. In: 22nd International Conference, vol 39, pp 199–216. https://doi.org/10.1016/j.is.2012.03.005

  28. Wei H, Cao HW, Yan MY et al (2021) Bsr-tc: Adaptively sampling for accurate triangle counting over evolving graph streams. Int J Softw Eng Knowl Eng 31:1561–1581. https://doi.org/10.1142/S021819402140012X

    Article  Google Scholar 

  29. Gemulla R, Lehner W, Haas PJ (2008) Maintaining bounded-size sample synopses of evolving datasets. VLDB J 17:173–202. https://doi.org/10.1007/s00778-007-0065-y

    Article  Google Scholar 

  30. Shin K (2017) Wrs: Waiting room sampling for accurate triangle counting in real graph streams. IEEE Comput So, pp 1087–1092. https://doi.org/10.1109/ICDM.2017.143

  31. Skala M (2013) Hypergeometric tail inequalities: ending the insanity. Statistics

Download references

Acknowledgements

Not applicable.

Funding

This work is supported by the National Natural Science Foundation of China (61772124).

Author information

Authors and Affiliations

Authors

Contributions

CY contributed to the evolution of research ideas and overall research goals and performed experiments; HL contributed in terms of theoretical proof, the writing, and modification of the initial draft; FW contributed to participate in the writing, translation and revision of the main content of this paper; ZL contributed in terms of data analysis and visualization and performed relevant comparison experiments; TR contributed to the collection and processing of data; HM contributed to the construction of the experimental platform and the verification of the experimental design; YZ contributed to the review of the first draft and the paper’s funding support.

Corresponding author

Correspondence to Yuhai Zhao.

Ethics declarations

Conflict of interest

The authors declare that they have no competing interests.

Ethical approval and consent to participate

Not applicable.

Human and animal ethics

Not applicable.

Consent for publication

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Theoretical results for RFES-B

In this section, we present the theoretical results (statements and proofs) not included in the main body.

1.1 A.1: Expectation

Before proving Lemma 4.1, in order to better illustrate the property of the standard reservoir sampling approach, we first introduce the following technical lemma, which cites the conclusion of Lemma A.1 in [17], and has been proven to be correct.

Lemma A.1

[17] For any \(T>M\), let A be any subset of \(E^T\) of size \(|A |=M\). Then, at the end of time step T, we have:

$$\begin{aligned} Pr(S=A)=\frac{1}{\genfrac(){0.0pt}1{|E^T |}{M}}=\frac{1}{\genfrac(){0.0pt}1{T}{M}} \end{aligned}$$
(A1)

i.e., the set of edges in S at the end of time T is a subset of \(E^T\) of size M chosen uniformly and at random from all subsets of \(E^T\) of the same size. When the properties of standard reservoir sampling are further clarified, we then proceed to a detailed proof of Lemma 4.1.

Proof of Lemma 4.1

Proof. If \(k>min\{M,T\}\), we have \(Pr(B\subseteq S)=0\) because it is impossible for B to be equal to S in these cases. From now on, we then assume \(k \le \{M,T\}\).

Case1: If \(T \le M\), then all edges that have appeared in the graph stream so far will be saved to the edge sample set S. Then, \(E^T\subseteq S\) and \(Pr(B\subseteq S)=1=\varphi ^{-1}_{k,T}\).

Case2: Assume that \(T>M\), and let \(\mathcal {B}\) be the family of subsets of \(E^T\) that : 1. have size M, and 2. contain B. That is, \(\mathcal {B}=\{C\subset E^T: \vert C\vert =M, B\subseteq C\}\). Based on this, we can symbolize \(\mathcal {B}\) as:

$$\begin{aligned} |\mathcal {B} |=\genfrac(){0.0pt}0{|E^T |-k}{M-k}=\genfrac(){0.0pt}0{T-k}{M-k}. \end{aligned}$$
(A2)

From this and and Lemma A.1 we then have:

$$\begin{aligned} Pr(B\subseteq S)=Pr(S\in B)=\sum _{C\in \mathcal {B}}Pr(S=C)=\frac{\genfrac(){0.0pt}0{T-k}{M-k}}{\genfrac(){0.0pt}0{T}{M}}=\prod _{i=0}^{k-1}\frac{M-i}{T-i}=\varphi ^{-1}_{k,T}. \end{aligned}$$
(A3)

\(\square \)

Now, we can prove Theorem 4.1 on the unbiasedness of the estimation computed by RFES-BASE (and on their exactness for \(T \le M\)).

Proof of Theorem.1

Proof. If \(T\le M\), all the edges appearing in the graph stream are stored in S, so the counting result of RFES-BASE is completely accurate under this condition. That is to say, when \(T \le M\), the value of the counter \(\mu ^T\) is the real number of the global triangle in the graph \(G^T\) at time T.

Next, we prove that the RFES-BASE algorithm is unbiased when \(T>M\). Assume now that \(T>M\) and assume that \(\vert \Delta ^T\vert >0\); otherwise, \(\mu ^T=\vert \Delta ^S\vert =0\) and RFES-BASE estimation is deterministically correct. Let triangle \(\lambda =(a,b,c)\in \Delta ^T\), (where abc are edges in \(E^T\), suppose a is the first edge of \(\lambda \) and c is the last edge of \(\lambda \)). Let \(\delta ^T_{\lambda }\) be a random variable that takes value \(\varphi ^T\) if \(\lambda =(a,b,c) \in \Delta ^S\)(i.e.,\(\{a\} \subseteq S\)) at the end of the step instant T, and 0 otherwise.

From Lemma 4.1, we have that,

$$\begin{aligned} E[\delta ^T_{\lambda }]=\varphi ^TPr(\lambda =(a,b,c) \in \Delta ^S)=\varphi ^TPr(a\in S)=\varphi ^T\frac{1}{\varphi _{1,T}}=\varphi ^T\frac{1}{\varphi ^T}=1. \end{aligned}$$
(A4)

We can write \(\varphi ^T\mu ^T=\sum _{\lambda \in \Delta ^T}\delta ^T_{\lambda }\), and from this, (16), and linearity of expectation, we have:

$$\begin{aligned} E[\varphi ^T\mu ^T]=E[\sum _{\lambda \in \Delta ^T}\delta ^T_{\lambda }]=\sum _{\lambda \in \Delta ^T}E[\delta ^T_{\lambda }]=\vert \Delta ^T\vert . \end{aligned}$$
(A5)

\(\square \)

1.2 A.2: Variance

We now analyze the estimation variance returned by RFES-BASE for \(T>M\)(the variance is 0 for \(T\le M\)).

Proof of Theorem 4.2

Proof. Assume \(\vert \Delta ^T \vert >0\); otherwise, the estimation is deterministically correct and has variance 0, and the thesis holds. Let any triangle \(\alpha \in \Delta ^T\) and \(\delta ^T_{\alpha }\) be as in the proof of Theorem 4.1. According to Theorem 4.1, \(E(\delta ^T_{\alpha })=1\), we have \(Var(\delta ^T_{\alpha })=E[(\delta ^T_{\alpha })^2]-E^2[\delta ^T_{\alpha }]=(\varphi ^T)^2\frac{1}{\varphi ^T}-1=\varphi ^T-1\). From this and the definition of variance and covariance, we obtain:

$$\begin{aligned} \begin{aligned} Var[\varphi ^T\mu ^T]&=Var[\sum _{\alpha \in \Delta ^T}\delta ^T_{\alpha }]=\sum _{\alpha \in \Delta ^T}\sum _{\beta \in \Delta ^T}Cov[\delta ^T_{\alpha },\delta ^T_{\beta }] \\&=\sum _{\alpha \in \Delta ^T}Var[\delta ^T_{\alpha }]+\sum _{\alpha ,\beta \in \Delta ^T, \alpha \ne \beta }Cov[\delta ^T_{\alpha },\delta ^T_{\beta }]\\&=\vert \Delta ^T\vert (\varphi ^T-1)+\sum _{\alpha ,\beta \in \Delta ^T, \alpha \ne \beta }Cov[\delta ^T_{\alpha },\delta ^T_{\beta }]\\&=\vert \Delta ^T\vert (\varphi ^T-1)+\sum _{\alpha ,\beta \in \Delta ^T, \alpha \ne \beta }(E[\delta ^T_{\alpha }\delta ^T_{\beta }]-E[\delta ^T_{\alpha }]E[\delta ^T_{\beta }]). \end{aligned} \end{aligned}$$
(A6)

Assume now \(\vert \Delta ^T \vert >2\), otherwise, we have \(\varrho ^T=\xi ^T=0\), and the thesis holds as the second term on the r.h.s. of (A.6) is 0. Let \(\alpha =(p,q,r)\) and \(\beta =(x,y,z)\) be two distinct triangles in, where p is the first edge of \(\alpha \) , r is the last edge of \(\alpha \), x is the first edge of \(\beta \) and z is the last edge of \(\beta \). If \(\alpha \) and \(\beta \) do not share the first edge p(or x), we have \(\delta ^T_{\alpha }\delta ^T_{\beta }=\varphi ^T\varphi ^T=\varphi ^2_{1,T}\) if the first edges of \(\alpha \) and \(\beta \) are both in S at the end of time T, and \(\delta ^T_{\alpha }\delta ^T_{\beta }=0\) otherwise. From Lemma 4.1, we have that:

$$\begin{aligned} E[\delta ^T_{\alpha }\delta ^T_{\beta }]=\varphi ^2_{1,T}Pr(\delta ^T_{\alpha }\delta ^T_{\beta }=\varphi ^2_{1,T})=\varphi ^2_{1,T}\frac{1}{\varphi _{2,T}}=\varphi _{1,T}\prod _{i=1}^{1}\frac{M-i}{T-i}=\varphi ^T\frac{M-1}{T-1}. \end{aligned}$$
(A7)

If instead \(\alpha \) and \(\beta \) share exactly the first edge p(or x), we have \(\delta ^T_{\alpha }\delta ^T_{\beta }=\varphi ^2_{1,T}\), and \(\delta ^T_{\alpha }\delta ^T_{\beta }=0\) otherwise. From Lemma 4.1, we then have that:

$$\begin{aligned} E[\delta ^T_{\alpha }\delta ^T_{\beta }]=\varphi ^2_{1,T}Pr(\delta ^T_{\alpha }\delta ^T_{\beta }=\varphi ^2_{1,T})=\varphi ^2_{1,T}\frac{1}{\varphi _{1,T}}=\varphi _{1,T}=\varphi ^T. \end{aligned}$$
(A8)

The thesis follows by combining (A6), (A7), (A8), recalling the definitions of \(\varrho ^T\) and \(\xi ^T\), and slightly reorganizing the terms. \(\square \)

Appendix B: Theoretical results for RFES-I

1.1 B.1: Expectation

Proof of Theorem 4.3

Proof. If \(T\le M\), all the edges appearing in the graph stream are stored in S, RFES-IMPR behaves exactly like RFES-BASE, and the statement follows from Theorem 4.1.

Assume now \(T>M\) and assume that \(\vert \Delta ^T\vert >0\); otherwise, the algorithm deterministically returns 0 as an estimation, and the thesis follows. Let any triangle \(\lambda \in \Delta ^T\) denote with a, b, and c the edges of \(\lambda \) and assume, w.l.o.g., that they appear in this order (not necessarily consecutively) on the stream. Let \(T_{\lambda }\) be the time step at which c is on the stream. Let \(\delta _{\lambda }\) be a random variable that takes value \(\varphi _{1,\delta _{\lambda }}\) if the first edge a of \(\lambda \) is in S at the end of time step \(T_{\lambda }\), and 0 otherwise. Since it must be \(T_{\lambda } \ge 1\), from Lemma 4.1 we have that:

$$\begin{aligned} Pr(\delta _{\lambda }=\varphi _{1,T_{\lambda }})=\frac{1}{\varphi _{1,T_{\lambda }}}. \end{aligned}$$
(B9)

When the third edge c of triangle \(\lambda \) arrives, that is, at time \(T_\lambda \), use the two nodes of edge c to update the neighbor node lists of the nodes of the related edges in S obtained by sampling before time \(T_\lambda \), then call the UpdateCounter function, and then use \(\vert \mathbb {N}^{S\_C}_{u^S\_{list},v^S\_{list}}\vert \varphi _{1, T_\lambda }\) to update the value of the counter, where \(\vert \mathbb {N}^{S\_C}_{u^S\_{list},v^S\_{list}}\vert \varphi _{1, T_\lambda }\) indicates that after updating the neighbor node lists of the related edges with the two vertices of edge c, the number of common neighbor lists that have changed (\(\le 2\)), the new edge c in the graph stream can only add at most two triangles at a time. Once the common neighbor nodes lists of the two nodes of the existing edge in the sample set S change, it means that the edge arriving at time \(T_\lambda \) and the edge in S form new triangles. With the arrival of edge c, all these triangles have the corresponding random variables taking the same value \(\varphi _{1,T_\lambda }\). This means that the random variable \(\mu ^T\) can be expressed as:

$$\begin{aligned} \mu ^T=\sum _{\lambda \in \Delta ^T}\delta _\lambda . \end{aligned}$$
(B10)

From this, linearity of expectation, and (B9), we get:

$$\begin{aligned} E[\mu ^T]=E[\sum _{\lambda \in \Delta ^T}\delta _\lambda ]=\sum _{\lambda \in \Delta ^T}\varphi _{1,T_\lambda }Pr(a\in S)=\sum _{\lambda \in \Delta ^T}\varphi _{1,T_\lambda }\frac{1}{\varphi _{1,T_\lambda }}=\vert \Delta ^T\vert . \end{aligned}$$
(B11)

\(\square \)

1.2 B.2: Variance

Proof of Lemma 4.2

Proof. Consider first the case where all edges of \(\alpha =(p,q,r)\) appear on the stream before any edge of \(\beta =(x,y,z)\), i.e., \(T_p<T_q<T_r<T_x<T_y<T_z\).

The presence or absence of p in S at the beginning of the time step \(T_\beta \) (i.e., whether \(D_\alpha \) happens or not) has no effect whatsoever on the probability that x is in the sample set S at the beginning of the time step \(T_z\). Hence in this case, \(Pr(D_\beta \vert D_\alpha )=Pr(D_\beta )\).

Consider now the case where the first edge of triangle \(\alpha \) appears in the graph stream before the third edge of triangle \(\beta \). Define now the events:

—A: “at time \(T_\alpha \), the edge x has been sampled and remains in the sample set S”, and

—B: “during the period from \(T_r\) t o \(T_z\) (including time points \(T_r\) and \(T_z\)), the incoming edge in the graph stream does not replace and delete x from S. That is, during this time, x is always kept in the sample set S”.

Therefore, we can write \(D_\beta \) as \(D_\beta =A\cap B\).

Hence,

$$\begin{aligned} Pr(D_\beta \vert D_\alpha )=Pr(A\cap B \vert D_\alpha )=Pr(A\vert D_\alpha )Pr(B\vert D_\alpha ). \end{aligned}$$
(B12)

We now show that, \(Pr(A\vert D_\alpha )\le Pr(A)\).

If we assume that \(T_r\le M\), then all the edges that appeared on the stream up until the beginning of \(T_r\) are in S. Therefore, \(Pr(A\vert D_\alpha )\le Pr(A)=1\).

Assume instead that \(T_r>M\). Among the \(\left( {\begin{array}{c}T_r\\ M\end{array}}\right) \) subsets of \(E^{T_r}\) of size M, there are \(\left( {\begin{array}{c}T_r-1\\ M-1\end{array}}\right) \) that contain edge p. At the time \(T_r\), the storage of edge p in S will affect the storage of edge x in S. Therefore, \(Pr(A\vert D_\alpha )=\frac{\left( {\begin{array}{c}T_r-1-1\\ M-1-1\end{array}}\right) }{\left( {\begin{array}{c}T_r-1\\ M-1\end{array}}\right) }=\frac{M-1}{T_r-1}\). According to initial assumptions and from the fact that for any \(j\ge 0\) and any \(n\ge m > j\) it holds \(\frac{m-j}{n-j}\le \frac{m}{n}\), then we have \(Pr(A\vert D_\alpha )\le Pr(A)\).This implies, from (B12), that,

$$\begin{aligned} Pr(D_\beta \vert D_\alpha )=Pr(A\cap B\vert D_\alpha )=Pr(A\vert D_\alpha )Pr(B\vert A\cap D_\alpha )\le Pr(A)Pr(B\vert A\cap D_\alpha ) \end{aligned}$$
(B13)

Consider now event B. When conditioned on A, event B is actually independent of event \(D_\alpha \). Thus, \(Pr(B\vert A\cap D_\alpha )=Pr(B\vert A)\). Putting this together with (B13), then we can obtain \(Pr(D_\beta \vert D_\alpha )=Pr(A\cap B\vert D_\alpha )=Pr(A\vert D_\alpha )Pr(B\vert A\cap D_\alpha )\le Pr(A)Pr(B\vert A\cap D_\alpha )\le Pr(A)Pr(B\vert A)\le Pr(A\cap B)\le Pr(D_\beta )\), where the last inequality follows from the fact that \(D_\beta =A\cap B\) by definition. \(\square \)

Proof of Theorem 4.4

Proof. Assume \(\vert \Delta ^T\vert >0\), otherwise, RFES-IMPR estimation is deterministically correct and has variance 0, and the thesis holds. Let \(\alpha \in \Delta ^T \) and let \(\delta _\alpha \) be a random variable that takes value \(\varphi _{1, T_r}\) if the first edge of \(\alpha \) is in S at the end of the time step \(T_r\), and 0 otherwise. According to the variance formula and properties, we can get \(Var[\delta _\alpha ]=E[(\delta _\alpha )^2]-E^2[\delta _\alpha ]=\varphi _{1, T_r}-1\le \varphi _{1, T}\).Therefore, the variance of the RFES-IMPR algorithm can finally be expressed as:

$$\begin{aligned} \begin{aligned} Var[\mu ^T]&=Var[\sum _{\alpha \in \Delta ^T}\delta _\alpha ]=\sum _{\alpha \in \Delta ^T}\sum _{\beta \in \Delta ^T}Cov[\delta _\alpha ,\delta _\beta ]\\&=Var[\sum _{\alpha \in \Delta ^T}\delta _\alpha ]+\sum _{\alpha ,\beta \in \Delta ^T, \alpha \ne \beta }Cov[\delta _\alpha ,\delta _\beta ] \\&\le \vert \Delta ^T\vert (\varphi _{1,T}-1)+\sum _{\alpha ,\beta \in \Delta ^T,\alpha \ne \beta }(E[\delta _\alpha \delta _\beta ]-E[\delta _\alpha ]E[\delta _\beta ])\\&\le \vert \Delta ^T\vert (\varphi _{1,T}-1)+\sum _{\alpha ,\beta \in \Delta ^T, \alpha \ne \beta }(E[\delta _\alpha \delta _\beta ]-1) \end{aligned} \end{aligned}$$
(B14)

For any triangle \(\alpha \in \Delta ^T\) and define \(q_\alpha =\varphi _{1,T_r}\). Assume now \(\vert \Delta ^T\vert \ge 2\), otherwise, we have \(\varrho ^T=\eta ^T=0\), and the thesis holds as the second term on the r.h.s. of (B14) is 0. Let now \(\alpha \) and \(\beta \) be two distinct triangles in \(\Delta ^T\)(since there may be shared edges between these two different triangles, \(T\ge 1\)). When calculating \(E[\delta _\alpha \delta _\beta ]=q_\alpha q_\beta Pr(\delta _\alpha \delta _\beta =q_\alpha q_\beta )\), it is necessary to ensure that both r and z are sampled and stored in S, and at the same time, use the remaining four edges of the two triangles to update neighbor node lists of the two nodes of the first edge of the triangle. The event “\(\delta _\alpha \delta _\beta =q_\alpha q_\beta \)” is the intersection of events \(D_\alpha \cap D_\beta \), where \(D_\alpha \) is the event that the first edge of \(\alpha \) is in S at the end of the time step \(T_\alpha \), and similarly for \(D_\beta \). Next, we need to calculate \(Pr(D_\alpha \cap D_\beta )\) in the various possible cases.

(1)If \(\alpha \) and \(\beta \) share an edge l, we perform a detailed analysis based on the first or third edge of all edges where l is \(\alpha \) and \(\beta \).

\(\textcircled {1}\) \(\alpha \) and \(\beta \) share the first edge. We suppose \(T_\alpha < T_\beta \) and let event \(E_1\) be “at time \(T_\alpha \), the shared edge l is in S”, and event \(E_2\) be “in the time period from \(T_\alpha \) to \(T_\beta \), the edge replaced and deleted from S does not include l”. Then, \(Pr(D_\beta \vert D_\alpha )=Pr(E_1\vert D_\alpha )Pr(E_2\vert E_1 \cap D_\alpha )=Pr(E_1)Pr(E_2\vert E_1)\).From Lemma 4.1, we then have \(Pr(E_2\vert E_1)=\prod _{j=max\{T_\alpha ,M\}}^{T_\beta }((1-\frac{M}{j})+\frac{M}{j}\frac{M-1}{M})=\prod _{j=max\{T_\alpha ,M\}}^{T_\beta }\frac{j-1}{j}=\frac{max\{T_\alpha -1,M-1\}}{max\{M-1,T_\beta \}}\le \frac{1}{q_\beta }\frac{max\{M,T_\alpha -1\}}{M}\).Thus, \(Pr(D_\beta \vert D_\alpha )=Pr(E_1)Pr(E_2\vert E_1)\le \frac{1}{q_\alpha }\frac{1}{q_\beta }\frac{max\{M,T_\alpha -1\}}{M}\). From the conditional probability formula and Lemma 4.2, we have \(Pr(D_\beta \cap D_\alpha )=Pr(D_\beta \vert D_\alpha )Pr(D_\alpha )\le \frac{1}{q_\alpha }\frac{1}{q_\beta }\frac{1}{q_\alpha }\frac{max\{M,T_\alpha -1\}}{M}\le \frac{1}{q^2_\alpha q_\beta }\frac{max\{M, T_\alpha -1\}}{M}\).

\(\textcircled {2}\) \(\alpha \) and \(\beta \) share the third edge. Let \(T_l\) be the time when l appears in the graph stream, then \(T_l=T_\alpha =T_\beta \). The event “\(D_\alpha \cap D_\beta \)” occurs if and only if both edges r and z are in S at the end of time \(T_l\). From Lemma 4.1, we have \(Pr(D_\beta \cap D_\alpha )=\frac{1}{\varphi _{2,T_l}}=\frac{1}{\varphi _{2,T_\alpha }}=\frac{1}{\varphi _{1,T_\alpha }}\frac{\varphi _{1,T_\alpha }}{\varphi _{2,T_\alpha }}\le \frac{1}{q_\alpha }\frac{\varphi _{1,T_\alpha }}{\varphi _{2,T_\alpha }}\le \frac{1}{q_\alpha }\frac{M-1}{T_\alpha -1}\le \frac{1}{q_\alpha }\frac{M-1}{T_\beta -1}\le \frac{1}{q_\alpha }\frac{M}{T_\beta }\le \frac{1}{q_\alpha }\frac{1}{q_\beta }\le \frac{1}{q_\alpha q_\beta }\).

(2)If \(\alpha \) and \(\beta \) do not share the first or third edge, we can start the discussion from the following aspects.

\(\textcircled {1}\) \(\alpha \) and \(\beta \) have a shared edge l, and l is the third edge of \(\alpha \) and the first edge of \(\beta \). Under this condition, both events \(D_\alpha \) and \(D_\beta \) occur independently of each other. Thus, \(Pr(D_\alpha \cap D_\beta )=Pr(D_\alpha )Pr(D_\beta )=\frac{1}{q_\alpha }\frac{1}{q_\beta }=\frac{1}{q_\alpha q_\beta }\).

\(\textcircled {2}\) \(\alpha \) and \(\beta \) do not share any edges and \(T_\alpha < T_x\). Both events \(D_\alpha \) and \(D_\beta \) occur independently of each other. Thus, \(Pr(D_\alpha \cap D_\beta )=Pr(D_\alpha )Pr(D_\beta )=\frac{1}{q_\alpha }\frac{1}{q_\beta }=\frac{1}{q_\alpha q_\beta }\).

\(\textcircled {3}\) \(\alpha \) and \(\beta \) may share the second edge, and \(T_p<T_x<T_\alpha <T_\beta \). Let event \(E_3\) be “at time \(T_\alpha \), edges p and x are in S”, and event \(E_4\) be “in the time period from \(T_\alpha \) to \(T_\beta \), the edge replaced and deleted from S does not include x”. Then, \(Pr(D_\beta \vert D_\alpha )=Pr(E_3\vert D_\alpha )Pr(E_4\vert E_3\cap D_\alpha )\). If \(T_\alpha \le M\), edges p and x must be in the sample set S. Consider instead the case \(T_\alpha >M\), and the event \(D_\alpha \) is established again, indicating that the moment p must be in S. At this time, all subsets of \(E^{T_\alpha }\) of size M and containing edge p and the other edge of \(\alpha \) have an equal probability of being S, from Lemma A.1. There are \(\left( {\begin{array}{c}T_\alpha -1\\ M-1\end{array}}\right) \) such set. Among these, there are \(\left( {\begin{array}{c}T_\alpha -2\\ M-2\end{array}}\right) \) sets that also contain x. Therefore, if \(T_\alpha >M\), we have \(Pr(E_3\vert D_\alpha )=\frac{\left( {\begin{array}{c}T_\alpha -2\\ M-2\end{array}}\right) }{\left( {\begin{array}{c}T_\alpha -1\\ M-1\end{array}}\right) }=\frac{M-1}{T_\alpha -1}\). Considering what we said before for the case \(T_\alpha \le M\), we then have \(Pr(E_3\vert D_\alpha )=min\{\frac{M-1}{T_\alpha -1}\}\). We also have \(Pr(E_4\vert E_3\cap D_\alpha )=\prod _{j=max\{T_\alpha ,M\}}^{T_\beta }((1-\frac{M}{j})+\frac{M}{j}\frac{M-1}{M})=\prod _{j=max\{T_\alpha ,M\}}^{T_\beta }\frac{j-1}{j}=\frac{max\{T_\alpha -1,M-1\}}{max\{M-1,T_\beta \}}\). Therefore, \(Pr(D_\beta \vert D_\alpha )=Pr(E_3\vert D_\alpha )Pr(E_4\vert E_3\cap D_\alpha )=min\{1,\frac{M-1}{T_\alpha -1}\}\frac{max\{T_\alpha -1,M-1\}}{max\{M-1,T_\beta \}}\). With a case analysis, one can show that, \(Pr(D_\alpha \cap D_\beta )\le \frac{1}{q_\alpha q_\beta }\frac{max\{M,T_\alpha -1\}}{M} min\{1,\frac{M-1}{T_\alpha -1}\}\).

\(\textcircled {4}\) \(\alpha \) and \(\beta \) may share the second edge, and \(T_p<T_x<T_\beta <T_\alpha \). Let event \(E_5\) be “at time \(T_\beta \), edges p and x are in S”, and event \(E_4\) be “in the time period from \(T_\beta \) to \(T_\alpha \), the edge replaced and deleted from S does not include p”. Then, \(Pr(D_\alpha \vert D_\beta )=Pr(E_5\vert D_\beta )Pr(E_6\vert E_5\cap D_\beta )\). If \(T_\beta \le M\), edges p and x must be in the sample set S. Consider instead the case \(T_\beta >M\), and the event \(D_\beta \) is established again, indicating that the moment x must be in S. At this time, all subsets of \(E^{T_\beta }\) of size M and containing edge x and the other edge of \(\beta \) have an equal probability of being S, from Lemma A.1. There are \(\left( {\begin{array}{c}T_\beta -1\\ M-1\end{array}}\right) \) such set. Among these, there are \(\left( {\begin{array}{c}T_\beta -2\\ M-2\end{array}}\right) \) sets that also contain p. Therefore, if \(T_\beta >M\), we have \(Pr(E_5\vert D_\beta )=\frac{\left( {\begin{array}{c}T_\beta -2\\ M-2\end{array}}\right) }{\left( {\begin{array}{c}T_\beta -1\\ M-1\end{array}}\right) }=\frac{M-1}{T_\beta -1}\). Considering what we said before for the case \(T_\beta \le M\), we then have \(Pr(E_5\vert D_\beta )=min\{1,\frac{M-1}{T_\beta -1}\}\). We also have \(Pr(E_6\vert E_5\cap D_\beta )=\prod _{j=max\{T_\beta ,M\}}^{T_\alpha }((1-\frac{M}{j})+\frac{M}{j}\frac{M-1}{M})=\prod _{j=max\{T_\beta ,M\}}^{T_\alpha }\frac{j-1}{j}=\frac{max\{T_\beta -1,M-1\}}{max\{M-1,T_\alpha \}}\). Therefore, \(Pr(D_\alpha \vert D_\beta )=Pr(E_5\vert D_\beta )Pr(E_6\vert E_5\cap D_\beta )=min\{1,\frac{M-1}{T_\beta -1}\}\frac{max\{T_\beta -1,M-1\}}{max\{M-1,T_\alpha \}}\). With a case analysis, one can show that, \(Pr(D_\alpha \cap D_\beta )\le \frac{1}{q_\alpha q_\beta }\frac{max\{M,T_\beta -1\}}{M} min\{1,\frac{M-1}{T_\beta -1}\}\).

To recap, when considering two different triangles \(\alpha \) and \(\beta \), we have several cases discussed above. Therefore, we can finally express the upper bound value of the variance of the RFES-IMPR algorithm as: \(Var[\mu ^T]\le \vert \Delta ^T \vert (\varphi _{1,T}-1)+\varrho ^T(\frac{1}{q_\alpha }\frac{max\{M,T_\alpha -1\}}{M}-1)+\eta ^T max\{(\frac{max\{M,T_\alpha -1\}}{M} min\{1,\frac{M-1}{T_\alpha -1}\})-1, (\frac{max\{M,T_\beta -1\}}{M} min\{1,\frac{M-1}{T_\beta -1}\})-1\}\). \(\square \)

Appendix C: Theoretical results for RFES-FD

1.1 C.1: Expectation

Before proving Theorem 4.5, we need the following lemmas. The following is a conclusion of Lemma A.4 in [17].

Lemma C.1

[17] For any \(T>0\) and any \(j(0\le j\le s^T)\), let \(\mathcal {B}^T\) be the collection of subsets \(E^T\) of size j. For any \(B \in \mathcal {B}^T\) it holds

$$\begin{aligned} Pr(S=B\vert M^T=j)=\frac{1}{\left( {\begin{array}{c}\vert E^T\vert \\ j\end{array}}\right) }. \end{aligned}$$

That is, conditioned on its size at the end of time step T, S is equally likely to be, at the end of time step T, any of the subsets of \(E^T\) of that size.

Lemma C.2

Recall the definition of \(\kappa ^T\) from (8) in the text. We have

$$\begin{aligned} \kappa ^T=Pr(M^T \ge 1). \end{aligned}$$

Our algorithm only needs to ensure that the first edge of any triangle appears in the sample set S. Therefore, at time T, the minimum capacity of the sample set S is 1, which is also different from the Triest algorithm.

The next lemma follows from Lemma C.1 in the same way as Lemma 4.1 follows from Lemma A.1.

Lemma C.3

For any time step \(T(T\ge 0)\) and any \(j(0\le j \le s^T)\), let B be any subset of \(E^T\) of size \(\vert B \vert =k\le s^T\). Then, at the end of time step T,

$$\begin{aligned} Pr(B\subseteq S\vert M^T=j)=\left\{ \begin{aligned} 0,&\ if \ k>j \\ \frac{1}{\psi _{k,j,s^T}},&\ if\ k\le j \\ \end{aligned} \right. \end{aligned}$$

The next two lemmas discuss properties of RFES-FD for \(T<T^*\), where \(T^*\) is the same as defined above, it means the first time that \(\vert E^T\vert \) had size \(M+1(T^*\ge M+1)\). Lemma C.4 is the conclusion of Lemma A.7 in [17], which has been proven to be true.

Lemma C.4

[17] For all \(T<T^*\), we have:

(1)\(n^T_g=0\); and

(2)\(S=E^T\); and

(3)\(M^T=s^T\).

Proof

The third conclusion of Lemma C.4 is dependent upon the first two conclusions. Therefore, we concentrate on proving the first two conclusions, and then, we can infer that the third conclusion is also true.

The proof is by induction on T. In the base for \(T=1\): the edge on the stream must be an insertion, and the algorithm deterministically inserts the edge in S. Assume now that it is true for all time steps up to (but excluding) some \(T\le T^*-1\). We now show that it is also true for T.

Assume the edge on the stream at time T is a deletion. The corresponding edge must be in S, from the inductive hypothesis. Hence RFES-FD removes it from S and increments the counter \(n_b\) by one. Thus it is still true that \(n^T_g=0\) and \(S=E^T\), and the thesis holds.

Assume now that the element on the stream at time T is an insertion. From the inductive hypothesis, we have that the current value of the counter \(n_g\) is zero.

If the counter \(n_b\) has currently value of zero as well, then, because of the hypothesis that \(T<T^*\), it must be that \(\vert S\vert =M^{(T-1)}=s^{(T-1)}<M\). Therefore RFES-FD always inserts the edge in S. Thus it is still true that \(n^T_g=0\) and \(S=E^T\), and the thesis holds.

If otherwise \(n_g > 0\), then RFES-FD flips a biased coin with a probability of heads equal to

$$\begin{aligned} \frac{n_b}{n_b+n_g}=\frac{n_b}{n_b}=1. \end{aligned}$$

Therefore, RFES-FD always inserts the edge in S and decrements \(n_g\) by one. Thus it is still true that \(n^T_g=0\) and \(S=E^T\), and the thesis holds.

The following result of Lemma C.5 is an immediate consequence of Lemma C.2 and Lemma C.4.

Lemma C.5

For all \(T<T^*\) such that \(s^T\ge 1\), we have \(\kappa ^T=1\).

That is to say, as long as the first edge of the triangle is in the sample set S, the neighbor node lists corresponding to vertexes of the first edge can be continuously updated with the subsequent edges in the graph stream.

Based on the five lemmas presented above, we next use them to begin a detailed proof of the expectations of the RFES-FD algorithm.

Proof of Theorem 4.5

Proof. Assume for now that \(T<T^*\). From Lemma C.4, we have that \(M^T=s^T\). If \(M^T<1\), then it must be \(s^T<1\), hence \(\vert \Delta ^T\vert =0\) and indeed the algorithm returns \(\rho ^T=0\) in this case. If instead \(M^T=s^T\ge 1\), then we have

$$\begin{aligned} \rho ^T=\frac{\mu ^T}{\kappa ^T}. \end{aligned}$$

From Lemma C.5 we have that \(\kappa ^T=1\) for all \(T<T^*\), hence \(\rho ^T=\frac{\mu ^T}{\kappa ^T}=\mu ^T=\vert \Delta ^S\vert =\vert \Delta ^T\vert \) in these cases. The establishment of equation \(\vert \Delta ^S\vert =\vert \Delta ^T\vert \) is derived from the conclusion (2) in Lemma C.4. As described in this paper, \(\rho ^T=\mu ^T=\vert \Delta ^T\vert \) is true for all \(T\le T^*\).

Assume now that \(T\ge T^*\). Using the law of total expectation, we can write

$$\begin{aligned} E[\rho ^T]=\sum _{j=0}^{min\{s^T,M\}}E[\rho ^T\vert M^T=j]Pr(M^T=j). \end{aligned}$$
(C15)

Assume that \(\vert \Delta ^T\vert >0\); otherwise, the algorithm deterministically returns 0 as an estimation and the thesis follows. Let \(\lambda \in \Delta ^T\), and let \(\delta ^T_\lambda \) be a random variable that takes value

$$\begin{aligned} \frac{\psi _{1,M^T,s^T}}{\kappa ^T}=\frac{s^T}{M^T}\frac{1}{\kappa ^T}, \end{aligned}$$

if the first edge of \(\lambda \) is in S at the end of the time instant T, and 0 otherwise. Thus, we can write

$$\begin{aligned} \rho ^T=\sum _{\lambda \in \Delta ^T}\delta ^T_\lambda . \end{aligned}$$

Then, using Lemma C.3 and Lemma C.5, we have, for \(1\le j\le min\{s^T, M\}\),

$$\begin{aligned} \begin{aligned} E[\rho ^T\vert M^T=j]&=\sum _{\lambda \in \Delta ^T}\frac{\psi _{1,j,s^T}}{\kappa ^T}Pr(\delta ^T_\lambda =\frac{\psi _{1,j,s^T}}{\kappa ^T}\vert M^T=j)\\&=\vert \Delta ^T\vert \frac{\psi _{1,j,s^T}}{\kappa ^T}\frac{1}{\psi _{1,j,s^T}}=\frac{1}{\kappa ^T}\vert \Delta ^T\vert =\vert \Delta ^T\vert . \end{aligned} \end{aligned}$$
(C16)

and

$$\begin{aligned} E[\rho ^T\vert M^T=j]=0,\ if\ 0\le j<1. \end{aligned}$$
(C17)

Plugging this into (C15), we can finally have,

$$\begin{aligned} \begin{aligned} E[\rho ^T]&=\sum _{j=0}^{min\{s^T,M\}}E[\rho ^T\vert M^T=j]Pr(M^T=j)\\&=\dfrac{1}{\kappa ^T}\vert \Delta ^T\vert \sum _{j=0}^{min\{s^T,M\}}Pr(M^T=j)=\vert \Delta ^T\vert . \end{aligned} \end{aligned}$$

From the above formula, while proving the conclusion of Theorem 4.5, we can further illustrate that the RFES-FD algorithm, like the previous two algorithms, is also unbiased in estimating the number of triangles in the graph stream.

1.2 C.2: Variance

Similarly to the process of demonstrating the mathematical expectation of RFES-FD, we must provide some lemmas for a valid proof of this algorithm’s variance.

Lemma C.6

For any time \(T\ge T^*\), and any \(j(1\le j \le min\{s^T,M\})\), we have

$$\begin{aligned} \begin{aligned} Var[\rho ^T\vert M^T=j]&=(\kappa ^T)^{-2} Var[\mu ^T\vert M^T=j]\\&=(\kappa ^T)^{-2}\left( \vert \Delta ^T\vert (\psi _{1,j,s^T}-1)+\zeta ^T \left( \dfrac{\psi ^2_{1,j,s^T}}{\psi _{1,j,s^T}}-1\right) +\omega ^T\left( \dfrac{\psi ^2_{1,j,s^T}}{\psi _{2,j,s^T}}-1\right) \right) \\&=(\kappa ^T)^{-2}\left( \vert \Delta ^T\vert (\psi _{1,j,s^T}-1)+\zeta ^T(\psi _{1,j,s^T}-1)+\omega ^T\left( \dfrac{\psi _{1,j,s^T}}{\psi _{2,j,s^T}}-1\right) \right) . \end{aligned} \end{aligned}$$
(C18)

Proof

The proof is analogous to that of Theorem 4.2, using j in place of M, \(s^T\) in place of T, \(\psi _{a, M^T,s^T}\) in place of \(\varphi _{a, T}\), and using Lemma C.3 instead of Lemma 4.1. The additional \((\kappa ^T)^{-2}\) multiplicative term comes from the \((\kappa ^T)^{-1}\) term used in the definition of \(\rho ^T\).

Lemma C.7

For any time \(T\ge T^*\), and any \(j(1<j\le min\{s^T,M\})\),if \(s^T\ge M\), we have

$$\begin{aligned} \begin{aligned} Var[\rho ^T\vert M^T=i]\le (\kappa ^T)^{-2}(\vert \Delta ^T\vert (\psi _{1,j,s^T}-1)+\zeta ^T(\psi ^2_{1,j,s^T}\psi ^{-1}_{1,j,s^T}-1)),\ for \ i\ge j;\\ Var[\rho ^T\vert M^T=i]\le (\kappa ^T)^{-2}(\vert \Delta ^T\vert (\psi _{1,1,s^T}-1)+\zeta ^T(\psi ^2_{1,2,s^T}\psi ^{-1}_{1,2,s^T}-1)),\ for \ i < j. \end{aligned} \end{aligned}$$

Proof

The proof follows by observing that the term \(\omega ^T(\dfrac{\psi _{1,j,s^T}}{\psi _{2,j,s^T}}-1)\) is non-positive, and that (C18) is a non-increasing function of the sample size.

The following lemma deals with properties of the r.v. \(M^T\), which is the conclusion of Lemma A.11 in [17].

Lemma C.8

[17, 31] Let \(T>T^*\), with \(s^T\ge M\). Let \(d^T=n^T_b+n^T_g\) denote the total number of unpaired deletions at time T. The sample size \(M^T\) follows the hypergeometric distribution.

$$\begin{aligned} Pr(M^T=j)=\left\{ \begin{aligned} \dfrac{\left( {\begin{array}{c}s^T\\ j\end{array}}\right) \left( {\begin{array}{c}d^T\\ M-j\end{array}}\right) }{\left( {\begin{array}{c}s^T+d^T\\ M\end{array}}\right) },&\ for \ max\{M-d^T,0\} \le j \le M\\ 0,&\ otherwise\\ \end{aligned} \right. \end{aligned}$$
(C19)

We have

$$\begin{aligned} E[M^T]=\dfrac{Ms^T}{s^T+d^T}, \end{aligned}$$
(C20)

and for any \(0<c<1\),

$$\begin{aligned} Pr(M^T>E[M^T]-cM)\ge 1-\dfrac{1}{e^{2c^2M}}. \end{aligned}$$
(C21)

Proof

Since \(T>T^*\), from the definition of \(T^*\), we have that the \(M^T\) has reached size M at least once (at \(T^*\)). From this and the definition of \(d^T\)(number of uncompensated deletions), we have that \(M^T\) cannot be less than \(M-d^T\). The rest of the proof for (C19) and for (C20) follows from the proof process of Lemma A.11 in [17], and the concentration bound in (C21) follows from the properties of the hypergeometric distribution. \(\square \)

The following corollary is a consequence of Lemma A.11 in [17].

Corollary C.1

[17] Consider the execution of RFES-FD at time \(T>T^*\). Suppose we have \(d^T\le \alpha s^T(0\le \alpha <1)\), with \(s^T\ge M\). If \(M\ge \frac{1}{2\sqrt{\alpha '-\alpha }}c'lns^T\) for \(\alpha<\alpha '<1\), we have:

$$\begin{aligned} Pr(M^T\ge M(1-\alpha '))>1-\frac{1}{(s^T)^{c'}} \end{aligned}$$

Proof of Theorem 4.6

Proof. From the law of total variance, we have:

$$\begin{aligned} \begin{aligned} Var[\rho ^T]&=\sum _{j=0}^{M}Var[\rho ^T\vert M^T=j] Pr(M^T=j)\\&\quad +\sum _{j=0}^{M}E[\rho ^T\vert M^T=j]^2(1-Pr(M^T=j)) Pr(M^T=j)\\&\quad -2\sum _{j=1}^{M}\sum _{i=0}^{j-1}E[\rho ^T\vert M^T=j]Pr(M^T=j)E[\rho ^T\vert M^T=i]Pr(M^T=i). \end{aligned} \end{aligned}$$

As shown in (C16) and (C17), for any \(j=0,1,...,M\) we have \(E[\rho ^T\vert M^T=j]\ge 0\). This, in turn, implies:

$$\begin{aligned} \begin{aligned} Var[\rho ^T]&\le \sum _{j=0}^{M}Var[\rho ^T\vert M^T=j] Pr(M^T=j)\\&\quad +\sum _{j=0}^{M}E[\rho ^T\vert M^T=j]^2(1-Pr(M^T=j)) Pr(M^T=j). \end{aligned} \end{aligned}$$
(C22)

Let us consider separately the two main components of (C22). From Lemma C.7, we have:

$$\begin{aligned}{} & {} \sum _{j=0}^{M}Var[\rho ^T\vert M^T=j]Pr(M^T=j) \end{aligned}$$
(C23)
$$\begin{aligned}{} & {} =\sum _{j\ge (1-\alpha ')}^{M}Var[\rho ^T\vert M^T=j]Pr(M^T=j)+\sum _{j=0}^{M(1-\alpha ')}Var[\rho ^T\vert M^T=j]Pr(M^T=j)\nonumber \\{} & {} \quad \le (\kappa ^T)^{-2}(\vert \Delta ^T\vert (\psi _{1,j,s^T}-1)+\zeta ^T(\psi ^2_{1,j,s^T}\psi ^{-1}_{2,j,s^T}-1))Pr(M^T>M(1-\alpha ')) \nonumber \\{} & {} \quad \le (\kappa ^T)^{-2}(\vert \Delta ^T\vert s^T+2\zeta ^T)Pr(M^T\le M(1-\alpha ')). \end{aligned}$$
(C24)

According to our hypothesis \(M\ge \dfrac{1}{2\sqrt{\alpha '-\alpha }}2lns^T\), thus, we have, from Corollary C.1:

$$\begin{aligned} Pr(M^T\le M(1-\alpha '))\le \dfrac{1}{(s^T)^2}. \end{aligned}$$

As \(\vert \Delta ^T\vert \le s^T\) and \(\zeta ^T\le \vert \Delta ^T\vert ^2\), we have:

$$\begin{aligned} (\kappa ^T)^{-2}(\vert \Delta ^T\vert s^T+2\zeta ^T)Pr(M^T\le M(1-\alpha '))\le 3(\kappa ^T)^{-2}. \end{aligned}$$

We can therefore rewrite (C24) as:

$$\begin{aligned} \begin{aligned}&\sum _{j=0}^{M}Var[\rho ^T\vert M^T=j]Pr(M^T=j)\le (\kappa ^T)^{-2}(\vert \Delta ^T\vert (\psi _{1,M(1-\alpha ')}-1))\\&\quad +(\kappa ^T)^{-2}(\zeta ^T(\psi ^2_{1,M(1-\alpha ',s^T)}\psi ^{-1}_{2,M(1-\alpha ',s^T)}-1)+3). \end{aligned} \end{aligned}$$
(C25)

Now, let us consider the term \(\sum _{j=0}^{M}E[\rho ^T\vert M^T=j]^2(1-Pr(M^T=j))Pr(M^T=j)\). Recall that, from (C16) and (C17), we then have that \(E[\rho ^T\vert M^T=j]=\vert \Delta ^T\vert (\kappa ^T)^{-1}\) for \(j=1,2,...,M\) and \(E[\rho ^T\vert M^T=j]=0\) for \(j=0\). From Corollary C.1, we have that for \(j\le (1-\alpha ')M\) and \(M\ge \frac{1}{2\sqrt{\alpha '-\alpha }}2lns^T\),

$$\begin{aligned} Pr(M^T=j)\le Pr(M^T\le M(1-\alpha '))\le \frac{1}{(s^T)^2}, \end{aligned}$$

and thus,

$$\begin{aligned} \begin{aligned} \sum _{j=0}^{M(1-\alpha ')}E[\rho ^T\vert M^T=j]^2(1-Pr(M^T=j))Pr(M^T=j)&\le \frac{(1-\alpha ')M\Delta ^T(\kappa ^T)^{-2}}{(s^T)^2}\\&\le (1-\alpha ')(\kappa ^T)^{-2}. \end{aligned} \end{aligned}$$
(C26)

Let us now consider the values \(j>M(1-\alpha ')\), we have:

$$\begin{aligned} \begin{aligned}&\sum _{j>(1-\alpha ')M}^{M}E[\rho ^T\vert M^T=j]^2(1-Pr(M^T=j))Pr(M^T=j)\\&\quad \le \alpha ' M\vert \Delta ^T\vert (\kappa ^T)^{-2}(1-\sum _{j>(1-\alpha ')M}^{M}Pr(M^T=j))\\&\quad \le \alpha ' M\vert \Delta ^T\vert (\kappa ^T)^{-2}(1-Pr(M^T>(1-\alpha ')M))\\&\quad \le \dfrac{(1-\alpha ')M\vert \Delta ^T\vert (\kappa ^T)^{-2}}{(s^T)^2}\le \alpha ' (\kappa ^T)^{-2}. \end{aligned} \end{aligned}$$
(C27)

where the last passages of equations (C26) and (C27) follow since, by hypothesis, \(M\le s^T\).

The theorem follows from composing the upper bounds obtained in (C25), (C26), and (C27) according to (C22). \(\square \)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yu, C., Liu, H., Wahab, F. et al. Global triangle estimation based on first edge sampling in large graph streams. J Supercomput 79, 14079–14116 (2023). https://doi.org/10.1007/s11227-023-05205-3

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-023-05205-3

Keywords