Streaming Algorithm: Filtering & Counting Distinct Elements: Compsci 590.02 Instructor: Ashwinmachanavajjhala
Streaming Algorithm: Filtering & Counting Distinct Elements: Compsci 590.02 Instructor: Ashwinmachanavajjhala
Can’t hope to
process a query on
the entire data, but
only on a small
working set.
• Continuous Monitoring
– 6 million surveillance cameras in London
– Video feeds from these cameras must be processed in real time
• Weather monitoring
• …
• Window Queries
– Queries over a recent k size window of the stream
– Types of queries: alert if there is a burst of traffic in the last 1 minute,
denial of service identification, alert if stock price > 100, etc.
Initialization:
• Set all n bits in the memory to 0.
If a not in S:
• Bloom filter returns TRUE if each hi(a) is 1 due to some other
element
Pr[bit j is 1 after m insertions] = 1 – Pr[bit j is 0 after m insertions]
= 1 – Pr[bit j was not set by k x m hash functions]
= 1 – (1 – 1/n)km
Pr[Bloom filter returns TRUE] = {1 – (1 – 1/n)km}k} ≈ (1 – e-km/n)k
k=1
• Pr[Bloom filter returns TRUE | a not in S] = 1 – e-m/n
= 1 – e-1/8 = 0.1175
k=2
• Pr[Bloom filter returns TRUE | a not in S] = (1 – e-2m/n)2
= (1 – e-1/4)2 ≈ 0.0493
Exercise:
What is the optimal number of
hash functions given m=|S| and n.
OUTPUT
• An estimate of the number of distinct elements in the stream
– Number of distinct users logging in to the website
– Number of distinct URLs browsed by the user
Proof:
• Tail0(h(x)) ≥ j implies at least the last j bits are 0
≤ 2k2/F ≤ 1/c
Theorem: If FM-sketch returns F’, then for all c > 2,
F/c ≤ F’ ≤ cF with probability 1-2/c
Q: For any δ, what is the value of s s.t. P[F/c ≤ F’med ≤ cF] > 1 - δ ?
• Let X = Σi xi E(X) = sρ