Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
21 views

Streaming Algorithm: Filtering & Counting Distinct Elements: Compsci 590.02 Instructor: Ashwinmachanavajjhala

The document discusses algorithms for filtering and counting distinct elements in data streams. It covers Bloom filters for filtering, which use hash functions to check if an element is in a set with low false positive rates. It also covers the FM-sketch algorithm for estimating the number of distinct elements in a stream. The FM-sketch tracks the maximum number of trailing zeros in hash values to estimate the count. By running multiple FM-sketches and taking the median, the success probability of the estimate being within a constant factor of the true count can be increased.

Uploaded by

Mirza Abdulla
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

Streaming Algorithm: Filtering & Counting Distinct Elements: Compsci 590.02 Instructor: Ashwinmachanavajjhala

The document discusses algorithms for filtering and counting distinct elements in data streams. It covers Bloom filters for filtering, which use hash functions to check if an element is in a set with low false positive rates. It also covers the FM-sketch algorithm for estimating the number of distinct elements in a stream. The FM-sketch tracks the maximum number of trailing zeros in hash values to estimate the count. By running multiple FM-sketches and taking the median, the success probability of the estimate being within a constant factor of the true count can be increased.

Uploaded by

Mirza Abdulla
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Streaming Algorithm: Filtering &

Counting Distinct Elements


CompSci 590.02
Instructor: AshwinMachanavajjhala

Lecture 6 : 590.02 Spring 13 1


Streaming Databases
Continuous/Standing Queries:
Every time a new data item
enters the system,
(conceptually) re-evalutate the
answer to the query

Can’t hope to
process a query on
the entire data, but
only on a small
working set.

Lecture 6 : 590.02 Spring 13 2


Examples of Streaming Data
• Internet & Web traffic
– Search/browsing history of users: Want to predict which ads/content to
show the user based on their history.
Can’t look at the entire history at runtime

• Continuous Monitoring
– 6 million surveillance cameras in London
– Video feeds from these cameras must be processed in real time

• Weather monitoring
• …

Lecture 6 : 590.02 Spring 13 3


Processing Streams
• Summarization
– Maintain a small size sketch (or summary) of the stream
– Answering queries using the sketch
– E.g., random sample
– later in the course – AMS, count min sketch, etc
– Types of queries: # distinct elements, most frequent elements in the
stream, aggregates like sum, min, max, etc.

• Window Queries
– Queries over a recent k size window of the stream
– Types of queries: alert if there is a burst of traffic in the last 1 minute,
denial of service identification, alert if stock price > 100, etc.

Lecture 6 : 590.02 Spring 13 4


Streaming Algorithms
• Sampling
– We have already seen this.
• Filtering
– “… does the incoming email address appear in a
set of white listed addresses … ”
• Counting Distinct Elements
– “… how many unique users visit cnn.com …”
• Heavy Hitters
– “… news articles contributing to >1% of all traffic …”
• Online Aggregation
– “… Based on seeing 50% of the data the answer is in [25,35] …”

Lecture 6 : 590.02 Spring 13 5


Streaming Algorithms
• Sampling
– We have already seen this.
• Filtering
– “… does the incoming email address appear in a
set of white listed addresses … ”
This Class
• Counting Distinct Elements
– “… how many unique users visit cnn.com …”
• Heavy Hitters
– “… news articles contributing to >1% of all traffic …”
• Online Aggregation
– “… Based on seeing 50% of the data the answer is in [25,35] …”

Lecture 6 : 590.02 Spring 13 6


FILTERING

Lecture 6 : 590.02 Spring 13 7


Problem
• A set S containing m values
– A whitelist of a billion non-spam email addresses

• Memory with n bits.


– Say 1 GB memory

• Goal: Construct a data structure that can efficient check whether


a new element is in S
– Returns TRUE with probability 1, when element is in S
– Returns FALSE with high probability (1-ε), when element is not in S

Lecture 6 : 590.02 Spring 13 8


Bloom Filter
• Consider a set of hash functions {h1, h2, .., hk}, hi: S  [1, n]

Initialization:
• Set all n bits in the memory to 0.

Insert a new element ‘a’:


• Compute h1(a), h2(a), …, hk(a). Set the corresponding bits to 1.

Check whether an element ‘a’ is in S:


• Compute h1(a), h2(a), …, hk(a).
If all the bits are 1, return TRUE.
Else, return FALSE

Lecture 6 : 590.02 Spring 13 9


Analysis
If a is in S:
• If h1(a), h2(a), …, hk(a) are all set to 1.
• Therefore, Bloom filter returns TRUE with probability 1.

If a not in S:
• Bloom filter returns TRUE if each hi(a) is 1 due to some other
element
Pr[bit j is 1 after m insertions] = 1 – Pr[bit j is 0 after m insertions]
= 1 – Pr[bit j was not set by k x m hash functions]
= 1 – (1 – 1/n)km
Pr[Bloom filter returns TRUE] = {1 – (1 – 1/n)km}k} ≈ (1 – e-km/n)k

Lecture 6 : 590.02 Spring 13 10


Example
• Suppose there are m = 109 emails in the white list.
• Suppose memory size of 1 GB (8 x 109 bits)

k=1
• Pr[Bloom filter returns TRUE | a not in S] = 1 – e-m/n
= 1 – e-1/8 = 0.1175
k=2
• Pr[Bloom filter returns TRUE | a not in S] = (1 – e-2m/n)2
= (1 – e-1/4)2 ≈ 0.0493

Lecture 6 : 590.02 Spring 13 11


Example
• Suppose there are m = 109 emails in the white list.
• Suppose memory size of 1 GB (8 x 109 bits)
False Positive Probability

Exercise:
What is the optimal number of
hash functions given m=|S| and n.

Number of hash functions

Lecture 6 : 590.02 Spring 13 12


Summary of Bloom Filters
• Given a large set of elements S, efficiently check whether a new
element is in the set.

• Bloom filters use hash functions to check membership


– If a is in S, return TRUE with probability 1
– If a is not in S, return FALSE with high probability
– False positive error depends on |S|, number of bits in the memory and
number of hash functions

Lecture 6 : 590.02 Spring 13 13


COUNTING DISTINCT ELEMENTS

Lecture 6 : 590.02 Spring 13 14


Distinct Elements
INPUT:
• A stream S of elements from a domain D
– A stream of logins to a website
– A stream of URLs browsed by a user
• Memory with n bits

OUTPUT
• An estimate of the number of distinct elements in the stream
– Number of distinct users logging in to the website
– Number of distinct URLs browsed by the user

Lecture 6 : 590.02 Spring 13 15


FM-sketch
• Consider a hash function h:D  {0,1}L which uniformly hashes
elements in the stream to L bit values

• IDEA: The more distinct elements in S, the more distinct hash


values are observed.

• Define: Tail0(h(x)) = number of trailing consecutive 0’s


– Tail0(101001) = 0
– Tail0(101010) = 1
– Tail0(001100) = 2
– Tail0(101000) = 3
– Tail0(000000) = 6 (=L)

Lecture 6 : 590.02 Spring 13 16


FM-sketch
Algorithm
• For all x ε S,
– Compute k(x) = Tail0(h(x))
• Let K = max x ε S k(x)
• Return F’ = 2K

Lecture 6 : 590.02 Spring 13 17


Analysis
Lemma: Pr[ Tail0(h(x)) ≥ j ] = 2-j

Proof:
• Tail0(h(x)) ≥ j implies at least the last j bits are 0

• Since elements are hashed to L-bit string uniformly at random,


the probability is (½)j = 2-j

Lecture 6 : 590.02 Spring 13 18


Analysis
• Let F be the true count of distinct elements, and
let c>2 be some integer.

• Let k1 be the largest k such that 2k < cF


• Let k2 be the smallest k such that 2k > F/c

• If K (returned by FM-sketch) is between k2 and k1, then


F/c ≤ F’ ≤ cF

Lecture 6 : 590.02 Spring 13 19


Analysis
• Let zx(k) = 1 if Tail0(h(x)) ≥ k
= 0 otherwise
• E[zx(k)] = 2-k Var(zx(k)) = 2-k(1 – 2-k)

• Let X(k) = ΣxεS zx(k)

• We are done if we show with high probability that


X(k1) = 0 and X(k2) ≠ 0

Lecture 6 : 590.02 Spring 13 20


Analysis
Lemma: Pr[X(k1) ≥ 1] ≤ 1/c
Proof: Pr[X(k1) ≥ 1] ≤ E(X(k1)) Markov Inequality
= F 2-k1 ≤ 1/c

Lemma: Pr[X(k2) = 0] ≤ 1/c


Proof: Pr[X(k2) = 0] = Pr[X(k2) – E(X(k2)) = E(X(k2))]
≤ Pr[|X(k2) – E(X(k2))| ≥ E(X(k2))]
≤ Var(X(k2)) / E(X(k2))2 Chebyshev Ineq.

≤ 2k2/F ≤ 1/c
Theorem: If FM-sketch returns F’, then for all c > 2,
F/c ≤ F’ ≤ cF with probability 1-2/c

Lecture 6 : 590.02 Spring 13 21


Boosting the success probability
• Construct s independent FM-sketches (F’1, F’2, …, F’s)
• Return the median F’med

Q: For any δ, what is the value of s s.t. P[F/c ≤ F’med ≤ cF] > 1 - δ ?

Lecture 6 : 590.02 Spring 13 22


Analysis
• Let c > 4, and xi = 0 if F/c ≤ F’i ≤ cF, and 1 otherwise
• ρ = E[xi]
= 1 - Pr[F/c ≤ F’i ≤ cF] ≤ 2/c < ½

• Let X = Σi xi E(X) = sρ

Lemma: If X < s/2, then F/c ≤ F’med ≤ cF (Exercise)

We are done if we show that Pr[X ≥ s/2] is small.

Lecture 6 : 590.02 Spring 13 23


Analysis
Pr[ X ≥ s/2 ] = Pr[ X – E(X) = s/2 – E(X) ]
≤ Pr[ |X – E(X)| ≥ s/2 – sρ ]
= Pr[ |X – E(X)| ≥ (1/2ρ – 1) sρ ]
≤ 2exp( – (1/2ρ – 1)2 sρ/3 ) Chernoff bounds

Thus, to bound this probability by δ, we need s to be:

Lecture 6 : 590.02 Spring 13 24


Boosting the success probability
In practice,
• Construct sk independent FM sketches
• Divide the sketches into s groups of k each
• Compute the mean estimate in each group
• Return the median of the means.

Lecture 6 : 590.02 Spring 13 25


Summary
• Counting the number of distinct elements exactly takes O(N)
space and Ω(N) time, where N is the number of distinct elements

• FM-sketch estimates the number of distinct elements in O(log N)


space and Θ(N) time

• FM-sketch: maximum number of trailing 0s in any hash value

• Can get good estimates with high probability by computing the


median of many independent FM-sketches.

Lecture 6 : 590.02 Spring 13 26

You might also like