0% found this document useful (0 votes)

21 views

Streaming Algorithm: Filtering & Counting Distinct Elements: Compsci 590.02 Instructor: Ashwinmachanavajjhala

The document discusses algorithms for filtering and counting distinct elements in data streams. It covers Bloom filters for filtering, which use hash functions to check if an element is in a set with low false positive rates. It also covers the FM-sketch algorithm for estimating the number of distinct elements in a stream. The FM-sketch tracks the maximum number of trailing zeros in hash values to estimate the count. By running multiple FM-sketches and taking the median, the success probability of the estimate being within a constant factor of the true count can be increased.

Uploaded by

Mirza Abdulla

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views

Streaming Algorithm: Filtering & Counting Distinct Elements: Compsci 590.02 Instructor: Ashwinmachanavajjhala

Uploaded by

Mirza Abdulla

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

Streaming Algorithm: Filtering &

Counting Distinct Elements

CompSci 590.02
Instructor: AshwinMachanavajjhala

Lecture 6 : 590.02 Spring 13 1

Streaming Databases
Continuous/Standing Queries:
Every time a new data item
enters the system,
(conceptually) re-evalutate the
answer to the query

Can’t hope to
process a query on
the entire data, but
only on a small
working set.

Lecture 6 : 590.02 Spring 13 2

Examples of Streaming Data
• Internet & Web traffic
– Search/browsing history of users: Want to predict which ads/content to
show the user based on their history.
Can’t look at the entire history at runtime

• Continuous Monitoring
– 6 million surveillance cameras in London
– Video feeds from these cameras must be processed in real time

• Weather monitoring
• …

Lecture 6 : 590.02 Spring 13 3

Processing Streams
• Summarization
– Maintain a small size sketch (or summary) of the stream
– Answering queries using the sketch
– E.g., random sample
– later in the course – AMS, count min sketch, etc
– Types of queries: # distinct elements, most frequent elements in the
stream, aggregates like sum, min, max, etc.

• Window Queries
– Queries over a recent k size window of the stream
– Types of queries: alert if there is a burst of traffic in the last 1 minute,
denial of service identification, alert if stock price > 100, etc.

Lecture 6 : 590.02 Spring 13 4

Streaming Algorithms
• Sampling
– We have already seen this.
• Filtering
– “… does the incoming email address appear in a
set of white listed addresses … ”
• Counting Distinct Elements
– “… how many unique users visit cnn.com …”
• Heavy Hitters
– “… news articles contributing to >1% of all traffic …”
• Online Aggregation
– “… Based on seeing 50% of the data the answer is in [25,35] …”

Lecture 6 : 590.02 Spring 13 5

Streaming Algorithms
• Sampling
– We have already seen this.
• Filtering
– “… does the incoming email address appear in a
set of white listed addresses … ”
This Class
• Counting Distinct Elements
– “… how many unique users visit cnn.com …”
• Heavy Hitters
– “… news articles contributing to >1% of all traffic …”
• Online Aggregation
– “… Based on seeing 50% of the data the answer is in [25,35] …”

Lecture 6 : 590.02 Spring 13 6

FILTERING

Lecture 6 : 590.02 Spring 13 7

Problem
• A set S containing m values
– A whitelist of a billion non-spam email addresses

• Memory with n bits.

– Say 1 GB memory

• Goal: Construct a data structure that can efficient check whether

a new element is in S
– Returns TRUE with probability 1, when element is in S
– Returns FALSE with high probability (1-ε), when element is not in S

Lecture 6 : 590.02 Spring 13 8

Bloom Filter
• Consider a set of hash functions {h1, h2, .., hk}, hi: S  [1, n]

Initialization:
• Set all n bits in the memory to 0.

Insert a new element ‘a’:

• Compute h1(a), h2(a), …, hk(a). Set the corresponding bits to 1.

Check whether an element ‘a’ is in S:

• Compute h1(a), h2(a), …, hk(a).
If all the bits are 1, return TRUE.
Else, return FALSE

Lecture 6 : 590.02 Spring 13 9

Analysis
If a is in S:
• If h1(a), h2(a), …, hk(a) are all set to 1.
• Therefore, Bloom filter returns TRUE with probability 1.

If a not in S:
• Bloom filter returns TRUE if each hi(a) is 1 due to some other
element
Pr[bit j is 1 after m insertions] = 1 – Pr[bit j is 0 after m insertions]
= 1 – Pr[bit j was not set by k x m hash functions]
= 1 – (1 – 1/n)km
Pr[Bloom filter returns TRUE] = {1 – (1 – 1/n)km}k} ≈ (1 – e-km/n)k

Lecture 6 : 590.02 Spring 13 10

Example
• Suppose there are m = 109 emails in the white list.
• Suppose memory size of 1 GB (8 x 109 bits)

k=1
• Pr[Bloom filter returns TRUE | a not in S] = 1 – e-m/n
= 1 – e-1/8 = 0.1175
k=2
• Pr[Bloom filter returns TRUE | a not in S] = (1 – e-2m/n)2
= (1 – e-1/4)2 ≈ 0.0493

Lecture 6 : 590.02 Spring 13 11

Example
• Suppose there are m = 109 emails in the white list.
• Suppose memory size of 1 GB (8 x 109 bits)
False Positive Probability

Exercise:
What is the optimal number of
hash functions given m=|S| and n.

Number of hash functions

Lecture 6 : 590.02 Spring 13 12

Summary of Bloom Filters
• Given a large set of elements S, efficiently check whether a new
element is in the set.

• Bloom filters use hash functions to check membership

– If a is in S, return TRUE with probability 1
– If a is not in S, return FALSE with high probability
– False positive error depends on |S|, number of bits in the memory and
number of hash functions

Lecture 6 : 590.02 Spring 13 13

COUNTING DISTINCT ELEMENTS

Lecture 6 : 590.02 Spring 13 14

Distinct Elements
INPUT:
• A stream S of elements from a domain D
– A stream of logins to a website
– A stream of URLs browsed by a user
• Memory with n bits

OUTPUT
• An estimate of the number of distinct elements in the stream
– Number of distinct users logging in to the website
– Number of distinct URLs browsed by the user

Lecture 6 : 590.02 Spring 13 15

FM-sketch
• Consider a hash function h:D  {0,1}L which uniformly hashes
elements in the stream to L bit values

• IDEA: The more distinct elements in S, the more distinct hash

values are observed.

• Define: Tail0(h(x)) = number of trailing consecutive 0’s

– Tail0(101001) = 0
– Tail0(101010) = 1
– Tail0(001100) = 2
– Tail0(101000) = 3
– Tail0(000000) = 6 (=L)

Lecture 6 : 590.02 Spring 13 16

FM-sketch
Algorithm
• For all x ε S,
– Compute k(x) = Tail0(h(x))
• Let K = max x ε S k(x)
• Return F’ = 2K

Lecture 6 : 590.02 Spring 13 17

Analysis
Lemma: Pr[ Tail0(h(x)) ≥ j ] = 2-j

Proof:
• Tail0(h(x)) ≥ j implies at least the last j bits are 0

• Since elements are hashed to L-bit string uniformly at random,

the probability is (½)j = 2-j

Lecture 6 : 590.02 Spring 13 18

Analysis
• Let F be the true count of distinct elements, and
let c>2 be some integer.

• Let k1 be the largest k such that 2k < cF

• Let k2 be the smallest k such that 2k > F/c

• If K (returned by FM-sketch) is between k2 and k1, then

F/c ≤ F’ ≤ cF

Lecture 6 : 590.02 Spring 13 19

Analysis
• Let zx(k) = 1 if Tail0(h(x)) ≥ k
= 0 otherwise
• E[zx(k)] = 2-k Var(zx(k)) = 2-k(1 – 2-k)

• Let X(k) = ΣxεS zx(k)

• We are done if we show with high probability that

X(k1) = 0 and X(k2) ≠ 0

Lecture 6 : 590.02 Spring 13 20

Analysis
Lemma: Pr[X(k1) ≥ 1] ≤ 1/c
Proof: Pr[X(k1) ≥ 1] ≤ E(X(k1)) Markov Inequality
= F 2-k1 ≤ 1/c

Lemma: Pr[X(k2) = 0] ≤ 1/c

Proof: Pr[X(k2) = 0] = Pr[X(k2) – E(X(k2)) = E(X(k2))]
≤ Pr[|X(k2) – E(X(k2))| ≥ E(X(k2))]
≤ Var(X(k2)) / E(X(k2))2 Chebyshev Ineq.

≤ 2k2/F ≤ 1/c
Theorem: If FM-sketch returns F’, then for all c > 2,
F/c ≤ F’ ≤ cF with probability 1-2/c

Lecture 6 : 590.02 Spring 13 21

Boosting the success probability
• Construct s independent FM-sketches (F’1, F’2, …, F’s)
• Return the median F’med

Q: For any δ, what is the value of s s.t. P[F/c ≤ F’med ≤ cF] > 1 - δ ?

Lecture 6 : 590.02 Spring 13 22

Analysis
• Let c > 4, and xi = 0 if F/c ≤ F’i ≤ cF, and 1 otherwise
• ρ = E[xi]
= 1 - Pr[F/c ≤ F’i ≤ cF] ≤ 2/c < ½

• Let X = Σi xi E(X) = sρ

Lemma: If X < s/2, then F/c ≤ F’med ≤ cF (Exercise)

We are done if we show that Pr[X ≥ s/2] is small.

Lecture 6 : 590.02 Spring 13 23

Analysis
Pr[ X ≥ s/2 ] = Pr[ X – E(X) = s/2 – E(X) ]
≤ Pr[ |X – E(X)| ≥ s/2 – sρ ]
= Pr[ |X – E(X)| ≥ (1/2ρ – 1) sρ ]
≤ 2exp( – (1/2ρ – 1)2 sρ/3 ) Chernoff bounds

Thus, to bound this probability by δ, we need s to be:

Lecture 6 : 590.02 Spring 13 24

Boosting the success probability
In practice,
• Construct sk independent FM sketches
• Divide the sketches into s groups of k each
• Compute the mean estimate in each group
• Return the median of the means.

Lecture 6 : 590.02 Spring 13 25

Summary
• Counting the number of distinct elements exactly takes O(N)
space and Ω(N) time, where N is the number of distinct elements

• FM-sketch estimates the number of distinct elements in O(log N)

space and Θ(N) time

• FM-sketch: maximum number of trailing 0s in any hash value

• Can get good estimates with high probability by computing the

median of many independent FM-sketches.

Lecture 6 : 590.02 Spring 13 26

Ranged Queries Using Bloom Filters Final
No ratings yet
Ranged Queries Using Bloom Filters Final
19 pages
Unit Title: It's A Slippery Slope!: Project-Based Learning Unit Plan
No ratings yet
Unit Title: It's A Slippery Slope!: Project-Based Learning Unit Plan
9 pages
Two Wattmeter Method EEM
No ratings yet
Two Wattmeter Method EEM
5 pages
On UNIT-5
No ratings yet
On UNIT-5
45 pages
Recursion: Breaking Down Problems Into Solvable Subproblems
No ratings yet
Recursion: Breaking Down Problems Into Solvable Subproblems
26 pages
Random Number Generation
No ratings yet
Random Number Generation
42 pages
K Means Clustering
No ratings yet
K Means Clustering
37 pages
19.1. Partitioning-Based Clustering Algorithms
No ratings yet
19.1. Partitioning-Based Clustering Algorithms
27 pages
Ex09 1 Sol
No ratings yet
Ex09 1 Sol
8 pages
Computer Vision: Spring 2006 15-385,-685
No ratings yet
Computer Vision: Spring 2006 15-385,-685
58 pages
Clustering Gene Expression Data: CS 838 WWW - Cs.wisc - Edu/ Craven/cs838.html Mark Craven Craven@biostat - Wisc.edu April 2001
No ratings yet
Clustering Gene Expression Data: CS 838 WWW - Cs.wisc - Edu/ Craven/cs838.html Mark Craven Craven@biostat - Wisc.edu April 2001
12 pages
CS252 Graduate Computer Architecture
No ratings yet
CS252 Graduate Computer Architecture
34 pages
Clustering: CMPUT 466/551 Nilanjan Ray
No ratings yet
Clustering: CMPUT 466/551 Nilanjan Ray
34 pages
Principal Components Analysis (PCA) in Matlab
No ratings yet
Principal Components Analysis (PCA) in Matlab
11 pages
CS246 Hw1
No ratings yet
CS246 Hw1
5 pages
CS229 Final Report - Music Genre Classification
No ratings yet
CS229 Final Report - Music Genre Classification
6 pages
Vorlesung02 Statistical Models in Simulation
No ratings yet
Vorlesung02 Statistical Models in Simulation
42 pages
Part 6: Random Number Generation
No ratings yet
Part 6: Random Number Generation
42 pages
Experiment II
No ratings yet
Experiment II
4 pages
cs1200 (1)
No ratings yet
cs1200 (1)
9 pages
A Fuzzy Self Constructing Feature Clustering Algorithm For Text Classification
No ratings yet
A Fuzzy Self Constructing Feature Clustering Algorithm For Text Classification
31 pages
Computer Architecture and Organization: Lecture 6: Floating Points
No ratings yet
Computer Architecture and Organization: Lecture 6: Floating Points
20 pages
Tema5 Teoria-2830
No ratings yet
Tema5 Teoria-2830
57 pages
Applications of Dynamic Programming: Steven Skiena
No ratings yet
Applications of Dynamic Programming: Steven Skiena
20 pages
Binary Search Algorithm: 14.1. Intuition
No ratings yet
Binary Search Algorithm: 14.1. Intuition
3 pages
Bloom Filter
No ratings yet
Bloom Filter
29 pages
10-601 Machine Learning: Homework 7: Instructions
No ratings yet
10-601 Machine Learning: Homework 7: Instructions
5 pages
SAHADEB - Logistic Reg - Sessions 8-10
No ratings yet
SAHADEB - Logistic Reg - Sessions 8-10
145 pages
lec05
No ratings yet
lec05
28 pages
Optimum Global Thresholding Using Otsu's Method
No ratings yet
Optimum Global Thresholding Using Otsu's Method
5 pages
Bloom Filter Guo
No ratings yet
Bloom Filter Guo
90 pages
Topic 4 - Randomized Algorithms, II: 4.1.1 Clustering Via Graph Cuts
No ratings yet
Topic 4 - Randomized Algorithms, II: 4.1.1 Clustering Via Graph Cuts
6 pages
To Print - Dynprog2
No ratings yet
To Print - Dynprog2
46 pages
Queueingtheory (1) 1
No ratings yet
Queueingtheory (1) 1
61 pages
AD I Practice Questions
No ratings yet
AD I Practice Questions
11 pages
EC 2214: Coding & Data Compression: Vishwakarma Institute of Technology
No ratings yet
EC 2214: Coding & Data Compression: Vishwakarma Institute of Technology
35 pages
Random Number Generation: Dr. John Mellor-Crummey
No ratings yet
Random Number Generation: Dr. John Mellor-Crummey
30 pages
Waveform Coding
No ratings yet
Waveform Coding
62 pages
p4
No ratings yet
p4
4 pages
Divide Conquer
No ratings yet
Divide Conquer
24 pages
Accelerated Data Science Introduction To Machine Learning Algorithms
No ratings yet
Accelerated Data Science Introduction To Machine Learning Algorithms
37 pages
Homework 3 Residential
No ratings yet
Homework 3 Residential
5 pages
Compsci Algorithms For Data Science: Cameron Musco University of Massachusetts Amherst. Fall 2019
No ratings yet
Compsci Algorithms For Data Science: Cameron Musco University of Massachusetts Amherst. Fall 2019
28 pages
K-Means Algorithm: Clustering Methods: Part 2a
No ratings yet
K-Means Algorithm: Clustering Methods: Part 2a
10 pages
Laksh Paaji
No ratings yet
Laksh Paaji
12 pages
y4
No ratings yet
y4
14 pages
SoICT-Eng - ProbComp - Lec 4
No ratings yet
SoICT-Eng - ProbComp - Lec 4
32 pages
To Read Dynprog2
No ratings yet
To Read Dynprog2
50 pages
Clustering: Unsupervised Learning Methods 15-381
No ratings yet
Clustering: Unsupervised Learning Methods 15-381
25 pages
Power & Energy Signals Questions and Answers - Sanfoundry
No ratings yet
Power & Energy Signals Questions and Answers - Sanfoundry
11 pages
Queueingtheory
No ratings yet
Queueingtheory
66 pages
Topic #2
No ratings yet
Topic #2
24 pages
Analysis of Algorithms: CS 302 - Data Structures Section 2.6
No ratings yet
Analysis of Algorithms: CS 302 - Data Structures Section 2.6
48 pages
MIT6 094IAP10 Lec03
No ratings yet
MIT6 094IAP10 Lec03
40 pages
Introduction
No ratings yet
Introduction
46 pages
1 Counting Sort
No ratings yet
1 Counting Sort
8 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
24 pages
13 Randomized Algorithms
100% (1)
13 Randomized Algorithms
50 pages
Digital Signal and Image Processing using MATLAB, Volume 3: Advances and Applications, The Stochastic Case
From Everand
Digital Signal and Image Processing using MATLAB, Volume 3: Advances and Applications, The Stochastic Case
Gérard Blanchet
3/5 (1)
Calculus I Essentials
From Everand
Calculus I Essentials
Editors of REA
1/5 (1)
A Short Course in Discrete Mathematics
From Everand
A Short Course in Discrete Mathematics
Edward A. Bender
3/5 (1)
A Brief Introduction to MATLAB: Taken From the Book "MATLAB for Beginners: A Gentle Approach"
From Everand
A Brief Introduction to MATLAB: Taken From the Book "MATLAB for Beginners: A Gentle Approach"
Peter Kattan
2.5/5 (2)
Moments and Deviations: 3.1. Markov's Inequality
No ratings yet
Moments and Deviations: 3.1. Markov's Inequality
27 pages
1 Minimum Spanning Tree (MST) : Lecture Notes CS:5360 Randomized Algorithms
No ratings yet
1 Minimum Spanning Tree (MST) : Lecture Notes CS:5360 Randomized Algorithms
9 pages
5.2.2. Application: Bucket Sort: 12-Feb-15 Mat-72306 Randal, Spring 2015 207
No ratings yet
5.2.2. Application: Bucket Sort: 12-Feb-15 Mat-72306 Randal, Spring 2015 207
26 pages
Proof: Applying Markov's Inequality, For Any: 0 PR (1 +) PR
No ratings yet
Proof: Applying Markov's Inequality, For Any: 0 PR (1 +) PR
19 pages
03 Hashing
No ratings yet
03 Hashing
21 pages
Randomized Algorithms: Prof. Tapio Elomaa
No ratings yet
Randomized Algorithms: Prof. Tapio Elomaa
37 pages
1 Markov's Inequality: Lecture Notes CS:5360 Randomized Algorithms
No ratings yet
1 Markov's Inequality: Lecture Notes CS:5360 Randomized Algorithms
11 pages
1 What Is A Randomized Algorithm?: Lecture Notes CS:5360 Randomized Algorithms
No ratings yet
1 What Is A Randomized Algorithm?: Lecture Notes CS:5360 Randomized Algorithms
8 pages
Lower Bound On Deterministic Evaluation Algorithms For NOR Circuits - Yao's Principle For Proving Lower Bounds
No ratings yet
Lower Bound On Deterministic Evaluation Algorithms For NOR Circuits - Yao's Principle For Proving Lower Bounds
61 pages
Min Max Games
No ratings yet
Min Max Games
68 pages
Yao's Minimax Principle: Game Tree Evaluation
No ratings yet
Yao's Minimax Principle: Game Tree Evaluation
30 pages
Skript 05
No ratings yet
Skript 05
4 pages
Lecture 2: Lower Bounds For Randomized Algorithms
No ratings yet
Lecture 2: Lower Bounds For Randomized Algorithms
10 pages
Lower Bounds in Computer Science
No ratings yet
Lower Bounds in Computer Science
38 pages
Massivedata14 Slidesxx
No ratings yet
Massivedata14 Slidesxx
13 pages
CSC304 Lecture 6
No ratings yet
CSC304 Lecture 6
21 pages
(Said S.E.H. Elnashaie, Parag Garhyan) Conservatio
No ratings yet
(Said S.E.H. Elnashaie, Parag Garhyan) Conservatio
661 pages
Grade 8 Singapore Math
100% (1)
Grade 8 Singapore Math
8 pages
Trigonometry 10
No ratings yet
Trigonometry 10
8 pages
Chapter 6 - Behavioural Modelling
No ratings yet
Chapter 6 - Behavioural Modelling
42 pages
Activity Analysis, Cost Behavior, and Cost Estimation: Answers To Review Questions
No ratings yet
Activity Analysis, Cost Behavior, and Cost Estimation: Answers To Review Questions
84 pages
DSTL STs QP With Solution
No ratings yet
DSTL STs QP With Solution
45 pages
Helical Spring
No ratings yet
Helical Spring
2 pages
Moses 5
No ratings yet
Moses 5
318 pages
Syllabus Econ 310 Spring 2013
No ratings yet
Syllabus Econ 310 Spring 2013
3 pages
Customization Log
No ratings yet
Customization Log
36 pages
Algorithm Design by Kleinberg and Tardos The Art of Computer Programming by Donald Knuth How To Solve It by Computer by R. G. Dromey
No ratings yet
Algorithm Design by Kleinberg and Tardos The Art of Computer Programming by Donald Knuth How To Solve It by Computer by R. G. Dromey
2 pages
Electro Magnetic Field PDF
No ratings yet
Electro Magnetic Field PDF
42 pages
Govpub C13 PDF
100% (1)
Govpub C13 PDF
120 pages
Data Link Layer PDF
No ratings yet
Data Link Layer PDF
87 pages
5.60 Thermodynamics & Kinetics: Mit Opencourseware
No ratings yet
5.60 Thermodynamics & Kinetics: Mit Opencourseware
6 pages
IMEC Grade 4-5 ENG
100% (1)
IMEC Grade 4-5 ENG
8 pages
Scaffold Erection
No ratings yet
Scaffold Erection
65 pages
Complex Number
No ratings yet
Complex Number
49 pages
Thermodynamics: Observation
No ratings yet
Thermodynamics: Observation
2 pages
Exam Prep 4 Solutions: Q1. MDPS: Dice Bonanza
No ratings yet
Exam Prep 4 Solutions: Q1. MDPS: Dice Bonanza
4 pages
Examples of Markov Chains: - Random Walk On A Line
No ratings yet
Examples of Markov Chains: - Random Walk On A Line
23 pages
G Saxby Practical - Holography - Chapter 1 - What Is A Hologram
No ratings yet
G Saxby Practical - Holography - Chapter 1 - What Is A Hologram
13 pages
TOS in Math
No ratings yet
TOS in Math
5 pages
GEELECDS Introduction To Probability
No ratings yet
GEELECDS Introduction To Probability
2 pages
Bathtub Dynamics: Initial Results of A Systems Thinking Inventory
No ratings yet
Bathtub Dynamics: Initial Results of A Systems Thinking Inventory
53 pages
12 6 3 Notes
No ratings yet
12 6 3 Notes
4 pages
Gec 104 Module 8
No ratings yet
Gec 104 Module 8
14 pages