Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
91 views

Crash Course On Data Stream Algorithms: Part I: Basic Definitions and Numerical Streams

The document provides an overview of data stream algorithms. It defines the data stream model as sequentially processing a stream of data elements from a large universe using limited memory. Common problems addressed include sampling to estimate statistics over the stream, and maintaining samples over sliding windows. Reservoir sampling is described as a method to take a uniform sample from a stream of unknown length by randomly replacing the sample with each new element with probability 1/t. Priority sampling is presented as a technique to uniformly sample from a sliding window by assigning each element a random priority value and always retaining the element with the minimum priority in the window.

Uploaded by

umayrh@gmail.com
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
91 views

Crash Course On Data Stream Algorithms: Part I: Basic Definitions and Numerical Streams

The document provides an overview of data stream algorithms. It defines the data stream model as sequentially processing a stream of data elements from a large universe using limited memory. Common problems addressed include sampling to estimate statistics over the stream, and maintaining samples over sliding windows. Reservoir sampling is described as a method to take a uniform sample from a stream of unknown length by randomly replacing the sample with each new element with probability 1/t. Priority sampling is presented as a technique to uniformly sample from a sliding window by assigning each element a random priority value and always retaining the element with the minimum priority in the window.

Uploaded by

umayrh@gmail.com
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 76

Crash Course on Data Stream Algorithms

Part I: Basic Denitions and Numerical Streams

Andrew McGregor
University of Massachusetts Amherst

1/24

Goals of the Crash Course


Goal: Give a avor for the theoretical results and techniques from the 100s of papers on the design and analysis of stream algorithms.

2/24

Goals of the Crash Course


Goal: Give a avor for the theoretical results and techniques from the 100s of papers on the design and analysis of stream algorithms. When we abstract away the application-specic details, what are the basic algorithmic ideas and challenges in stream processing? What is and isnt possible?

2/24

Goals of the Crash Course


Goal: Give a avor for the theoretical results and techniques from the 100s of papers on the design and analysis of stream algorithms. When we abstract away the application-specic details, what are the basic algorithmic ideas and challenges in stream processing? What is and isnt possible? Disclaimer: Talks will be theoretical/mathematical but shouldnt require much in the way of prerequisites.

2/24

Goals of the Crash Course


Goal: Give a avor for the theoretical results and techniques from the 100s of papers on the design and analysis of stream algorithms. When we abstract away the application-specic details, what are the basic algorithmic ideas and challenges in stream processing? What is and isnt possible? Disclaimer: Talks will be theoretical/mathematical but shouldnt require much in the way of prerequisites. Request:

2/24

Goals of the Crash Course


Goal: Give a avor for the theoretical results and techniques from the 100s of papers on the design and analysis of stream algorithms. When we abstract away the application-specic details, what are the basic algorithmic ideas and challenges in stream processing? What is and isnt possible? Disclaimer: Talks will be theoretical/mathematical but shouldnt require much in the way of prerequisites. Request:
If you get bored, ask questions. . .

2/24

Goals of the Crash Course


Goal: Give a avor for the theoretical results and techniques from the 100s of papers on the design and analysis of stream algorithms. When we abstract away the application-specic details, what are the basic algorithmic ideas and challenges in stream processing? What is and isnt possible? Disclaimer: Talks will be theoretical/mathematical but shouldnt require much in the way of prerequisites. Request:
If you get bored, ask questions. . . If you get lost, ask questions. . .

2/24

Goals of the Crash Course


Goal: Give a avor for the theoretical results and techniques from the 100s of papers on the design and analysis of stream algorithms. When we abstract away the application-specic details, what are the basic algorithmic ideas and challenges in stream processing? What is and isnt possible? Disclaimer: Talks will be theoretical/mathematical but shouldnt require much in the way of prerequisites. Request:
If you get bored, ask questions. . . If you get lost, ask questions. . . If youd like to ask questions, ask questions. . .

2/24

Outline

Basic Denitions Sampling Sketching Counting Distinct Items Summary of Some Other Results

3/24

Outline

Basic Denitions Sampling Sketching Counting Distinct Items Summary of Some Other Results

4/24

Data Stream Model

Stream: m elements from universe of size n, e.g., x1 , x2 , . . . , xm = 3, 5, 3, 7, 5, 4, . . .

5/24

Data Stream Model

Stream: m elements from universe of size n, e.g., x1 , x2 , . . . , xm = 3, 5, 3, 7, 5, 4, . . .

Goal: Compute a function of stream, e.g., median, number of distinct elements, longest increasing sequence.

5/24

Data Stream Model

Stream: m elements from universe of size n, e.g., x1 , x2 , . . . , xm = 3, 5, 3, 7, 5, 4, . . .

Goal: Compute a function of stream, e.g., median, number of distinct elements, longest increasing sequence. Catch:
1. Limited working memory, sublinear in n and m

5/24

Data Stream Model

Stream: m elements from universe of size n, e.g., x1 , x2 , . . . , xm = 3, 5, 3, 7, 5, 4, . . .

Goal: Compute a function of stream, e.g., median, number of distinct elements, longest increasing sequence. Catch:
1. Limited working memory, sublinear in n and m 2. Access data sequentially

5/24

Data Stream Model

Stream: m elements from universe of size n, e.g., x1 , x2 , . . . , xm = 3, 5, 3, 7, 5, 4, . . .

Goal: Compute a function of stream, e.g., median, number of distinct elements, longest increasing sequence. Catch:
1. Limited working memory, sublinear in n and m 2. Access data sequentially 3. Process each element quickly

5/24

Data Stream Model

Stream: m elements from universe of size n, e.g., x1 , x2 , . . . , xm = 3, 5, 3, 7, 5, 4, . . .

Goal: Compute a function of stream, e.g., median, number of distinct elements, longest increasing sequence. Catch:
1. Limited working memory, sublinear in n and m 2. Access data sequentially 3. Process each element quickly

Origins in 70s but has become popular in last ten years because of growing theory and very applicable.

5/24

Whys it become popular?

Practical Appeal:
Faster networks, cheaper data storage, ubiquitous data-logging results in massive amount of data to be processed. Applications to network monitoring, query planning, I/O eciency for massive data, sensor networks aggregation. . .

6/24

Whys it become popular?

Practical Appeal:
Faster networks, cheaper data storage, ubiquitous data-logging results in massive amount of data to be processed. Applications to network monitoring, query planning, I/O eciency for massive data, sensor networks aggregation. . .

Theoretical Appeal:
Easy to state problems but hard to solve. Links to communication complexity, compressed sensing, embeddings, pseudo-random generators, approximation. . .

6/24

Outline

Basic Denitions Sampling Sketching Counting Distinct Items Summary of Some Other Results

7/24

Sampling and Statistics

Sampling is a general technique for tackling massive amounts of data

8/24

Sampling and Statistics

Sampling is a general technique for tackling massive amounts of data Example: To compute the median packet size of some IP packets, we could just sample some and use the median of the sample as an estimate for the true median. Statistical arguments relate the size of the sample to the accuracy of the estimate.

8/24

Sampling and Statistics

Sampling is a general technique for tackling massive amounts of data Example: To compute the median packet size of some IP packets, we could just sample some and use the median of the sample as an estimate for the true median. Statistical arguments relate the size of the sample to the accuracy of the estimate. Challenge: But how do you take a sample from a stream of unknown length or from a sliding window?

8/24

Reservoir Sampling

Problem: Find uniform sample s from a stream of unknown length

9/24

Reservoir Sampling

Problem: Find uniform sample s from a stream of unknown length Algorithm:


Initially s = x1 On seeing the t-th element, s xt with probability 1/t

9/24

Reservoir Sampling

Problem: Find uniform sample s from a stream of unknown length Algorithm:


Initially s = x1 On seeing the t-th element, s xt with probability 1/t

Analysis:
Whats the probability that s = xi at some time t i?

9/24

Reservoir Sampling

Problem: Find uniform sample s from a stream of unknown length Algorithm:


Initially s = x1 On seeing the t-th element, s xt with probability 1/t

Analysis:
Whats the probability that s = xi at some time t i? 1 1 1 1 P [s = xi ] = 1 ... 1 = i i +1 t t

9/24

Reservoir Sampling

Problem: Find uniform sample s from a stream of unknown length Algorithm:


Initially s = x1 On seeing the t-th element, s xt with probability 1/t

Analysis:
Whats the probability that s = xi at some time t i? 1 1 1 1 P [s = xi ] = 1 ... 1 = i i +1 t t To get k samples we use O(k log n) bits of space.

9/24

Priority Sampling for Sliding Windows


Problem: Maintain a uniform sample from the last w items

10/24

Priority Sampling for Sliding Windows


Problem: Maintain a uniform sample from the last w items Algorithm:
1. For each xi we pick a random value vi (0, 1)

10/24

Priority Sampling for Sliding Windows


Problem: Maintain a uniform sample from the last w items Algorithm:
1. For each xi we pick a random value vi (0, 1) 2. In a window xjw +1 , . . . , xj return value xi with smallest vi

10/24

Priority Sampling for Sliding Windows


Problem: Maintain a uniform sample from the last w items Algorithm:
1. For each xi we pick a random value vi (0, 1) 2. In a window xjw +1 , . . . , xj return value xi with smallest vi 3. To do this, maintain set of all elements in sliding window whose v value is minimal among subsequent values

10/24

Priority Sampling for Sliding Windows


Problem: Maintain a uniform sample from the last w items Algorithm:
1. For each xi we pick a random value vi (0, 1) 2. In a window xjw +1 , . . . , xj return value xi with smallest vi 3. To do this, maintain set of all elements in sliding window whose v value is minimal among subsequent values

Analysis:

10/24

Priority Sampling for Sliding Windows


Problem: Maintain a uniform sample from the last w items Algorithm:
1. For each xi we pick a random value vi (0, 1) 2. In a window xjw +1 , . . . , xj return value xi with smallest vi 3. To do this, maintain set of all elements in sliding window whose v value is minimal among subsequent values

Analysis:
The probability that j-th oldest element is in S is 1/j so the expected number of items in S is 1/w + 1/(w 1) + . . . + 1/1 = O(log w )

10/24

Priority Sampling for Sliding Windows


Problem: Maintain a uniform sample from the last w items Algorithm:
1. For each xi we pick a random value vi (0, 1) 2. In a window xjw +1 , . . . , xj return value xi with smallest vi 3. To do this, maintain set of all elements in sliding window whose v value is minimal among subsequent values

Analysis:
The probability that j-th oldest element is in S is 1/j so the expected number of items in S is 1/w + 1/(w 1) + . . . + 1/1 = O(log w ) Hence, algorithm only uses O(log w log n) bits of memory.

10/24

Other Types of Sampling


Universe sampling: For a random i R [n], compute fi = |{j : xj = i}|

11/24

Other Types of Sampling


Universe sampling: For a random i R [n], compute fi = |{j : xj = i}| Minwise hashing: Sample i R {i : there exists j such that xj = i}

11/24

Other Types of Sampling


Universe sampling: For a random i R [n], compute fi = |{j : xj = i}| Minwise hashing: Sample i R {i : there exists j such that xj = i} AMS sampling: Sample xj for j R [m] and compute r = |{j j : xj = xj }|

11/24

Other Types of Sampling


Universe sampling: For a random i R [n], compute fi = |{j : xj = i}| Minwise hashing: Sample i R {i : there exists j such that xj = i} AMS sampling: Sample xj for j R [m] and compute r = |{j j : xj = xj }| Handy when estimating quantities like
i

g (fi ) because g (fi )


i

E [m(g (r ) g (r 1))] =

11/24

Outline

Basic Denitions Sampling Sketching Counting Distinct Items Summary of Some Other Results

12/24

Sketching

Sketching is another general technique for processing streams

13/24

Sketching

Sketching is another general technique for processing streams Basic idea: Apply a linear projection on the y that takes high-dimensional data to a smaller dimensional space. Post-process lower dimensional image to estimate the quantities of interest.

13/24

Estimating the dierence between two streams

Input: Stream from two sources x1 , x2 , . . . , xm ([n] [n])m

14/24

Estimating the dierence between two streams

Input: Stream from two sources x1 , x2 , . . . , xm ([n] [n])m Goal: Estimate dierence between distribution of red values and blue values, e.g., |fi gi |
i[n]

where fi = |{k : xk = i}| and gi = |{k : xk = i}|

14/24

p-Stable Distributions and Algorithm


Defn: A p-stable distribution has the following property: for X , Y , Z and a, b R : aX + bY (|a|p + |b|p )1/p Z

e.g., Gaussian is 2-stable and Cauchy distribution is 1-stable

15/24

p-Stable Distributions and Algorithm


Defn: A p-stable distribution has the following property: for X , Y , Z and a, b R : aX + bY (|a|p + |b|p )1/p Z

e.g., Gaussian is 2-stable and Cauchy distribution is 1-stable Algorithm:


Generate random matrix A Rkn where Aij Cauchy, k = O(
2

).

15/24

p-Stable Distributions and Algorithm


Defn: A p-stable distribution has the following property: for X , Y , Z and a, b R : aX + bY (|a|p + |b|p )1/p Z

e.g., Gaussian is 2-stable and Cauchy distribution is 1-stable Algorithm:


Generate random matrix A Rkn where Aij Cauchy, k = O( Compute sketches Af and Ag incrementally
2

).

15/24

p-Stable Distributions and Algorithm


Defn: A p-stable distribution has the following property: for X , Y , Z and a, b R : aX + bY (|a|p + |b|p )1/p Z

e.g., Gaussian is 2-stable and Cauchy distribution is 1-stable Algorithm:


Generate random matrix A Rkn where Aij Cauchy, k = O( Compute sketches Af and Ag incrementally Return median(|t1 |, . . . , |tk |) where t = Af Ag
2

).

15/24

p-Stable Distributions and Algorithm


Defn: A p-stable distribution has the following property: for X , Y , Z and a, b R : aX + bY (|a|p + |b|p )1/p Z

e.g., Gaussian is 2-stable and Cauchy distribution is 1-stable Algorithm:


Generate random matrix A Rkn where Aij Cauchy, k = O( Compute sketches Af and Ag incrementally Return median(|t1 |, . . . , |tk |) where t = Af Ag
2

).

Analysis:
By the 1-stability property for Zi Cauchy X X |ti | = | Ai,j (fj gj )| |Zi | |fj gj |
j j

15/24

p-Stable Distributions and Algorithm


Defn: A p-stable distribution has the following property: for X , Y , Z and a, b R : aX + bY (|a|p + |b|p )1/p Z

e.g., Gaussian is 2-stable and Cauchy distribution is 1-stable Algorithm:


Generate random matrix A Rkn where Aij Cauchy, k = O( Compute sketches Af and Ag incrementally Return median(|t1 |, . . . , |tk |) where t = Af Ag
2

).

Analysis:
By the 1-stability property for Zi Cauchy X X |ti | = | Ai,j (fj gj )| |Zi | |fj gj |
j j

For k = O( 2 ), since median(|Zi |) = 1, with high probability, X X (1 ) |fj gj | median(|t1 |, . . . , |tk |) (1 + ) |fj gj |
j j

15/24

A Useful Multi-Purpose Sketch: Count-Min Sketch

16/24

A Useful Multi-Purpose Sketch: Count-Min Sketch


Heavy Hitters: Find all i such that fi m

16/24

A Useful Multi-Purpose Sketch: Count-Min Sketch


Heavy Hitters: Find all i such that fi m Range Sums: Estimate ikj fk when i, j arent known in advance

16/24

A Useful Multi-Purpose Sketch: Count-Min Sketch


Heavy Hitters: Find all i such that fi m Range Sums: Estimate ikj fk when i, j arent known in advance Find k-Quantiles: Find values q0 , . . . , qk such that q0 = 0, qk = n, and
iqj 1

fi <

jm k

fi
iqj

16/24

A Useful Multi-Purpose Sketch: Count-Min Sketch


Heavy Hitters: Find all i such that fi m Range Sums: Estimate ikj fk when i, j arent known in advance Find k-Quantiles: Find values q0 , . . . , qk such that q0 = 0, qk = n, and
iqj 1

fi <

jm k

fi
iqj

Algorithm: Count-Min Sketch


Maintain an array of counters ci,j for i [d] and j [w ]

16/24

A Useful Multi-Purpose Sketch: Count-Min Sketch


Heavy Hitters: Find all i such that fi m Range Sums: Estimate ikj fk when i, j arent known in advance Find k-Quantiles: Find values q0 , . . . , qk such that q0 = 0, qk = n, and
iqj 1

fi <

jm k

fi
iqj

Algorithm: Count-Min Sketch


Maintain an array of counters ci,j for i [d] and j [w ] Construct d random hash functions h1 , h2 , . . . hd : [n] [w ]

16/24

A Useful Multi-Purpose Sketch: Count-Min Sketch


Heavy Hitters: Find all i such that fi m Range Sums: Estimate ikj fk when i, j arent known in advance Find k-Quantiles: Find values q0 , . . . , qk such that q0 = 0, qk = n, and
iqj 1

fi <

jm k

fi
iqj

Algorithm: Count-Min Sketch


Maintain an array of counters ci,j for i [d] and j [w ] Construct d random hash functions h1 , h2 , . . . hd : [n] [w ] Update counters: On seeing value v , increment ci,hi (v ) for i [d]

16/24

A Useful Multi-Purpose Sketch: Count-Min Sketch


Heavy Hitters: Find all i such that fi m Range Sums: Estimate ikj fk when i, j arent known in advance Find k-Quantiles: Find values q0 , . . . , qk such that q0 = 0, qk = n, and
iqj 1

fi <

jm k

fi
iqj

Algorithm: Count-Min Sketch


Maintain an array of counters ci,j for i [d] and j [w ] Construct d random hash functions h1 , h2 , . . . hd : [n] [w ] Update counters: On seeing value v , increment ci,hi (v ) for i [d] To get an estimate of fk , return fk = min ci,hi (k)
i

16/24

A Useful Multi-Purpose Sketch: Count-Min Sketch


Heavy Hitters: Find all i such that fi m Range Sums: Estimate ikj fk when i, j arent known in advance Find k-Quantiles: Find values q0 , . . . , qk such that q0 = 0, qk = n, and
iqj 1

fi <

jm k

fi
iqj

Algorithm: Count-Min Sketch


Maintain an array of counters ci,j for i [d] and j [w ] Construct d random hash functions h1 , h2 , . . . hd : [n] [w ] Update counters: On seeing value v , increment ci,hi (v ) for i [d] To get an estimate of fk , return fk = min ci,hi (k)
i

Analysis: For d = O(log 1/) and w = O(1/ 2 ) P fk m fk fk 1


16/24

Outline

Basic Denitions Sampling Sketching Counting Distinct Items Summary of Some Other Results

17/24

Counting Distinct Elements

Input: Stream x1 , x2 , . . . , xm [n]m Goal: Estimate the number of distinct values in the stream up to a multiplicative factor (1 + ) with high probability.

18/24

Algorithm

Algorithm:
1. Apply random hash function h : [n] [0, 1] to each element

19/24

Algorithm

Algorithm:
1. Apply random hash function h : [n] [0, 1] to each element 2. Compute , the t-th smallest value of the hash seen where t = 21/
2

19/24

Algorithm

Algorithm:
1. Apply random hash function h : [n] [0, 1] to each element 2. Compute , the t-th smallest value of the hash seen where t = 21/ 3. Return = t/ as estimate for r , the number of distinct items. r
2

19/24

Algorithm

Algorithm:
1. Apply random hash function h : [n] [0, 1] to each element 2. Compute , the t-th smallest value of the hash seen where t = 21/ 3. Return = t/ as estimate for r , the number of distinct items. r
2

Analysis:
1. Algorithm uses O(
2

log n) bits of space.

19/24

Algorithm

Algorithm:
1. Apply random hash function h : [n] [0, 1] to each element 2. Compute , the t-th smallest value of the hash seen where t = 21/ 3. Return = t/ as estimate for r , the number of distinct items. r
2

Analysis:
1. Algorithm uses O( 2 log n) bits of space. 2. Well show estimate has good accuracy with reasonable probability P [| r | r ] 9/10 r

19/24

Accuracy Analysis
1. Suppose the distinct items are a1 , . . . , ar

20/24

Accuracy Analysis
1. Suppose the distinct items are a1 , . . . , ar 2. Over Estimation: P [ (1 + )r ] = P [t/ (1 + )r ] = P r t r (1 + )

20/24

Accuracy Analysis
1. Suppose the distinct items are a1 , . . . , ar 2. Over Estimation: P [ (1 + )r ] = P [t/ (1 + )r ] = P r 3. Let Xi = 1[h(ai ) P
t r (1+ ) ]

t r (1 + )

and X =

Xi

t = P [X > t] = P [X > (1 + )E [X ]] r (1 + )

20/24

Accuracy Analysis
1. Suppose the distinct items are a1 , . . . , ar 2. Over Estimation: P [ (1 + )r ] = P [t/ (1 + )r ] = P r 3. Let Xi = 1[h(ai ) P
t r (1+ ) ]

t r (1 + )

and X =

Xi

t = P [X > t] = P [X > (1 + )E [X ]] r (1 + )

4. By a Chebyshev analysis, P [X > (1 + )E [X ]] 1


2 E [X ]

1/20

20/24

Accuracy Analysis
1. Suppose the distinct items are a1 , . . . , ar 2. Over Estimation: P [ (1 + )r ] = P [t/ (1 + )r ] = P r 3. Let Xi = 1[h(ai ) P
t r (1+ ) ]

t r (1 + )

and X =

Xi

t = P [X > t] = P [X > (1 + )E [X ]] r (1 + )

4. By a Chebyshev analysis, P [X > (1 + )E [X ]] 1


2 E [X ]

1/20

5. Under Estimation: A similar analysis shows P [ (1 )r ] 1/20 r

20/24

Outline

Basic Denitions Sampling Sketching Counting Distinct Items Summary of Some Other Results

21/24

Some Other Results


Correlations: Input: (x1 , y1 ), (x2 , y2 ), . . . , (xm , ym ) Goal: Estimate strength of correlation between x and y via the distance between joint distribution and product of the marginals. Result: (1 + ) approx in O( O(1) ) space.

22/24

Some Other Results


Correlations: Input: (x1 , y1 ), (x2 , y2 ), . . . , (xm , ym ) Goal: Estimate strength of correlation between x and y via the distance between joint distribution and product of the marginals. Result: (1 + ) approx in O( O(1) ) space. Linear Regression: Input: Stream denes a matrix A Rnd and b Rd1 Goal: Find x such that Ax b 2 is minimized. Result: (1 + ) estimation in O(d 2 1 ) space.

22/24

Some More Other Results


Histograms: Input: x1 , x2 , . . . , xm [n]m Goal: Determine B bucket histogram H : [m] R minimizing (xi H(i))2
i[m]

Result: (1 + ) estimation in O(B 2

) space

23/24

Some More Other Results


Histograms: Input: x1 , x2 , . . . , xm [n]m Goal: Determine B bucket histogram H : [m] R minimizing (xi H(i))2
i[m]

Result: (1 + ) estimation in O(B 2

) space

Transpositions and Increasing Subsequences: Input: x1 , x2 , . . . , xm [n]m Goal: Estimate number of transpositions |{i < j : xi > xj }| Goal: Estimate length of longest increasing subsequence Results: (1 + ) approx in O( 1 ) and O( 1 n) space respectively

23/24

Thanks!

Blog: http://polylogblog.wordpress.com Lectures: Piotr Indyk, MIT http://stellar.mit.edu/S/course/6/fa07/6.895/ Books: Data Streams: Algorithms and Applications S. Muthukrishnan (2005) Algorithms and Complexity of Stream Processing A. McGregor, S. Muthukrishnan (forthcoming)

24/24

You might also like