Crash Course On Data Stream Algorithms: Part I: Basic Definitions and Numerical Streams
Crash Course On Data Stream Algorithms: Part I: Basic Definitions and Numerical Streams
Andrew McGregor
University of Massachusetts Amherst
1/24
2/24
2/24
2/24
2/24
2/24
2/24
2/24
Outline
Basic Denitions Sampling Sketching Counting Distinct Items Summary of Some Other Results
3/24
Outline
Basic Denitions Sampling Sketching Counting Distinct Items Summary of Some Other Results
4/24
5/24
Goal: Compute a function of stream, e.g., median, number of distinct elements, longest increasing sequence.
5/24
Goal: Compute a function of stream, e.g., median, number of distinct elements, longest increasing sequence. Catch:
1. Limited working memory, sublinear in n and m
5/24
Goal: Compute a function of stream, e.g., median, number of distinct elements, longest increasing sequence. Catch:
1. Limited working memory, sublinear in n and m 2. Access data sequentially
5/24
Goal: Compute a function of stream, e.g., median, number of distinct elements, longest increasing sequence. Catch:
1. Limited working memory, sublinear in n and m 2. Access data sequentially 3. Process each element quickly
5/24
Goal: Compute a function of stream, e.g., median, number of distinct elements, longest increasing sequence. Catch:
1. Limited working memory, sublinear in n and m 2. Access data sequentially 3. Process each element quickly
Origins in 70s but has become popular in last ten years because of growing theory and very applicable.
5/24
Practical Appeal:
Faster networks, cheaper data storage, ubiquitous data-logging results in massive amount of data to be processed. Applications to network monitoring, query planning, I/O eciency for massive data, sensor networks aggregation. . .
6/24
Practical Appeal:
Faster networks, cheaper data storage, ubiquitous data-logging results in massive amount of data to be processed. Applications to network monitoring, query planning, I/O eciency for massive data, sensor networks aggregation. . .
Theoretical Appeal:
Easy to state problems but hard to solve. Links to communication complexity, compressed sensing, embeddings, pseudo-random generators, approximation. . .
6/24
Outline
Basic Denitions Sampling Sketching Counting Distinct Items Summary of Some Other Results
7/24
8/24
Sampling is a general technique for tackling massive amounts of data Example: To compute the median packet size of some IP packets, we could just sample some and use the median of the sample as an estimate for the true median. Statistical arguments relate the size of the sample to the accuracy of the estimate.
8/24
Sampling is a general technique for tackling massive amounts of data Example: To compute the median packet size of some IP packets, we could just sample some and use the median of the sample as an estimate for the true median. Statistical arguments relate the size of the sample to the accuracy of the estimate. Challenge: But how do you take a sample from a stream of unknown length or from a sliding window?
8/24
Reservoir Sampling
9/24
Reservoir Sampling
9/24
Reservoir Sampling
Analysis:
Whats the probability that s = xi at some time t i?
9/24
Reservoir Sampling
Analysis:
Whats the probability that s = xi at some time t i? 1 1 1 1 P [s = xi ] = 1 ... 1 = i i +1 t t
9/24
Reservoir Sampling
Analysis:
Whats the probability that s = xi at some time t i? 1 1 1 1 P [s = xi ] = 1 ... 1 = i i +1 t t To get k samples we use O(k log n) bits of space.
9/24
10/24
10/24
10/24
10/24
Analysis:
10/24
Analysis:
The probability that j-th oldest element is in S is 1/j so the expected number of items in S is 1/w + 1/(w 1) + . . . + 1/1 = O(log w )
10/24
Analysis:
The probability that j-th oldest element is in S is 1/j so the expected number of items in S is 1/w + 1/(w 1) + . . . + 1/1 = O(log w ) Hence, algorithm only uses O(log w log n) bits of memory.
10/24
11/24
11/24
11/24
E [m(g (r ) g (r 1))] =
11/24
Outline
Basic Denitions Sampling Sketching Counting Distinct Items Summary of Some Other Results
12/24
Sketching
13/24
Sketching
Sketching is another general technique for processing streams Basic idea: Apply a linear projection on the y that takes high-dimensional data to a smaller dimensional space. Post-process lower dimensional image to estimate the quantities of interest.
13/24
14/24
Input: Stream from two sources x1 , x2 , . . . , xm ([n] [n])m Goal: Estimate dierence between distribution of red values and blue values, e.g., |fi gi |
i[n]
14/24
15/24
).
15/24
).
15/24
).
15/24
).
Analysis:
By the 1-stability property for Zi Cauchy X X |ti | = | Ai,j (fj gj )| |Zi | |fj gj |
j j
15/24
).
Analysis:
By the 1-stability property for Zi Cauchy X X |ti | = | Ai,j (fj gj )| |Zi | |fj gj |
j j
For k = O( 2 ), since median(|Zi |) = 1, with high probability, X X (1 ) |fj gj | median(|t1 |, . . . , |tk |) (1 + ) |fj gj |
j j
15/24
16/24
16/24
16/24
fi <
jm k
fi
iqj
16/24
fi <
jm k
fi
iqj
16/24
fi <
jm k
fi
iqj
16/24
fi <
jm k
fi
iqj
16/24
fi <
jm k
fi
iqj
16/24
fi <
jm k
fi
iqj
Outline
Basic Denitions Sampling Sketching Counting Distinct Items Summary of Some Other Results
17/24
Input: Stream x1 , x2 , . . . , xm [n]m Goal: Estimate the number of distinct values in the stream up to a multiplicative factor (1 + ) with high probability.
18/24
Algorithm
Algorithm:
1. Apply random hash function h : [n] [0, 1] to each element
19/24
Algorithm
Algorithm:
1. Apply random hash function h : [n] [0, 1] to each element 2. Compute , the t-th smallest value of the hash seen where t = 21/
2
19/24
Algorithm
Algorithm:
1. Apply random hash function h : [n] [0, 1] to each element 2. Compute , the t-th smallest value of the hash seen where t = 21/ 3. Return = t/ as estimate for r , the number of distinct items. r
2
19/24
Algorithm
Algorithm:
1. Apply random hash function h : [n] [0, 1] to each element 2. Compute , the t-th smallest value of the hash seen where t = 21/ 3. Return = t/ as estimate for r , the number of distinct items. r
2
Analysis:
1. Algorithm uses O(
2
19/24
Algorithm
Algorithm:
1. Apply random hash function h : [n] [0, 1] to each element 2. Compute , the t-th smallest value of the hash seen where t = 21/ 3. Return = t/ as estimate for r , the number of distinct items. r
2
Analysis:
1. Algorithm uses O( 2 log n) bits of space. 2. Well show estimate has good accuracy with reasonable probability P [| r | r ] 9/10 r
19/24
Accuracy Analysis
1. Suppose the distinct items are a1 , . . . , ar
20/24
Accuracy Analysis
1. Suppose the distinct items are a1 , . . . , ar 2. Over Estimation: P [ (1 + )r ] = P [t/ (1 + )r ] = P r t r (1 + )
20/24
Accuracy Analysis
1. Suppose the distinct items are a1 , . . . , ar 2. Over Estimation: P [ (1 + )r ] = P [t/ (1 + )r ] = P r 3. Let Xi = 1[h(ai ) P
t r (1+ ) ]
t r (1 + )
and X =
Xi
t = P [X > t] = P [X > (1 + )E [X ]] r (1 + )
20/24
Accuracy Analysis
1. Suppose the distinct items are a1 , . . . , ar 2. Over Estimation: P [ (1 + )r ] = P [t/ (1 + )r ] = P r 3. Let Xi = 1[h(ai ) P
t r (1+ ) ]
t r (1 + )
and X =
Xi
t = P [X > t] = P [X > (1 + )E [X ]] r (1 + )
1/20
20/24
Accuracy Analysis
1. Suppose the distinct items are a1 , . . . , ar 2. Over Estimation: P [ (1 + )r ] = P [t/ (1 + )r ] = P r 3. Let Xi = 1[h(ai ) P
t r (1+ ) ]
t r (1 + )
and X =
Xi
t = P [X > t] = P [X > (1 + )E [X ]] r (1 + )
1/20
20/24
Outline
Basic Denitions Sampling Sketching Counting Distinct Items Summary of Some Other Results
21/24
22/24
22/24
) space
23/24
) space
Transpositions and Increasing Subsequences: Input: x1 , x2 , . . . , xm [n]m Goal: Estimate number of transpositions |{i < j : xi > xj }| Goal: Estimate length of longest increasing subsequence Results: (1 + ) approx in O( 1 ) and O( 1 n) space respectively
23/24
Thanks!
Blog: http://polylogblog.wordpress.com Lectures: Piotr Indyk, MIT http://stellar.mit.edu/S/course/6/fa07/6.895/ Books: Data Streams: Algorithms and Applications S. Muthukrishnan (2005) Algorithms and Complexity of Stream Processing A. McGregor, S. Muthukrishnan (forthcoming)
24/24