Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
67 views

01 Streaming PDF

This document summarizes key concepts relating to streaming models and sampling from data streams. It discusses the basic streaming model where a processor with limited memory receives a stream of unknown length from a large universe. Common queries on the stream include finding the mean, maximum, and representative samples. The document then covers reservoir sampling for selecting a uniform random sample without replacement from a stream using only O(1) memory. It also discusses counting distinct elements in a stream using only O(log m) memory where m is the size of the universe.

Uploaded by

Venkata Praneeth
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views

01 Streaming PDF

This document summarizes key concepts relating to streaming models and sampling from data streams. It discusses the basic streaming model where a processor with limited memory receives a stream of unknown length from a large universe. Common queries on the stream include finding the mean, maximum, and representative samples. The document then covers reservoir sampling for selecting a uniform random sample without replacement from a stream using only O(1) memory. It also discusses counting distinct elements in a stream using only O(log m) memory where m is the size of the universe.

Uploaded by

Venkata Praneeth
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

24-01-2020

John Augustine
Jan 16, 2020

Data Mining (CS6720) Introduction to Streaming Model

Streaming Sampling from a stream

1 2

Basic Model Sample queries


• Processor (unlimited power. Why?) • Mean, max, etc.
• Limited memory. Goal is to minimize the amount of memory. • Representative candidate element(s) (Formally: sample(s) with/without
• Stream: 𝑎 , 𝑎 , … , 𝑎 (typically, 𝑛 is unknown to processor) replacement)
• Each 𝑎 ∈ 𝑈, the universe, which can be {1, 2, … , 𝑚}, a set of 𝑚 symbols, ℝ, ℤ, ℕ, etc. • Suppose 𝑈 is a collection of 𝑚 items and and 𝑓 is the frequency (number
• Query: some question about item seen so far. (Sample queries next page.) of occurrences) of 𝑗th item.
• At time step 𝑖, item 𝑎 is provided to the processor. • The 𝑝th frequency moment is ∑ 𝑓 .
• Processor uses 𝑚 memory words and 𝑎 , • The 0th frequency moment is the number of distinct items.
• Updates memory such that answer to query is updated. • The 2nd frequency moment (aka “surprise number”) is useful in computing variance.
/
• Several other variants (sliding window, cash register model, graph • The quantity ∑ 𝑓 captures the frequency of the most frequent item as
streaming, etc.) 𝑝 → ∞.

3 4
24-01-2020

Sampling Formulation and


• Formulation attempt: Pick each item with probability, say, 1/100.
Reservoir
• Stream of (unknown) 𝑛 elements from a universe of 𝑚 symbols. Large 𝑛
Sampling and 𝑚.
• Query: sample of 𝑘 ≪ min(𝑛, 𝑚) elements chosen uniformly at random
without replacement.
• Goal: memory ≤ 𝑘 symbols.

• Strawman solution 1: First 𝑘 or last 𝑘.


• Strawman solution 2: Choose some random 𝑘 indices and collect items at
Sampling without replacement in a those indices.
stream. Vitter, 1985.

5 6

Simple case: 𝑘 = 1 Case: Arbitrary 𝑘


• Only one memory cell • Now we are allowed 𝑘 cells. Place first 𝑘 items in cells.
• Item 𝑎 is placed in cell with prob .
• When item 𝑎 , 𝑖 > 𝑘, arrives, pick a random number 𝑟 from 1 to 𝑖.
• Event 𝐸 : The item 𝑎 is the final sample iff If 𝑟 ≤ 𝑘, place 𝑎 in cell 𝑟. Else, ignore 𝑎 .
(i) 𝑎 was placed in cell and
(ii) subsequent items NOT placed in cell.
• Event E: 𝑎 is a sample iff (i) 𝑎 is placed in one of the 𝑘 cells and (ii)
1 1 1 1 no subsequent item is placed in that cell.
Pr 𝐸 = 1− 1− ⋯ 1−
𝑖 𝑖+1 𝑖+2 𝑛 • Thus,
1 𝑖 𝑖+1 𝑛−1 1
𝑘 1 1 1 𝑘
= ⋯ = Pr 𝐸 = 1− 1− ⋯ 1− = .
𝑖 𝑖+1 𝑖+2 𝑛 𝑛 𝑖 𝑖+1 𝑖+2 𝑛 𝑛

7 8
24-01-2020

Problem Formulation
Counting • Stream on 𝑛 numbers from {1, 2, … , 𝑚}. Material not covered in class
but required to know (and
• Query: how many different numbers?
Distinct
therefore asked in
tests/quizzes/exams) will be
• Goal: memory 𝑂(log 𝑚) bits mentioned in a cloud like this.
Elements You are free to search other
sources (Internet, textbook).
• Deterministic solutions?
• 𝑛 = 𝑚 + 1 stream elements require m bits.
• Proof:
• Suppose after 𝑚 stream items only m-1 bits.
• 2 − 1 possible subsets could be seen, but only 2 states for memory.
• Thus, some two subsets 𝑆 and 𝑆 represented by same state.
• If those subsets of different sizes, error.
In a Stream • Otherwise, if same size, then what if (𝑚 + 1)th item is from 𝑆 \S ?

9 10

Problem Formulation -- Updated Intuition


• Stream on 𝑛 numbers from {1, 2, … , 𝑚}.
• Query: Let 𝑑 be the number of distinct items.
Output 𝑑 such that Pr ≤ 𝑑 ≤ 6𝑑 > .
• Goal: memory 𝑂(log 𝑚)

Why 6?

Answer: convenient to prove.

This and other subsequent figures & screen clips are from
Blum, Hopcroft, and Kannan unless specified otherwise.

11 12
24-01-2020

Algorithm Claim:
• Assume availability of two-universal hash function Let 𝑏 , 𝑏 , … , 𝑏 be the 𝑑 distinct items encountered in the stream.
ℎ: 1, 2, … , 𝑚 → {1, 2, … , 𝑀} Then,
where 𝑀 > 𝑚. How to get 2-
𝑑 2 𝑑 1
universal hash
Pr ≤ 𝑑 ≤ 6𝑑 ≥ − >
• Pr ℎ 𝑎 = 𝑥 = 1/𝑀. 6 3 𝑀 2
functions? Why is it
2-universal? (when 𝑀 ≫ 𝑑).
• Pr ℎ 𝑎 = 𝑥 ∧ ℎ 𝑏 = 𝑦 = .

• Algorithm: hash each value in the stream and only remember the
smallest value 𝑠. On being queried, report 𝑑 = 𝑀/𝑠.

13 14

Proof. Brief Detour into Elementary Tail Bounds


First, we focus on the event: 𝑠 ≤ Markov’s Inequality: If 𝑋 is a random variable that only takes nonnegative
values, then, for all 𝑎 > 0,
𝐸𝑋
Pr 𝑋 ≥ 𝑎 ≤ .
𝑎

Chebyshev’s Inequality: For any random variable 𝑋 (not necessarily


𝑀 nonnegative) and for all 𝑎 > 0,
𝑠≤ 𝑉𝑎𝑟 𝑋 Prove these tail
6𝑑 Pr 𝑋 − 𝐸 𝑋 ≥ 𝑎 ≤ . inequalities.
𝑎 Reference:
Also, for any 𝑡 > 1, Mitzenmacher and
𝑉𝑎𝑟 𝑋 Upfal
Pr 𝑋 − 𝐸 𝑋 ≥ 𝑡 ⋅ 𝐸 𝑋 ≤ .
𝑡 ⋅ (𝐸 𝑋 )

15 16
24-01-2020

Back to our Proof.


Focus on event: 𝑠 ≥ and bound its probability to within 1/6. 6𝑀 6
𝐸 𝑌 = 1 × Pr 𝑌 = 1 = Pr ℎ 𝑏 < ≈ .
𝑑 𝑑
Recall:
6
𝐸𝑌 ≈ = 6.
6𝑀 6𝑀 𝑑
Pr 𝑠 ≥ = Pr ∀𝑖, ℎ 𝑏 ≥ .
𝑑 𝑑 𝑉𝑎𝑟 𝑌 = 𝐸 𝑌 −𝐸 𝑌 =𝐸 𝑌 −𝐸 𝑌 ≤𝐸 𝑌 .
Let
6𝑀 𝑉𝑎𝑟 𝑌 = 𝑉𝑎𝑟[𝑌 ] ≤ 𝐸[𝑌]
𝑌 = 0, if ℎ 𝑏 ≥
𝑑
1, Otherwise.
Due to 2-way
And, let 𝑌 = ∑ 𝑌. independence of 𝑌 ’s.
Find out why?

17 18

Homework
6𝑀 6𝑀 How to boost probability and achieve the following claim?
Pr 𝑠 ≥ = Pr ∀𝑖, ℎ 𝑏 ≥
𝑑 𝑑
For any 𝑐 > 0,
= Pr 𝑌 = 0 𝑑 1
Pr ≤ 𝑑 ≤ 6𝑑 ≥ 1 −
6 𝑚
≤ Pr 𝑌 − 𝐸 𝑌 ≥ 𝐸 𝑌 Final result follows by the union
bound which says:
Pr 𝐸 ∪ 𝐸 ∪ ⋯ ≤ Pr 𝐸 Hint: Use an appropriate number of repetitions and use the median
𝑉𝑎𝑟 𝑌 1 1 value. For analysis (i.e., proving the above claim), use Chernoff bounds.
≤ ≤ ≤ .
𝐸𝑌 𝐸𝑌 6 (You will only need Equation (5).)

19 20
24-01-2020

Chernoff Bounds (Mitzenmacher and Upfal) Chernoff Bounds (Mitzenmacher and Upfal)
Let 𝑋 , 𝑋 , … , 𝑋 be independent binomial trials with Pr 𝑋 = 1 = 𝑝. Let Let 𝑋 , 𝑋 , … , 𝑋 be independent binomial trials with Pr 𝑋 = 1 = 𝑝.
𝑋 = 𝑋 + 𝑋 + ⋯ + 𝑋 and 𝜇 = 𝐸 𝑋 = 𝑛𝑝. Then, the following hold. Let 𝑋 = 𝑋 + 𝑋 + ⋯ + 𝑋 and 𝜇 = 𝐸 𝑋 = 𝑛𝑝. Then, for 0 < 𝛿 < 1,
1. For any 𝛿 > 0, 𝑒
𝑒 Pr 𝑋 ≤ 1 − 𝛿 𝜇 ≤ , (4)
1−𝛿
Pr 𝑋 ≥ 1 + 𝛿 𝜇 ≤ . (1)
1+𝛿
2. For 0 < 𝛿 ≤ 1, Pr 𝑋 ≤ 1 − 𝛿 𝜇 ≤ 𝑒 , (5)

Pr 𝑋 ≥ 1 + 𝛿 𝜇 ≤ 𝑒 . (2) And putting (2) and (5) together,


3. For 𝑅 ≥ 6𝜇, Pr 𝑋 − 𝜇 ≥ 𝛿𝜇 ≤ 2𝑒 . (6)
Pr 𝑋 ≥ 𝑅 ≤ 2 . (3)
contd.

21 22

The 𝑝th Frequency Moment


Let 𝑓 be the number of occurrences of 𝑠 ∈ {1,2, … , 𝑚}.
Frequency
Moments The 𝑝th frequency moment is given by

𝑓 .

Generalization of Distinct Elements

23 24
24-01-2020

Commonly Studied Frequency Moments


𝑝 = 0 captures number of distinct elements (assuming 0 = 0).
Estimating the
𝑝 = 1 is simply the stream length 𝑛. Second
𝑝 = 2 is useful in computing the variance. Moment

It is often called the surprise number. Can you guess why?


Hint: Consider a random stream vs a stream with just one item.

𝑝 = ∞ captures frequency of the most frequent element. Alon, Mattias, and Szegedy (AMS)

25 26

Claim:
The AMS Algorithm
• For each 𝑠 ∈ {1,2, … , 𝑚}, let 𝑥 be ±1 with probability ½ each.
𝑎= 𝑥 𝑓

• For the theory to go through,


𝑥 can be the hash of 𝑠 into {−1, +1} is an unbiased estimator of the second frequency
such that the outcomes are 4-way independent. moment.
• Algorithm (unbiased, but weak): Maintain a counter 𝑠𝑢𝑚 that is
incremented by 𝑥 when a symbol 𝑠 arrives. Report 𝑎 = 𝑠𝑢𝑚
• Thus, at the end, sum = ∑ 𝑥 𝑓 and 𝑎 = ∑ 𝑥 𝑓 .

27 28
24-01-2020

Claim: 𝑉𝑎𝑟 𝑎 = 𝐸 𝑎 − 𝐸 [𝑎] ≤ 2 𝐸 𝑎 . Reducing the Error Probability


Algorithm: Repeat the algorithm 𝑟 = times yielding estimates
𝑎 , 𝑎 , … , 𝑎 and report the average 𝑋 = ∑ 𝑎 .

Claim:
𝑉𝑎𝑟 𝑋
Pr 𝑋 − 𝐸 𝑋 > 𝜖𝐸 𝑋 ≤ ≤𝛿
If one of 𝑠, 𝑡, 𝑢, or 𝑣 𝜖 𝐸 𝑋
is different from the
others, the
expectation of that
term vanishes. Why?
29 30

You might also like