0% found this document useful (0 votes)

67 views

01 Streaming PDF

This document summarizes key concepts relating to streaming models and sampling from data streams. It discusses the basic streaming model where a processor with limited memory receives a stream of unknown length from a large universe. Common queries on the stream include finding the mean, maximum, and representative samples. The document then covers reservoir sampling for selecting a uniform random sample without replacement from a stream using only O(1) memory. It also discusses counting distinct elements in a stream using only O(log m) memory where m is the size of the universe.

Uploaded by

Venkata Praneeth

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

67 views

01 Streaming PDF

Uploaded by

Venkata Praneeth

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

24-01-2020

John Augustine
Jan 16, 2020

Data Mining (CS6720) Introduction to Streaming Model

Streaming Sampling from a stream

1 2

Basic Model Sample queries

• Processor (unlimited power. Why?) • Mean, max, etc.
• Limited memory. Goal is to minimize the amount of memory. • Representative candidate element(s) (Formally: sample(s) with/without
• Stream: 𝑎 , 𝑎 , … , 𝑎 (typically, 𝑛 is unknown to processor) replacement)
• Each 𝑎 ∈ 𝑈, the universe, which can be {1, 2, … , 𝑚}, a set of 𝑚 symbols, ℝ, ℤ, ℕ, etc. • Suppose 𝑈 is a collection of 𝑚 items and and 𝑓 is the frequency (number
• Query: some question about item seen so far. (Sample queries next page.) of occurrences) of 𝑗th item.
• At time step 𝑖, item 𝑎 is provided to the processor. • The 𝑝th frequency moment is ∑ 𝑓 .
• Processor uses 𝑚 memory words and 𝑎 , • The 0th frequency moment is the number of distinct items.
• Updates memory such that answer to query is updated. • The 2nd frequency moment (aka “surprise number”) is useful in computing variance.
/
• Several other variants (sliding window, cash register model, graph • The quantity ∑ 𝑓 captures the frequency of the most frequent item as
streaming, etc.) 𝑝 → ∞.

3 4
24-01-2020

Sampling Formulation and

• Formulation attempt: Pick each item with probability, say, 1/100.
Reservoir
• Stream of (unknown) 𝑛 elements from a universe of 𝑚 symbols. Large 𝑛
Sampling and 𝑚.
• Query: sample of 𝑘 ≪ min(𝑛, 𝑚) elements chosen uniformly at random
without replacement.
• Goal: memory ≤ 𝑘 symbols.

• Strawman solution 1: First 𝑘 or last 𝑘.

• Strawman solution 2: Choose some random 𝑘 indices and collect items at
Sampling without replacement in a those indices.
stream. Vitter, 1985.

5 6

Simple case: 𝑘 = 1 Case: Arbitrary 𝑘

• Only one memory cell • Now we are allowed 𝑘 cells. Place first 𝑘 items in cells.
• Item 𝑎 is placed in cell with prob .
• When item 𝑎 , 𝑖 > 𝑘, arrives, pick a random number 𝑟 from 1 to 𝑖.
• Event 𝐸 : The item 𝑎 is the final sample iff If 𝑟 ≤ 𝑘, place 𝑎 in cell 𝑟. Else, ignore 𝑎 .
(i) 𝑎 was placed in cell and
(ii) subsequent items NOT placed in cell.
• Event E: 𝑎 is a sample iff (i) 𝑎 is placed in one of the 𝑘 cells and (ii)
1 1 1 1 no subsequent item is placed in that cell.
Pr 𝐸 = 1− 1− ⋯ 1−
𝑖 𝑖+1 𝑖+2 𝑛 • Thus,
1 𝑖 𝑖+1 𝑛−1 1
𝑘 1 1 1 𝑘
= ⋯ = Pr 𝐸 = 1− 1− ⋯ 1− = .
𝑖 𝑖+1 𝑖+2 𝑛 𝑛 𝑖 𝑖+1 𝑖+2 𝑛 𝑛

7 8
24-01-2020

Problem Formulation
Counting • Stream on 𝑛 numbers from {1, 2, … , 𝑚}. Material not covered in class
but required to know (and
• Query: how many different numbers?
Distinct
therefore asked in
tests/quizzes/exams) will be
• Goal: memory 𝑂(log 𝑚) bits mentioned in a cloud like this.
Elements You are free to search other
sources (Internet, textbook).
• Deterministic solutions?
• 𝑛 = 𝑚 + 1 stream elements require m bits.
• Proof:
• Suppose after 𝑚 stream items only m-1 bits.
• 2 − 1 possible subsets could be seen, but only 2 states for memory.
• Thus, some two subsets 𝑆 and 𝑆 represented by same state.
• If those subsets of different sizes, error.
In a Stream • Otherwise, if same size, then what if (𝑚 + 1)th item is from 𝑆 \S ?

9 10

Problem Formulation -- Updated Intuition

• Stream on 𝑛 numbers from {1, 2, … , 𝑚}.
• Query: Let 𝑑 be the number of distinct items.
Output 𝑑 such that Pr ≤ 𝑑 ≤ 6𝑑 > .
• Goal: memory 𝑂(log 𝑚)

Why 6?

Answer: convenient to prove.

This and other subsequent figures & screen clips are from
Blum, Hopcroft, and Kannan unless specified otherwise.

11 12
24-01-2020

Algorithm Claim:
• Assume availability of two-universal hash function Let 𝑏 , 𝑏 , … , 𝑏 be the 𝑑 distinct items encountered in the stream.
ℎ: 1, 2, … , 𝑚 → {1, 2, … , 𝑀} Then,
where 𝑀 > 𝑚. How to get 2-
𝑑 2 𝑑 1
universal hash
Pr ≤ 𝑑 ≤ 6𝑑 ≥ − >
• Pr ℎ 𝑎 = 𝑥 = 1/𝑀. 6 3 𝑀 2
functions? Why is it
2-universal? (when 𝑀 ≫ 𝑑).
• Pr ℎ 𝑎 = 𝑥 ∧ ℎ 𝑏 = 𝑦 = .

• Algorithm: hash each value in the stream and only remember the
smallest value 𝑠. On being queried, report 𝑑 = 𝑀/𝑠.

13 14

Proof. Brief Detour into Elementary Tail Bounds

First, we focus on the event: 𝑠 ≤ Markov’s Inequality: If 𝑋 is a random variable that only takes nonnegative
values, then, for all 𝑎 > 0,
𝐸𝑋
Pr 𝑋 ≥ 𝑎 ≤ .
𝑎

Chebyshev’s Inequality: For any random variable 𝑋 (not necessarily

𝑀 nonnegative) and for all 𝑎 > 0,
𝑠≤ 𝑉𝑎𝑟 𝑋 Prove these tail
6𝑑 Pr 𝑋 − 𝐸 𝑋 ≥ 𝑎 ≤ . inequalities.
𝑎 Reference:
Also, for any 𝑡 > 1, Mitzenmacher and
𝑉𝑎𝑟 𝑋 Upfal
Pr 𝑋 − 𝐸 𝑋 ≥ 𝑡 ⋅ 𝐸 𝑋 ≤ .
𝑡 ⋅ (𝐸 𝑋 )

15 16
24-01-2020

Back to our Proof.

Focus on event: 𝑠 ≥ and bound its probability to within 1/6. 6𝑀 6
𝐸 𝑌 = 1 × Pr 𝑌 = 1 = Pr ℎ 𝑏 < ≈ .
𝑑 𝑑
Recall:
6
𝐸𝑌 ≈ = 6.
6𝑀 6𝑀 𝑑
Pr 𝑠 ≥ = Pr ∀𝑖, ℎ 𝑏 ≥ .
𝑑 𝑑 𝑉𝑎𝑟 𝑌 = 𝐸 𝑌 −𝐸 𝑌 =𝐸 𝑌 −𝐸 𝑌 ≤𝐸 𝑌 .
Let
6𝑀 𝑉𝑎𝑟 𝑌 = 𝑉𝑎𝑟[𝑌 ] ≤ 𝐸[𝑌]
𝑌 = 0, if ℎ 𝑏 ≥
𝑑
1, Otherwise.
Due to 2-way
And, let 𝑌 = ∑ 𝑌. independence of 𝑌 ’s.
Find out why?

17 18

Homework
6𝑀 6𝑀 How to boost probability and achieve the following claim?
Pr 𝑠 ≥ = Pr ∀𝑖, ℎ 𝑏 ≥
𝑑 𝑑
For any 𝑐 > 0,
= Pr 𝑌 = 0 𝑑 1
Pr ≤ 𝑑 ≤ 6𝑑 ≥ 1 −
6 𝑚
≤ Pr 𝑌 − 𝐸 𝑌 ≥ 𝐸 𝑌 Final result follows by the union
bound which says:
Pr 𝐸 ∪ 𝐸 ∪ ⋯ ≤ Pr 𝐸 Hint: Use an appropriate number of repetitions and use the median
𝑉𝑎𝑟 𝑌 1 1 value. For analysis (i.e., proving the above claim), use Chernoff bounds.
≤ ≤ ≤ .
𝐸𝑌 𝐸𝑌 6 (You will only need Equation (5).)

19 20
24-01-2020

Chernoff Bounds (Mitzenmacher and Upfal) Chernoff Bounds (Mitzenmacher and Upfal)
Let 𝑋 , 𝑋 , … , 𝑋 be independent binomial trials with Pr 𝑋 = 1 = 𝑝. Let Let 𝑋 , 𝑋 , … , 𝑋 be independent binomial trials with Pr 𝑋 = 1 = 𝑝.
𝑋 = 𝑋 + 𝑋 + ⋯ + 𝑋 and 𝜇 = 𝐸 𝑋 = 𝑛𝑝. Then, the following hold. Let 𝑋 = 𝑋 + 𝑋 + ⋯ + 𝑋 and 𝜇 = 𝐸 𝑋 = 𝑛𝑝. Then, for 0 < 𝛿 < 1,
1. For any 𝛿 > 0, 𝑒
𝑒 Pr 𝑋 ≤ 1 − 𝛿 𝜇 ≤ , (4)
1−𝛿
Pr 𝑋 ≥ 1 + 𝛿 𝜇 ≤ . (1)
1+𝛿
2. For 0 < 𝛿 ≤ 1, Pr 𝑋 ≤ 1 − 𝛿 𝜇 ≤ 𝑒 , (5)

Pr 𝑋 ≥ 1 + 𝛿 𝜇 ≤ 𝑒 . (2) And putting (2) and (5) together,

3. For 𝑅 ≥ 6𝜇, Pr 𝑋 − 𝜇 ≥ 𝛿𝜇 ≤ 2𝑒 . (6)
Pr 𝑋 ≥ 𝑅 ≤ 2 . (3)
contd.

21 22

The 𝑝th Frequency Moment

Let 𝑓 be the number of occurrences of 𝑠 ∈ {1,2, … , 𝑚}.
Frequency
Moments The 𝑝th frequency moment is given by

𝑓 .

Generalization of Distinct Elements

23 24
24-01-2020

Commonly Studied Frequency Moments

𝑝 = 0 captures number of distinct elements (assuming 0 = 0).
Estimating the
𝑝 = 1 is simply the stream length 𝑛. Second
𝑝 = 2 is useful in computing the variance. Moment

It is often called the surprise number. Can you guess why?

Hint: Consider a random stream vs a stream with just one item.

𝑝 = ∞ captures frequency of the most frequent element. Alon, Mattias, and Szegedy (AMS)

25 26

Claim:
The AMS Algorithm
• For each 𝑠 ∈ {1,2, … , 𝑚}, let 𝑥 be ±1 with probability ½ each.
𝑎= 𝑥 𝑓

• For the theory to go through,

𝑥 can be the hash of 𝑠 into {−1, +1} is an unbiased estimator of the second frequency
such that the outcomes are 4-way independent. moment.
• Algorithm (unbiased, but weak): Maintain a counter 𝑠𝑢𝑚 that is
incremented by 𝑥 when a symbol 𝑠 arrives. Report 𝑎 = 𝑠𝑢𝑚
• Thus, at the end, sum = ∑ 𝑥 𝑓 and 𝑎 = ∑ 𝑥 𝑓 .

27 28
24-01-2020

Claim: 𝑉𝑎𝑟 𝑎 = 𝐸 𝑎 − 𝐸 [𝑎] ≤ 2 𝐸 𝑎 . Reducing the Error Probability

Algorithm: Repeat the algorithm 𝑟 = times yielding estimates
𝑎 , 𝑎 , … , 𝑎 and report the average 𝑋 = ∑ 𝑎 .

Claim:
𝑉𝑎𝑟 𝑋
Pr 𝑋 − 𝐸 𝑋 > 𝜖𝐸 𝑋 ≤ ≤𝛿
If one of 𝑠, 𝑡, 𝑢, or 𝑣 𝜖 𝐸 𝑋
is different from the
others, the
expectation of that
term vanishes. Why?
29 30

Cheatsheet PDF
100% (1)
Cheatsheet PDF
4 pages
Algorithms for Massive Data Problems
No ratings yet
Algorithms for Massive Data Problems
28 pages
Mining Data Streams (Part 2)
No ratings yet
Mining Data Streams (Part 2)
56 pages
Unit 4 - 4.4
No ratings yet
Unit 4 - 4.4
23 pages
Compsci Algorithms For Data Science: Cameron Musco University of Massachusetts Amherst. Fall 2019
No ratings yet
Compsci Algorithms For Data Science: Cameron Musco University of Massachusetts Amherst. Fall 2019
28 pages
Assignment No.2: HOANG Nguyen Phong
No ratings yet
Assignment No.2: HOANG Nguyen Phong
6 pages
14 Streams
No ratings yet
14 Streams
6 pages
3.flajolet Martin Algorithm
No ratings yet
3.flajolet Martin Algorithm
31 pages
Massivedata14 Slidesxx
No ratings yet
Massivedata14 Slidesxx
13 pages
STA1006S Summarized Notes
No ratings yet
STA1006S Summarized Notes
16 pages
Streaming Algorithm
No ratings yet
Streaming Algorithm
16 pages
Lec1 Bloom Distinctcount
No ratings yet
Lec1 Bloom Distinctcount
76 pages
Lec 8
No ratings yet
Lec 8
4 pages
Approximate Frequency Counts Over Data Streams
No ratings yet
Approximate Frequency Counts Over Data Streams
87 pages
Sta1006 Summary
No ratings yet
Sta1006 Summary
20 pages
Flajolrt
No ratings yet
Flajolrt
34 pages
Expectation of Geometric Distribution Variance and Standard Deviation
No ratings yet
Expectation of Geometric Distribution Variance and Standard Deviation
5 pages
Streaming Algorithm: Filtering & Counting Distinct Elements: Compsci 590.02 Instructor: Ashwinmachanavajjhala
No ratings yet
Streaming Algorithm: Filtering & Counting Distinct Elements: Compsci 590.02 Instructor: Ashwinmachanavajjhala
26 pages
Book 160 163
No ratings yet
Book 160 163
4 pages
Notestream
No ratings yet
Notestream
5 pages
Lecture2 FrequencyMoments
No ratings yet
Lecture2 FrequencyMoments
37 pages
Mining Data Streams
No ratings yet
Mining Data Streams
67 pages
Unit 2 Mathematical Foundation of Big Data: - Syllabus
No ratings yet
Unit 2 Mathematical Foundation of Big Data: - Syllabus
26 pages
Experiment No 8
No ratings yet
Experiment No 8
7 pages
ENGR 3341: Probability Theory and Statistics Notes
No ratings yet
ENGR 3341: Probability Theory and Statistics Notes
5 pages
report-endterm
No ratings yet
report-endterm
30 pages
STS 301 PDF
No ratings yet
STS 301 PDF
67 pages
fm algorithm
No ratings yet
fm algorithm
3 pages
Chapter Five and Six
No ratings yet
Chapter Five and Six
23 pages
Cmpe107 Notes
No ratings yet
Cmpe107 Notes
14 pages
Prob. & Stat Lecture Note- Chapter 3
No ratings yet
Prob. & Stat Lecture Note- Chapter 3
16 pages
02 Fundamentals of Probability
No ratings yet
02 Fundamentals of Probability
21 pages
Probability Chap 1
No ratings yet
Probability Chap 1
9 pages
Streams 2
No ratings yet
Streams 2
49 pages
DSBD_Unit-II_3
No ratings yet
DSBD_Unit-II_3
28 pages
Midterms Cheatsheet
No ratings yet
Midterms Cheatsheet
2 pages
MMD 05
No ratings yet
MMD 05
50 pages
Module 1. Introduction To Probability Theory Student Notes 1.1: Basic Notions in Probability
No ratings yet
Module 1. Introduction To Probability Theory Student Notes 1.1: Basic Notions in Probability
6 pages
bda exp8
No ratings yet
bda exp8
4 pages
Notes Streaming
No ratings yet
Notes Streaming
8 pages
Mining Data Streams (Part 1)
No ratings yet
Mining Data Streams (Part 1)
46 pages
week_11
No ratings yet
week_11
40 pages
PS Ch2 SS 2023-For Students
No ratings yet
PS Ch2 SS 2023-For Students
47 pages
Crash Course On Data Stream Algorithms: Part I: Basic Definitions and Numerical Streams
No ratings yet
Crash Course On Data Stream Algorithms: Part I: Basic Definitions and Numerical Streams
76 pages
Prof (1) F P Kelly - Probability
No ratings yet
Prof (1) F P Kelly - Probability
78 pages
Eda Formulas
No ratings yet
Eda Formulas
2 pages
report-mid
No ratings yet
report-mid
19 pages
Week1
No ratings yet
Week1
40 pages
Chapter 2
No ratings yet
Chapter 2
88 pages
SoICT-Eng - ProbComp - Lec 2
No ratings yet
SoICT-Eng - ProbComp - Lec 2
28 pages
Walpole- Ch 2
No ratings yet
Walpole- Ch 2
62 pages
Department of Statistics Module Code: STA1142: School of Mathematical and Natural Sciences
No ratings yet
Department of Statistics Module Code: STA1142: School of Mathematical and Natural Sciences
19 pages
Statistics Notes
No ratings yet
Statistics Notes
17 pages
HW 2 Sol
No ratings yet
HW 2 Sol
5 pages
Applied Probability Theory - J. Chen
100% (3)
Applied Probability Theory - J. Chen
177 pages
Lecture 1 M
No ratings yet
Lecture 1 M
5 pages
A 18-Page Statistics & Data Science Cheat Sheets
No ratings yet
A 18-Page Statistics & Data Science Cheat Sheets
18 pages
02 StreamsAlgorithms
No ratings yet
02 StreamsAlgorithms
93 pages
hw5 Sol PDF
No ratings yet
hw5 Sol PDF
7 pages
The Numpy Pocketbook: Essentials on the Go
From Everand
The Numpy Pocketbook: Essentials on the Go
Silas Meadowlark
No ratings yet
Ap Eamcet - 2017 District Wise Toppers in Engineering: Page 1 of 28
No ratings yet
Ap Eamcet - 2017 District Wise Toppers in Engineering: Page 1 of 28
28 pages
Department of Humanities and Social Sciences IIT Madras: S.No. Slot Course No Course Name Instructor Name Room Credit
No ratings yet
Department of Humanities and Social Sciences IIT Madras: S.No. Slot Course No Course Name Instructor Name Room Credit
1 page
Action Recognition With Trajectory-Pooled Deep-Convolutional Descriptors
No ratings yet
Action Recognition With Trajectory-Pooled Deep-Convolutional Descriptors
10 pages
Learning Spatiotemporal Features With 3D Convolutional Networks
No ratings yet
Learning Spatiotemporal Features With 3D Convolutional Networks
16 pages
Marx and Hegel On Alienation
No ratings yet
Marx and Hegel On Alienation
10 pages
Two-Stream Convolutional Networks For Action Recognition in Videos
No ratings yet
Two-Stream Convolutional Networks For Action Recognition in Videos
9 pages
The Drinking Philosophers Problem: K. M. Chandy and J. Misra University of Texas at Austin
No ratings yet
The Drinking Philosophers Problem: K. M. Chandy and J. Misra University of Texas at Austin
15 pages
Ba Eco (Hons.) Lu
No ratings yet
Ba Eco (Hons.) Lu
30 pages
Getting To Know Your Data: 2.1 Exercises
100% (1)
Getting To Know Your Data: 2.1 Exercises
8 pages
International Journal of Pure and Applied Mathematics No. 3 2013, 583-592
No ratings yet
International Journal of Pure and Applied Mathematics No. 3 2013, 583-592
10 pages
BSAB-Statistics in Agribusiness
No ratings yet
BSAB-Statistics in Agribusiness
14 pages
This Study Resource Was: MC Qu. 9-54 The Construction Manager For ABC..
No ratings yet
This Study Resource Was: MC Qu. 9-54 The Construction Manager For ABC..
3 pages
Practical Research 1 - Q 1 - Mod 1 - Nature and Inquiry of Research - V 5
No ratings yet
Practical Research 1 - Q 1 - Mod 1 - Nature and Inquiry of Research - V 5
49 pages
Joice 211122631042
No ratings yet
Joice 211122631042
52 pages
PR Research 2 2 1 Revised) @#
No ratings yet
PR Research 2 2 1 Revised) @#
13 pages
Chi-Square Test: Dr. T. T. Kachwala
No ratings yet
Chi-Square Test: Dr. T. T. Kachwala
14 pages
Pls-Sem P
100% (1)
Pls-Sem P
32 pages
p1 Ece 069 Set A Key
No ratings yet
p1 Ece 069 Set A Key
4 pages
TeamJimuelAlivia Revised
No ratings yet
TeamJimuelAlivia Revised
32 pages
stat q4
No ratings yet
stat q4
5 pages
ABDETA GROUP FINAL PROPOSAL
No ratings yet
ABDETA GROUP FINAL PROPOSAL
19 pages
Sci Tech Catalogue
No ratings yet
Sci Tech Catalogue
28 pages
Experiment 4: Aim/Overview of The Practical: Task To Be Done
No ratings yet
Experiment 4: Aim/Overview of The Practical: Task To Be Done
7 pages
Finger Vein Recognition
No ratings yet
Finger Vein Recognition
20 pages
R2023 I MCA - First Semester Syllabus
No ratings yet
R2023 I MCA - First Semester Syllabus
10 pages
DS Tutorial 3
No ratings yet
DS Tutorial 3
2 pages
Module 2 Charts and Diagrams
No ratings yet
Module 2 Charts and Diagrams
9 pages
7406HW03
No ratings yet
7406HW03
2 pages
Webinar 5 Assignment
No ratings yet
Webinar 5 Assignment
2 pages
Week 13 Chapter 9 Synthesis
No ratings yet
Week 13 Chapter 9 Synthesis
20 pages
Chapter 3 Econometrics
No ratings yet
Chapter 3 Econometrics
34 pages
Cai. Z (1996)
No ratings yet
Cai. Z (1996)
9 pages
Introduction To Statistics
No ratings yet
Introduction To Statistics
7 pages
Smith Qualitative Psychology Third Edition Jonathan A Smith - The full ebook version is available, download now to explore
100% (1)
Smith Qualitative Psychology Third Edition Jonathan A Smith - The full ebook version is available, download now to explore
37 pages
Ictus Trial
No ratings yet
Ictus Trial
9 pages
Module 4 Manual
No ratings yet
Module 4 Manual
17 pages
PhD Entrance Examination Syllabus Updated
No ratings yet
PhD Entrance Examination Syllabus Updated
23 pages