0% found this document useful (0 votes)

4 views

Algorithms for Massive Data Problems

Uploaded by

rucywl

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

Algorithms for Massive Data Problems

Uploaded by

rucywl

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

1 Algorithms for Massive Data Problems 2

1.1 Frequency Moments of Data Streams . . . . . . . . . . . . . . . . . . . . . 3
1.1.1 Number of Distinct Elements in a Data Stream . . . . . . . . . . . 4
1.1.2 Counting the Number of Occurrences of a Given Element. . . . . . 7
1.1.3 Counting Frequent Elements . . . . . . . . . . . . . . . . . . . . . . 8
1.1.4 The Second Moment . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2 Sketch of a Large Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.2.1 Matrix Multiplication Using Sampling . . . . . . . . . . . . . . . . 14
1.2.2 Approximating a Matrix with a Sample of Rows and Columns . . . 18
1.3 Graph and Matrix Sparsifiers . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.4 Sketches of Documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

1
DATA. Eg. Matrix A
RAM

Aij = ? Aij = · · ·

ALGORITHM

Figure 1.1: Traditional Algorithm Model

1 Algorithms for Massive Data Problems

Massive Data, Sampling

This chapter deals with massive data problems where the input data (a graph, a ma-
trix or some other object) is too large to be stored in random access memory. One model
for such problems is the streaming model, where the data can be seen only once. In
the streaming model, the natural technique to deal with the massive data is sampling.
Sampling is done “on the fly”. As each piece of data is seen, based on a coin toss, one
decides whether to include the data in the sample. Typically, the probability of including
the data point in the sample may depend on its value. Models allowing multiple passes
through the data are also useful; but the number of passes needs to be small. We always
assume that random access memory (RAM) is limited, so the entire data cannot be stored
in RAM.

To introduce the basic flavor of sampling on the fly, consider the following simple
primitive. We have a stream of n elements, where, n is not known at the start. We wish
to sample uniformly at random one of the n elements. If we have to pick a sample and
then never change it, we are out of luck. [You can convince yourself of this.] So we have to
allow ourselves to “change our mind”: After perhaps drawing one element as a sample, we
must allow ourselves to reject it and instead draw the current element as the new sample.
With this hint, the reader may want to think of the algorithm. Instead of describing the

2
algorithm for this problem, we will solve a more general problem:
From a stream of n positive real numbers a1 , a2 , . . . , an , draw a sample element ai so
that the probability of picking an element is proportional to its value.
It is easy to see that the following sampling method works. Upon seeing a1 , a2 , . . . , ai ,
keep track of the sum a = a1 + a2 + · · · + ai and a sample aj , j ≤ i, drawn with
probability proportional to its value. On seeing ai+1 , replace the current sample by ai+1
ai+1
with probability a+a i+1
and update a.
We can prove by induction that this algorithm does in fact sample element aj with
probability aj /(a1 + a2 + · · · + an ) : Suppose after we have read a1 , a2 , . . . , ai , we have
picked a sample with probability of aj being the sample equal to aj /(a1 + a2 + · · · + ai ).
The probability that we will keep this aj after reading ai+1 is
ai+1 a1 + a2 + · · · + ai
1− = .
a1 + a2 + · · · + ai+1 a1 + a2 + · · · + ai+1
Combining the two, we see that the probability that aj is the sample after reading ai+1
is precisely aj /(a1 + a2 + · · · + ai+1 ) completing the inductive proof.

1.1 Frequency Moments of Data Streams

An important class of problems concerns the frequency moments of data streams.
Here a data stream a1 , a2 , . . . , an of length n consists of symbols ai from an alphabet of
m possible symbols which for convenience we denote as {1, 2, . . . , m}. Throughout this
section, n, m, and ai will have these meanings and s (for symbol) will denote a generic
element of {1, 2, . . . , m}. The frequency fs of the symbol s is the number of occurrences
of s in the stream. For a nonnegative integer p, the pth frequency moment of the stream
is m
X
fsp .
s=1

Note that the p = 0 frequency moment corresponds to the number of distinct symbols
occurring in the stream. The first
P frequency moment is just n, the length of the string.
The second frequency moment, fs2 , is useful in computing the variance of the stream.
s

m m m
1 X n 2 1 X 2 n n 2 1 X 2 n2
fs − = f − 2 fs + = f −
m s=1 m m s=1 s m m m s=1 s m2
m
1/p
fsp
P
In the limit as p becomes large, is the frequency of the most frequent ele-
s=1
ment(s).

We will describe sampling based algorithms to compute these quantities for streaming
data shortly. But first a note on the motivation for these various problems. The identity
and frequency of the the most frequent item or more generally, items whose frequency

3
exceeds a fraction of n, is clearly important in many applications. If the items are packets
on a network with source destination addresses, the high frequency items identify the
heavy bandwidth users. If the data is purchase records in a supermarket, the high fre-
quency items are the best-selling items. Determining the number of distinct symbols is
the abstract version of determining such things as the number of accounts, web users, or
credit card holders. The second moment and variance are useful in networking as well as
in database and other applications. Large amounts of network log data are generated by
routers that can record for all the messages passing through them, the source address,
destination address, and the number of packets. This massive data cannot be easily sorted
or aggregated into totals for each source/destination. But it is important to know if some
popular source-destination pairs have a lot of traffic for which the variance is the natural
measure.

1.1.1 Number of Distinct Elements in a Data Stream

Consider a sequence a1 , a2 , . . . , an of n elements, each ai an integer in the range 1 to m
where n and m are very large. Suppose we wish to determine the number of distinct ai in
the sequence. Each ai might represent a credit card number extracted from a sequence of
credit card transactions and we wish to determine how many distinct credit card accounts
there are. The model is a data stream where symbols are seen one at a time. We first
show that any deterministic algorithm that determines the number of distinct elements
exactly must use at least m bits of memory.

Lower bound on memory for exact deterministic algorithm

Suppose we have seen the first k symbols of the stream and k > m. The set of distinct
symbols seen so far could be any of the 2m subsets of {1, 2, . . . , m}. Each subset must
result in a different state for our algorithm and hence m bits of memory are required. To
see this, suppose first that two different size subsets of distinct symbols lead to the same
internal state. Then our algorithm would produce the same count of distinct symbols for
both inputs, clearly an error for one of the input sequences. If two sequences with the
same number of distinct elements but different subsets lead to the same state, then on
next seeing a symbol that appeared in one sequence but not the other, we would make
an error on at least one of them .

Algorithm for the Number of distinct elements

Let a1 , a2 , . . . , an be a sequence of elements where each ai ∈ {1, 2, . . . , m}. The number

of distinct elements can be estimated with O(log m) space. Let S ⊆ {1, 2, . . . , m} be the
set of elements that appear in the sequence. Suppose that the elements of S were selected
uniformly at random from {1, 2, . . . , m}. Let min denote the minimum element of S.
Knowing the minimum element of S allows us to estimate the size of S. The elements of
m
S partition the set {1, 2, . . . , m} into |S| + 1 subsets each of size approximately |S|+1 . See

4
|S| + 1 subsets
z }| {

m
|S|+1

Figure 1.2: Estimating the size of S from the minimum element in S which has value
m
approximately |S|+1 . The elements of S partition the set {1, 2, . . . , m} into |S| + 1 subsets
m
each of size approximately |S|+1 .

m
Figure 1.2. Thus, the minimum element of S should have value close to |S|+1 . Solving
m m
min = |S|+1 yields |S| = min − 1. Since we can determine min, this gives us an estimate
of |S|.

The above analysis required that the elements of S were picked uniformly at random
from {1, 2, . . . , m}. This is generally not the case when we have a sequence a1 , a2 , . . . , an
of elements from {1, 2, . . . , m}. Clearly if the elements of S were obtained by selecting the
|S| smallest elements of {1, 2, . . . , m}, the above technique would give the wrong answer.
If the elements are not picked uniformly at random, can we estimate the number of distinct
elements? The way to solve this problem is to use a hash function h where

h : {1, 2, . . . , m} → {0, 1, 2, . . . , M − 1}

To count the number of distinct elements in the input, count the number of elements
in the mapped set {h (a1 ) , h (a2 ) , . . .}. The point being that {h(a1 ), h (a2 ) , . . .} behaves
like a random subset and so the above heuristic argument using the minimum to estimate
the number of elements may apply. If we needed h (1) , h (2) , . . . to be completely inde-
pendent, the space needed to store the hash function would too high. Fortunately, only
2-way independence is needed. We recall the formal definition of 2-way independence
below. But first recall that a hash function is always chosen at random from a family of
hash functions and phrases like “probability of collision” refer to the probability over the
choice of hash function.

Universal Hash Functions

The set of hash functions

H = {h | h : {1, 2, . . . , m} → {0, 1, 2, . . . , M − 1}}

is 2-universal if for all x and y in {1, 2, . . . , m}, x 6= y, and for all z and w in {0, 1, 2, . . . , M − 1}
1
Prob [h (x) = z and h (y) = w] = M2

5
for a randomly chosen h. The concept of a 2-universal family of hash functions is that
given x, h (x) is equally likely to be any element of {0, 1, 2, . . . , M − 1} (the reader should
prove this from the definition of 2-universal) and for x 6= y, h (x) and h (y) are indepen-
dent.

We now give an example of a 2-universal family of hash functions. For simplicity let
M be a prime, with M > m. For each pair of integers a and b in the range [0,M -1], define
a hash function
hab (x) = ax + b mod (M )
To store the hash function hab , store the two integers a and b. This requires only O(log M )
space. To see that the family is 2-universal note that h(x) = z and h(y) = w if and only
if
x 1 a z
= mod (M )
y 1 b w

x 1
If x 6= y , the matrix is invertible modulo M and there is only one solution for
y 1
a and b. Thus, for a and b chosen uniformly at random, the probability of the equation
holding is exactly M12 . [We assumed M > m. What goes wrong if this did not hold?]

Analysis of distinct element counting algorithm

Let b1 , b2 , . . . , bd be the distinct values that appear in the input. Then

S = {h (b1 ) , h (b2 ) , . . . h (bd )} is a set of d random and 2-way independent values from the
M
set {0, 1, 2, . . . , M − 1}. We now show that min is a good estimate for d, the number of
distinct elements in the input, where min=min(S).

Lemma 1.1 Assume M >100d. With probability at least 32 , d6 ≤ min

M
≤ 6d, where min is
the smallest element of S.
M
> 6d < 61 .

Proof: First, we show that Prob min

M M M
Prob > 6d = Prob min < = Prob ∃k h (bk ) <
min 6d 6d

For i = 1, 2, . . . , d define the indicator variable

M

1 if h (bi ) < 6d
zi =
0 otherwise

d
P
and let z = zi . If h (bi ) is chosen randomly from {0, 1, 2, . . . , M − 1}, then Prob [zi = 1] =
i=1

6
1 1
6d
. Thus, E (zi ) = 6d
and E (z) = 61 . Now

M M
Prob > 6d = Prob min <
min 6d

M
= Prob ∃k h (bk ) <
6d
= Prob (z ≥ 1) = Prob [z ≥ 6E (z)] .

By Markov’s inequality Prob [z ≥ 6E (z)] ≤ 61 .

M
< d6 < 16 .

Finally, we show that Prob min

M d
6M
6M

Prob < 6 = Prob min > d
= Prob ∀k h (bk ) > d
min

For i = 1, 2, . . . , d define the indicator variable

6M

0 if h (bi ) > d
yi =
1 otherwise

d
6 6
P
and let y = yi . Now Prob (yi = 1) = d
, E (yi ) = d
, and E (y) = 6. For 2-way
i=1
independent random variables, the variance of their sum is the sum of their variances. So
V ar (y) = dV ar (y1 ). Further, it is easy to see since y1 is 0 or 1 that

V ar(y1 ) = E (y1 − E(y1 ))2 = E(y12 ) − E 2 (y1 ) = E(y1 ) − E 2 (y1 ) ≤ E (y1 ) .

Thus V ar(y) ≤ E (y). Now by the Chebychev inequality,

M d 6M
6M
Prob < = Prob min > d = Prob ∀k h (bi ) >
min 6 d
= Prob (y = 0) ≤ Prob [|y − E (y)| ≥ E (y)]
Var(y) 1 1
≤ 2 ≤ ≤
E (y) E (y) 6
M 1 M d 1
Since min > 6d with probability at most 6
and min
< 6
with probability at most 6
,
d
6
≤ min ≤ 6d with probability at least 32 .
M

1.1.2 Counting the Number of Occurrences of a Given Element.

To count the number of occurrences of an element in a stream requires at most log n
space where n is the length of the stream. Clearly, for any length stream that occurs in
practice, we can afford log n space. For this reason, the following material may never be
used in practice, but the technique is interesting and may give insight into how to solve

7
some other problem.

Consider a string of 0’s and 1’s of length n in which we wish to count the number
of occurrences of 1’s. Clearly if we had log n bits of memory we could keep track of the
exact number of 1’s. However, we can approximate the number with only log log m bits.

Let m be the number of 1’s that occur in the sequence. Keep a value k such that 2k
is approximately the number of occurrences m. Storing k requires only log log n bits of
memory. The algorithm works as follows. Start with k=0. For each occurrence of a 1,
add one to k with probability 1/2k . At the end of the string, the quantity 2k − 1 is the
estimate of m. To obtain a coin that comes down heads with probability 1/2k , flip a fair
coin, one that comes down heads with probability 1/2, k times and report heads if the fair
coin comes down heads in all k flips.

Given k, on average it will take 2k ones before k is incremented. Thus, the expected
number of 1’s to produce the current value of k is 1 + 2 + 4 + · · · + 2k−1 = 2k − 1.

1.1.3 Counting Frequent Elements

The Majority and Frequent Algorithms

First consider the very simple problem of n people voting. There are m candidates,
{1, 2, . . . , m}. We want to determine if one candidate gets a majority vote and if so
who. Formally, we are given a stream of integers a1 , a2 , . . . , an , each ai belonging to
{1, 2, . . . , m}, and want to determine whether there is some s ∈ {1, 2, . . . , m} which oc-
curs more than n/2 times and if so which s. It is easy to see that to solve the problem
exactly on read only once streaming data with a deterministic algorithm, requires Ω(n)
space. Suppose n is even and the first n/2 items are all distinct and the last n/2 items are
identical. After reading the first n/2 items, we need to remember exactly which elements
of {1, 2, . . . , m} have occurred. If for two different sets of elements occurring in the first
half of the stream, the contents of the memory are the same, then a mistake would occur
if the second half of the stream consists solely of an element in one set, but not the other.
m
Thus, log2 n/2 bits of memory, which if m > n is Ω(n), are needed.

Now lets allow the algorithm a random number generator. The reader may want to
think about simple sampling schemes first - like picking an element as a sample and then
checking how many times it occurs and modifications of this. The following is a simple
low-space algorithm that always finds the majority vote if there is one. If there is no
majority vote, the output may be arbitrary. That is, there may be “false positives”, but
no “false negatives”.

Majority Algorithm

8
Store a1 and initialize a counter to one. For each subsequent ai , if ai is the same as
the currently stored item, increment the counter by one. If it differs, decrement the
counter by one provided the counter is nonzero. If the counter is zero, then store ai
and set the counter to one.

To analyze the algorithm, it is convenient to view the decrement counter step as “elim-
inating” two items, the new one and the one that caused the last increment in the counter.
It is easy to see that if there is a majority element s, it must be stored at the end. If not,
each occurrence of s was eliminated; but each such elimination also causes another item
to be eliminated and so for a majority item not to be stored at the end, we must have
eliminated more than n items, a contradiction.

Next we modify the above algorithm so that not just the majority, but also items
with frequency above some threshold are detected. We will also ensure (approximately)
that there are no false positives as well as no false negatives. Indeed the algorithm below
will find the frequency (number of occurrences) of each element of {1, 2, . . . , m} to within
n
an additive term of k+1 using O(k log n) space by keeping k counters instead of just one
counter.

Algorithm Frequent
Maintain a list of items being counted. Initially the list is empty. For each item, if
it is the same as some item on the list, increment its counter by one. If it differs
from all the items on the list, then if there are less than k items on the list, add the
item to the list with its counter set to one. If there are already k items on the list
decrement each of the current counters by one deleting an element from the list if its
count becomes zero.

Theorem 1.2 At the end of Algorithm Frequent, for each s ∈ {1, 2, . . . , m}, its counter
on the list is at least the number of occurrences of s in the stream minus n/(k+1). In
particular, if some s does not occur on the list, its counter is zero and the theorem asserts
that it occurs fewer than n/(k+1) times in the stream.

Proof: View each decrement counter step as eliminating some items. An item is elimi-
nated if it is the current ai being read and there are already k symbols different from it
on the list in which case it and k other items are simultaneously eliminated. Thus, the
elimination of each occurrence of an s ∈ {1, 2, . . . , m} is really the elimination of k + 1
items. Thus, no more than n/(k+1) occurrences of any symbol can be eliminated. Now,
it is clear that if an item is not eliminated, then it must still be on the list at the end.
This proves the theorem.

1.1.4 The Second Moment

This section focuses on computing the second moment of a stream with symbols from
{1, 2, . . . , m}. Let fs denote the number of occurrences of symbol s in the stream. The

9
m
fs2 . To calculate the second moment, for each
P
second moment of the stream is given by
s=1
symbol s, 1 ≤ s ≤ m, independently set a random variable xs to ±1 with probability 1/2.
Maintain a sum by adding xs to the sum each time the symbol s occurs in the stream.
m
P
At the end of the stream, the sum will equal xs fs . The expected value of the sum will
s=1
be zero where the expectation is over the choice of the ±1 value for the xs .
m
!
X
E xs fs = 0.
s=1

Although the expected value of the sum is zero, its actual value is a random variable and
the expected value of the square of the sum is given by
m
!2 m
! ! m
X X X X
2 2
E xs f s =E xs fs + 2E xs xt f s f t = fs2 ,
s=1 s=1 s6=t s=1

The last equality follows since E (xs xt ) = 0 for s 6= t. Thus

m
!2
X
a= xs f s
s=1

m
fs2 . One difficulty which we will come back to is that to store the xi
P
is an estimator of
s=1
requires space m and we want to do the calculation in log m space.

A second issue is that this estimator is depends on its variance which we now compute.
m
!4 !
X X
V ar (a) ≤ E x s fs =E xs xt xu xv f s f t f u f v
s=1 1≤s,t,u,v≤m

The first inequality is because the variance is at most the second moment and the second
equality is by expansion. In the second sum, since the xs are independent, if any one of
s, u, t, or v is distinct from the others, then the expectation of the whole term is zero.
Thus, we need to deal only with terms of the form x2s x2t for t 6= s and terms of the form
x4s . Note that this does not need the full power of mutual independence of all the xs , it
only needs 4-way independence, that any four of the x0s s are mutually independent. In
4

the above sum, there are four indices s, t, u, v and there are 2 ways of choosing two of

10
them that have the same x value. Thus,
m X m
! m
!
4 X X
V ar (a) ≤ E x2s x2t fs2 ft2 + E x4s fs4
2 s=1 t=s+1 s=1
m
X m
X m
X
=6 fs2 ft2 + fs4
s=1 t=s+1 s=1
m
!2 m
!2
X X
≤3 fs2 + fs2 = 4E 2 (a) .
s=1 s=1

The variance can be reduced by a factor of r by taking the average of r independent trials.
With r independent trials the variance would be at most 4r E 2 (a), so to achieve relative
m
fs2 , we need O(1/ε2 ) independent trials.
P
error ε in the estimate of
s=1

We will briefly discuss the independent trials here, so as to understand exactly the
m
P
amount of independence needed. Instead of computing a using the running sum xs f s
s=1
for one random vector x, we independently generate r m-vectors (x1 , x2 , . . . , xr ) at the
outset and compute r running sums
m
X m
X m
X
x1s fs , x2s fs , . . . , xrs fs .
s=1 s=1 s=1
m
2 m
2 m
2
P P P
Let a1 = x1s fs , a2 = x2s fs , . . . , ar = xrs fs . Our estimate is
s=1 s=1 s=1
1
r
+ a2 + · · · + ar ). The variance of this estimator is
(a1
Var 1r (a1 + a2 + · · · + ar ) = r12 [Var (a1 ) + Var (a2 ) + · · · + Var (ar )] = 1r Var(a1 ),

where we have assumed that the a1 , a2 , . . . , ar are mutually independent. Now we com-
pute the variance of a1 as we have done for the variance of a. Note that this calculation
assumes only 4-way independence between the coordinates of x1 . We summarize the as-
sumptions here for future reference:
m
fs2 within relative error ε with probability close to one, say at
P
To get an estimate of
s=1
least 0.9999, it suffices to have r = O(1/ε2 ) vectors x1 , x2 , . . . , xr , each with m coordinates
of ±1 with
1. E (xs ) = 0 for all s.
2. x1 , x2 , . . . , xr are mutually independent. That is for any r vectors v1 , v2 , . . . , vr with
±1 coordinates, Prob(x1 = v1 , x2 = v2 , . . . , xr = vr ) = Prob(x1 = v1 )Prob(x2 =
v2 ) · · · Prob(xr = vr ).

11
3. Any four coordinates of x1 are independent. Same for x2 , x3 , . . . , and xr .

[Caution: (2) does not assume that Prob(x1 = v1 ) = 21m for all v1 ; such an assumption
would mean the coordinates of x1 are mutually independent.] The only drawback with the
algorithm as we have described it so far is that we need to keep the r vectors x1 , x2 , . . . , xr
in memory so that we can do the running sums. This is too space-expensive. We need to
do the problem in space dependent upon the logarithm of the size of the alphabet m, not
m itself. If ε is in Ω(1), then r is in O(1), so it is not the number of trials r which is the
problem. It is the m.

In the next section, we will see that the computation can be done in O(log m) space
by using pseudo-random vectors x1 , x2 , . . . , xr instead of truly random ones. The pseudo-
random vectors will satisfy (i), (ii), and (iii) and so they will suffice. This pseudo-
randomness and limited independence has deep connections, so we will go into the con-
nections as well.

Error-Correcting codes, polynomial interpolation and limited-way indepen-

dence

Consider the problem of generating a random m-vector x of ±1’s so that any subset of
four coordinates is mutually independent, i.e., for any distinct s, t, u, and v in {1, 2, . . . , m}
and any a, b, c, and d in {-1, +1},
1
Prob(xs = a, xt = b, xu = c, xv = d) = 16
.

We will see that such an n-dimensional vector may be generated from a truly ran-
dom “seed” of only O(log m) mutually independent bits. Thus, we need only store the
O(log m) bits and can generate any of the m coordinates when needed. This allows us to
store the 4-way independent random m-vector using only log m bits. The first fact needed
for this is that for any k, there is a finite field F with exactly 2k elements, each of which
can be represented with k bits and arithmetic operations in the field can be carried out in
O(k 2 ) time. [In fact F can be taken to be the set of polynomials of degree k − 1 with mod
2 coefficients, where, multiplication is done modulo an irreducible polynomial of degree
k, again over modulo 2.] Here, k will be the ceiling of log2 m. We also assume another
basic fact about polynomial interpolation which says that a polynomial of degree at most
three is uniquely determined by its value over any field F at four points. More precisely,
for any four distinct points a1 , a2 , a3 , a4 ∈ F and any four possibly not distinct values
b1 , b2 , b3 , b4 ∈ F , there is a unique polynomial f (x) = f0 + f1 x + f2 x2 + f3 x3 of degree at
most three that with computations done over F , f (a1 ) = b1 , f (a2 ) = b2 , f (a3 ) = b3 , and
f (a4 ) = b4 .

Now our definition of the pseudo-random ± vector x with 4-way independence is

simple. Choose four elements f0 , f1 , f2 , f3 at random from F and form the polynomial
f (s) = f0 + f1 s + f2 s2 + f3 s3 . This polynomial represents x as follows. For s = 1, 2, . . . , m,

12
xs is the leading bit of the k-bit representation of f (s). Thus, the m-dimensional vector
x can be computed using only the 4k bits in f0 , f1 , f2 , f3 . (Here k = dlog me).

Lemma 1.3 The x defined above have 4-way independence.

Proof: Assume that the elements of F are represented in binary using ± instead of the
traditional 0 and 1. Let s, t, u, and v be any four coordinates of x and let α, β, γ, δ ∈
{−1, 1}. There are exactly 2k−1 elements of F whose leading bit is α and similarly for
β, γ, and δ. So, there are exactly 24(k−1) 4-tuples of elements b1 , b2 , b3 , b4 ∈ F so that
the leading bit of b1 is α, the leading bit of b2 is β, the leading bit of b3 is γ, and the
leading bit of b4 is δ. For each such b1 , b2 , b3 , and b4 , there is precisely one polynomial f
so that

f (s) = b1 , f (t) = b2 , f (u) = b3 , and f (v) = b4

as we saw above. So, the probability that

xs = α, xt = β, xu = γ, and xv = δ
24(k−1) 24(k−1) 1
is precisely = = as asserted.
total number of f 24k 16

The lemma states how to get one vector x with 4-way independence. However, we
need r = O(1/ε2 ) vectors. Also the vectors must be mutually independent. But this is
easy, just choose r polynomials at the outset.

To implement the algorithm with low space, store only the polynomials in memory.
This requires 4k = O(log m) bits per polynomial for a total of O(log m/ε2 ) bits. When a
symbol s in the stream is read, compute x1s , x2s , . . . , xrs and update the running sums.
Note that xs1 is just the leading bit of the first polynomial evaluated at s; this calculation
is in O(log m) time. Thus, we repeatedly compute the xs from the “seeds”, namely the
coefficients of the polynomials.

This idea of polynomial interpolation is also used in other contexts. Error-correcting

codes is an important example. Say we wish to transmit n bits over a channel which may
introduce noise. One can introduce redundancy into the transmission so that some channel
errors can be corrected. A simple way to do this is to view the n bits to be transmitted
as coefficients of a polynomial f (x) of degree n − 1. Now transmit f evaluated at points
1, 2, 3, . . . , n + m. At the receiving end, any n correct values will suffice to reconstruct
the polynomial and the true message. So up to m errors can be tolerated. But even if
the number of errors is at most m, it is not a simple matter to know which values are
corrupted. We do not elaborate on this here.

13
1.2 Sketch of a Large Matrix
We will see how to find a good approximation of an m×n matrix A, which is read from
external memory, where m and n are large. The approximation consists of a sample of s
columns and r rows of A along with an s × r multiplier matrix. The schematic diagram
is given in Figure 1.3.

The crucial point is that uniform sampling will not always work as is seen from simple
examples. We will see that if we sample rows or columns with probabilities proportional
to the squared length of the row or column, then indeed we can get a good approximation.
One may recall that the top k singular vectors of the SVD of A, give a similar picture;
but the SVD takes more time to compute, requires all of A to be stored in RAM, and
does not have the property that the rows and columns are directly from A. However, the
SVD does yield the best approximation. Error bounds for our approximation are weaker,
though it can be found faster.

We briefly touch upon two motivations for such a sketch. Suppose A is the document-
term matrix of a large collection of documents. We are to “read” the collection at the
outset and store a sketch so that later, when a query (represented by a vector with one
entry per term) arrives, we can find its similarity to each document in the collection.
Similarity is defined by the dot product. In Figure 1.3 it is clear that the matrix-vector
product of a query with the right hand side can be done in time O(ns + sr + rm) which
would be linear in n and m if s and r are O(1). To bound errors for this process, we need
to show that the difference between A and the sketch of A has small 2-norm. Recall that
the 2-norm ||A||2 of a matrix A is max xT Ax.
|x|=1

A second motivation comes from recommendation systems. Here A would be a

customer-product matrix whose (i, j)th entry is the preference of customer i for prod-
uct j. The objective is to collect a few sample entries of A and based on them, get an
approximation to A so that we can make future recommendations. These results say that
a few sampled rows of A (all preferences of a few customers) and a few sampled columns
(all customers’ preferences for a few products) give a good approximation to A provided
that the samples are drawn according to the length-squared distribution.

We first tackle a simpler problem, that of multiplying two matrices. The matrix
multiplication also uses length squared sampling and is a building block to the sketch.

1.2.1 Matrix Multiplication Using Sampling

Suppose A is an m × n matrix and B is an n × p matrix and the product AB is desired.
We show how to use sampling to get an approximate product faster than the traditional
multiplication. Let A (:, k) denote the k th column of A. A (:, k) is a m × 1 matrix. Let

14
   
Multi Sample rows


 
 

 plier r×m
   Sample 
  columns  s × r
   

 A ≈ 
   
   
   
   
   
n×m n×s

Figure 1.3: Schematic diagram of the approximation of A by a sample of s columns and

r rows.

B (k, :) be the k th row of B. B (k, :) is a 1 × n matrix. It is easy to see that

n
X n
X
AB = A (:, k)B (k, :) = Outer Product of k th column of A and the k th row of B.
k=1 k=1

[This is easy to verify since, the (i, j) th entry of the outer product of the k th column of
A and the k th row of B is just aik bkj .] Note that for each value of k, A(:, k)B(k, :) is an
m×p matrix each element of which is a single product of elements of A and B. An obvious
use of sampling suggests itself. Sample some values for k and compute A (:, k) B (k, :) for
the sampled k’s using their suitably scaled sum as the estimate of AB. It turns out that
nonuniform sampling probabilities are useful. Define a random variable z that takes on
values in {1, 2, . . . , n}. Let pk denote the probability that z assumes the value k. The pk
are nonnegative and sum to one. Define an associated random matrix variable that has
value
1
X = A (:, z) B (z, :)
pz
which takes on value p1k A(:, k)B(k, :) with probability pk for k = 1, 2, . . . , n. Let E (X)
denote the entry-wise expectation.
n n
X 1 X
E (X) = Prob(z = k) A (:, k) B (k, :) = A (:, k)B (k, :) = AB.
k=1
p k
k=1

This explains the scaling by p1z in X. [In general, if we wish to estimate the sum of n
real numbers a1 , a2 , . . . , an by drawing a sample z from {1, 2, . . . , n} with probabilities
p1 , p2 , . . . , pn , then the unbiased estimator of the sum a1 + a2 + · · · + an is az /pz .]

We wish to bound the error in this estimate for which the usual quantity of interest is
the variance. But here each entry xij of the matrix random variable may have a different
variance. So, we define the variance of X as the sum of the variances of all its entries. This

15
natural device of just taking the sum of the variances greatly simplifies the calculations
as we will see.
p
m X
X X X XX 1 X
Var(X) = Var (xij ) = E x2ij − (E(xij )2 ) = pk 2 a2ik b2kj − (AB)2ij .
i=1 j=1 ij ij i,j k
p k ij

We can simplify by exchanging the order of summations to get

X 1 X X X 1
Var(X) = a2ik b2kj − ||AB||2F = |A (:, k) |2 |B (k, :) |2 − ||AB||2F .
k
p k i j k
p k

|A(:,k)|2
If pk is proportional to |A (:, k) |2 , i.e, pk = ||A||2F
, we get a simple bound for Var(X).

Var(X) ≤ ||A||2F |B (k, :) |2 = ||A||2F ||B||2F .

P
k

Now, if we do s i.i.d. trials, find s (matrix-valued) random variables X1 , X2 , . . . , Xs and

take their (entry-wise) averages, the variance goes down by a factor of s (as is always
the case). It will be useful to give this whole process of estimating the product of two
matrices A, B a symbol - ⊗s which we formally define below.

16
Algorithm Input A, B m × n respectively n × p matrices. Algorithm finds an approx-
imation A ⊗s B to the matrix product AB.,

• A ⊗s B = 1s (X1 + X2 + · · · + Xs ), where, each Xi is computed independently (of

other Xj ) by:

– Draw sample of random variable z from {1, 2, . . . , n} with probabilities pro-

portional to the length squared of the columns of A, i.e.,
∗ Prob (z = k) = |A(:, k)|2 /||A||2F for k = 1, 2, . . . , n.
||A||2F
– Set Xi = |A(:,z)|2
A(:, z)B(z, :).

Since ⊗s is a randomized algorithm, it does not give us exactly the same output every
time, so A ⊗s B is not strictly speaking a function of A, B.
Using probabilities that are proportional to length squared of columns of A turns out
to be useful in other contexts as well and sampling according to them is called “length-
squared sampling”.
The Lemma below shows that the error between the correct AB and the estimate A ⊗s B
goes down as s increases. The error is terms of ||A||F ||B||F which is natural.

Lemma 1.4 Suppose A is an m × n matrix and B is an n × p matrix.

1
E ||A ⊗s B − AB||2F ≤ ||A||2F ||B||2F .
s
Proof: Taking the average of s independent samples divides the variance by s.

The basic step in the algorithm is computing the s outer products, each of which takes
time O(n) In addition, we need the time to compute the length-squared of the columns
of A (which can be done by reading A once) and drawing the sample.

It is interesting that if A, B come in the correct order, the estimate A ⊗s B can be

found while the data streams.

Lemma 1.5 If A is in column order and B is in row order, then as the data streams we
can draw samples and implement the algorithm A ⊗s B with O(s(m + p)) RAM space in
the streaming model.

Proof: As an extension of our primitive to sample from a stream of positive reals a1 , a2 , . . . , an

with probabilities proportional to ai , we an devise an algorithm to do length-squared sam-
pling. Suppose we have read the first i columns of A and maintain the primitive that we
have sampled one of those i columns with probability proportional to length squared. As
the i + 1 column streams by, we (i) keep it in RAM and (ii) compute its length squared
by summing squares of its entries. At the end of reading the column, we can replace the
current sampled column with the i+1 st column with probability equal to (length squared

17
of the i + 1 st column)/(sum of length squared of columns 1, 2, . . . i + 1), (much as we did
earlier with real numbers - details left to the reader), so at the end we draw one length-
squared sampled column. We would simultaneously draw s samples. Then we pick the
corresponding rows of B (no sampling) and finally carry out the multiplication algorithm
⊗s with the sample in RAM.

An important special case is the multiplication AAT where the length-squared distri-
bution can be shown to be the optimal distribution. The variance of X defined above for
this case is
P 1
E(x2ij ) − E 2 (xij ) = |A(:, k)|4 − ||AAT ||2F .
P P
pk
i,j ij k

The second term is independent of the pk . The first term is minimized when the pk
are proportional to |A(:, k)|2 , hence length-squared distribution minimizes the variance
of this estimator. [Show that for any positive real numbers a1 , a2 , . . . , an , the minimum
value of p11 a21 + p12 a22 + · · · + p1n a2n over all p1 , p2 , . . . , pn ≥ 0 summing to 1 is attained when
pk ∝ ak .]

1.2.2 Approximating a Matrix with a Sample of Rows and Columns

We now return to the problem outlined in the beginning of section (1.2). Suppose an
m × n (m, n are large) matrix A is read from external memory and a sketch of it is to
be kept for further computation. Clearly, just a random sample of columns (or rows) will
not do, since the sample tells us little about the unsampled columns. A simple sketch
idea would be to keep a random sample of rows and a random sample of columns of the
matrix. With uniform random sampling, this can fail to be a good sketch. Consider the
case where very few rows and columns of A are nonzero. A small sample will miss them,
storing only zeros. This toy example can be made more subtle by making a few rows
have very high absolute value entries compared to the rest. The sampling probabilities
need to depend on the size of the entries. If we do length-squared sampling of both rows
and columns, then we can get an approximation to A with a bound on the approximation
error. Note that the sampling can be achieved in two passes through the matrix, the first
to compute running sums of the length-squared of each row and column and the second
pass to draw the sample. Only the sampled rows and columns, which is much smaller
than the whole, need to be stored in RAM for further computations. The algorithm starts
with:

• Pick s columns of A using length-squared sampling.

• Pick r rows of A using length-squared sampling (here lengths are row lengths).

We will see that from these sampled columns and rows, we can construct a succinct
approximation to A. For this, we first start with what seems to be a strange idea.

18
Observe that A = AI, where I is the n × n identity matrix. Now, lets approximate AI
by A ⊗s I. The error is given by Lemma (1.4):
1
||A − A ⊗s I||2F ≤ ||A||2F ||I||2F .
s
Our aim would be to make the error at most ε||A||2F for which we will need to make
s > ||I||2F /ε = n/ε which of course is ridiculous since we want s << n. But this was
caused by the fact that I had large (n) rank. A smaller rank “identity-like” matrix might
work better. For this, we will find the SVD of R, where, R is the r × n matrix consisting
of the sampled rows of A.
t
X
SVD of R : R= σi ui vi T ,
i=1

where, t =rank(R), so that all the σi above are positive.

Claim 1 The matrix W = ti=1 vi vi T acts as the identity when applied to any vector v
P
in the row space of R (from the left). I.e., for such v, W v = v.

, vt , we can write v = ti=1 αi vi and then we

P
Proof: Since v is in the space of v1 , v2 , . . .P
get (using the orthogonality of vi ): W v = i αi vi = v as claimed.

We will approximate A by
A ≈ AW ≈ A ⊗s W.
The following Lemma shows that these approximations are valid;

Lemma 1.6 1. E (||A − AW ||2 ) ≤ √1 ||A||F .

r
√
r
2. E (||AW − A ⊗s W ||2 ) ≤ √ ||A||F .
s

3. A ⊗s W can be found from the sampled rows and columns of A.

Proof: ||A − AW ||2 = |(A − AW )x| for some unit length vector x by definition of || · ||2 .
Let y be the component of x in the space spanned by v1 , v2 , . . . , vt and let v = x − y
be the component of x orthogonal to this space. By Claim (1), AW y = Ay, so that
(A − AW )y = 0. Hence, without loss of generality, we may assume x has no component
in the space spanned by the vi . So, W x = 0 and thus (A − AW )x = Ax. Also, Rx = 0.
Now imagine approximating the matrix multiplication AT A by AT ⊗r A. Since the columns
of AT are rows of A, this can be done using the sampled rows of A. So, it follows from
Rx = 0 that AT ⊗r Ax = 0. Using this, we get

|Ax|2 = xT AT Ax = xT (AT A − AT ⊗r A)x ≤ ||AT A − AT ⊗r A||2 .

19
So, ||AW − A||22 ≤ ||AT A − AT ⊗r A||2 . Thus, we have
1/2 1
E(||A − AW ||2 ) ≤ E(||A − AW ||22 ) ≤ E||AT A − AT ⊗r A||2 ≤ ||A||2F ,
r
by Lemma (1.4) proving (1).
For (2), Lemma (1.4) implies that E (||AW − A ⊗s W ||2 ) ≤ √1r ||A||F ||W ||F . Now,
||W ||2F is the sum ofPthe squared lengths of the rows of W which equals trace of (W W T )
and since W W T = ti−1 vi vi T , we get that trace of W W T is t which proves (2).
The proof of (3) is left to the reader. Also left to the reader are choices of s, r to make
the error bounds small.

1.3 Graph and Matrix Sparsifiers

The problem we consider here is the following: we have a n × d matrix A, where, n >> d.
We are to find a subset of some r rows of A and multiply each of these r rows by some
scaler to form a r × d matrix B so that for every d−vector v, we have

|Bv| ≈ε |Av|,

where, for two non-negative real numbers a, b, we say b ≈ε a iff b ∈ [(1 − ε)a, (1 + ε)a]. [In
words, b is an approximation to a with relative error at most ε.] So, we wish the length
of Bv to be a relative error approximation to length of Av for every v. We give two
motivations for this problem before solving it by showing that with r = O(d log d/ε4 ), we
can indeed solve the problem. [In words, essentially O(d log d) rows of A suffice, however
large n is.]
Start with the familiar example of a document term matrix. Suppose we have n
documents and d terms, where, n >> d. Each document is represented as before as a
d− vector with one component per term. A very general problem we can ask is for a
summary of A so that for every new document (which is itself represented as a d−vector
v), we should be able to compute an approximation to the vector of similarities of v to
each existing document - i.e., the vector Av to relative error ε, namely, we want a vector
z approximating Av in the sense zi ≈ε (Av)i for every i. We do not know a solution
of this problem at the current state of knowledge. A simpler question is to ask that the
summary be able to approximate |Av|2 which is the sum of squared similarities of the
new document to every document in the collection and this is the problem we solve here.
A second important example is that of graphs. Suppose we are given a graph G(V, E)
d
with d vertices and n edges. n could be as large as 2 (and no larger). Represent the
graph by a signed edge-vertex incidence matrix A. A has one column per vertex and one
row per edge. each row has exactly two non-zero entries - a +1 for one end vertex of the
edge and a -1 for the other end vertex. [We arbitrarily choose which end vertex gets a 1
and which a -1.] For S ⊆ V , the cut (S, S̄) is the set of edges between S and S̄. It is easy
to show that
The number of edges in the cut (S, S̄) = |A1S |2 ,

20
where, 1S is the vector with 1 for each i ∈ S and 0 for each i ∈ / S. [We leave this to
the reader.] So the problem here would “sparsify” the graph - i.e., produce a subset of
O(d log d) edges to form a sub-graph H and weight them so that for every cut, the
weight of the cut in H is a relative error ε approximation to the number of edges in the
cut in G.

Theorem 1.7 Suppose A is any n × d matrix. We can find (in polynomial time) subset
of r = Ω(d log d/ε4 ) rows of A and multiply each by a scaler to form a r × d matrix B so
that
|Bv| ≈ε |Av| ∀d − vectors v.

Proof: The proof will use length-squared sampling on a matrix A0 obtained from A by
scaling the rows of A suitably. First, we argue that we cannot do without this scaling.
Consider the case of graphs, where, A is the edge-vertex incidence matrix as above. The
graph might have one (or more) edge(s) which must be included in the sparsifier. A simple
example is the case when G consists of two cliques of size n/2 each connected by a single
edge. So the cut with one of the clique-vertices forming S has one edge and unless we
pick this edge in our subset, we would have 0 weight across the cut which is not correct.
But if we just use uniform sampling, we are quite unlikely to pick this edge among our
sample of d log d edges since there are in total Ω(d2 ) edges.
Now to the proof of the theorem: First observe that |Av|2 = vT (AT A)v and similarly
for B. Also observe that for any two non-negative real numbers a, b, b2 ≈ε a2 implies that
b ≈ε a. So intuitively, it suffices to approximate AT A. For this, AT ⊗r A suggests itself as
a sampling based method, since, as we saw earlier, AT ⊗ A may be written as B T B where
B consists of some rows of A scaled.
We start with an important fact about length-squared sampling which is stated below.
[The proof is beyond the scope of the book and so we assume the theorem without proof.]
Theorem 1.8 For any n × d matrix A, if r ∈ Ω(d log d/ε4 ), then

||AT ⊗r A − AT A||2 ≤ ε||A||22 .

In other words, O(d log d) sampled rows of A are sufficient to approximate AT A in spectral
norm to error ||A||2 . Recalling that AT ⊗ A can be written as B T B where, B is an r × d
matrix consisting of the r sampled rows, scaled, (see section (1.2.1)), we get vT (B T B)v =
|Bv|2 is within ε||A||22 of |Av|2 . But for relative error, we need the error to be at most
ε|Av|2 and if ||A||2 > |Av|2 which is the case in general, we do not get a relative error
result directly.
To achieve relative error, more work is needed. Find the SVD of A and suppose it is
t
X
A= σi ui vi T , t = Rank(A).
i=1

Let P = ti=1 ui vi . P has all singular values equal to 1, so that |P v| = |v| for all vectors
P
v in the row space of P .

21
Apply the theorem now to P to get that for every non-zero vector v in the row space
of P :

||P T ⊗r P − P T P ||2 ≤ ε||P ||2 = ε

=⇒vT P T ⊗r P v ≈ε vT P T P v.
Pt 1 T
But P = AD, where, D the “psuedo-inverse” of A is D = i=1 σi vi vi . But then,
vT P T ⊗r P v ≈ε vT P T P v∀v means vT DT AT ⊗r ADv ≈ε vT DT AT ADv∀v. But then D
is non-singular. Suppose w is any vector in the span of vi . w can be written as Dv with
v = D−1 w and from this it follows that

wT AT ⊗r Aw ≈ε wT AT Aw

and the theorem follows.

1.4 Sketches of Documents

Suppose one wished to store all the web pages from the WWW. Since there are billions
of web pages, one might store just a sketch of each page where a sketch is a few hundred
bits that capture sufficient information to do whatever task one had in mind. A web page
or a document is a sequence. We begin this section by showing how to sample a set and
then how to convert the problem of sampling a sequence into a problem of sampling a set.

Consider subsets of size 1000 of the integers from 1 to 106 . Suppose one wished to
compute the resemblance of two subsets A and B by the formula
|A∩B|
resemblance (A, B) = |A∪B|

Suppose that instead of using the sets A and B, one sampled the sets and compared ran-
dom subsets of size ten. How accurate would the estimate be? One way to sample would
be to select ten elements uniformly at random from A and B. However, this method is
unlikely to produce overlapping samples. Another way would be to select the ten smallest
elements from each of A and B. If the sets A and B overlapped significantly one might
expect the sets of ten smallest elements from each of A and B to also overlap. One dif-
ficulty that might arise is that the small integers might be used for some special purpose
and appear in essentially all sets and thus distort the results. To overcome this potential
problem, rename all elements using a random permutation.

Suppose two subsets of size 1000 overlapped by 900 elements. What would the over-
lap be of the 10 smallest elements from each subset? One would expect the nine smallest
elements from the 900 common elements to be in each of the two subsets for an overlap
of 90%. The resemblance(A, B) for the size ten sample would be 9/11=0.81.

22
Another method would be to select the elements equal to zero mod m for some inte-
ger m. If one samples mod m the size of the sample becomes a function of n. Sampling
mod m allows us to also handle containment.

In another version of the problem one has a sequence rather than a set. Here one con-
verts the sequence into a set by replacing the sequence by the set of all short subsequences
of some length k. Corresponding to each sequence is a set of length k subsequences. If
k is sufficiently large, then two sequences are highly unlikely to give rise to the same set
of subsequences. Thus, we have converted the problem of sampling a sequence to that of
sampling a set. Instead of storing all the subsequences, we need only store a small subset
of the set of length k subsequences.

Suppose you wish to be able to determine if two web pages are minor modifications
of one another or to determine if one is a fragment of the other. Extract the sequence
of words occurring on the page. Then define the set of subsequences of k consecutive
words from the sequence. Let S(D) be the set of all subsequences of length k occurring
in document D. Define resemblance of A and B by
|S(A)∩S(B)|
resemblance (A, B) = |S(A)∪S(B)|

And define containment as

|S(A)∩S(B)|
containment (A, B) = |S(A)|

Let W be a set of subsequences. Define min (W ) to be the s smallest elements in W and

define mod (W ) as the set of elements of w that are zero mod m.

Let π be a random permutation of all length k subsequences. Define F (A) to be the

s smallest elements of A and V (A) to be the set mod m in the ordering defined by the
permutation.

Then
F (A) ∩ F (B)
F (A) ∪ F (B)
and
|V (A)∩V (B)|
|V (A)∪V (B)|

are unbiased estimates of the resemblance of A and B. The value

|V (A)∩V (B)|
|V (A)|

is an unbiased estimate of the containment of A in B.

23
1.5 Exercises
Algorithms for Massive Data Problems
Exercise 1.1 Given a stream of n positive real numbers a1 , a2 , . . . , an , upon seeing
a1 , a2 , . . . , ai keep track of the sum a = a1 + a2 + · · · + ai and a sample aj , j ≤ i drawn
ai+1
with probability proportional to its value. On reading ai+1 , with probability a+a i+1
replace
the current sample with ai+1 and update a. Prove that the algorithm selects an ai from
the stream with the probability of picking ai being proportional to its value.

Exercise 1.2 Given a stream of symbols a1 , a2 , . . . , an , give an algorithm that will select
one symbol uniformly at random from the stream. How much memory does your algorithm
require?

Exercise 1.3 Give an algorithm to select an ai from a stream of symbols a1 , a2 , . . . , an

with probability proportional to a2i .

Exercise 1.4 How would one pick a random word from a very large book where the prob-
ability of picking a word is proportional to the number of occurrences of the word in the
book?

Solution: Put your finger on a random word. The probabilities will be automatically
proportional to the number of occurrences, since each occurrence is equally likely to be
picked.

Exercise 1.5 For the streaming model give an algorithm to draw s independent samples
each with the probability proportional to its value. Justify that your algorithm works
correctly.

Frequency Moments of Data Streams

Number of Distinct Elements in a Data Stream
Lower bound on memory for exact deterministic algorithm
Algorithm for the Number of distinct elements
Universal Hash Functions

1
Exercise 1.6 Show that for a 2-universal hash family Prob (h(x) = z) = M +1
for all
x ∈ {1, 2, . . . , m} and z ∈ {0, 1, 2, . . . , M }.

Exercise 1.7 Let p be a prime. A set of hash functions

H = {h| {0, 1, . . . , p − 1} → {0, 1, . . . , p − 1}}
is 3-universal if for all u,v,w,x,y, and z in {0, 1, . . . , p − 1}
1
Prob (h (x) = u, h (y) = v, h (z) = w) = .
p3

24
(a) Is the set {hab (x) = ax + b mod p|0 ≤ a, b < p} of hash functions 3-universal?

(b) Give a 3-universal set of hash functions.

Solution: No. h (x) = u and h (y) = v determine a and b and hence h (z).

Exercise 1.8 Give an example of a set of hash functions that is not 2-universal.

Solution: Consider a hash function hi that maps all x’s to the integer i.

H = {hi (x) = i|1 ≤ i ≤ m}

is a set of hash functions where every x is equally likely to be mapped to any i in the
range 1 to m by a hash function chosen at random. However, h(x) and h(y) are clearly
not independent.

Analysis of distinct element counting algorithm

Counting the Number of Occurrences of a Given Element.

Exercise 1.9

(a) What is the variance of the method in Section 1.1.2 of counting the number of occur-
rences of a 1 with log log n memory?

(b) Can the algorithm be iterated to use only log log log n memory? What happens to the
variance?

Exercise 1.10 Consider a coin that comes down heads with probability p. Prove that the
expected number of flips before a head occurs is 1/p.

Solution: Let f be the number of flips before a heads occurs. Then

E [f ] = a + 2 (1 − a) a + 3 (1 − a)2 a + · · ·
a
(1 − a) + 2 (1 − a)2 + 3 (1 − a)3

= 1−a
a 1−a 1
= 1−a a2
= a

Exercise 1.11 Randomly generate a string x1 x2 · · · xn of 106 0’s and 1’s with probability
1/2 of x being a 1. Count the number of ones in the string and also estimate the number
i
of ones by the approximate counting algorithm. Repeat the process for p=1/4, 1/8, and
1/16. How close is the approximation?

25
Counting Frequent Elements
The Majority and Frequent Algorithms
The Second Moment

Exercise 1.12 Construct an example in which the majority algorithm gives a false posi-
tive, i.e., stores a nonmajority element at the end.

Exercise 1.13 Construct examples where the frequent algorithm in fact does as badly as
in the theorem, i.e., it “under counts” some item by n/(k+1).

Exercise 1.14 Recall basic statistics on how an average of independent trials cuts down
m
fs2 .
P
variance and complete the argument for relative error ε estimate of
s=1

Error-Correcting codes, polynomial interpolation and limited-way indepen-

dence
Exercise 1.15 Let F be a field. Prove that for any four distinct points a1 , a2 , a3 , and a4
in F and any four (possibly not distinct) values b1 , b2 , b3 , and b4 in F , there is a unique
polynomial f (x) = f0 +f1 x+f2 x2 +f3 x3 of degree at most three so that f (a1 ) = b1 , f (a2 ) =
b2 , f (a3 ) = b3 f (a4 ) = b4 with all computations done over F .

Solution: Van Der Monde Matrix etc

Sketch of a Large Matrix

Exercise 1.16 Suppose we want to pick a row of a matrix at random where the probability
of picking row i is proportional to the sum of squares of the entries of that row. How would
we do this in the streaming model? Do not assume that the elements of the matrix are
given in row order.
(a) Do the problem when the matrix is given in column order.

(b) Do the problem when the matrix is represented in sparse notation: it is just presented
as a list of triples (i, j, aij ), in arbitrary order.

Solution: Pick an element aij with probability proportional to a2ij . We have already
described the algorithm for this. Then return the i (row number) of the element picked.
The probability of picking one particular i is just the sum of probabilities of picking each
P a2ij
element of the row which is P 2 , exactly as desired. Note that this works even in the
a kl
j k,l
sparse representation.

Matrix Multiplication Using Sampling

26
n
P
Exercise 1.17 Suppose A and B are two matrices. Show that AB = A (:, k)B (k, :).
k=1

Exercise 1.18 Generate two 100 by 100 matrices A and B with integer values between
1 and 100. Compute the product AB both directly and by sampling. Plot the difference
in L2 norm between the results as a function of the number of samples. In generating
the matrices make sure that they are skewed. One method would be the following. First
generate two 100 dimensional vectors a and b with integer values between 1 and 100. Next
generate the ith row of A with integer values between 1 and ai and the ith column of B
with integer values between 1 and bi .

Approximating a Matrix with a Sample of Rows and Columns

Exercise 1.19 Show that ADDT B is exactly

1 A (:, k1 ) B (k1 , :) A (:, k2 ) B (k2 , :) A (:, ks ) B (ks , :)
+ + ··· +
s pk1 pk2 pks

Exercise 1.20 Suppose a1 , a2 , . . . , am are nonnegative reals. Show that the minimum
m
P ak P
of xk
subject to the constraints xk ≥ 0 and xk = 1 is attained when the xk are
k=1 √ k
proportional to ak .

Sketches of Documents

Exercise 1.21 Consider random sequences of length n composed of the integers 0 through
9. Represent a sequence by its set of length k-subsequences. What is the resemblance of
the sets of length k-subsequences from two random sequences of length n for various values
of k as n goes to infinity?

Exercise 1.22 What if the sequences in the Exercise 1.21 were not random? Suppose the
sequences were strings of letters and that there was some nonzero probability of a given
letter of the alphabet following another. Would the result get better or worse?

Exercise 1.23 Consider a random sequence of length 10,000 over an alphabet of size
100.

(a) For k = 3 what is probability that two possible successor subsequences for a given
subsequence are in the set of subsequences of the sequence?

(b) For k = 5 what is the probability?

27
Solution: 1.23(a) A subsequence has 100 possible successor subsequences. The proba-
bility that a given subsequence is in the set of 104 subsequences of the sequence is 1/100.
1 100
= 1e > 0.6

Thus the answer is 1 − 100
(b) For k = 5 the universe of possible subsequences has grown to 1010 and the prob-
ability of being a possible successor drops to 10−6 . (1-10−6 )100 =0. However, since the
sequence is of length 104 , the probability that some subsequence’s successor is in the set
106
is (1 − 10−6 ) = 1e .
Exercise 1.24 How would you go about detecting plagiarism in term papers?
Exercise 1.25 (Finding duplicate web pages) Suppose you had one billion web pages
and you wished to remove duplicates. How would you do this?
Solution: :?? Suppose web pages are in html. Create a window consisting of each se-
quence of 10 consecutive html commands including text. Hash the contents of each window
to an integer and save the lowest 10 integers for each page. To detect exact duplicates,
hash the 10 integers to a single integer and look for collisions. If we want almost exact
duplicates we might ask for eight of the ten integers to be the same. In this case hash
the three lowest integers, the next three lowest, and the last four to integers and look
for collisions. If eight of the ten integers agree one of the above three hashes must agree.
NOT TRUE.
Exercise 1.26 Construct two sequences of 0’s and 1’s having the same set of subsequences
of width w.
Solution: 1.26 For w = 3, 11101111 and 11110111.
Exercise 1.27 Consider the following lyrics:
When you walk through the storm hold your head up high and don’t be afraid of the
dark. At the end of the storm there’s a golden sky and the sweet silver song of the
lark.
Walk on, through the wind, walk on through the rain though your dreams be tossed
and blown. Walk on, walk on, with hope in your heart and you’ll never walk alone,
you’ll never walk alone.
How large must k be to uniquely recover the lyric from the set of all subsequences of
symbols of length k? Treat the blank as a symbol.
Exercise 1.28 Blast: Given a long sequence a, say 109 and a shorter sequence b, say
105 , how do we find a position in a which is the start of a subsequence b0 that is close to
b? This problem can be solved by dynamic programming but not in reasonable time. Find
a time efficient algorithm to solve this problem.
Hint: (Shingling approach) One possible approach would be to fix a small length, say
seven, and consider the shingles of a and b of length seven. If a close approximation to b
is a substring of a, then a number of shingles of b must be shingles of a. This should allows
us to find the approximate location in a of the approximation of b. Some final algorithm
should then be able to find the best match.

CS85: Data Stream Algorithms Lecture Notes, Fall 2009: Amit Chakrabarti Dartmouth College
No ratings yet
CS85: Data Stream Algorithms Lecture Notes, Fall 2009: Amit Chakrabarti Dartmouth College
61 pages
Course Notes
No ratings yet
Course Notes
201 pages
10 Cuda Dgemm Tiled
No ratings yet
10 Cuda Dgemm Tiled
33 pages
Mathematics For Economists: Linear Models and Matrix Algebra
No ratings yet
Mathematics For Economists: Linear Models and Matrix Algebra
58 pages
Software Optimization For High-Performance Computing
100% (3)
Software Optimization For High-Performance Computing
409 pages
Nonlinear Programming 3rd Edition Theoretical Solutions Manual
No ratings yet
Nonlinear Programming 3rd Edition Theoretical Solutions Manual
12 pages
01 Streaming PDF
No ratings yet
01 Streaming PDF
8 pages
Streaming Algorithm
No ratings yet
Streaming Algorithm
16 pages
Chakrabarthi-Streaming Alg Book
No ratings yet
Chakrabarthi-Streaming Alg Book
94 pages
Mining Data Streams (Part 2)
No ratings yet
Mining Data Streams (Part 2)
56 pages
Unit 2 Mathematical Foundation of Big Data: - Syllabus
No ratings yet
Unit 2 Mathematical Foundation of Big Data: - Syllabus
26 pages
Streaming Algorithms
No ratings yet
Streaming Algorithms
73 pages
Approximate Frequency Counts Over Data Streams
No ratings yet
Approximate Frequency Counts Over Data Streams
87 pages
Assignment No.2: HOANG Nguyen Phong
No ratings yet
Assignment No.2: HOANG Nguyen Phong
6 pages
A Simple Algorithm For Finding Frequent Elements in Streams and Bags
No ratings yet
A Simple Algorithm For Finding Frequent Elements in Streams and Bags
5 pages
Streaming Algorithms: Ajinkya Potdar Hemanga Krishna Borah
No ratings yet
Streaming Algorithms: Ajinkya Potdar Hemanga Krishna Borah
47 pages
Estimating Frequency Moments of Data Streams Using Random Linear Combinations
No ratings yet
Estimating Frequency Moments of Data Streams Using Random Linear Combinations
12 pages
Finding Frequent Items in Data Streams
No ratings yet
Finding Frequent Items in Data Streams
11 pages
Lec1 Bloom Distinctcount
No ratings yet
Lec1 Bloom Distinctcount
76 pages
engg-data-module-3
No ratings yet
engg-data-module-3
8 pages
Compsci Algorithms For Data Science: Cameron Musco University of Massachusetts Amherst. Fall 2019
No ratings yet
Compsci Algorithms For Data Science: Cameron Musco University of Massachusetts Amherst. Fall 2019
28 pages
Lossy Counting
No ratings yet
Lossy Counting
39 pages
Notes on Randomized Algorithms
No ratings yet
Notes on Randomized Algorithms
539 pages
Streaming Algorithms: CS6234 Advanced Algorithms February 10 2015
No ratings yet
Streaming Algorithms: CS6234 Advanced Algorithms February 10 2015
90 pages
Unit 4 - 4.4
No ratings yet
Unit 4 - 4.4
23 pages
Crash Course On Data Stream Algorithms: Part I: Basic Definitions and Numerical Streams
No ratings yet
Crash Course On Data Stream Algorithms: Part I: Basic Definitions and Numerical Streams
76 pages
Notes Streaming
No ratings yet
Notes Streaming
8 pages
Randomised Algorithm
No ratings yet
Randomised Algorithm
385 pages
Discreate Structure
No ratings yet
Discreate Structure
21 pages
14 Streams
No ratings yet
14 Streams
6 pages
Introduction To Probability and Random Signals
100% (9)
Introduction To Probability and Random Signals
139 pages
02 StreamsAlgorithms
No ratings yet
02 StreamsAlgorithms
93 pages
Chapter 2
No ratings yet
Chapter 2
88 pages
3.flajolet Martin Algorithm
No ratings yet
3.flajolet Martin Algorithm
31 pages
ENGR 3341: Probability Theory and Statistics Notes
No ratings yet
ENGR 3341: Probability Theory and Statistics Notes
5 pages
Chap 1 - Probability
No ratings yet
Chap 1 - Probability
22 pages
Notes On Randomized Algorithms: James Aspnes March 3rd, 2020
No ratings yet
Notes On Randomized Algorithms: James Aspnes March 3rd, 2020
453 pages
Bda Unit - 2
No ratings yet
Bda Unit - 2
12 pages
GL
No ratings yet
GL
144 pages
01 Probability Theory
No ratings yet
01 Probability Theory
48 pages
Sta1006 Summary
No ratings yet
Sta1006 Summary
20 pages
Approximate Counting of Permutation Patterns: Omri Ben-Eliezer Slobodan Mitrovi C Pranjal Srivastava
No ratings yet
Approximate Counting of Permutation Patterns: Omri Ben-Eliezer Slobodan Mitrovi C Pranjal Srivastava
25 pages
Walpole- Ch 2
No ratings yet
Walpole- Ch 2
62 pages
Mining Data Streams (Part 1)
No ratings yet
Mining Data Streams (Part 1)
46 pages
Notes PDF
No ratings yet
Notes PDF
407 pages
Lesson 5 Probability and Probabilty Distributions
No ratings yet
Lesson 5 Probability and Probabilty Distributions
29 pages
Mining Data Streams
No ratings yet
Mining Data Streams
67 pages
CSE291 Course Notes
No ratings yet
CSE291 Course Notes
69 pages
Streams 2
No ratings yet
Streams 2
49 pages
Lecture Notes On Bucket Algorithms - Luc Devroye
No ratings yet
Lecture Notes On Bucket Algorithms - Luc Devroye
154 pages
Randomized Algosnotes
No ratings yet
Randomized Algosnotes
362 pages
Book 160 163
No ratings yet
Book 160 163
4 pages
Lec 8
No ratings yet
Lec 8
4 pages
Chapter One: 1.1 Axiom of Probability
No ratings yet
Chapter One: 1.1 Axiom of Probability
11 pages
Notestream
No ratings yet
Notestream
5 pages
Streaming Algorithm: Filtering & Counting Distinct Elements: Compsci 590.02 Instructor: Ashwinmachanavajjhala
No ratings yet
Streaming Algorithm: Filtering & Counting Distinct Elements: Compsci 590.02 Instructor: Ashwinmachanavajjhala
26 pages
Comput Complexity of Count and Sampl
No ratings yet
Comput Complexity of Count and Sampl
409 pages
Engineering Data Analysis Learning Mateial (2nd Week) PDF
No ratings yet
Engineering Data Analysis Learning Mateial (2nd Week) PDF
7 pages
Notes
No ratings yet
Notes
422 pages
L - B F E A: Earning Ased Requency Stimation Lgorithms
No ratings yet
L - B F E A: Earning Ased Requency Stimation Lgorithms
20 pages
Book
No ratings yet
Book
267 pages
Estimating Moments
No ratings yet
Estimating Moments
17 pages
Top Numerical Methods With Matlab For Beginners!
From Everand
Top Numerical Methods With Matlab For Beginners!
Andrei Besedin
No ratings yet
Digital Signal and Image Processing using MATLAB, Volume 3: Advances and Applications, The Stochastic Case
From Everand
Digital Signal and Image Processing using MATLAB, Volume 3: Advances and Applications, The Stochastic Case
Gérard Blanchet
3/5 (1)
2a GNU Octave Online Compiler Refresher Tutorial
No ratings yet
2a GNU Octave Online Compiler Refresher Tutorial
26 pages
UW Math 308 Notes
No ratings yet
UW Math 308 Notes
172 pages
Unit-3: Divide and Conquer: Algorithms
No ratings yet
Unit-3: Divide and Conquer: Algorithms
78 pages
Linear Algebra Chapter 2 - MATRICES
No ratings yet
Linear Algebra Chapter 2 - MATRICES
39 pages
AIMLCZC416 MFML
No ratings yet
AIMLCZC416 MFML
134 pages
HW8_Solution
No ratings yet
HW8_Solution
15 pages
Matrix and Array Operations
No ratings yet
Matrix and Array Operations
7 pages
Static and Dynamic Analysis of Structures
100% (1)
Static and Dynamic Analysis of Structures
11 pages
Domain Specific Languages in R Advanced Statistical Programming 1st Edition Thomas Mailund all chapter instant download
100% (3)
Domain Specific Languages in R Advanced Statistical Programming 1st Edition Thomas Mailund all chapter instant download
55 pages
KAPSEL 4 KELOMPOK 2 TMTK 6A Baru
No ratings yet
KAPSEL 4 KELOMPOK 2 TMTK 6A Baru
19 pages
Quantitative Techniques in Management: Module - 1: Matrices
No ratings yet
Quantitative Techniques in Management: Module - 1: Matrices
60 pages
Bam 112
No ratings yet
Bam 112
8 pages
Puc II Maths Hassan - MCQ With Answer
No ratings yet
Puc II Maths Hassan - MCQ With Answer
31 pages
Scipy Cheat Sheet Python For Data Science: Linear Algebra
No ratings yet
Scipy Cheat Sheet Python For Data Science: Linear Algebra
1 page
Unit-02: Matrices and Determinants | MCQs
No ratings yet
Unit-02: Matrices and Determinants | MCQs
24 pages
Download Full (Ebook) Structural Mechanics : Modelling and Analysis of Frames and Trusses by Karl-Gunnar Olsson, Ola Dahlblom ISBN 9781119159339, 9781119159346, 9781119159353, 1119159334, 1119159342, 1119159350 PDF All Chapters
100% (15)
Download Full (Ebook) Structural Mechanics : Modelling and Analysis of Frames and Trusses by Karl-Gunnar Olsson, Ola Dahlblom ISBN 9781119159339, 9781119159346, 9781119159353, 1119159334, 1119159342, 1119159350 PDF All Chapters
67 pages
Cryptography
No ratings yet
Cryptography
20 pages
YLP Form 5 Mathematics
No ratings yet
YLP Form 5 Mathematics
18 pages
Bridging Math Matrix
No ratings yet
Bridging Math Matrix
4 pages
The Poisson Integral Formula and Representations of SU
No ratings yet
The Poisson Integral Formula and Representations of SU
22 pages
LP 1 Math 5
No ratings yet
LP 1 Math 5
29 pages
Mathematics For Business:, BMGT 282
No ratings yet
Mathematics For Business:, BMGT 282
145 pages
Notes On Kronecker Products: 1 The Stack Operator
0% (1)
Notes On Kronecker Products: 1 The Stack Operator
4 pages
Directx9 Graphics The Definitive Guide To Direct 3d
No ratings yet
Directx9 Graphics The Definitive Guide To Direct 3d
368 pages
Cryptography project 001
No ratings yet
Cryptography project 001
24 pages
DR Bob Jantzen's Differential Geometry
No ratings yet
DR Bob Jantzen's Differential Geometry
485 pages