Algorithms for Massive Data Problems
Algorithms for Massive Data Problems
1
DATA. Eg. Matrix A
RAM
Aij = ? Aij = · · ·
ALGORITHM
This chapter deals with massive data problems where the input data (a graph, a ma-
trix or some other object) is too large to be stored in random access memory. One model
for such problems is the streaming model, where the data can be seen only once. In
the streaming model, the natural technique to deal with the massive data is sampling.
Sampling is done “on the fly”. As each piece of data is seen, based on a coin toss, one
decides whether to include the data in the sample. Typically, the probability of including
the data point in the sample may depend on its value. Models allowing multiple passes
through the data are also useful; but the number of passes needs to be small. We always
assume that random access memory (RAM) is limited, so the entire data cannot be stored
in RAM.
To introduce the basic flavor of sampling on the fly, consider the following simple
primitive. We have a stream of n elements, where, n is not known at the start. We wish
to sample uniformly at random one of the n elements. If we have to pick a sample and
then never change it, we are out of luck. [You can convince yourself of this.] So we have to
allow ourselves to “change our mind”: After perhaps drawing one element as a sample, we
must allow ourselves to reject it and instead draw the current element as the new sample.
With this hint, the reader may want to think of the algorithm. Instead of describing the
2
algorithm for this problem, we will solve a more general problem:
From a stream of n positive real numbers a1 , a2 , . . . , an , draw a sample element ai so
that the probability of picking an element is proportional to its value.
It is easy to see that the following sampling method works. Upon seeing a1 , a2 , . . . , ai ,
keep track of the sum a = a1 + a2 + · · · + ai and a sample aj , j ≤ i, drawn with
probability proportional to its value. On seeing ai+1 , replace the current sample by ai+1
ai+1
with probability a+a i+1
and update a.
We can prove by induction that this algorithm does in fact sample element aj with
probability aj /(a1 + a2 + · · · + an ) : Suppose after we have read a1 , a2 , . . . , ai , we have
picked a sample with probability of aj being the sample equal to aj /(a1 + a2 + · · · + ai ).
The probability that we will keep this aj after reading ai+1 is
ai+1 a1 + a2 + · · · + ai
1− = .
a1 + a2 + · · · + ai+1 a1 + a2 + · · · + ai+1
Combining the two, we see that the probability that aj is the sample after reading ai+1
is precisely aj /(a1 + a2 + · · · + ai+1 ) completing the inductive proof.
Note that the p = 0 frequency moment corresponds to the number of distinct symbols
occurring in the stream. The first
P frequency moment is just n, the length of the string.
The second frequency moment, fs2 , is useful in computing the variance of the stream.
s
m m m
1 X n 2 1 X 2 n n 2 1 X 2 n2
fs − = f − 2 fs + = f −
m s=1 m m s=1 s m m m s=1 s m2
m
1/p
fsp
P
In the limit as p becomes large, is the frequency of the most frequent ele-
s=1
ment(s).
We will describe sampling based algorithms to compute these quantities for streaming
data shortly. But first a note on the motivation for these various problems. The identity
and frequency of the the most frequent item or more generally, items whose frequency
3
exceeds a fraction of n, is clearly important in many applications. If the items are packets
on a network with source destination addresses, the high frequency items identify the
heavy bandwidth users. If the data is purchase records in a supermarket, the high fre-
quency items are the best-selling items. Determining the number of distinct symbols is
the abstract version of determining such things as the number of accounts, web users, or
credit card holders. The second moment and variance are useful in networking as well as
in database and other applications. Large amounts of network log data are generated by
routers that can record for all the messages passing through them, the source address,
destination address, and the number of packets. This massive data cannot be easily sorted
or aggregated into totals for each source/destination. But it is important to know if some
popular source-destination pairs have a lot of traffic for which the variance is the natural
measure.
Suppose we have seen the first k symbols of the stream and k > m. The set of distinct
symbols seen so far could be any of the 2m subsets of {1, 2, . . . , m}. Each subset must
result in a different state for our algorithm and hence m bits of memory are required. To
see this, suppose first that two different size subsets of distinct symbols lead to the same
internal state. Then our algorithm would produce the same count of distinct symbols for
both inputs, clearly an error for one of the input sequences. If two sequences with the
same number of distinct elements but different subsets lead to the same state, then on
next seeing a symbol that appeared in one sequence but not the other, we would make
an error on at least one of them .
4
|S| + 1 subsets
z }| {
m
|S|+1
Figure 1.2: Estimating the size of S from the minimum element in S which has value
m
approximately |S|+1 . The elements of S partition the set {1, 2, . . . , m} into |S| + 1 subsets
m
each of size approximately |S|+1 .
m
Figure 1.2. Thus, the minimum element of S should have value close to |S|+1 . Solving
m m
min = |S|+1 yields |S| = min − 1. Since we can determine min, this gives us an estimate
of |S|.
The above analysis required that the elements of S were picked uniformly at random
from {1, 2, . . . , m}. This is generally not the case when we have a sequence a1 , a2 , . . . , an
of elements from {1, 2, . . . , m}. Clearly if the elements of S were obtained by selecting the
|S| smallest elements of {1, 2, . . . , m}, the above technique would give the wrong answer.
If the elements are not picked uniformly at random, can we estimate the number of distinct
elements? The way to solve this problem is to use a hash function h where
h : {1, 2, . . . , m} → {0, 1, 2, . . . , M − 1}
To count the number of distinct elements in the input, count the number of elements
in the mapped set {h (a1 ) , h (a2 ) , . . .}. The point being that {h(a1 ), h (a2 ) , . . .} behaves
like a random subset and so the above heuristic argument using the minimum to estimate
the number of elements may apply. If we needed h (1) , h (2) , . . . to be completely inde-
pendent, the space needed to store the hash function would too high. Fortunately, only
2-way independence is needed. We recall the formal definition of 2-way independence
below. But first recall that a hash function is always chosen at random from a family of
hash functions and phrases like “probability of collision” refer to the probability over the
choice of hash function.
is 2-universal if for all x and y in {1, 2, . . . , m}, x 6= y, and for all z and w in {0, 1, 2, . . . , M − 1}
1
Prob [h (x) = z and h (y) = w] = M2
5
for a randomly chosen h. The concept of a 2-universal family of hash functions is that
given x, h (x) is equally likely to be any element of {0, 1, 2, . . . , M − 1} (the reader should
prove this from the definition of 2-universal) and for x 6= y, h (x) and h (y) are indepen-
dent.
We now give an example of a 2-universal family of hash functions. For simplicity let
M be a prime, with M > m. For each pair of integers a and b in the range [0,M -1], define
a hash function
hab (x) = ax + b mod (M )
To store the hash function hab , store the two integers a and b. This requires only O(log M )
space. To see that the family is 2-universal note that h(x) = z and h(y) = w if and only
if
x 1 a z
= mod (M )
y 1 b w
x 1
If x 6= y , the matrix is invertible modulo M and there is only one solution for
y 1
a and b. Thus, for a and b chosen uniformly at random, the probability of the equation
holding is exactly M12 . [We assumed M > m. What goes wrong if this did not hold?]
d
P
and let z = zi . If h (bi ) is chosen randomly from {0, 1, 2, . . . , M − 1}, then Prob [zi = 1] =
i=1
6
1 1
6d
. Thus, E (zi ) = 6d
and E (z) = 61 . Now
M M
Prob > 6d = Prob min <
min 6d
M
= Prob ∃k h (bk ) <
6d
= Prob (z ≥ 1) = Prob [z ≥ 6E (z)] .
d
6 6
P
and let y = yi . Now Prob (yi = 1) = d
, E (yi ) = d
, and E (y) = 6. For 2-way
i=1
independent random variables, the variance of their sum is the sum of their variances. So
V ar (y) = dV ar (y1 ). Further, it is easy to see since y1 is 0 or 1 that
7
some other problem.
Consider a string of 0’s and 1’s of length n in which we wish to count the number
of occurrences of 1’s. Clearly if we had log n bits of memory we could keep track of the
exact number of 1’s. However, we can approximate the number with only log log m bits.
Let m be the number of 1’s that occur in the sequence. Keep a value k such that 2k
is approximately the number of occurrences m. Storing k requires only log log n bits of
memory. The algorithm works as follows. Start with k=0. For each occurrence of a 1,
add one to k with probability 1/2k . At the end of the string, the quantity 2k − 1 is the
estimate of m. To obtain a coin that comes down heads with probability 1/2k , flip a fair
coin, one that comes down heads with probability 1/2, k times and report heads if the fair
coin comes down heads in all k flips.
Given k, on average it will take 2k ones before k is incremented. Thus, the expected
number of 1’s to produce the current value of k is 1 + 2 + 4 + · · · + 2k−1 = 2k − 1.
First consider the very simple problem of n people voting. There are m candidates,
{1, 2, . . . , m}. We want to determine if one candidate gets a majority vote and if so
who. Formally, we are given a stream of integers a1 , a2 , . . . , an , each ai belonging to
{1, 2, . . . , m}, and want to determine whether there is some s ∈ {1, 2, . . . , m} which oc-
curs more than n/2 times and if so which s. It is easy to see that to solve the problem
exactly on read only once streaming data with a deterministic algorithm, requires Ω(n)
space. Suppose n is even and the first n/2 items are all distinct and the last n/2 items are
identical. After reading the first n/2 items, we need to remember exactly which elements
of {1, 2, . . . , m} have occurred. If for two different sets of elements occurring in the first
half of the stream, the contents of the memory are the same, then a mistake would occur
if the second half of the stream consists solely of an element in one set, but not the other.
m
Thus, log2 n/2 bits of memory, which if m > n is Ω(n), are needed.
Now lets allow the algorithm a random number generator. The reader may want to
think about simple sampling schemes first - like picking an element as a sample and then
checking how many times it occurs and modifications of this. The following is a simple
low-space algorithm that always finds the majority vote if there is one. If there is no
majority vote, the output may be arbitrary. That is, there may be “false positives”, but
no “false negatives”.
Majority Algorithm
8
Store a1 and initialize a counter to one. For each subsequent ai , if ai is the same as
the currently stored item, increment the counter by one. If it differs, decrement the
counter by one provided the counter is nonzero. If the counter is zero, then store ai
and set the counter to one.
To analyze the algorithm, it is convenient to view the decrement counter step as “elim-
inating” two items, the new one and the one that caused the last increment in the counter.
It is easy to see that if there is a majority element s, it must be stored at the end. If not,
each occurrence of s was eliminated; but each such elimination also causes another item
to be eliminated and so for a majority item not to be stored at the end, we must have
eliminated more than n items, a contradiction.
Next we modify the above algorithm so that not just the majority, but also items
with frequency above some threshold are detected. We will also ensure (approximately)
that there are no false positives as well as no false negatives. Indeed the algorithm below
will find the frequency (number of occurrences) of each element of {1, 2, . . . , m} to within
n
an additive term of k+1 using O(k log n) space by keeping k counters instead of just one
counter.
Algorithm Frequent
Maintain a list of items being counted. Initially the list is empty. For each item, if
it is the same as some item on the list, increment its counter by one. If it differs
from all the items on the list, then if there are less than k items on the list, add the
item to the list with its counter set to one. If there are already k items on the list
decrement each of the current counters by one deleting an element from the list if its
count becomes zero.
Theorem 1.2 At the end of Algorithm Frequent, for each s ∈ {1, 2, . . . , m}, its counter
on the list is at least the number of occurrences of s in the stream minus n/(k+1). In
particular, if some s does not occur on the list, its counter is zero and the theorem asserts
that it occurs fewer than n/(k+1) times in the stream.
Proof: View each decrement counter step as eliminating some items. An item is elimi-
nated if it is the current ai being read and there are already k symbols different from it
on the list in which case it and k other items are simultaneously eliminated. Thus, the
elimination of each occurrence of an s ∈ {1, 2, . . . , m} is really the elimination of k + 1
items. Thus, no more than n/(k+1) occurrences of any symbol can be eliminated. Now,
it is clear that if an item is not eliminated, then it must still be on the list at the end.
This proves the theorem.
9
m
fs2 . To calculate the second moment, for each
P
second moment of the stream is given by
s=1
symbol s, 1 ≤ s ≤ m, independently set a random variable xs to ±1 with probability 1/2.
Maintain a sum by adding xs to the sum each time the symbol s occurs in the stream.
m
P
At the end of the stream, the sum will equal xs fs . The expected value of the sum will
s=1
be zero where the expectation is over the choice of the ±1 value for the xs .
m
!
X
E xs fs = 0.
s=1
Although the expected value of the sum is zero, its actual value is a random variable and
the expected value of the square of the sum is given by
m
!2 m
! ! m
X X X X
2 2
E xs f s =E xs fs + 2E xs xt f s f t = fs2 ,
s=1 s=1 s6=t s=1
m
fs2 . One difficulty which we will come back to is that to store the xi
P
is an estimator of
s=1
requires space m and we want to do the calculation in log m space.
A second issue is that this estimator is depends on its variance which we now compute.
m
!4 !
X X
V ar (a) ≤ E x s fs =E xs xt xu xv f s f t f u f v
s=1 1≤s,t,u,v≤m
The first inequality is because the variance is at most the second moment and the second
equality is by expansion. In the second sum, since the xs are independent, if any one of
s, u, t, or v is distinct from the others, then the expectation of the whole term is zero.
Thus, we need to deal only with terms of the form x2s x2t for t 6= s and terms of the form
x4s . Note that this does not need the full power of mutual independence of all the xs , it
only needs 4-way independence, that any four of the x0s s are mutually independent. In
4
the above sum, there are four indices s, t, u, v and there are 2 ways of choosing two of
10
them that have the same x value. Thus,
m X m
! m
!
4 X X
V ar (a) ≤ E x2s x2t fs2 ft2 + E x4s fs4
2 s=1 t=s+1 s=1
m
X m
X m
X
=6 fs2 ft2 + fs4
s=1 t=s+1 s=1
m
!2 m
!2
X X
≤3 fs2 + fs2 = 4E 2 (a) .
s=1 s=1
The variance can be reduced by a factor of r by taking the average of r independent trials.
With r independent trials the variance would be at most 4r E 2 (a), so to achieve relative
m
fs2 , we need O(1/ε2 ) independent trials.
P
error ε in the estimate of
s=1
We will briefly discuss the independent trials here, so as to understand exactly the
m
P
amount of independence needed. Instead of computing a using the running sum xs f s
s=1
for one random vector x, we independently generate r m-vectors (x1 , x2 , . . . , xr ) at the
outset and compute r running sums
m
X m
X m
X
x1s fs , x2s fs , . . . , xrs fs .
s=1 s=1 s=1
m
2 m
2 m
2
P P P
Let a1 = x1s fs , a2 = x2s fs , . . . , ar = xrs fs . Our estimate is
s=1 s=1 s=1
1
r
+ a2 + · · · + ar ). The variance of this estimator is
(a1
Var 1r (a1 + a2 + · · · + ar ) = r12 [Var (a1 ) + Var (a2 ) + · · · + Var (ar )] = 1r Var(a1 ),
where we have assumed that the a1 , a2 , . . . , ar are mutually independent. Now we com-
pute the variance of a1 as we have done for the variance of a. Note that this calculation
assumes only 4-way independence between the coordinates of x1 . We summarize the as-
sumptions here for future reference:
m
fs2 within relative error ε with probability close to one, say at
P
To get an estimate of
s=1
least 0.9999, it suffices to have r = O(1/ε2 ) vectors x1 , x2 , . . . , xr , each with m coordinates
of ±1 with
1. E (xs ) = 0 for all s.
2. x1 , x2 , . . . , xr are mutually independent. That is for any r vectors v1 , v2 , . . . , vr with
±1 coordinates, Prob(x1 = v1 , x2 = v2 , . . . , xr = vr ) = Prob(x1 = v1 )Prob(x2 =
v2 ) · · · Prob(xr = vr ).
11
3. Any four coordinates of x1 are independent. Same for x2 , x3 , . . . , and xr .
[Caution: (2) does not assume that Prob(x1 = v1 ) = 21m for all v1 ; such an assumption
would mean the coordinates of x1 are mutually independent.] The only drawback with the
algorithm as we have described it so far is that we need to keep the r vectors x1 , x2 , . . . , xr
in memory so that we can do the running sums. This is too space-expensive. We need to
do the problem in space dependent upon the logarithm of the size of the alphabet m, not
m itself. If ε is in Ω(1), then r is in O(1), so it is not the number of trials r which is the
problem. It is the m.
In the next section, we will see that the computation can be done in O(log m) space
by using pseudo-random vectors x1 , x2 , . . . , xr instead of truly random ones. The pseudo-
random vectors will satisfy (i), (ii), and (iii) and so they will suffice. This pseudo-
randomness and limited independence has deep connections, so we will go into the con-
nections as well.
Consider the problem of generating a random m-vector x of ±1’s so that any subset of
four coordinates is mutually independent, i.e., for any distinct s, t, u, and v in {1, 2, . . . , m}
and any a, b, c, and d in {-1, +1},
1
Prob(xs = a, xt = b, xu = c, xv = d) = 16
.
We will see that such an n-dimensional vector may be generated from a truly ran-
dom “seed” of only O(log m) mutually independent bits. Thus, we need only store the
O(log m) bits and can generate any of the m coordinates when needed. This allows us to
store the 4-way independent random m-vector using only log m bits. The first fact needed
for this is that for any k, there is a finite field F with exactly 2k elements, each of which
can be represented with k bits and arithmetic operations in the field can be carried out in
O(k 2 ) time. [In fact F can be taken to be the set of polynomials of degree k − 1 with mod
2 coefficients, where, multiplication is done modulo an irreducible polynomial of degree
k, again over modulo 2.] Here, k will be the ceiling of log2 m. We also assume another
basic fact about polynomial interpolation which says that a polynomial of degree at most
three is uniquely determined by its value over any field F at four points. More precisely,
for any four distinct points a1 , a2 , a3 , a4 ∈ F and any four possibly not distinct values
b1 , b2 , b3 , b4 ∈ F , there is a unique polynomial f (x) = f0 + f1 x + f2 x2 + f3 x3 of degree at
most three that with computations done over F , f (a1 ) = b1 , f (a2 ) = b2 , f (a3 ) = b3 , and
f (a4 ) = b4 .
12
xs is the leading bit of the k-bit representation of f (s). Thus, the m-dimensional vector
x can be computed using only the 4k bits in f0 , f1 , f2 , f3 . (Here k = dlog me).
Proof: Assume that the elements of F are represented in binary using ± instead of the
traditional 0 and 1. Let s, t, u, and v be any four coordinates of x and let α, β, γ, δ ∈
{−1, 1}. There are exactly 2k−1 elements of F whose leading bit is α and similarly for
β, γ, and δ. So, there are exactly 24(k−1) 4-tuples of elements b1 , b2 , b3 , b4 ∈ F so that
the leading bit of b1 is α, the leading bit of b2 is β, the leading bit of b3 is γ, and the
leading bit of b4 is δ. For each such b1 , b2 , b3 , and b4 , there is precisely one polynomial f
so that
xs = α, xt = β, xu = γ, and xv = δ
24(k−1) 24(k−1) 1
is precisely = = as asserted.
total number of f 24k 16
The lemma states how to get one vector x with 4-way independence. However, we
need r = O(1/ε2 ) vectors. Also the vectors must be mutually independent. But this is
easy, just choose r polynomials at the outset.
To implement the algorithm with low space, store only the polynomials in memory.
This requires 4k = O(log m) bits per polynomial for a total of O(log m/ε2 ) bits. When a
symbol s in the stream is read, compute x1s , x2s , . . . , xrs and update the running sums.
Note that xs1 is just the leading bit of the first polynomial evaluated at s; this calculation
is in O(log m) time. Thus, we repeatedly compute the xs from the “seeds”, namely the
coefficients of the polynomials.
13
1.2 Sketch of a Large Matrix
We will see how to find a good approximation of an m×n matrix A, which is read from
external memory, where m and n are large. The approximation consists of a sample of s
columns and r rows of A along with an s × r multiplier matrix. The schematic diagram
is given in Figure 1.3.
The crucial point is that uniform sampling will not always work as is seen from simple
examples. We will see that if we sample rows or columns with probabilities proportional
to the squared length of the row or column, then indeed we can get a good approximation.
One may recall that the top k singular vectors of the SVD of A, give a similar picture;
but the SVD takes more time to compute, requires all of A to be stored in RAM, and
does not have the property that the rows and columns are directly from A. However, the
SVD does yield the best approximation. Error bounds for our approximation are weaker,
though it can be found faster.
We briefly touch upon two motivations for such a sketch. Suppose A is the document-
term matrix of a large collection of documents. We are to “read” the collection at the
outset and store a sketch so that later, when a query (represented by a vector with one
entry per term) arrives, we can find its similarity to each document in the collection.
Similarity is defined by the dot product. In Figure 1.3 it is clear that the matrix-vector
product of a query with the right hand side can be done in time O(ns + sr + rm) which
would be linear in n and m if s and r are O(1). To bound errors for this process, we need
to show that the difference between A and the sketch of A has small 2-norm. Recall that
the 2-norm ||A||2 of a matrix A is max xT Ax.
|x|=1
We first tackle a simpler problem, that of multiplying two matrices. The matrix
multiplication also uses length squared sampling and is a building block to the sketch.
14
Multi Sample rows
plier r×m
Sample
columns s × r
A ≈
n×m n×s
[This is easy to verify since, the (i, j) th entry of the outer product of the k th column of
A and the k th row of B is just aik bkj .] Note that for each value of k, A(:, k)B(k, :) is an
m×p matrix each element of which is a single product of elements of A and B. An obvious
use of sampling suggests itself. Sample some values for k and compute A (:, k) B (k, :) for
the sampled k’s using their suitably scaled sum as the estimate of AB. It turns out that
nonuniform sampling probabilities are useful. Define a random variable z that takes on
values in {1, 2, . . . , n}. Let pk denote the probability that z assumes the value k. The pk
are nonnegative and sum to one. Define an associated random matrix variable that has
value
1
X = A (:, z) B (z, :)
pz
which takes on value p1k A(:, k)B(k, :) with probability pk for k = 1, 2, . . . , n. Let E (X)
denote the entry-wise expectation.
n n
X 1 X
E (X) = Prob(z = k) A (:, k) B (k, :) = A (:, k)B (k, :) = AB.
k=1
p k
k=1
This explains the scaling by p1z in X. [In general, if we wish to estimate the sum of n
real numbers a1 , a2 , . . . , an by drawing a sample z from {1, 2, . . . , n} with probabilities
p1 , p2 , . . . , pn , then the unbiased estimator of the sum a1 + a2 + · · · + an is az /pz .]
We wish to bound the error in this estimate for which the usual quantity of interest is
the variance. But here each entry xij of the matrix random variable may have a different
variance. So, we define the variance of X as the sum of the variances of all its entries. This
15
natural device of just taking the sum of the variances greatly simplifies the calculations
as we will see.
p
m X
X X X XX 1 X
Var(X) = Var (xij ) = E x2ij − (E(xij )2 ) = pk 2 a2ik b2kj − (AB)2ij .
i=1 j=1 ij ij i,j k
p k ij
|A(:,k)|2
If pk is proportional to |A (:, k) |2 , i.e, pk = ||A||2F
, we get a simple bound for Var(X).
16
Algorithm Input A, B m × n respectively n × p matrices. Algorithm finds an approx-
imation A ⊗s B to the matrix product AB.,
Since ⊗s is a randomized algorithm, it does not give us exactly the same output every
time, so A ⊗s B is not strictly speaking a function of A, B.
Using probabilities that are proportional to length squared of columns of A turns out
to be useful in other contexts as well and sampling according to them is called “length-
squared sampling”.
The Lemma below shows that the error between the correct AB and the estimate A ⊗s B
goes down as s increases. The error is terms of ||A||F ||B||F which is natural.
The basic step in the algorithm is computing the s outer products, each of which takes
time O(n) In addition, we need the time to compute the length-squared of the columns
of A (which can be done by reading A once) and drawing the sample.
Lemma 1.5 If A is in column order and B is in row order, then as the data streams we
can draw samples and implement the algorithm A ⊗s B with O(s(m + p)) RAM space in
the streaming model.
17
of the i + 1 st column)/(sum of length squared of columns 1, 2, . . . i + 1), (much as we did
earlier with real numbers - details left to the reader), so at the end we draw one length-
squared sampled column. We would simultaneously draw s samples. Then we pick the
corresponding rows of B (no sampling) and finally carry out the multiplication algorithm
⊗s with the sample in RAM.
An important special case is the multiplication AAT where the length-squared distri-
bution can be shown to be the optimal distribution. The variance of X defined above for
this case is
P 1
E(x2ij ) − E 2 (xij ) = |A(:, k)|4 − ||AAT ||2F .
P P
pk
i,j ij k
The second term is independent of the pk . The first term is minimized when the pk
are proportional to |A(:, k)|2 , hence length-squared distribution minimizes the variance
of this estimator. [Show that for any positive real numbers a1 , a2 , . . . , an , the minimum
value of p11 a21 + p12 a22 + · · · + p1n a2n over all p1 , p2 , . . . , pn ≥ 0 summing to 1 is attained when
pk ∝ ak .]
• Pick r rows of A using length-squared sampling (here lengths are row lengths).
We will see that from these sampled columns and rows, we can construct a succinct
approximation to A. For this, we first start with what seems to be a strange idea.
18
Observe that A = AI, where I is the n × n identity matrix. Now, lets approximate AI
by A ⊗s I. The error is given by Lemma (1.4):
1
||A − A ⊗s I||2F ≤ ||A||2F ||I||2F .
s
Our aim would be to make the error at most ε||A||2F for which we will need to make
s > ||I||2F /ε = n/ε which of course is ridiculous since we want s << n. But this was
caused by the fact that I had large (n) rank. A smaller rank “identity-like” matrix might
work better. For this, we will find the SVD of R, where, R is the r × n matrix consisting
of the sampled rows of A.
t
X
SVD of R : R= σi ui vi T ,
i=1
Claim 1 The matrix W = ti=1 vi vi T acts as the identity when applied to any vector v
P
in the row space of R (from the left). I.e., for such v, W v = v.
We will approximate A by
A ≈ AW ≈ A ⊗s W.
The following Lemma shows that these approximations are valid;
Proof: ||A − AW ||2 = |(A − AW )x| for some unit length vector x by definition of || · ||2 .
Let y be the component of x in the space spanned by v1 , v2 , . . . , vt and let v = x − y
be the component of x orthogonal to this space. By Claim (1), AW y = Ay, so that
(A − AW )y = 0. Hence, without loss of generality, we may assume x has no component
in the space spanned by the vi . So, W x = 0 and thus (A − AW )x = Ax. Also, Rx = 0.
Now imagine approximating the matrix multiplication AT A by AT ⊗r A. Since the columns
of AT are rows of A, this can be done using the sampled rows of A. So, it follows from
Rx = 0 that AT ⊗r Ax = 0. Using this, we get
19
So, ||AW − A||22 ≤ ||AT A − AT ⊗r A||2 . Thus, we have
1/2 1
E(||A − AW ||2 ) ≤ E(||A − AW ||22 ) ≤ E||AT A − AT ⊗r A||2 ≤ ||A||2F ,
r
by Lemma (1.4) proving (1).
For (2), Lemma (1.4) implies that E (||AW − A ⊗s W ||2 ) ≤ √1r ||A||F ||W ||F . Now,
||W ||2F is the sum ofPthe squared lengths of the rows of W which equals trace of (W W T )
and since W W T = ti−1 vi vi T , we get that trace of W W T is t which proves (2).
The proof of (3) is left to the reader. Also left to the reader are choices of s, r to make
the error bounds small.
|Bv| ≈ε |Av|,
where, for two non-negative real numbers a, b, we say b ≈ε a iff b ∈ [(1 − ε)a, (1 + ε)a]. [In
words, b is an approximation to a with relative error at most ε.] So, we wish the length
of Bv to be a relative error approximation to length of Av for every v. We give two
motivations for this problem before solving it by showing that with r = O(d log d/ε4 ), we
can indeed solve the problem. [In words, essentially O(d log d) rows of A suffice, however
large n is.]
Start with the familiar example of a document term matrix. Suppose we have n
documents and d terms, where, n >> d. Each document is represented as before as a
d− vector with one component per term. A very general problem we can ask is for a
summary of A so that for every new document (which is itself represented as a d−vector
v), we should be able to compute an approximation to the vector of similarities of v to
each existing document - i.e., the vector Av to relative error ε, namely, we want a vector
z approximating Av in the sense zi ≈ε (Av)i for every i. We do not know a solution
of this problem at the current state of knowledge. A simpler question is to ask that the
summary be able to approximate |Av|2 which is the sum of squared similarities of the
new document to every document in the collection and this is the problem we solve here.
A second important example is that of graphs. Suppose we are given a graph G(V, E)
d
with d vertices and n edges. n could be as large as 2 (and no larger). Represent the
graph by a signed edge-vertex incidence matrix A. A has one column per vertex and one
row per edge. each row has exactly two non-zero entries - a +1 for one end vertex of the
edge and a -1 for the other end vertex. [We arbitrarily choose which end vertex gets a 1
and which a -1.] For S ⊆ V , the cut (S, S̄) is the set of edges between S and S̄. It is easy
to show that
The number of edges in the cut (S, S̄) = |A1S |2 ,
20
where, 1S is the vector with 1 for each i ∈ S and 0 for each i ∈ / S. [We leave this to
the reader.] So the problem here would “sparsify” the graph - i.e., produce a subset of
O(d log d) edges to form a sub-graph H and weight them so that for every cut, the
weight of the cut in H is a relative error ε approximation to the number of edges in the
cut in G.
Theorem 1.7 Suppose A is any n × d matrix. We can find (in polynomial time) subset
of r = Ω(d log d/ε4 ) rows of A and multiply each by a scaler to form a r × d matrix B so
that
|Bv| ≈ε |Av| ∀d − vectors v.
Proof: The proof will use length-squared sampling on a matrix A0 obtained from A by
scaling the rows of A suitably. First, we argue that we cannot do without this scaling.
Consider the case of graphs, where, A is the edge-vertex incidence matrix as above. The
graph might have one (or more) edge(s) which must be included in the sparsifier. A simple
example is the case when G consists of two cliques of size n/2 each connected by a single
edge. So the cut with one of the clique-vertices forming S has one edge and unless we
pick this edge in our subset, we would have 0 weight across the cut which is not correct.
But if we just use uniform sampling, we are quite unlikely to pick this edge among our
sample of d log d edges since there are in total Ω(d2 ) edges.
Now to the proof of the theorem: First observe that |Av|2 = vT (AT A)v and similarly
for B. Also observe that for any two non-negative real numbers a, b, b2 ≈ε a2 implies that
b ≈ε a. So intuitively, it suffices to approximate AT A. For this, AT ⊗r A suggests itself as
a sampling based method, since, as we saw earlier, AT ⊗ A may be written as B T B where
B consists of some rows of A scaled.
We start with an important fact about length-squared sampling which is stated below.
[The proof is beyond the scope of the book and so we assume the theorem without proof.]
Theorem 1.8 For any n × d matrix A, if r ∈ Ω(d log d/ε4 ), then
In other words, O(d log d) sampled rows of A are sufficient to approximate AT A in spectral
norm to error ||A||2 . Recalling that AT ⊗ A can be written as B T B where, B is an r × d
matrix consisting of the r sampled rows, scaled, (see section (1.2.1)), we get vT (B T B)v =
|Bv|2 is within ε||A||22 of |Av|2 . But for relative error, we need the error to be at most
ε|Av|2 and if ||A||2 > |Av|2 which is the case in general, we do not get a relative error
result directly.
To achieve relative error, more work is needed. Find the SVD of A and suppose it is
t
X
A= σi ui vi T , t = Rank(A).
i=1
Let P = ti=1 ui vi . P has all singular values equal to 1, so that |P v| = |v| for all vectors
P
v in the row space of P .
21
Apply the theorem now to P to get that for every non-zero vector v in the row space
of P :
wT AT ⊗r Aw ≈ε wT AT Aw
Suppose one wished to store all the web pages from the WWW. Since there are billions
of web pages, one might store just a sketch of each page where a sketch is a few hundred
bits that capture sufficient information to do whatever task one had in mind. A web page
or a document is a sequence. We begin this section by showing how to sample a set and
then how to convert the problem of sampling a sequence into a problem of sampling a set.
Consider subsets of size 1000 of the integers from 1 to 106 . Suppose one wished to
compute the resemblance of two subsets A and B by the formula
|A∩B|
resemblance (A, B) = |A∪B|
Suppose that instead of using the sets A and B, one sampled the sets and compared ran-
dom subsets of size ten. How accurate would the estimate be? One way to sample would
be to select ten elements uniformly at random from A and B. However, this method is
unlikely to produce overlapping samples. Another way would be to select the ten smallest
elements from each of A and B. If the sets A and B overlapped significantly one might
expect the sets of ten smallest elements from each of A and B to also overlap. One dif-
ficulty that might arise is that the small integers might be used for some special purpose
and appear in essentially all sets and thus distort the results. To overcome this potential
problem, rename all elements using a random permutation.
Suppose two subsets of size 1000 overlapped by 900 elements. What would the over-
lap be of the 10 smallest elements from each subset? One would expect the nine smallest
elements from the 900 common elements to be in each of the two subsets for an overlap
of 90%. The resemblance(A, B) for the size ten sample would be 9/11=0.81.
22
Another method would be to select the elements equal to zero mod m for some inte-
ger m. If one samples mod m the size of the sample becomes a function of n. Sampling
mod m allows us to also handle containment.
In another version of the problem one has a sequence rather than a set. Here one con-
verts the sequence into a set by replacing the sequence by the set of all short subsequences
of some length k. Corresponding to each sequence is a set of length k subsequences. If
k is sufficiently large, then two sequences are highly unlikely to give rise to the same set
of subsequences. Thus, we have converted the problem of sampling a sequence to that of
sampling a set. Instead of storing all the subsequences, we need only store a small subset
of the set of length k subsequences.
Suppose you wish to be able to determine if two web pages are minor modifications
of one another or to determine if one is a fragment of the other. Extract the sequence
of words occurring on the page. Then define the set of subsequences of k consecutive
words from the sequence. Let S(D) be the set of all subsequences of length k occurring
in document D. Define resemblance of A and B by
|S(A)∩S(B)|
resemblance (A, B) = |S(A)∪S(B)|
Then
F (A) ∩ F (B)
F (A) ∪ F (B)
and
|V (A)∩V (B)|
|V (A)∪V (B)|
23
1.5 Exercises
Algorithms for Massive Data Problems
Exercise 1.1 Given a stream of n positive real numbers a1 , a2 , . . . , an , upon seeing
a1 , a2 , . . . , ai keep track of the sum a = a1 + a2 + · · · + ai and a sample aj , j ≤ i drawn
ai+1
with probability proportional to its value. On reading ai+1 , with probability a+a i+1
replace
the current sample with ai+1 and update a. Prove that the algorithm selects an ai from
the stream with the probability of picking ai being proportional to its value.
Exercise 1.2 Given a stream of symbols a1 , a2 , . . . , an , give an algorithm that will select
one symbol uniformly at random from the stream. How much memory does your algorithm
require?
Exercise 1.4 How would one pick a random word from a very large book where the prob-
ability of picking a word is proportional to the number of occurrences of the word in the
book?
Solution: Put your finger on a random word. The probabilities will be automatically
proportional to the number of occurrences, since each occurrence is equally likely to be
picked.
Exercise 1.5 For the streaming model give an algorithm to draw s independent samples
each with the probability proportional to its value. Justify that your algorithm works
correctly.
1
Exercise 1.6 Show that for a 2-universal hash family Prob (h(x) = z) = M +1
for all
x ∈ {1, 2, . . . , m} and z ∈ {0, 1, 2, . . . , M }.
24
(a) Is the set {hab (x) = ax + b mod p|0 ≤ a, b < p} of hash functions 3-universal?
Solution: No. h (x) = u and h (y) = v determine a and b and hence h (z).
Exercise 1.8 Give an example of a set of hash functions that is not 2-universal.
Solution: Consider a hash function hi that maps all x’s to the integer i.
is a set of hash functions where every x is equally likely to be mapped to any i in the
range 1 to m by a hash function chosen at random. However, h(x) and h(y) are clearly
not independent.
Exercise 1.9
(a) What is the variance of the method in Section 1.1.2 of counting the number of occur-
rences of a 1 with log log n memory?
(b) Can the algorithm be iterated to use only log log log n memory? What happens to the
variance?
Exercise 1.10 Consider a coin that comes down heads with probability p. Prove that the
expected number of flips before a head occurs is 1/p.
E [f ] = a + 2 (1 − a) a + 3 (1 − a)2 a + · · ·
a
(1 − a) + 2 (1 − a)2 + 3 (1 − a)3
= 1−a
a 1−a 1
= 1−a a2
= a
Exercise 1.11 Randomly generate a string x1 x2 · · · xn of 106 0’s and 1’s with probability
1/2 of x being a 1. Count the number of ones in the string and also estimate the number
i
of ones by the approximate counting algorithm. Repeat the process for p=1/4, 1/8, and
1/16. How close is the approximation?
25
Counting Frequent Elements
The Majority and Frequent Algorithms
The Second Moment
Exercise 1.12 Construct an example in which the majority algorithm gives a false posi-
tive, i.e., stores a nonmajority element at the end.
Exercise 1.13 Construct examples where the frequent algorithm in fact does as badly as
in the theorem, i.e., it “under counts” some item by n/(k+1).
Exercise 1.14 Recall basic statistics on how an average of independent trials cuts down
m
fs2 .
P
variance and complete the argument for relative error ε estimate of
s=1
Exercise 1.16 Suppose we want to pick a row of a matrix at random where the probability
of picking row i is proportional to the sum of squares of the entries of that row. How would
we do this in the streaming model? Do not assume that the elements of the matrix are
given in row order.
(a) Do the problem when the matrix is given in column order.
(b) Do the problem when the matrix is represented in sparse notation: it is just presented
as a list of triples (i, j, aij ), in arbitrary order.
Solution: Pick an element aij with probability proportional to a2ij . We have already
described the algorithm for this. Then return the i (row number) of the element picked.
The probability of picking one particular i is just the sum of probabilities of picking each
P a2ij
element of the row which is P 2 , exactly as desired. Note that this works even in the
a kl
j k,l
sparse representation.
26
n
P
Exercise 1.17 Suppose A and B are two matrices. Show that AB = A (:, k)B (k, :).
k=1
Exercise 1.18 Generate two 100 by 100 matrices A and B with integer values between
1 and 100. Compute the product AB both directly and by sampling. Plot the difference
in L2 norm between the results as a function of the number of samples. In generating
the matrices make sure that they are skewed. One method would be the following. First
generate two 100 dimensional vectors a and b with integer values between 1 and 100. Next
generate the ith row of A with integer values between 1 and ai and the ith column of B
with integer values between 1 and bi .
Exercise 1.20 Suppose a1 , a2 , . . . , am are nonnegative reals. Show that the minimum
m
P ak P
of xk
subject to the constraints xk ≥ 0 and xk = 1 is attained when the xk are
k=1 √ k
proportional to ak .
Sketches of Documents
Exercise 1.21 Consider random sequences of length n composed of the integers 0 through
9. Represent a sequence by its set of length k-subsequences. What is the resemblance of
the sets of length k-subsequences from two random sequences of length n for various values
of k as n goes to infinity?
Exercise 1.22 What if the sequences in the Exercise 1.21 were not random? Suppose the
sequences were strings of letters and that there was some nonzero probability of a given
letter of the alphabet following another. Would the result get better or worse?
Exercise 1.23 Consider a random sequence of length 10,000 over an alphabet of size
100.
(a) For k = 3 what is probability that two possible successor subsequences for a given
subsequence are in the set of subsequences of the sequence?
27
Solution: 1.23(a) A subsequence has 100 possible successor subsequences. The proba-
bility that a given subsequence is in the set of 104 subsequences of the sequence is 1/100.
1 100
= 1e > 0.6
Thus the answer is 1 − 100
(b) For k = 5 the universe of possible subsequences has grown to 1010 and the prob-
ability of being a possible successor drops to 10−6 . (1-10−6 )100 =0. However, since the
sequence is of length 104 , the probability that some subsequence’s successor is in the set
106
is (1 − 10−6 ) = 1e .
Exercise 1.24 How would you go about detecting plagiarism in term papers?
Exercise 1.25 (Finding duplicate web pages) Suppose you had one billion web pages
and you wished to remove duplicates. How would you do this?
Solution: :?? Suppose web pages are in html. Create a window consisting of each se-
quence of 10 consecutive html commands including text. Hash the contents of each window
to an integer and save the lowest 10 integers for each page. To detect exact duplicates,
hash the 10 integers to a single integer and look for collisions. If we want almost exact
duplicates we might ask for eight of the ten integers to be the same. In this case hash
the three lowest integers, the next three lowest, and the last four to integers and look
for collisions. If eight of the ten integers agree one of the above three hashes must agree.
NOT TRUE.
Exercise 1.26 Construct two sequences of 0’s and 1’s having the same set of subsequences
of width w.
Solution: 1.26 For w = 3, 11101111 and 11110111.
Exercise 1.27 Consider the following lyrics:
When you walk through the storm hold your head up high and don’t be afraid of the
dark. At the end of the storm there’s a golden sky and the sweet silver song of the
lark.
Walk on, through the wind, walk on through the rain though your dreams be tossed
and blown. Walk on, walk on, with hope in your heart and you’ll never walk alone,
you’ll never walk alone.
How large must k be to uniquely recover the lyric from the set of all subsequences of
symbols of length k? Treat the blank as a symbol.
Exercise 1.28 Blast: Given a long sequence a, say 109 and a shorter sequence b, say
105 , how do we find a position in a which is the start of a subsequence b0 that is close to
b? This problem can be solved by dynamic programming but not in reasonable time. Find
a time efficient algorithm to solve this problem.
Hint: (Shingling approach) One possible approach would be to fix a small length, say
seven, and consider the shingles of a and b of length seven. If a close approximation to b
is a substring of a, then a number of shingles of b must be shingles of a. This should allows
us to find the approximate location in a of the approximation of b. Some final algorithm
should then be able to find the best match.
28