A Bootstrap Method For Error Estimation in Randomized Matrix Multiplication
A Bootstrap Method For Error Estimation in Randomized Matrix Multiplication
Abstract
In recent years, randomized methods for numerical linear algebra have received growing
interest as a general approach to large-scale problems. Typically, the essential ingredient
of these methods is some form of randomized dimension reduction, which accelerates
computations, but also creates random approximation error. In this way, the dimension
reduction step encodes a tradeoff between cost and accuracy. However, the exact numerical
relationship between cost and accuracy is typically unknown, and consequently, it may be
difficult for the user to precisely know (1) how accurate a given solution is, or (2) how much
computation is needed to achieve a given level of accuracy. In the current paper, we study
randomized matrix multiplication (sketching) as a prototype setting for addressing these
general problems. As a solution, we develop a bootstrap method for directly estimating the
accuracy as a function of the reduced dimension (as opposed to deriving worst-case bounds
on the accuracy in terms of the reduced dimension). From a computational standpoint, the
proposed method does not substantially increase the cost of standard sketching methods,
and this is made possible by an “extrapolation” technique. In addition, we provide both
theoretical and empirical results to demonstrate the effectiveness of the proposed method.
Keywords: matrix sketching, randomized matrix multiplication, bootstrap methods
1. Introduction
The development of randomized numerical linear algebra (RNLA or RandNLA) has led
to a variety of efficient methods for solving large-scale matrix problems, such as matrix
multiplication, least-squares approximation, and low-rank matrix factorization, among
others (Halko et al., 2011; Mahoney, 2011; Woodruff, 2014; Drineas and Mahoney, 2016).
A general feature of these methods is that they apply some form of randomized dimension
reduction to an input matrix, which reduces the cost of subsequent computations. In
exchange for the reduced cost, the randomization leads to some error in the resulting
solution, and consequently, there is a tradeoff between cost and accuracy.
For many canonical matrix problems, the relationship between cost and accuracy has
been the focus of a growing body of theoretical work, and the literature provides many
performance guarantees for RNLA methods. In general, these guarantees offer a good
qualitative description of how the accuracy depends on factors such as problem size, number
of iterations, condition numbers, and so on. Yet, it is also the case that such guarantees
tend to be overly pessimistic for any particular problem instance — often because the
guarantees are formulated to hold in the worst case among a large class of possible inputs.
Likewise, it is often impractical to use such guarantees to determine precisely how accurate
a given solution is, or precisely how much computation is needed to achieve a desired level
of accuracy.
In light of this situation, it is of interest to develop efficient methods for estimating the
exact relationship between the cost and accuracy of RNLA methods on a problem-specific
basis. Since the literature has been somewhat quiet on this general question, the aim of this
paper is to analyze randomized matrix multiplication as a prototype setting, and propose
an approach that may be pursued more broadly. (Extensions are discussed at the end of
the paper in Section 6.)
E[ST S] = In , (2)
with In being the identity matrix. In particular, the relation (2) implies that the sketched
product is an unbiased estimate, E[ÃT B̃] = AT B. Most commonly, the matrix S can be
interpreted as acting on A and B by sampling their rows, or by randomly projecting their
columns. In Section 2, we describe some popular examples of sketching matrices to be
considered in our analysis.
2
A Bootstrap Method for Error Estimation in Randomized Matrix Multiplication
0.5 0.5
0.45
0.99-quantile
0.45
0.4 0.4
L∞ Norm Error
0.35
L∞ Norm Error
0.35
0.3 0.3
0.25 0.25
0.2 0.2
0.15 0.15
0.1 0.1
0.05 0.05
0 0
0 200 400 600 800 1000 1200 1400 1600 1800 0 200 400 600 800 1000 1200 1400 1600 1800
Sketch Size t Sketch Size t
Figure 1: Left panel: The curve shows how εt fluctuates with varying sketch size t, as
rows are added to S, with A and B held fixed. (Each row of A ∈ R8,124×112 is a
feature vector of the Mushroom dataset (Frank and Asuncion, 2010), and we set
B = A.) The rows of S were generated randomly from a Gaussian distribution
(see Section 2), and the matrix A was scaled so that kAT Ak∞ = 1. Right panel:
There are 1,000 colored curves, each arising from a repetition of the simulation
in the left panel. The thick black curve represents q0.99 (t).
For example, the quantity q0.99 (t) is the tightest upper bound on εt that holds with
probability at least 0.99. Hence, for any fixed α, the function q1−α (t) represents a precise
tradeoff curve for relating cost and accuracy. Moreover, the function q1−α (t) is specific to
the input matrices A and B.
To clarify the interpretation of q1−α (t), it is helpful to plot the fluctuations of εt . In
the left panel of Figure 1, we illustrate a simulation where randomly generated rows are
3
Lopes, Wang, and Mahoney
incrementally added to a sketching matrix S, with A and B held fixed. Each time a row is
added to S, the sketch size t increases by 1, and we plot the corresponding value of εt as t
ranges from 100 to 1,700. (Note that the user is typically unable to observe such a curve
in practice.) In the right panel, we display 1,000 repetitions of the simulation, with each
colored curve corresponding to one repetition. (The variation is due only to the different
draws of S.) In particular, the function q0.99 (t) is represented by the thick black curve,
delineating the top 1% of the colored curves at each value of t.
In essence, the right panel of Figure 1 shows that if the user had knowledge of the
(unknown) function q1−α (t), then two important purposes could be served. First, for any
fixed value t, the user would have a sharp problem-specific bound on εt . Second, for any
fixed error tolerance , the user could select t so that that “just enough” computation is
spent in order to achieve εt ≤ with probability at least 1 − α.
The estimation problem. The challenge we face is that a naive computation of q1−α (t)
by generating samples of εt would defeat the purpose of sketching. Indeed, generating
samples of εt by brute force would require running the sketching method many times,
and it would also require computing the entire product AT B. Consequently, the technical
problem of interest is to develop an efficient way to estimate q1−α (t), without adding much
cost to a single run of the sketching method.
1.3. Contributions
From a conceptual standpoint, the main novelty of our work is that it bridges two sets of
ideas that are ordinarily studied in distinct communities. Namely, we apply the statistical
technique of bootstrapping to enhance algorithms for numerical linear algebra. To some
extent, this pairing of ideas might seem counterintuitive, since bootstrap methods are
sometimes labeled as “computationally intensive”, but it will turn out that the cost of
bootstrapping can be managed in our context. Another reason our approach is novel is
that we use the bootstrap to quantify error in the output of a randomized algorithm, rather
than for the usual purpose of quantifying uncertainty arising from data. In this way, our
approach harnesses the versatility of bootstrap methods, and we hope that our results in
the “use case” of matrix multiplication will encourage broader applications of bootstrap
methods in randomized computations. (See also Section 6, and note that in concurrent
work, we have pursued similar approaches in the contexts of randomized least-squares and
classification algorithms (Lopes et al., 2018b; Lopes, 2019).)
From a technical standpoint, our main contributions are a method for estimating the
function q1−α (t), as well as theoretical performance guarantees. Computationally, the
proposed method is efficient in the sense that its cost is comparable to a single run
of standard sketching methods (see Section 2). This efficiency is made possible by an
“extrapolation” technique, which allows us to bootstrap small “initial” sketches with t0 rows,
and inexpensively estimate q1−α (t) at larger values t t0 . The empirical performance of
the extrapolation technique is also quite encouraging, as discussed in Section 5. Lastly, with
regard to theoretical analysis, our proofs circumvent some technical restrictions occurring
in the analysis of related bootstrap methods in the statistics literature.
4
A Bootstrap Method for Error Estimation in Randomized Matrix Multiplication
5
Lopes, Wang, and Mahoney
At a more technical level, the ability to avoid restrictions on A and B comes from our
use of the Lévy-Prohorov metric for distributional approximations — which differs from
the Kolmogorov metric that has been predominantly used in previous works on multiplier
bootstrap methods. More specifically, analyses based on the Kolmogorov metric typically
rely on “anti-concentration inequalities” (Chernozhukov et al., 2013, 2015), which ultimately
lead to the mentioned variance assumptions. On the other hand, our approach based on the
Lévy-Prohorov metric does not require the use of anti-concentration inequalities. Finally
it should be mentioned that the techniques used to control the LP metric are related to
those that have been developed for bootstrap approximations via coupling inequalities as
in Chernozhukov et al. (2016).
Outline. This paper is organized as follows. Section 2 introduces some technical
background. Section 3 describes the proposed bootstrap algorithm. Section 4 establishes
the main theoretical results, and then numerical performance is illustrated in Section 5.
Lastly, conclusions and extensions of the method are presented in Section 6, and all proofs
are given in the appendices.
2. Preliminaries
Notation and terminology. The set {1, . . . , n} is denoted as [n]. The P ith standard
basis vector is denoted as ei . If C = [cij ] is a real matrix, then kCkF = ( i,j c2ij )1/2 is
the Frobenius norm, and kCk2 is the spectral norm (maximum singular value). If X is
a random variable and p ≥ 1, we write kXkp = (E[|X|p ])1/p for the usual Lp norm. If
ψ : [0, ∞) → [0, ∞) is a non-decreasing convex function with ψ(0) = 0, then the ψ-Orlicz
norm of X is defined as kXkψ := inf{r > 0 | E[ψ(|X|/r)] ≤ 1}. In particular, we define
ψp (x) := exp(xp ) − 1 for p ≥ 1, and we say that X is sub-Gaussian when kXkψ2 < ∞, or
sub-exponential when kXkψ1 < ∞. In Appendix F, Lemma 9 summarizes the facts about
Orlicz norms that will be used.
We will use c to denote a positive absolute constant that may change from line to line.
The matrices A, B, and S are viewed as lying in a sequence of matrices indexed by the
tuple (d, d0 , t, n). For a pair of generic functions f and g, we write f (d, d0 , t, n) . g(d, d0 , t, n)
when there is a positive absolute constant c so that f (d, d0 , t, n) ≤ c g(d, d0 , t, n) holds for all
large values of d, d0 , t, and n. Furthermore, if a and b are two quantities that satisfy both
a . b and b . a, then we write a b. Lastly, we do not use the symbols . or when
relating random variables.
Examples of sketching matrices. Our theoretical results will deal with three common
types of sketching matrices, reviewed below.
• Row sampling. If (p1 , . . . , pn ) is a probability vector, then S ∈ Rt×n can be constructed
1 1
by sampling its rows i.i.d. from the set { √tp e1 , . . . , √tp n
en } ⊂ Rn , where the vector
1
√1 ei is selected with probability pi . Some of the most well known choices for the
tpi
sampling probabilities include uniform sampling, with pi ≡ 1/n, length sampling
(Drineas et al., 2006a; Magen and Zouzias, 2011), with
keT Ak2 keTi Bk2
pi = Pn i T T
, (5)
j=1 kej Ak2 kej Bk2
6
A Bootstrap Method for Error Estimation in Randomized Matrix Multiplication
and leverage score sampling, for which further background may be found in the papers
(Drineas et al., 2006b, 2008, 2012).
• Sub-Gaussian projection. Gaussian projection is the most well-known random
projection method, and is sometimes referred to as the Johnson-Lindenstrauss (JL)
transform (Johnson and Lindenstrauss, 1984). In detail, if G ∈ Rt×n is a standard
Gaussian matrix, with entries that are i.i.d. samples from N (0, 1), then S = √1t G is a
Gaussian projection matrix. More generally, the entries of G can be drawn i.i.d. from
a zero-mean sub-Gaussian distribution, which often leads to similar performance
characteristics in RNLA applications.
• Subsampled randomized Hadamard transform (SRHT). Let n be a power of 2, and
define the Walsh-Hadamard matrix Hn recursively∗
Hn/2 Hn/2 1 1
Hn := with H2 := .
Hn/2 −Hn/2 1 −1
Next, let D◦n ∈ Rn×n be random diagonal matrix with independent ±1 Rademacher
variables along the diagonal, and let P ∈ Rt×n have rows uniformly sampled from
{ √1 e1 , . . . , √1 en }. Then, the t × n matrix
t/n t/n
is called an SRHT matrix. This type of sketching matrix was introduced in the seminal
paper (Ailon and Chazelle, 2006), and additional details regarding implementation
may be found in the papers (Drineas et al., 2011; Wang, 2015). (The factor √1n is used
so that √1n Hn is an orthogonal matrix.) An important property of SRHT matrices is
that they can be multiplied with any n × d matrix in O(n · d · log t) time (Ailon and
Liberty, 2009), which is faster than the O(n · d · t) time usually required for a dense
sketching matrix.
3. Methodology
Before presenting our method in algorithmic form, we first explain the underlying intuition.
Di = AT si sTi B, (8)
∗
The restriction that n is a power of 2 can be relaxed with variants of SRHT matrices (Avron et al.,
2010; Boutsidis and Gittens, 2013).
7
Lopes, Wang, and Mahoney
then E[Di ] = AT B, and it follows that the difference between the sketched and unsketched
products can be viewed as a sample average of zero-mean random matrices
Furthermore, in the cases of length sampling and Gaussian projection, the matrices
D1 , . . . , Dt are independent, and in the case of SRHT sketches, these matrices are “nearly”
independent. So, in light of the central limit theorem, it is natural to suspect that
the random matrix (9) will be well-approximated (in distribution) by a matrix with
Gaussian entries. In particular, if we examine the (j1 , j2 ) entry, then we may expect that
eTj1 AT ST SB−AT B ej2 will approximately follow the distribution N (0, 1t σj21 ,j2 ), where the
unknown parameter σj21 ,j2 can be estimated with
Pt 2
σ̂j21 ,j2 := 1
t i=1 eTj1 (Di − AT ST SB)ej2 .
Based on these considerations, the idea of the proposed bootstrap method is to generate
a random matrix whose (j1 , j2 ) entry is sampled from N (0, 1t σ̂j21 ,j2 ). It turns out
that an efficient way of generating such a matrix is to sample i.i.d. random variables
ξ1 , . . . , ξt ∼ N (0, 1), independent of S, and then compute
1 Pt T ST SB .
t i=1 ξi D i − A (10)
In other words, if S is conditioned upon, then the distribution of the (j1 , j2 ) entry of
the above matrix is exactly N (0, 1t σ̂j21 ,j2 ).† Hence, if the matrix (10) is viewed as an
“approximate sample” of AT ST SB − AT B, then it is natural to use the `∞ -norm of the
matrix (10) as an approximate sample of εt = kAT ST SB − AT Bk∞ . Likewise, if we define
the bootstrap sample
1 Pt
ε?t := t i=1 i ξ D i − A T ST SB , (11)
∞
then the bootstrap algorithm will generate i.i.d. samples of ε?t , conditionally on S. In turn,
the (1 − α)-quantile of the bootstrap samples, say q̂1−α (t), can be used to estimate q1−α (t).
†
It is also possible to show that the joint distribution of the entries in the matrix (10) mimics that of
AT ST SB − AT B, but we omit such details to simplify the discussion.
8
A Bootstrap Method for Error Estimation in Randomized Matrix Multiplication
Hence, if the user would like to determine a sketch size t so that q1−α (t) ≤ , for some
ext (t) ≤ , which is equivalent to
tolerance , then t should be selected so that q̂1−α
√ 2
t ≥ t0 q̂1−α (t0 ) . (13)
9
Lopes, Wang, and Mahoney
and in fact, this could be improved further if parallelization of Algorithm 1 is taken into
account. It is also important to note that rather small values of B are shown to work well
in our experiments, such as B = 20. Hence, as long t0 remains fairly small compared to t,
then the condition (14) may be expected to hold, and this is borne out in our experiments.
The same reasoning also applies when n log(t) d · t0 , which conforms with the fact that
sketching methods are intended to handle situations where n is very large.
10
A Bootstrap Method for Error Estimation in Randomized Matrix Multiplication
and likewise for B̃. Hence, if S is conditioned upon, then sampling with replacement from
the rows of à and B̃ imitates the random mechanism that originally generated à and B̃.
1. Draw a vector (i1 , . . . , it ) by sampling t numbers with replacement from {1, . . . , t}.
0
2. Form matrices Ã∗ ∈ Rt×d and B̃∗ ∈ Rt×d by selecting (respectively) the rows
from à and B̃ that are indexed by (i1 , . . . , it ).
4. Main results
Our main results quantify how well the estimate q̂1−α (t) from Algorithm 1 approximates
the true value q1−α (t), and this will be done by analyzing how well the distribution of a
bootstrap sample ε?t,1 approximates the distribution of εt . For the purposes of comparing
distributions, we will use the Lévy-Prohorov metric, defined below.
Lévy-Prohorov (LP) metric. Let L(U ) denote the distribution of a random variable
U , and let B denote the collection of Borel subsets of R. For any A ∈ B, and δ > 0,
δ
define the δ-neighborhood A := x ∈ R inf y∈A |x − y| ≤ δ . Then, for any two random
variables U and V , the dLP metric between their distributions is given by
n o
dLP (L(U ), L(V )) := inf δ > 0 P(U ∈ A) ≤ P(V ∈ Aδ ) + δ for all A ∈ B .
The dLP metric is a standard tool for comparing distributions, due to the fact that
convergence with respect to dLP is equivalent to convergence in distribution (Huber and
Ronchetti, 2009, Theorem 2.9).
Approximating quantiles. An important property of the dLP metric is that if two
distributions are close in this metric, then their quantiles are close in the following sense.
Recall that if FU is the distribution function of a random variable U , then the (1−α)-quantile
of U is the same as the generalized inverse FU−1 (1 − α) := inf{q ∈ [0, ∞) | FU (q) ≥ 1 − α}.
Next, suppose that two random variables U and V satisfy
dLP L(U ), L(V ) ≤ ,
for some ∈ (0, α) with α ∈ (0, 1/2). Then, the quantiles of U and V are close in the sense
that
FU−1 (1 − α) − FV−1 (1 − α) ≤ ψα (), (15)
where the function ψα () := FU−1 (1 − α + ) − FU−1 (1 − α − ) + is strictly monotone, and
satisfies ψα (0) = 0. (For a proof, see Lemma 15 of Appendix F.) In light of this fact, it will
11
Lopes, Wang, and Mahoney
be more convenient to express our results for approximating q1−α (t) in terms of the dLP
metric.
(a) (Sub-Gaussian case). The entries of the matrix S = [Si,j √] are zero-mean i.i.d. sub-
2 1
Gaussian random variables, with E[Si,j ] = t , and maxi,j k tSi,j kψ2 . 1. Furthermore,
t & ν(A, B)2/3 (log d)5 .
(b) (Length sampling case). The matrix S is generated by length sampling, with the
probabilities in equation (5), and also, t & (kAkF kBkF )2/3 (log d)5 .
(c) (SRHT case). The matrix S is an SRHT matrix as defined in equation (6), and also,
t & ν(A, B)2/3 (log n)2 (log d)5 .
12
A Bootstrap Method for Error Estimation in Randomized Matrix Multiplication
Theorem 1 Let h(x) = x1/2 + x3/4 for x ≥ 0. If Assumption 1 (a) holds, then there
is an absolute constant c > 0 such that the following bound holds with probability at least
1 − 1t − dd1 0 ,
p
c · h(ν(A, B)) · log(d)
dLP L(Zt ) , L(Zt? |S) ≤ 1/8
.
t
If Assumption 1 (b) holds, then there is an absolute constant c > 0 such that the following
bound holds with probability at least 1 − 1t − dd1 0 ,
p
c · h(kAk F kBk F ) · log(d)
dLP L(Zt ) , L(Zt? |S) ≤ 1/8
.
t
Remarks. A noteworthy property of the bounds is that they are dimension-free with
respect to the large dimension n. Also, they have a very mild logarithmic dependence on d.
With regard to the dependence on t, there are two other important factors to keep in mind.
First, the practical performance of the bootstrap method (shown in Section 5) is much
better than what the t−1/8 rate suggests. Second, the problem of finding the optimal rates
of approximation for multiplier bootstrap methods is a largely open problem — even in
the simpler setting of bootstrapping the coordinate-wise maximum of vectors (rather than
matrices). In the vector context, the literature has focused primarily on the Kolmogorov
metric (rather than the LP metric), and some quite recent improvements beyond the t−1/8
rate have been developed in Chernozhukov et al. (2017) and Lopes et al. (2018a). However,
these works also rely on model assumptions that would lead to additional restrictions on the
matrices A and B in our setup. Likewise, the problem of extending our results to achieve
faster rates or handle other metrics is a natural direction for future work.
The SRHT case. For the case of SRHT matrices, the analogue of Theorem 1 needs to
be stated in a slightly different way for technical reasons. From a qualitative standpoint,
the results for SRHT and sub-Gaussian matrices turn out to be similar.
The technical issue to be handled is that the rows of an SRHT matrix are not
independent, due to their common dependence on the matrix D◦n . Fortunately, this
inconvenience can be addressed by conditioning on D◦n . Theoretically, this simplifies the
analysis of the bootstrap, since it “decouples” the rows of the SRHT matrix. Meanwhile, if
we let q̃1−α (t) denote the (1 − α)-quantile of the distribution L(εt |D◦n ),
n o
q̃1−α (t) := inf q ∈ [0, ∞) P(εt ≤ q D◦n ) ≥ 1 − α ,
then it is simple to check that q̃1−α (t) acts as a “surrogate” for q1−α (t), since‡
≥ E[1 − α] (18)
= 1 − α.
‡
It is also possible to show that q̃1−α (t) fluctuates around q1−α (t). Indeed, if we define the random
variable V := P(εt ≤ q1−α (t)|D◦n ), it can be checked that the event V ≥ 1 − α is equivalent to the
event q̃1−α (t) ≤ q1−α (t). Furthermore, if we suppose that 1 − α lies in the range of the c.d.f. of εt , then
E[V ] = 1 − α. In turn, it follows that the event q̃1−α (t) ≤ q1−α (t) occurs when V ≥ E[V ], and conversely,
the event q̃1−α (t) > q1−α (t) occurs when V < E[V ].
13
Lopes, Wang, and Mahoney
For this reason, we will view q̃1−α (t) as the new parameter to estimate (instead of q1−α (t)),
and accordingly, the aim of the following result is to quantify how well the bootstrap
distribution L(Zt? |S) approximates the conditional distribution L(Zt |D◦n ).
Theorem 2 Let h(x) = x1/2 + x3/4 for x ≥ 0. If Assumption 1 (c) holds, then there
is an absolute constant c > 0 such that the following bound holds with probability at least
1 − 1t − dd1 0 − nc ,
p
c · h(ν(A, B) log(n)) · log(d)
dLP L(Zt |D◦n ) , L(Zt? |S) ≤ .
t1/8
Remarks. Up to a factor involving log(n), the bound for SRHT matrices matches that
for sub-Gaussian matrices. Meanwhile, from a more practical standpoint, our empirical
results will show that the bootstrap’s performance for SRHT matrices is generally similar
to that for both sub-Gaussian and length-sampling matrices.
Further discussion of results. To comment on the role of ν(A, B) and kAkF kBkF in
Theorems 1 and 2, it is possible to interpret them as problem-specific “scale parameters”.
Indeed, it is natural that the bounds on dLP should increase with the scale of A and B for
the following reason. Namely, if A or B is multiplied by a scale factor κ > 0, then it can
be checked that the quantile error |q̂1−α (t) − q1−α (t)| will also change by a factor of κ, and
furthermore, the inequality (15) demonstrates a monotone relationship between the sizes of
the quantile error and the dLP error. For this reason, the bootstrap may still perform well
in relation to the scale of the problem when the magnitudes of the parameters ν(A, B) and
kAkF kBkF are large. Alternatively, this idea can be seen by noting that the dLP bounds
can be made arbitrarily small by simply changing the units used to measure the entries of
A and B.
Beyond these considerations, it is still of interest to compare the results for different
sketching matrices once a particular scaling has been fixed. For concreteness, consider a
scaling where the spectral norms of A and B satisfy kAk2 kBk2 1. (As an example, if
we view A> A as a sample covariance matrix, then the condition kAk2 1 simply means
that the largest principal component score is of order
p 1.) Under this scaling, it is simple to
check that ν(A, B) = O(1), and kAkF kBkF = O( r(A)r(B)), where r(A) := kAk2F /kAk22
is the “stable rank”. In particular, note that if A and B are approximately low rank, as
is common in applications, then r(A) d, and r(B) d0 . Accordingly, we may conclude
that if the conditions of Theorems 1 and 2 hold, then bootstrap consistency occurs under
the following limits
p
log(d)/t1/8 = o(1) in the sub-Gaussian case, (19)
p
(r(A)r(B))3/8 log(d)/t1/8 = o(1) in the length-sampling case, (20)
p
log(n)3/4 log(d)/t1/8 = o(1) in the SRHT case, (21)
14
A Bootstrap Method for Error Estimation in Randomized Matrix Multiplication
5. Experiments
This section outlines a set of experiments for evaluating the performance of Algorithm 1
with the extrapolation speed-up described in Section 3.3. The experiments involved both
synthetic and natural matrices, as described below.
Synthetic matrices. In order to generate the matrix A ∈ Rn×d synthetically, we selected
the factors of its singular value decomposition A = Udiag(σ)VT in the following ways, fixing
n = 30, 000 and d = 1, 000. In previous work, a number of other experiments in randomized
matrix computations have been designed along these lines (Ma et al., 2014; Yang et al.,
2016).
The factor U ∈ Rn×d was selected as the Q factor from the reduced QR factorization
of a random matrix X ∈ Rn×d . The rows of X were sampled i.i.d. from a multivariate
t-distribution, t2 (µ, C), with 2 degrees of freedom, mean µ = 0, and covariance cij =
2 × 0.5|i−j| where C = [cij ]. (This choice causes the matrix A to have high row-coherence,
which is of interest, since this is a challenging case for sampling-based sketching matrices.)
Next, the factor V ∈ Rd×d was selected as the Q factor from a QR factorization of a d × d
matrix with i.i.d. N (0, 1) entries. For the singular values σ ∈ Rd+ , we chose two options,
kAk2
leading to either a low or high stable rank r(A) = kAkF2 . In the low stable rank case, we
2
put σi = 10κi for a set of equally spaced values κi between 0 and -6, yielding r(A) = 36.7.
Alternatively, in the high stable rank case, the entries of σ were equally spaced between
0.1 and 1, yielding r(A) = 370.1. Finally, to make all numerical comparisons on a common
scale, we normalized A so that kAT Ak∞ = 1.
Natural matrices. We also conducted experiments on five natural data matrices A
from the LIBSVM repository Chang and Lin (2011), named ‘Connect’, ‘DNA’, ‘MNIST’,
‘Mushrooms’, and ‘Protein’, with the same normalization that was used for the synthetic
matrices. These datasets are briefly summarized in Table 1.
15
Lopes, Wang, and Mahoney
of the random variable εt . In turn, the 0.99 sample quantile of the 1,000 realizations of εt
was treated as the true value of q0.99 (t), and this appears as the black curve in all plots.
Extrapolated estimates. With regard to the bootstrap extrapolation method in
Section 3.3, we fixed the value t0 = d/2 as the initial sketch size to extrapolate from.
For each A, and each type of sketching matrix, we applied Algorithm 1 to each of the 1,000
realizations of à = SA ∈ Rt0 ×d generated previously. Each time Algorithm 1 was run,
we used the modest choice of B = 20 for the number of bootstrap samples. From each
set of 20 bootstrap samples, we used the 0.99 sample quantile as the estimate q̂0.99 (t0 ).§
Hence, there were 1,000 realizations of q̂0.99 (t0 ) altogether. Next, we used the scaling rule
in equation (12) to obtain 1,000 realizations of the extrapolated estimate q̂0.99ext (t) for values
t ≥ t0 .
ext (t) over the 1,000 realizations,
In order to illustrate the variability of the estimate q̂0.99
we plot three different curves as a function of t. The blue curve represents the average
ext (t), while the green and yellow curves respectively correspond to the estimates
value of q̂0.99
ranking 100th an 900th out of the 1,000 realizations.
becomes larger.
With attention to the extrapolation rule (12), there are two main points to note. First,
the plots show that the extrapolation may be initiated at fairly low values of t0 , which are
much less than the sketch sizes needed to achieve a small sketching error εt . Second, we see
ext (t) remains accurate for t much larger than t , well up to t = 10, 000 and perhaps
that q̂0.99 0
even farther. Consequently, the results show that the extrapolation technique is capable of
saving quite a bit of computation without much detriment to statistical performance.
To consider the relationship between theory and practice, one basic observation is that
all three types of sketching matrices obey roughly similar bounds in Theorems 1 and 2,
and indeed, we also see generally similar numerical performance among the three types.
At a more fine-grained level however, the Gaussian and SRHT sketching matrices tend
ext (t) with somewhat higher variance than in the case of length
to produce estimates q̂0.99
sampling. Another difference between theory and simulation, is that the actual performance
of the method seems to be better than what the theory suggests — since the estimates are
accurate at values of t0 that are much smaller than what would be expected from the rates
in Theorems 1 and 2.
§
Note that since 19/20 = 0.95 and 20/20 = 1, the 0.99 quantile was obtained by an interpolation rule.
16
A Bootstrap Method for Error Estimation in Randomized Matrix Multiplication
Figure 1: Extra-D1k.
Figure 2: Results for synthetic matrices. The black line represents q0.99 (t) as a function
of t. The blue star is the average bootstrap estimate at the initial sketch size
.
t0 = d/2 = 500, and the blue line represents the average extrapolated estimate
ext (t)] derived from the starting value t . To display the variability of the
E[q̂0.99 0
estimates, the green and yellow curves correspond to the 100th and 900th largest
ext (t) at each t.
among the 1,000 realizations of q̂0.99
17
Lopes, Wang, and Mahoney
Figure 3: Results for natural matrices. The results for the natural matrices are plotted
in the same way as described
Figure in
4: the caption for the results on the synthetic
Real-extra.
7
matrices.
18
A Bootstrap Method for Error Estimation in Randomized Matrix Multiplication
At a high level, each of the applications below deals with an object, say Θ, that is
difficult to compute, as well as a randomized approximation, say Θ,
e that is built from a
sketching matrix S with t rows. Next, if we consider the random error variable
ε t = kΘ
e − Θk,
for an unspecified norm k · k, then the problem of estimating the relationship between
accuracy and computation can again be viewed as the problem of estimating the quantile
function q1−α (t) associated with εt . In turn, this leads to the question of how to develop a
new bootstrap procedure that can generate approximate samples of εt , yielding an estimate
q̂1−α (t). However, instead of starting from the multiplier bootstrap (Algorithm 1) as before,
it may be conceptually easier to extend the non-parametric bootstrap (Algorithm 2) —
because the latter bootstrap can viewed as a “plug-in” procedure that replaces AT B with
ÃT B̃, and replaces ÃT B̃ with (Ã∗ )T (B̃∗ ).
• Linear regression. Consider a multi-response linear regression problem, where the rows
0
of B ∈ Rn×d are response vectors, and the rows of A ∈ Rn×d are input observations.
The optimal solution to `2 -regression is given by
2
Wopt = argmin AW − B F
= (AT A)† AT B,
W∈Rd×d0
which has O(nd2 + ndd0 ) cost. In the case where max{d, d0 } n, the matrix
multiplications are a computational bottleneck, and an approximate solution can be
obtained via
Wf opt = (ÃT Ã)† (ÃT B̃),
which has a cost O(td2 + tdd0 ) + Csketch , where Csketch is cost of matrix sketch-
ing (Drineas et al., 2006b, 2011, 2012; Clarkson and Woodruff, 2013). In order to
estimate the quantile function associated with the error variable εt = kW f opt − Wopt k,
we could consider generating bootstrap samples of the form εt = kW ∗ f∗ − W f opt k,
opt
where Wf ∗ = ((Ã∗ )T (Ã∗ ))† (Ã∗ )T (B̃∗ ). For recent results in the case where W is a
opt
vector, we refer to the paper (Lopes et al., 2018b).
• Functions of covariance matrices. If the rows of the matrix A are viewed as a sample
of observations, then inferences on the population covariance structure are often based
on functions of the form ψ(AT A). For instance, the function ψ(AT A) could be the
top eigenvector, a set of eigenvalues, the condition number, or a test statistic. In any
of these cases, if ψ(ÃT Ã) is used as a fast approximation (Dasarathy et al., 2015),
then the sketching error εt = kψ(ÃT Ã) − ψ(AT A)k might be bootstrapped using
ε∗t = kψ((Ã∗ )T (Ã∗ )) − ψ(ÃT Ã)k.
19
Lopes, Wang, and Mahoney
T
function f (w) = ni=1 log 1 + e−yi w xi + γ2 kwk22 over coefficient vectors w in Rd .
P
w ← w − κ H−1 ∇f,
εt = kH̃−1 ∇f − H−1 ∇f k,
and in turn, this might be bootstrapped using ε∗t = k(H̃∗ )−1 ∇f − H̃−1 ∇f k, where
H̃∗ = (Ã∗ )T (Ã∗ ) + γId .
Acknowledgments
We thank the anonymous reviewers for their helpful suggestions. MEL thanks the National
Science Foundation for partial support under grant DMS-1613218. MWM would like
to thank the National Science Foundation, the Army Research Office, and the Defense
Advanced Research Projects Agency for providing partial support of this work.
Appendices
Outline of appendices. Appendix A explains the main conceptual ideas underlying the
proofs of Theorems 1 and 2. In particular, the proofs of these theorems will be decomposed
into two main results: Propositions 3 and 4, which are given in Appendix A.
Appendix B will prove the sub-Gaussian case of Proposition 3, and Appendix C will
prove the sub-Gaussian case of Proposition 4. Later on, Appendices D and E, will explain
how the arguments can be changed to handle the length-sampling and SRHT cases.
20
A Bootstrap Method for Error Estimation in Randomized Matrix Multiplication
0 0
where Cj := s ej1 eTj2 ∈ Rd×d and ej1 ∈ Rd , ej2 ∈ Rd are standard basis vectors. In words,
the function fj merely picks out the (j1 , j2 ) entry of W, and multiplies by a sign s. Likewise,
let J be the collection of all the triples j, and define the class of linear functions
F := fj | j ∈ J .
Clearly, card(F ) = 2dd0 . Under this definition, it is simple to check that Zt and Zt? , defined
in equations (16) and (17), can be expressed as
Zt = max Gt (fj ), and Zt? = max G?t (fj ).
fj ∈F fj ∈F
21
Lopes, Wang, and Mahoney
for all j, k ∈ J . In turn, define the following random variable as the the maximum of this
Gaussian process,
Z := max G(fj ).
fj ∈F
In order to handle the case of SRHT matrices, define another zero-mean Gaussian process
G̃ : F → R (conditionally on a fixed realization of D◦n ) to have its covariance structure
given by
Z̃ := max G̃(fj ).
fj ∈F
22
A Bootstrap Method for Error Estimation in Randomized Matrix Multiplication
The remainder of the proof consists in bounding each of these quantities, and we will
establish the following two bounds for all δ > 0,
so that the second term in line (24) is of order δ. The idea is that if δ satisfies both of the
conditions (30) and (31), then the definition of the dLP metric and line (24) imply
which clearly satisfies line (31). Futhermore, it can be checked that δ0 also satisfies the
constraint (30) under Assumption 1 (a). (The details of verifying this are somewhat tedious
and are given in Lemma 16 in Appendix F.)
23
Lopes, Wang, and Mahoney
To finish the proof, it remains to establish the bounds (28) and (29). To handle Lt , note
that¶
3
= c kBCTj AT kF (since kHk2 = kHkF when H is rank-1)
3/2
= c tr BCTj AT ACj BT
3/2
= c eTj1 AT Aej1 · eTj2 BT Bej2
It follows from Lemma 9 (part 4) and Lemma 13 in Appendix F that Kt (δ) can be bounded
in terms of the Orlicz norm kηkψ1 ,
√ 3 √
δ t δ t
Kt (δ) ≤ c log(card(F )) + kηk ψ1 · exp(− kηkψ log(card(F ) ).
1
Lastly, we turn to bounding Jt (δ). Fortunately, much of the argument for bounding
Kt (δ) can be carried over. Specifically, consider the random variable
¶
√
In this step, we use the assumption that k tSi,j kψ2 ≤ c for all i and j.
24
A Bootstrap Method for Error Estimation in Randomized Matrix Multiplication
Proceeding in a way that is similar to the bound for Kt (δ), it follows from part (3) of
Lemma 9 that
where the last step follows from the bounds (32) through (33), and the fact that
|tr(BCTj A)| ≤ ν(A, B). Consequently, up to a constant factor, Jt (δ) satisfies the same
bound as Kt (δ) given in line (36), and this proves the claim in line (29).
P(S ∈ S ) ≥ 1 − 1t . (39)
Second, whenever the event {S ∈ S } occurs, we have the following bound for any δ > 0
and any Borel set A ⊂ R,
P max G?t (fj ) ∈ A S ≤ P max G(fj ) ∈ Aδ + c ν(A,B)·δ log(card(F
t1/4
))
. (40)
fj ∈F fj ∈F
If we set δ to the particular choice δ0 := t−1/8 ν(A, B) · log(card(F )), then δ0 solves the
p
equation
δ0 = ν(A,B)·δ log(card(F
t1/4
))
.
0
25
Lopes, Wang, and Mahoney
Consequently, by the definition of the dLP metric, this implies that whenever the event
{S ∈ S } occurs, we have
When referencing Theorem 3.2 from the paper Chernozhukov et al. (2016), note that
E[G(fj )] = 0 and E[G?t (fj )|S] = 0 for all fj ∈ F . To interpret ∆t (S), it may be viewed as
the `∞ -distance between the covariance matrices associated with G?t (conditionally on S)
and G.
Using the above notation, we define the set of sketching matrices S ⊂ Rn×t according
to
S∈S if and only if ∆t (S) ≤ √ct · ν(A, B)2 · log(card(F )). (43)
Based on this definition, it is simple to check that the proof is reduced to showing that
the event {S ∈ S } occurs with probability at least 1 − 1t − dd1 0 . This is guaranteed by the
lemma below.
Proof We begin by bounding ∆t (S) with two other quantities (to be denoted ∆0t (S),
00 1 Pt
∆t (S)) that are easier to bound. Using the fact that t i=1 Di = AT ST SB it can be
checked that
P P P
t t t
E G?t (fj )G?t (fk ) S = 1 1 1
t f
i=1 j (D i )fk (D i ) − t f
i=1 j (D i ) · t f
i=1 k (D i ) .
26
A Bootstrap Method for Error Estimation in Randomized Matrix Multiplication
From looking at the last two lines, it is natural to define the following zero-mean random
variables for any triple i, j, k,k
h i
Qi,j,k := fj (Di )fk (Di ) − E fj (Di )fk (Di ) ,
and Pt
1
Rt,j := t i=1 fj (Di ) − E[fj (Di )] .
Then, some algebra shows that
P
t
E G?t (fj ) G?t (fk ) S − E G(fj ) G(fk ) = 1t i=1 Qi,j,k − Rt,j · Rt,k
− E fj (D1 ) · Rt,k − E fk (D1 ) · Rt,j .
then
∆t (S) ≤ ∆0t (S) + ∆00t (S)2 + 2ν(A, B) · ∆00t (S),
where we have made use of the simple bound |E[fj (D1 )]| ≤ kAT Bk∞ ≤ ν(A, B). The
following lemma establishes tail bounds for ∆0t (S) and ∆00t (S), which lead to the statement
of Proposition 4.
c
q
∆00t (S) ≤ √ · ν(A, B) ·
log card(F ) (ii)
t
1
occurs with probability at least 1 − dd0 .
k
Note that Qi,j,k is a multivariate polynomial of degree-4 in the variables Si,j , and so techniques based
on moment generating functions, like Chernoff bounds, are not generally applicable to controlling Qi,j,k . For
instance, if X ∼ N (0, 1), then the variable X 4 does not have a moment generating function. Handling this
obstacle is a notable aspect of our analysis.
27
Lopes, Wang, and Mahoney
Proof of Lemma 6 (i). Let p > 2. Due to part (3) of Lemma 9 in Appendix F, we have
1/p Pt
k∆0t (S)kp ≤ card(F )2 · max 1
t i=1 Qi,j,k . (44)
(j,k)∈J ×J p
Note that each variable Qi,j,k has moments of all orders, and when j and k are held fixed,
the sequence {Qi,j,k }1≤i≤t is i.i.d. For this reason, it is natural to use Rosenthal’s inequality
to bound the Lp norm of the right side of the previous line. Specifically, the version of
Rosenthal’s inequality∗∗ stated in Lemma 10 in Appendix F leads to
1 Pt p/ log(p) Pt Pt p 1/p
t i=1 Qi,j,k ≤c· t · max i=1 Qi,j,k , i=1 kQi,j,k kp . (45)
p 2
The L2 norm on the right side of Rosenthal’s inequality (45) satisfies the bound
q
Pt
var( ti=1 Qi,j,k )
P
i=1 Qi,j,k =
2
√q
= t var(Q1,j,k )
√q
= t var fj (D1 ) fk (D1 )
√
≤ t fj (D1 ) fk (D1 )
2
√
≤ t fj (D1 ) 4 · fk (D1 ) 4 (Cauchy-Schwarz)
√
≤ c t fj (D1 ) ψ1 · fk (D1 ) ψ1
(Lemma 9)
√
≤ c tν(A, B)2 ,
≤ 2 fj (D1 ) 2p
· fk (D1 ) 2p
(Cauchy-Schwarz)
≤ c p2 fj (D1 ) ψ1
· fk (D1 ) ψ1
(Lemma 9 in Appendix F)
28
A Bootstrap Method for Error Estimation in Randomized Matrix Multiplication
and as long as the first term in the Rosenthal bound dominates†† , i.e.
1 Pt c·(p/ log(p))·ν(A,B)2
t i=1 Qi,j,k ≤ √
t
.
p
Since the previous bound does not depend on j or k, combining it with the first step in
line (44) leads to
ν(A,B)2
∆0t (S) p
≤ c · (p/ log(p)) · card(F )2/p · √
t
.
Next, we convert this norm bound into a tail bound. Specifically, if we consider the value
ν(A,B)2
xp := c · (p/ log(p)) · card(F )2/p · √
t
· t1/p
p = log(card(F )),
and noting that card(F )1/p = e, it follows that under this choice of p,
2
1/p
xp ≤ c·ν(A,B) ·log(card(F
√
t
)) t
· log(p) .
Moreover, as long as t . card(F )κ for some absolute constant κ ≥ 1 (which holds under
Assumption 1), then the last factor on the right satisfies
t1/p
(card(F )1/p )κ eκ
log(p) ≤ log(p) = log(log(card(F )) . 1.
So, combining the last few steps, there is an absolute constant c such that
2
P ∆0t (S) ≥ c·ν(A,B) ·log(card(F
√
t
))
≤ 1t ,
as needed.
††
Under the choice of p = log(card(F )) = log(2dd0 ) that will be made at the end of this argument, it is
straightforward to check that the condition (47) holds under Assumption 1.
29
Lopes, Wang, and Mahoney
Proof of Lemma 6 (ii). Note that for each i ∈ [t] and j ∈ J , we have
which is a centered sub-Gaussian quadratic form. Due to the bound (35), we have
Furthermore, this can be combined with a standard concentration bound for sums of
independent sub-exponential random variables (Lemma 12) to show that for any r ≥ 0,
1 Pt 2 , r) .
P t i=1 fj (D i ) − E[fj (D i )] ≥ r ν(A, B) ≤ 2 exp − c · t · min(r (50)
follows that there is a sufficiently large absolute constant c1 > 0 such that if we put
c1
p
r=√ t
log(card(F )),
then
c t min(r2 , r) ≥ 2 log(card(F )),
where c is the same as in the bound (51). In turn, this implies
1
P ∆00t (S) ≥ c1
p
√ log(card(F )) · ν(A, B) ≤ 2 exp(− log(card(F )) = 0 , (52)
t dd
as desired.
≤ kAkF kBkF .
30
A Bootstrap Method for Error Estimation in Randomized Matrix Multiplication
Consequently,
kfj (D1 ) − E[fj (D1 )]kψ1 ≤ kfj (D1 )kψ1 + kE[fj (D1 )]kψ1
(53)
≤ kfj (D1 )kψ1 + ckAkF kBkF .
Hence, it remains to show that kfj (D1 )kψ1 ≤ ckAkF kBkF , which is the content of Lemma 7
below.
Lemma 7 If S is generated by length sampling with the probabilities in line (5), then for
any j ∈ J , we have the bound
fj (D1 ) ψ1
≤ 2kAkF kBkF . (54)
Proof
By the definition
of the ψ1 -Orlicz norm, it suffices to find a value of r > 0 so that
E exp |fj (D1 )|/r is at most 2. Due to the Cauchy-Schwarz inequality, the non-zero
length-sampling probabilities pl satisfy
qP qP
n T Ak2 n T 2
1 j=1 ke j 2 j=1 kej Bk2 kAkF kBkF
≤ T T
= T .
pl kel Ak2 kel Bk2 kel Ak2 keTl Bk2
eT
T
l B
≤ max exp 1
r kAkF kBkF keT Bk
CTj kAAT eelk2
l∈[n] l 2 l
1
≤ exp r kAkF kBkF kCj k2
1
= exp r kAkF kBkF .
Hence, if we take r = 2kAkF kBkF , then the right hand side is at most e1/2 ≤ 2.
31
Lopes, Wang, and Mahoney
Since we are not aware of a standard notation for a conditional Orlicz norm, we define
n o
fj (D1 ) D◦n ψ1 := inf r > 0 E ψ1 |fj (D1 )|/r D◦n ≤ 1 ,
which is a random variable, since it is a function of D◦n . The following lemma provides
a bound on this quantity, which turns out to be of order log(n) ν(A, B). For this reason,
the SRHT case (c) of Propositions 3 and 4 will have the same form as case (a), but with
log(n) ν(A, B) replacing ν(A, B).
Lemma 8 If S is an SRHT matrix, then the following bound holds with probability at least
1 − c/n,
fj (D1 ) D◦n ψ1 ≤ c · log(n) · ν(A, B). (55)
√
For an SRHT matrix S = P √1n Hn D◦n , recall that the rows of tP are sampled uniformly
at random from the set { √1 e1 , . . . , √1 en }. It follows that
1/n 1/n
h i h i
|f (D )|
E exp j r 1 D◦n = E exp 1r tr(CTj AT s1 sT1 B) D◦n
n
1X
exp 1r eTl Hn D◦n BCTj AT D◦n HTn el .
=
n
l=1
Recalling that D◦n = diag(ε) where ε ∈ Rn is a vector of i.i.d. Rademacher variables, and
that all entries of Hn are ±1, it follows that εl has the same distribution as ε for each l ∈ [n].
Consequently, each quadratic form εTl (BCTj AT )εl concentrates around tr(BCTj AT ), and
we can use a union bound to control the maximum of these quadratic forms. Note also
that the matrix BCTj AT is rank-1, and so kBCTj AT k22 = kBCTj AT k2F , Hence, by choosing
the parameter u to be proportional to log(n) · kBCTj AT kF in the Hanson-Wright inequality
(Lemma 11), and using a union bound, there is an absolute constant c > 0 such that
T T T T T T T
P max εl (BCj A )εl ≥ tr(BCj A ) + c log(n)kBCj A kF ≤ nc . (57)
l∈[n]
Furthermore, noting that tr(BCTj AT ) and kBCTj AT kF are both at most ν(A, B), we have
T T T
P max εl (BCj A )εl ≥ 2c log(n)ν(A, B) ≤ nc . (58)
l∈[n]
32
A Bootstrap Method for Error Estimation in Randomized Matrix Multiplication
Finally, this means that if we take r = 4c log(n)ν(A, B) in the bound (56), then the event
holds with probability at least 1 − nc , which completes the proof, since e1/2 ≤ 2.
√
kXkp ≤ c pkXkψ2 (60)
and
max Xj ≤ c log(d) max kXj kψ1
1≤j≤d ψ1 1≤j≤d
4. Let X be any random variable. Then, for any x > 0 and p ≥ 1, we have
kXkp p
P |X| ≥ x ≤ x .
and
P |X| ≥ x ≤ c1 e−c2 x/kXkψ1 .
Proof In part 1, line (59) follows from line 5.11 of Vershynin (2012), line (60) follows
from definition 5.13 of Vershynin (2012), and line (61) follows from p.94 of van der Vaart
and Wellner (1996). Next, part 2 follows from the definition of the ψ2 -Orlicz norm and
the moment generating function for N (0, σ 2 ). Part 3 is due to Lemma 2.2.2 of van der
Vaart and Wellner (1996). Lastly, part 4 follows from Markov’s inequality and line 5.14
of Vershynin (2012).
33
Lopes, Wang, and Mahoney
Lemma 10 (Rosenthal’s inequality with best constants) Fix any number p > 2.
Let Y1 , . . . , Yt be independent random variables with E[Yi ] = 0 and E |Yi |p < ∞ for all
1 ≤ i ≤ t. Then,
Pt p Pt Pt p 1/p
i=1 Yi p ≤ c log(p) · max i=1 Yi 2 , i=1 Yi kp . (62)
Proof See the paper Johnson et al. (1985). The statement above differs slightly from the
Theorem 4.1 in the paper Johnson et al. (1985), which requires symmetric random variables,
but the remark on p.247 of that paper explains why the variables Y1 , . . . , Yt need not be
symmetric as long as they have mean 0.
for all x > 0, then the following bound holds for all r > 0,
34
A Bootstrap Method for Error Estimation in Randomized Matrix Multiplication
Next, we employ the Hanson-Wright inequality (Lemma 11). By considering the “threshold”
kHk2
u∗ := κ2 kHkF2 , it is helpful to note that the quantities in the exponent of the Hanson-Wright
u2
inequality satisfy κ4 kHk2F
≤ u
κ2 kHk2
if and only if u ≤ u∗ . Hence,
Z ∞ n o
1 u2 u
E exp(|Q|/r) ≤ 1 + r exp − c min κ4 kHk2F
, κ2 kHk 2
· eu/r du
0
Z u∗ Z ∞ n o
≤1+ 1
r eu/r du + 1
r exp − u c
κ2 kHk2 − 1
r du.
0 u∗
Note that the condition C 0 > 0 means that it is necessary to have r > 1c κ2 kHk2 . To finish
the argument, we further require that r is large enough so that (say)
u∗ 1 c·r
r ≤ 4 and κ2 kHk2
≥ 3, (65)
which ensures
E exp(|Q|/r) ≤ e1/4 + 1
2 < 2,
as desired. Note that the constraints (65) are the same as
kHk2
r ≥ 4κ2 kHkF2 and r ≥ 3c κ2 kHk2 .
35
Lopes, Wang, and Mahoney
Due to the basic fact that kHk2 ≤ kHkF for all matrices H, it follows that whenever
kHk2
r ≥ max(4, 3c )κ2 kHkF2 , we have E exp(|Q|/r) < 2.
Remark. The following lemma is a basic fact about the dLP metric, but may not be
widely known, and so we give a proof. Recall also that we use the generalized inverse
FV−1 (α) := inf{z ∈ R | FV (z) ≥ α}, where FV denotes the c.d.f. of V .
Lemma 15 Fix α ∈ (0, 1/2) and suppose there is some ∈ (0, α) such that random
variables U and V satisfy
dLP (L(U ), L(V )) ≤ .
Then, the quantiles of U and V satisfy
It is a fact that this metric is always dominated by the dLP metric in the sense that
for all scalar random variables U and V (Huber and Ronchetti, 2009, p.36). Based on the
definition of the dL metric, it is straightforward to check that the following inequalities hold
under the assumption of the lemma,
as needed.
36
A Bootstrap Method for Error Estimation in Randomized Matrix Multiplication
where r ≥ 1 is a free parameter to be adjusted. Based on the bound (29), it is easy to check
that plugging δ1 (r) into Kt (·) and Jt (·) leads to
then δ1 (r) will satisfy both of the desired constraints (30) and (31). Solving the equation
δ1 (r) = δ0 gives
r = t3/8 · log−3/2 (d) · ν(A, B)−1/4 ,
and then the condition r ≥ c log(log(d)4 ) is the same as
8/3
t ≥ c ν(A, B)1/4 log(d)3/2 · log(log(d)4 )
(67)
2/3 4 4 8/3
= c ν(A, B) log(d) · log(log(d) ) ,
References
N. Ailon and B. Chazelle. Approximate nearest neighbors and the fast Johnson-
Lindenstrauss transform. In Annual ACM Symposium on Theory of Computing (STOC),
2006.
N. Ailon and E. Liberty. Fast dimension reduction using Rademacher series on dual BCH
codes. Discrete & Computational Geometry, 42(4):615–630, 2009.
C. Boutsidis and A. Gittens. Improved matrix algorithms via the subsampled randomized
hadamard transform. SIAM Journal on Matrix Analysis and Applications, 34(3):1301–
1340, 2013.
37
Lopes, Wang, and Mahoney
C. Brezinski and M. R. Zaglia. Extrapolation methods: theory and practice. Elsevier, 2013.
C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines. ACM
Transactions on Intelligent Systems and Technology (TIST), 2(3):27, 2011. URL http:
//www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/.
J. Chang, W. Zhou, W.-X. Zhou, and L. Wang. Comparing large covariance matrices
under weak conditions on the dependence structure and its application to gene clustering.
Biometrics, 2016.
P. Drineas, R. Kannan, and M. W. Mahoney. Fast Monte Carlo algorithms for matrices
I: Approximating matrix multiplication. SIAM Journal on Computing, 36(1):132–157,
2006a.
38
A Bootstrap Method for Error Estimation in Randomized Matrix Multiplication
A. Frank and A. Asuncion. UCI machine learning repository, 2010. URL http://archive.
ics.uci.edu/ml.
M. E. Lopes, Z. Lin, and H.-G. Mueller. Bootstrapping max statistics in high dimensions:
Near-parametric rates under weak variance decay and application to functional data
analysis. arXiv:1807.04429, 2018a.
39
Lopes, Wang, and Mahoney
A. Magen and A. Zouzias. Low rank matrix-valued Chernoff bounds and approximate matrix
multiplication. In Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 2011.
M. W. Mahoney. Randomized algorithms for matrices and data. Foundations and Trends
in Machine Learning, 3(2):123–224, 2011.
T. Sarlós. Improved approximation algorithms for large matrices via random projections.
In Annual IEEE Symposium on Foundations of Computer Science (FOCS), 2006.
A. W. van der Vaart and J. A. Wellner. Weak Convergence and Empirical Processes.
Springer, 1996.
D. P. Woodruff. Sketching as a tool for numerical linear algebra. Foundations and Trends R
in Theoretical Computer Science, 10(1–2):1–157, 2014.
F. Woolfe, E. Liberty, V. Rokhlin, and M. Tygert. A fast randomized algorithm for the
approximation of matrices. Applied and Computational Harmonic Analysis, 25(3):335–
366, 2008.
40