Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
8 views

A Bootstrap Method For Error Estimation in Randomized Matrix Multiplication

Uploaded by

ali rahimi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

A Bootstrap Method For Error Estimation in Randomized Matrix Multiplication

Uploaded by

ali rahimi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Journal of Machine Learning Research 20 (2019) 1-40 Submitted 8/17; Revised 1/19; Published 2/19

A Bootstrap Method for Error Estimation


in Randomized Matrix Multiplication

Miles E. Lopes melopes@ucdavis.edu


Department of Statistics
University of California at Davis
Davis, CA 95616, USA
Shusen Wang shusen.wang@stevens.edu
Department of Computer Science
Stevens Institute of Technology
Hoboken, NJ 07030, USA
Michael W. Mahoney mmahoney@stat.berkeley.edu
International Computer Science Institute and Department of Statistics
University of California at Berkeley
Berkeley, CA 94720, USA

Editor: Hui Zou

Abstract
In recent years, randomized methods for numerical linear algebra have received growing
interest as a general approach to large-scale problems. Typically, the essential ingredient
of these methods is some form of randomized dimension reduction, which accelerates
computations, but also creates random approximation error. In this way, the dimension
reduction step encodes a tradeoff between cost and accuracy. However, the exact numerical
relationship between cost and accuracy is typically unknown, and consequently, it may be
difficult for the user to precisely know (1) how accurate a given solution is, or (2) how much
computation is needed to achieve a given level of accuracy. In the current paper, we study
randomized matrix multiplication (sketching) as a prototype setting for addressing these
general problems. As a solution, we develop a bootstrap method for directly estimating the
accuracy as a function of the reduced dimension (as opposed to deriving worst-case bounds
on the accuracy in terms of the reduced dimension). From a computational standpoint, the
proposed method does not substantially increase the cost of standard sketching methods,
and this is made possible by an “extrapolation” technique. In addition, we provide both
theoretical and empirical results to demonstrate the effectiveness of the proposed method.
Keywords: matrix sketching, randomized matrix multiplication, bootstrap methods

1. Introduction
The development of randomized numerical linear algebra (RNLA or RandNLA) has led
to a variety of efficient methods for solving large-scale matrix problems, such as matrix
multiplication, least-squares approximation, and low-rank matrix factorization, among
others (Halko et al., 2011; Mahoney, 2011; Woodruff, 2014; Drineas and Mahoney, 2016).
A general feature of these methods is that they apply some form of randomized dimension
reduction to an input matrix, which reduces the cost of subsequent computations. In

c 2019 Miles E. Lopes, Shusen Wang, and Michael W. Mahoney.


License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided
at http://jmlr.org/papers/v20/17-451.html.
Lopes, Wang, and Mahoney

exchange for the reduced cost, the randomization leads to some error in the resulting
solution, and consequently, there is a tradeoff between cost and accuracy.
For many canonical matrix problems, the relationship between cost and accuracy has
been the focus of a growing body of theoretical work, and the literature provides many
performance guarantees for RNLA methods. In general, these guarantees offer a good
qualitative description of how the accuracy depends on factors such as problem size, number
of iterations, condition numbers, and so on. Yet, it is also the case that such guarantees
tend to be overly pessimistic for any particular problem instance — often because the
guarantees are formulated to hold in the worst case among a large class of possible inputs.
Likewise, it is often impractical to use such guarantees to determine precisely how accurate
a given solution is, or precisely how much computation is needed to achieve a desired level
of accuracy.
In light of this situation, it is of interest to develop efficient methods for estimating the
exact relationship between the cost and accuracy of RNLA methods on a problem-specific
basis. Since the literature has been somewhat quiet on this general question, the aim of this
paper is to analyze randomized matrix multiplication as a prototype setting, and propose
an approach that may be pursued more broadly. (Extensions are discussed at the end of
the paper in Section 6.)

1.1. Randomized matrix multiplication


To describe our problem setting, we briefly review the rudiments of randomized matrix
multiplication, which is often known as matrix sketching (Drineas et al., 2006a; Mahoney,
0
2011; Woodruff, 2014). If A ∈ Rn×d and B ∈ Rn×d are fixed input matrices, then sketching
methods are commonly used to approximate AT B in the the regime where max{d, d0 }  n.
For instance, this regime corresponds to “big data” applications where A and B are data
matrices with very large numbers of observations.
As a way of reducing the cost of ordinary matrix multiplication, the main idea of
0
sketching is to compute the product ÃT B̃ of smaller matrices à ∈ Rt×d and B̃ ∈ Rt×d , for
some choice of t  n. These smaller matrices are referred to as “sketches”, and they are
generated randomly according to
à := SA and B̃ := SB, (1)

where S ∈ Rt×n is a random “sketching matrix” satisfying the condition

E[ST S] = In , (2)

with In being the identity matrix. In particular, the relation (2) implies that the sketched
product is an unbiased estimate, E[ÃT B̃] = AT B. Most commonly, the matrix S can be
interpreted as acting on A and B by sampling their rows, or by randomly projecting their
columns. In Section 2, we describe some popular examples of sketching matrices to be
considered in our analysis.

1.2. Problem formulation


When sketching is implemented, the choice of the sketch size t plays a central role, since it
directly controls the relationship between cost and accuracy. If t is small, then the sketched

2
A Bootstrap Method for Error Estimation in Randomized Matrix Multiplication

0.5 0.5

0.45
0.99-quantile
0.45
0.4 0.4

L∞ Norm Error
0.35
L∞ Norm Error

0.35
0.3 0.3

0.25 0.25

0.2 0.2

0.15 0.15

0.1 0.1

0.05 0.05

0 0
0 200 400 600 800 1000 1200 1400 1600 1800 0 200 400 600 800 1000 1200 1400 1600 1800
Sketch Size t Sketch Size t

Figure 1: Left panel: The curve shows how εt fluctuates with varying sketch size t, as
rows are added to S, with A and B held fixed. (Each row of A ∈ R8,124×112 is a
feature vector of the Mushroom dataset (Frank and Asuncion, 2010), and we set
B = A.) The rows of S were generated randomly from a Gaussian distribution
(see Section 2), and the matrix A was scaled so that kAT Ak∞ = 1. Right panel:
There are 1,000 colored curves, each arising from a repetition of the simulation
in the left panel. The thick black curve represents q0.99 (t).

product ÃT B̃ may be computed quickly, but it is unlikely to be a good approximation to


AT B. Conversely, if t is large, then the sketched product is more expensive to compute,
but it is more likely to be accurate. For this reason, we will parameterize the relationship
between cost and accuracy in terms of t.
Conventionally, the error of an approximate matrix product is measured with a norm,
and in particular, we will consider error as measured by the `∞ -norm,
εt := AT ST SB − AT B ∞
, (3)
where kCk∞ := maxi,j |cij | for a matrix C = [cij ]. (Further background on analysis of `∞ -
norm or entry-wise error for matrix multiplication may be found in (Higham, 2002; Drineas
et al., 2006a; Demmel et al., 2007; Pagh, 2013), among others.) In the context of sketching,
it is crucial to note that εt is a random variable, due to the randomness in S. Consequently,
it is natural to study the quantiles of εt , because they specify the tightest possible bounds
on εt that hold with a prescribed probability. More specifically, for any α ∈ (0, 1), the
(1 − α)-quantile of εt is defined as
 
q1−α (t) := inf q ∈ [0, ∞) P εt ≤ q ≥ 1 − α . (4)

For example, the quantity q0.99 (t) is the tightest upper bound on εt that holds with
probability at least 0.99. Hence, for any fixed α, the function q1−α (t) represents a precise
tradeoff curve for relating cost and accuracy. Moreover, the function q1−α (t) is specific to
the input matrices A and B.
To clarify the interpretation of q1−α (t), it is helpful to plot the fluctuations of εt . In
the left panel of Figure 1, we illustrate a simulation where randomly generated rows are

3
Lopes, Wang, and Mahoney

incrementally added to a sketching matrix S, with A and B held fixed. Each time a row is
added to S, the sketch size t increases by 1, and we plot the corresponding value of εt as t
ranges from 100 to 1,700. (Note that the user is typically unable to observe such a curve
in practice.) In the right panel, we display 1,000 repetitions of the simulation, with each
colored curve corresponding to one repetition. (The variation is due only to the different
draws of S.) In particular, the function q0.99 (t) is represented by the thick black curve,
delineating the top 1% of the colored curves at each value of t.
In essence, the right panel of Figure 1 shows that if the user had knowledge of the
(unknown) function q1−α (t), then two important purposes could be served. First, for any
fixed value t, the user would have a sharp problem-specific bound on εt . Second, for any
fixed error tolerance , the user could select t so that that “just enough” computation is
spent in order to achieve εt ≤  with probability at least 1 − α.
The estimation problem. The challenge we face is that a naive computation of q1−α (t)
by generating samples of εt would defeat the purpose of sketching. Indeed, generating
samples of εt by brute force would require running the sketching method many times,
and it would also require computing the entire product AT B. Consequently, the technical
problem of interest is to develop an efficient way to estimate q1−α (t), without adding much
cost to a single run of the sketching method.

1.3. Contributions
From a conceptual standpoint, the main novelty of our work is that it bridges two sets of
ideas that are ordinarily studied in distinct communities. Namely, we apply the statistical
technique of bootstrapping to enhance algorithms for numerical linear algebra. To some
extent, this pairing of ideas might seem counterintuitive, since bootstrap methods are
sometimes labeled as “computationally intensive”, but it will turn out that the cost of
bootstrapping can be managed in our context. Another reason our approach is novel is
that we use the bootstrap to quantify error in the output of a randomized algorithm, rather
than for the usual purpose of quantifying uncertainty arising from data. In this way, our
approach harnesses the versatility of bootstrap methods, and we hope that our results in
the “use case” of matrix multiplication will encourage broader applications of bootstrap
methods in randomized computations. (See also Section 6, and note that in concurrent
work, we have pursued similar approaches in the contexts of randomized least-squares and
classification algorithms (Lopes et al., 2018b; Lopes, 2019).)
From a technical standpoint, our main contributions are a method for estimating the
function q1−α (t), as well as theoretical performance guarantees. Computationally, the
proposed method is efficient in the sense that its cost is comparable to a single run
of standard sketching methods (see Section 2). This efficiency is made possible by an
“extrapolation” technique, which allows us to bootstrap small “initial” sketches with t0 rows,
and inexpensively estimate q1−α (t) at larger values t  t0 . The empirical performance of
the extrapolation technique is also quite encouraging, as discussed in Section 5. Lastly, with
regard to theoretical analysis, our proofs circumvent some technical restrictions occurring
in the analysis of related bootstrap methods in the statistics literature.

4
A Bootstrap Method for Error Estimation in Randomized Matrix Multiplication

1.4. Related work


Several works have considered the problem of error estimation for randomized matrix
computations—mostly in the context of low-rank approximation (Woolfe et al., 2008;
Liberty et al., 2007; Halko et al., 2011), least squares (Lopes et al., 2018b), or matrix
multiplication (Ar et al., 1993; Sarlós, 2006). With attention to matrix multiplication,
the latter two papers offer methods for estimating high-probability bounds on the error
ηt := kÃT B̃ − AT Bk, where k · k is either the maximum absolute row sum norm, or the
Frobenius norm. At a high level, all of the mentioned papers rely on a common technique,
which is to randomly generate a sequence of “test-vectors”, say v1 , v2 , . . . , and then use
the matrix-vector products wi := ÃT B̃vi − AT (Bvi ) to derive an estimated bound, say
η̂t , for ηt . The origin of this technique may be traced to the classical works (Dixon, 1983;
Freivalds, 1979).
Our approach differs from the “test-vector approach” in some essential ways. One
difference arises because the bounds on η̂t are generally constructed from the vectors {wi }
using conservative inequalities. By contrast, our approach avoids this conservativeness by
directly estimating q1−α (t), which is an optimal bound on εt in the sense of equation (4).
A second difference deals with computational demands. For example, in order to
compute the vectors {wi } in the test-vector approach, it is necessary to access the full
matrices A and B. On the other hand, our method does not encounter this difficulty,
because it only requires access to the much smaller sketches à and B̃. Also, in the test-
vector approach, the cost to compute each vector wi is proportional to the large dimension
n, while the cost to compute q̂1−α (t) with our method is independent of n. Finally, the
test-vector approach can only be used to check if the product ÃT B̃ is accurate after it has
been computed, whereas our approach can be used to dynamically “predict” an appropriate
sketch size t from a small “initial” sketching matrix (see Section 3.3).
With regard to the statistics literature, our work builds upon a line of research dealing
with “multiplier bootstrap methods” in high-dimensional problems (Chernozhukov et al.,
2013, 2014, 2017). Such methods are well-suited to approximating the distributions of
statistics such as kx̄k∞ , where x̄ ∈ Rp denotes the sample average of n independent
mean-zero vectors, with n  p. More recently, this approach has been substantially
extended to other “max type” statistics arising from sample covariance matrices (Chang
et al., 2016; Chen, 2018). Nevertheless, the strong results in these works do not
readily translate to our context, either because the statistics are substantially different
from the `∞ -norm (Chang et al., 2016), or because of technical assumptions (Chen,
2018). For instance, if the P results in the latter work are applied to a sample covariance
matrix of the form n1 ni=1 (xi − x̄)(xi − x̄)> , where x1 , . . . , xn ∈ Rp are mean-zero
i.i.d. vectors, with x1 = (X11 , . . . , X1p ), then it is necessary to make assumptions such
as minj,k var(X1j X1k ) ≥ c, for some constant c > 0. As this relates P to the sketching
context, note that the sketched √ product may be written as ÃT B̃ = 1t ti=1 AT si sTi B, where
s1 , . . . , sn ∈ Rt are the rows of tS. It follows that analogous variance assumptions would
lead to conditions on the matrices A and B that could be violated if any column of A or B
has many small entries, or is sparse. By contrast, our results do not rely on such variance
assumptions, and we allow the matrices A and B to be unrestricted.

5
Lopes, Wang, and Mahoney

At a more technical level, the ability to avoid restrictions on A and B comes from our
use of the Lévy-Prohorov metric for distributional approximations — which differs from
the Kolmogorov metric that has been predominantly used in previous works on multiplier
bootstrap methods. More specifically, analyses based on the Kolmogorov metric typically
rely on “anti-concentration inequalities” (Chernozhukov et al., 2013, 2015), which ultimately
lead to the mentioned variance assumptions. On the other hand, our approach based on the
Lévy-Prohorov metric does not require the use of anti-concentration inequalities. Finally
it should be mentioned that the techniques used to control the LP metric are related to
those that have been developed for bootstrap approximations via coupling inequalities as
in Chernozhukov et al. (2016).
Outline. This paper is organized as follows. Section 2 introduces some technical
background. Section 3 describes the proposed bootstrap algorithm. Section 4 establishes
the main theoretical results, and then numerical performance is illustrated in Section 5.
Lastly, conclusions and extensions of the method are presented in Section 6, and all proofs
are given in the appendices.

2. Preliminaries
Notation and terminology. The set {1, . . . , n} is denoted as [n]. The P ith standard
basis vector is denoted as ei . If C = [cij ] is a real matrix, then kCkF = ( i,j c2ij )1/2 is
the Frobenius norm, and kCk2 is the spectral norm (maximum singular value). If X is
a random variable and p ≥ 1, we write kXkp = (E[|X|p ])1/p for the usual Lp norm. If
ψ : [0, ∞) → [0, ∞) is a non-decreasing convex function with ψ(0) = 0, then the ψ-Orlicz
norm of X is defined as kXkψ := inf{r > 0 | E[ψ(|X|/r)] ≤ 1}. In particular, we define
ψp (x) := exp(xp ) − 1 for p ≥ 1, and we say that X is sub-Gaussian when kXkψ2 < ∞, or
sub-exponential when kXkψ1 < ∞. In Appendix F, Lemma 9 summarizes the facts about
Orlicz norms that will be used.
We will use c to denote a positive absolute constant that may change from line to line.
The matrices A, B, and S are viewed as lying in a sequence of matrices indexed by the
tuple (d, d0 , t, n). For a pair of generic functions f and g, we write f (d, d0 , t, n) . g(d, d0 , t, n)
when there is a positive absolute constant c so that f (d, d0 , t, n) ≤ c g(d, d0 , t, n) holds for all
large values of d, d0 , t, and n. Furthermore, if a and b are two quantities that satisfy both
a . b and b . a, then we write a  b. Lastly, we do not use the symbols . or  when
relating random variables.
Examples of sketching matrices. Our theoretical results will deal with three common
types of sketching matrices, reviewed below.
• Row sampling. If (p1 , . . . , pn ) is a probability vector, then S ∈ Rt×n can be constructed
1 1
by sampling its rows i.i.d. from the set { √tp e1 , . . . , √tp n
en } ⊂ Rn , where the vector
1
√1 ei is selected with probability pi . Some of the most well known choices for the
tpi
sampling probabilities include uniform sampling, with pi ≡ 1/n, length sampling
(Drineas et al., 2006a; Magen and Zouzias, 2011), with
keT Ak2 keTi Bk2
pi = Pn i T T
, (5)
j=1 kej Ak2 kej Bk2

6
A Bootstrap Method for Error Estimation in Randomized Matrix Multiplication

and leverage score sampling, for which further background may be found in the papers
(Drineas et al., 2006b, 2008, 2012).
• Sub-Gaussian projection. Gaussian projection is the most well-known random
projection method, and is sometimes referred to as the Johnson-Lindenstrauss (JL)
transform (Johnson and Lindenstrauss, 1984). In detail, if G ∈ Rt×n is a standard
Gaussian matrix, with entries that are i.i.d. samples from N (0, 1), then S = √1t G is a
Gaussian projection matrix. More generally, the entries of G can be drawn i.i.d. from
a zero-mean sub-Gaussian distribution, which often leads to similar performance
characteristics in RNLA applications.
• Subsampled randomized Hadamard transform (SRHT). Let n be a power of 2, and
define the Walsh-Hadamard matrix Hn recursively∗
   
Hn/2 Hn/2 1 1
Hn := with H2 := .
Hn/2 −Hn/2 1 −1

Next, let D◦n ∈ Rn×n be random diagonal matrix with independent ±1 Rademacher
variables along the diagonal, and let P ∈ Rt×n have rows uniformly sampled from
{ √1 e1 , . . . , √1 en }. Then, the t × n matrix
t/n t/n

S = P( √1n Hn )D◦n (6)

is called an SRHT matrix. This type of sketching matrix was introduced in the seminal
paper (Ailon and Chazelle, 2006), and additional details regarding implementation
may be found in the papers (Drineas et al., 2011; Wang, 2015). (The factor √1n is used
so that √1n Hn is an orthogonal matrix.) An important property of SRHT matrices is
that they can be multiplied with any n × d matrix in O(n · d · log t) time (Ailon and
Liberty, 2009), which is faster than the O(n · d · t) time usually required for a dense
sketching matrix.

3. Methodology
Before presenting our method in algorithmic form, we first explain the underlying intuition.

3.1. Intuition for multiplier bootstrap method



If the row vectors of tS are denoted s1 , . . . , st ∈ Rn , then ST S may be conveniently
expressed as a sample average
ST S = 1t ti=1 si sTi .
P
(7)
For row sampling, Gaussian projection, and SRHT, these row vectors satisfy E[si sTi ] = In .
Consequently, if we define the random d × d0 rank-1 (dyad) matrix

Di = AT si sTi B, (8)

The restriction that n is a power of 2 can be relaxed with variants of SRHT matrices (Avron et al.,
2010; Boutsidis and Gittens, 2013).

7
Lopes, Wang, and Mahoney

then E[Di ] = AT B, and it follows that the difference between the sketched and unsketched
products can be viewed as a sample average of zero-mean random matrices

AT ST SB − AT B = 1t ti=1 (Di − AT B).


P
(9)

Furthermore, in the cases of length sampling and Gaussian projection, the matrices
D1 , . . . , Dt are independent, and in the case of SRHT sketches, these matrices are “nearly”
independent. So, in light of the central limit theorem, it is natural to suspect that
the random matrix (9) will be well-approximated (in distribution) by a matrix with
Gaussian entries. In particular, if we examine the (j1 , j2 ) entry, then we may expect that
eTj1 AT ST SB−AT B ej2 will approximately follow the distribution N (0, 1t σj21 ,j2 ), where the
unknown parameter σj21 ,j2 can be estimated with
Pt 2
σ̂j21 ,j2 := 1
t i=1 eTj1 (Di − AT ST SB)ej2 .

Based on these considerations, the idea of the proposed bootstrap method is to generate
a random matrix whose (j1 , j2 ) entry is sampled from N (0, 1t σ̂j21 ,j2 ). It turns out
that an efficient way of generating such a matrix is to sample i.i.d. random variables
ξ1 , . . . , ξt ∼ N (0, 1), independent of S, and then compute
1 Pt T ST SB .

t i=1 ξi D i − A (10)

In other words, if S is conditioned upon, then the distribution of the (j1 , j2 ) entry of
the above matrix is exactly N (0, 1t σ̂j21 ,j2 ).† Hence, if the matrix (10) is viewed as an
“approximate sample” of AT ST SB − AT B, then it is natural to use the `∞ -norm of the
matrix (10) as an approximate sample of εt = kAT ST SB − AT Bk∞ . Likewise, if we define
the bootstrap sample
 
1 Pt
ε?t := t i=1 i ξ D i − A T ST SB , (11)

then the bootstrap algorithm will generate i.i.d. samples of ε?t , conditionally on S. In turn,
the (1 − α)-quantile of the bootstrap samples, say q̂1−α (t), can be used to estimate q1−α (t).

3.2. Multiplier bootstrap algorithm


We now explain how proposed method can be implemented in just a few lines. This
description also reveals the important fact that the algorithm only requires access to
the sketches à and B̃ (rather than the full matrices A and B). Although the formula
for generating samples of ε?t given below may appear different from equation (11), it is
straightforward to check that these are equivalent. Lastly, the choice of the number of
bootstrap samples B will be discussed at the end of subsection 3.3.


It is also possible to show that the joint distribution of the entries in the matrix (10) mimics that of
AT ST SB − AT B, but we omit such details to simplify the discussion.

8
A Bootstrap Method for Error Estimation in Randomized Matrix Multiplication

Algorithm 1 (Multiplier bootstrap for εt ).


Input: the number of bootstrap samples B, and the sketches à and B̃.
For b = 1, . . . , B do
1. Draw an i.i.d. sample ξ1 , . . . , ξt from N (0, 1), independent of S;
¯ ÃT B̃)−ÃT ΞB̃
2. Compute the bootstrap sample ε?t,b := ξ·( , where ξ¯ := 1 Pt
∞ t i=1 ξi
and Ξ := diag(ξ1 , . . . , ξt ).
Return: q̂1−α (t) ←− the (1 − α)-quantile of the values ε?t,1 , . . . , ε?t,B .

3.3. Saving on computation with extrapolation


In its basic form, the cost of Algorithm 1 is O(B · t · d · d0 ), which has the favorable property
of being independent of the large dimension n. Also, the computation of the samples
ε?t,1 , . . . , ε?t,B is embarrassingly parallel, with the cost of each sample being O(t · d · d0 ).
Moreover, due to the way that the quantile q1−α (t) scales with t, it is possible to reduce
the cost of Algorithm 1 even further — via the technique of extrapolation (also called
Richardson extrapolation) (Sidi, 2003; Brezinski and Zaglia, 2013).
The essential idea of extrapolation is to carry out Algorithm 1 for a modest “initial”
sketch size t0 , and then use an initial estimate q̂1−α (t0 ) to “look ahead” and predict a larger
value t for which q1−α (t) is small enough to satisfy the user’s desired level of accuracy.
The immediate benefit of this approach is that Algorithm 1 only needs to applied to small
“initial versions” of à and B̃, each with t0 rows, which reduces the cost of the algorithm
to O(B · t0 · d · d0 ). Furthermore, this means that if Algorithm 1 is run in parallel, then it is
only necessary to communicate copies of the small initial sketching matrices. (To illustrate
the small size of the initial sketching matrices, our experiments include several examples
where the ratio t0 /n is approximately 1/100 or less.)
From a theoretical viewpoint, our use of extrapolation is based on the approximation
q1−α (t) ≈ √κt , where t is sufficiently large, and κ = κ(A, B, α) is an unknown number. A
formal justification for this approximation can be made using Proposition 3 in Appendix A,
but it is simpler to give an intuitive explanation here. Recall from Section 3.1 that
as t becomes large, the (j1 , j2 ) entry [ÃT B̃ − AT B]j1 ,j2 should be well-approximated in
distribution by a Gaussian random variable of the form √1t Gj1 ,j2 . In turn, this suggests that
1
εt should be well-approximated in distribution by √
t
maxj1 ,j2 |Gj1 ,j2 |, which has quantiles
1
that are proportional to √
t
.
In order to take advantage of the theoretical scaling q1−α (t) ≈ √κt , we may use
Algorithm 1 to compute q̂1−α (t0 ) with an initial sketch size t0 , and then approximate the
value q1−α (t) for t  t0 with the following extrapolated estimator

ext
q̂1−α (t) := √t0 q̂1−α (t0 ). (12)
t

Hence, if the user would like to determine a sketch size t so that q1−α (t) ≤ , for some
ext (t) ≤ , which is equivalent to
tolerance , then t should be selected so that q̂1−α
√ 2
t ≥ t0 q̂1−α (t0 ) . (13)

9
Lopes, Wang, and Mahoney

In our experiments in Section 5, we illustrate some examples where an accurate estimate


of q1−α (t) at t = 10,000 can be obtained from the rule (13) using an initial sketch size
t0 ≈ 500, yielding a roughly 20-fold speedup on the basic version of Algorithm 1.
Comparison with the cost of sketching. Given that the purpose of Algorithm 1 is
to enhance sketching methods, it is important to understand how the added cost of the
bootstrap compares to the cost of running sketching methods in the standard way. As a
point of reference, we compare with the cost of computing AT ST SB when S is chosen to be
an SRHT matrix, since this is one of the most efficient sketching methods. If we temporarily
assume for simplicity that A and B are both of size n × d, then it follows from Section 2
that computing AT ST SB has a cost of order O(t · d2 + n · d · log(t)). Meanwhile, the cost
of running Algorithm 1 with the extrapolation speedup based on an initial sketch size t0 is
O(B · t0 · d2 ). Consequently, the extra cost of the bootstrap does not exceed the stated cost
of sketching when the number of bootstrap samples satisfies
n log(t)
B = O( tt0 + d t0 ), (14)

and in fact, this could be improved further if parallelization of Algorithm 1 is taken into
account. It is also important to note that rather small values of B are shown to work well
in our experiments, such as B = 20. Hence, as long t0 remains fairly small compared to t,
then the condition (14) may be expected to hold, and this is borne out in our experiments.
The same reasoning also applies when n log(t)  d · t0 , which conforms with the fact that
sketching methods are intended to handle situations where n is very large.

3.4. Relation with the non-parametric bootstrap


For readers who are more familiar with the “non-parametric bootstrap” (based on sampling
with replacement), the purpose of this short subsection is to explain the relationship with
the multiplier bootstrap in Algorithm 1. Indeed, an understanding of this relationship
may be helpful, since the non-parametric bootstrap might be viewed as more intuitive,
and perhaps easier to generalize to more complex situations. However, it turns out that
Algorithm 1 is technically more convenient to analyze, and that is why the paper focuses
primarily on Algorithm 1. Meanwhile, from a practical point of view, there is little difference
between the two approaches, since both have the same order of computational cost, and in
our experience, we have observed essentially the same performance in simulations. Also,
the extrapolation technique can be applied to both algorithms in the same way.
To spell out the connection, the only place where Algorithm 1 needs to be changed is
in step 1. Rather than choosing the multiplier variables ξ1 , . . . , ξt to be i.i.d. N (0, 1) as
in Algorithm 1, the non-parametric bootstrap chooses ξi = ζi − 1, where (ζ1 , . . . , ζt ) is a
sample from a multinomial distribution, based on tossing t balls into t equally likely bins,
where ζi is the number of balls in bin i. Hence, the mean and variance of each ξi are nearly
the same as before, with E[ξi ] = 0 and var(ξi ) = 1 − 1/t, but the variables ξ1 , . . . , ξt are no
longer independent.
From a more algorithmic viewpoint, it is simple to check that the choice of ξ1 , . . . , ξt
based on the multinomial distribution is equivalent to sampling with replacement from the
rows of à and B̃. The underlying intuition for this approach is based on the fact that for
many types of sketching matrices, the rows of S are i.i.d., which makes the rows of à i.i.d.,

10
A Bootstrap Method for Error Estimation in Randomized Matrix Multiplication

and likewise for B̃. Hence, if S is conditioned upon, then sampling with replacement from
the rows of à and B̃ imitates the random mechanism that originally generated à and B̃.

Algorithm 2 (Non-parametric bootstrap for εt ).


Input: the number of samples B, and the sketches à and B̃.
For b = 1, . . . , B do

1. Draw a vector (i1 , . . . , it ) by sampling t numbers with replacement from {1, . . . , t}.
0
2. Form matrices Ã∗ ∈ Rt×d and B̃∗ ∈ Rt×d by selecting (respectively) the rows
from à and B̃ that are indexed by (i1 , . . . , it ).

3. Compute the bootstrap sample ε∗t,b := (Ã∗ )T (B̃∗ ) − ÃT B̃ ∞


.

Return: q̂1−α (t) ←− the (1 − α)-quantile of the values ε∗t,1 , . . . , ε∗t,B .

4. Main results
Our main results quantify how well the estimate q̂1−α (t) from Algorithm 1 approximates
the true value q1−α (t), and this will be done by analyzing how well the distribution of a
bootstrap sample ε?t,1 approximates the distribution of εt . For the purposes of comparing
distributions, we will use the Lévy-Prohorov metric, defined below.
Lévy-Prohorov (LP) metric. Let L(U ) denote the distribution of a random variable
U , and let B denote the collection of Borel subsets of R. For any A ∈ B, and δ > 0,
δ
define the δ-neighborhood A := x ∈ R inf y∈A |x − y| ≤ δ . Then, for any two random
variables U and V , the dLP metric between their distributions is given by
n o
dLP (L(U ), L(V )) := inf δ > 0 P(U ∈ A) ≤ P(V ∈ Aδ ) + δ for all A ∈ B .

The dLP metric is a standard tool for comparing distributions, due to the fact that
convergence with respect to dLP is equivalent to convergence in distribution (Huber and
Ronchetti, 2009, Theorem 2.9).
Approximating quantiles. An important property of the dLP metric is that if two
distributions are close in this metric, then their quantiles are close in the following sense.
Recall that if FU is the distribution function of a random variable U , then the (1−α)-quantile
of U is the same as the generalized inverse FU−1 (1 − α) := inf{q ∈ [0, ∞) | FU (q) ≥ 1 − α}.
Next, suppose that two random variables U and V satisfy

dLP L(U ), L(V ) ≤ ,

for some  ∈ (0, α) with α ∈ (0, 1/2). Then, the quantiles of U and V are close in the sense
that
FU−1 (1 − α) − FV−1 (1 − α) ≤ ψα (), (15)
where the function ψα () := FU−1 (1 − α + ) − FU−1 (1 − α − ) +  is strictly monotone, and
satisfies ψα (0) = 0. (For a proof, see Lemma 15 of Appendix F.) In light of this fact, it will

11
Lopes, Wang, and Mahoney

be more convenient to express our results for approximating q1−α (t) in terms of the dLP
metric.

4.1. Statements of results


Our main assumption involves three separate cases, corresponding to different choices of
the sketching matrix S.

Assumption 1 The dimensions d and d0 satisfy d  d0 . Also, there is a positive absolute


constant κ ≥ 1 such that d1/κ . t . dκ , which is to say that neither d nor t grows
exponentially with the other. In addition,
p one of the following sets of conditions holds,
involving the parameter ν(A, B) := kAT Ak∞ kBT Bk∞ .

(a) (Sub-Gaussian case). The entries of the matrix S = [Si,j √] are zero-mean i.i.d. sub-
2 1
Gaussian random variables, with E[Si,j ] = t , and maxi,j k tSi,j kψ2 . 1. Furthermore,
t & ν(A, B)2/3 (log d)5 .

(b) (Length sampling case). The matrix S is generated by length sampling, with the
probabilities in equation (5), and also, t & (kAkF kBkF )2/3 (log d)5 .

(c) (SRHT case). The matrix S is an SRHT matrix as defined in equation (6), and also,
t & ν(A, B)2/3 (log n)2 (log d)5 .

Clarifications on bootstrap approximation. Before stating our main results below,


it is worth clarifying a few technical items. First, since our analysis involves central limit
type approximations of ÃT B̃ − A T
√ B as a sum of t independent matrices, we will rescale
the error variables by a factor of t, obtaining

Zt := tεt , (16)

as well as its bootstrap analogue, √


Zt? := tε?t . (17)
With regard to the original problem of estimating the quantile q1−α (t) for εt , this rescaling
makes no essential difference, since quantiles are homogenous
√ with respect to scaling, and
in particular, the (1 − α)-quantile of Zt is simply tq1−α (t).
As a second clarification, recall that the bootstrap method generates samples ε?t based
upon a particular realization of S. For this reason, the bootstrap approximation to L(Zt )
is the conditional distribution L(Zt? |S). Consequently, it should be noted that L(Zt? |S) is
a random probability measure, and dLP (L(Zt ) , L(Zt? |S)) is a random variable, since they
both depend on the random matrix S.

12
A Bootstrap Method for Error Estimation in Randomized Matrix Multiplication

Theorem 1 Let h(x) = x1/2 + x3/4 for x ≥ 0. If Assumption 1 (a) holds, then there
is an absolute constant c > 0 such that the following bound holds with probability at least
1 − 1t − dd1 0 ,
p
  c · h(ν(A, B)) · log(d)
dLP L(Zt ) , L(Zt? |S) ≤ 1/8
.
t
If Assumption 1 (b) holds, then there is an absolute constant c > 0 such that the following
bound holds with probability at least 1 − 1t − dd1 0 ,
p
  c · h(kAk F kBk F ) · log(d)
dLP L(Zt ) , L(Zt? |S) ≤ 1/8
.
t
Remarks. A noteworthy property of the bounds is that they are dimension-free with
respect to the large dimension n. Also, they have a very mild logarithmic dependence on d.
With regard to the dependence on t, there are two other important factors to keep in mind.
First, the practical performance of the bootstrap method (shown in Section 5) is much
better than what the t−1/8 rate suggests. Second, the problem of finding the optimal rates
of approximation for multiplier bootstrap methods is a largely open problem — even in
the simpler setting of bootstrapping the coordinate-wise maximum of vectors (rather than
matrices). In the vector context, the literature has focused primarily on the Kolmogorov
metric (rather than the LP metric), and some quite recent improvements beyond the t−1/8
rate have been developed in Chernozhukov et al. (2017) and Lopes et al. (2018a). However,
these works also rely on model assumptions that would lead to additional restrictions on the
matrices A and B in our setup. Likewise, the problem of extending our results to achieve
faster rates or handle other metrics is a natural direction for future work.
The SRHT case. For the case of SRHT matrices, the analogue of Theorem 1 needs to
be stated in a slightly different way for technical reasons. From a qualitative standpoint,
the results for SRHT and sub-Gaussian matrices turn out to be similar.
The technical issue to be handled is that the rows of an SRHT matrix are not
independent, due to their common dependence on the matrix D◦n . Fortunately, this
inconvenience can be addressed by conditioning on D◦n . Theoretically, this simplifies the
analysis of the bootstrap, since it “decouples” the rows of the SRHT matrix. Meanwhile, if
we let q̃1−α (t) denote the (1 − α)-quantile of the distribution L(εt |D◦n ),
n o
q̃1−α (t) := inf q ∈ [0, ∞) P(εt ≤ q D◦n ) ≥ 1 − α ,

then it is simple to check that q̃1−α (t) acts as a “surrogate” for q1−α (t), since‡

P(εt ≤ q̃1−α (t)) = E P εt ≤ q̃1−α (t) D◦n


 

≥ E[1 − α] (18)

= 1 − α.

It is also possible to show that q̃1−α (t) fluctuates around q1−α (t). Indeed, if we define the random
variable V := P(εt ≤ q1−α (t)|D◦n ), it can be checked that the event V ≥ 1 − α is equivalent to the
event q̃1−α (t) ≤ q1−α (t). Furthermore, if we suppose that 1 − α lies in the range of the c.d.f. of εt , then
E[V ] = 1 − α. In turn, it follows that the event q̃1−α (t) ≤ q1−α (t) occurs when V ≥ E[V ], and conversely,
the event q̃1−α (t) > q1−α (t) occurs when V < E[V ].

13
Lopes, Wang, and Mahoney

For this reason, we will view q̃1−α (t) as the new parameter to estimate (instead of q1−α (t)),
and accordingly, the aim of the following result is to quantify how well the bootstrap
distribution L(Zt? |S) approximates the conditional distribution L(Zt |D◦n ).

Theorem 2 Let h(x) = x1/2 + x3/4 for x ≥ 0. If Assumption 1 (c) holds, then there
is an absolute constant c > 0 such that the following bound holds with probability at least
1 − 1t − dd1 0 − nc ,
p
  c · h(ν(A, B) log(n)) · log(d)
dLP L(Zt |D◦n ) , L(Zt? |S) ≤ .
t1/8
Remarks. Up to a factor involving log(n), the bound for SRHT matrices matches that
for sub-Gaussian matrices. Meanwhile, from a more practical standpoint, our empirical
results will show that the bootstrap’s performance for SRHT matrices is generally similar
to that for both sub-Gaussian and length-sampling matrices.
Further discussion of results. To comment on the role of ν(A, B) and kAkF kBkF in
Theorems 1 and 2, it is possible to interpret them as problem-specific “scale parameters”.
Indeed, it is natural that the bounds on dLP should increase with the scale of A and B for
the following reason. Namely, if A or B is multiplied by a scale factor κ > 0, then it can
be checked that the quantile error |q̂1−α (t) − q1−α (t)| will also change by a factor of κ, and
furthermore, the inequality (15) demonstrates a monotone relationship between the sizes of
the quantile error and the dLP error. For this reason, the bootstrap may still perform well
in relation to the scale of the problem when the magnitudes of the parameters ν(A, B) and
kAkF kBkF are large. Alternatively, this idea can be seen by noting that the dLP bounds
can be made arbitrarily small by simply changing the units used to measure the entries of
A and B.
Beyond these considerations, it is still of interest to compare the results for different
sketching matrices once a particular scaling has been fixed. For concreteness, consider a
scaling where the spectral norms of A and B satisfy kAk2  kBk2  1. (As an example, if
we view A> A as a sample covariance matrix, then the condition kAk2  1 simply means
that the largest principal component score is of order
p 1.) Under this scaling, it is simple to
check that ν(A, B) = O(1), and kAkF kBkF = O( r(A)r(B)), where r(A) := kAk2F /kAk22
is the “stable rank”. In particular, note that if A and B are approximately low rank, as
is common in applications, then r(A)  d, and r(B)  d0 . Accordingly, we may conclude
that if the conditions of Theorems 1 and 2 hold, then bootstrap consistency occurs under
the following limits
p
log(d)/t1/8 = o(1) in the sub-Gaussian case, (19)
p
(r(A)r(B))3/8 log(d)/t1/8 = o(1) in the length-sampling case, (20)
p
log(n)3/4 log(d)/t1/8 = o(1) in the SRHT case, (21)

where we have used the simplifying assumption that d  d0 .

14
A Bootstrap Method for Error Estimation in Randomized Matrix Multiplication

5. Experiments
This section outlines a set of experiments for evaluating the performance of Algorithm 1
with the extrapolation speed-up described in Section 3.3. The experiments involved both
synthetic and natural matrices, as described below.
Synthetic matrices. In order to generate the matrix A ∈ Rn×d synthetically, we selected
the factors of its singular value decomposition A = Udiag(σ)VT in the following ways, fixing
n = 30, 000 and d = 1, 000. In previous work, a number of other experiments in randomized
matrix computations have been designed along these lines (Ma et al., 2014; Yang et al.,
2016).
The factor U ∈ Rn×d was selected as the Q factor from the reduced QR factorization
of a random matrix X ∈ Rn×d . The rows of X were sampled i.i.d. from a multivariate
t-distribution, t2 (µ, C), with 2 degrees of freedom, mean µ = 0, and covariance cij =
2 × 0.5|i−j| where C = [cij ]. (This choice causes the matrix A to have high row-coherence,
which is of interest, since this is a challenging case for sampling-based sketching matrices.)
Next, the factor V ∈ Rd×d was selected as the Q factor from a QR factorization of a d × d
matrix with i.i.d. N (0, 1) entries. For the singular values σ ∈ Rd+ , we chose two options,
kAk2
leading to either a low or high stable rank r(A) = kAkF2 . In the low stable rank case, we
2
put σi = 10κi for a set of equally spaced values κi between 0 and -6, yielding r(A) = 36.7.
Alternatively, in the high stable rank case, the entries of σ were equally spaced between
0.1 and 1, yielding r(A) = 370.1. Finally, to make all numerical comparisons on a common
scale, we normalized A so that kAT Ak∞ = 1.
Natural matrices. We also conducted experiments on five natural data matrices A
from the LIBSVM repository Chang and Lin (2011), named ‘Connect’, ‘DNA’, ‘MNIST’,
‘Mushrooms’, and ‘Protein’, with the same normalization that was used for the synthetic
matrices. These datasets are briefly summarized in Table 1.

Table 1: A summary of the natural datasets.


Dataset Connect DNA MNIST Mushrooms Protein
n 67, 557 2, 000 60, 000 8, 124 17, 766
d 126 180 780 112 356

5.1. Design of experiments


For each matrix A, natural or synthetic, we considered the task of estimating the quantile
q0.99 (t) for the random sketching error εt = AT A − AT ST SA ∞ . The sketching matrix
S ∈ Rt×n was allowed to be one of three types: Gaussian projection, length-sampling, and
SRHT, as described in Section 2.
Ground truth values. The ground truth values for q0.99 (t) were constructed in the
following way. For each matrix A, a grid of t values was specified, ranging from d/2 up to a
larger number as high as 10d or 20d, depending on A. Next, for each t value, and for each
type of sketching matrix, we used 1,000 realizations of S ∈ Rt×n , yielding 1,000 realizations

15
Lopes, Wang, and Mahoney

of the random variable εt . In turn, the 0.99 sample quantile of the 1,000 realizations of εt
was treated as the true value of q0.99 (t), and this appears as the black curve in all plots.
Extrapolated estimates. With regard to the bootstrap extrapolation method in
Section 3.3, we fixed the value t0 = d/2 as the initial sketch size to extrapolate from.
For each A, and each type of sketching matrix, we applied Algorithm 1 to each of the 1,000
realizations of à = SA ∈ Rt0 ×d generated previously. Each time Algorithm 1 was run,
we used the modest choice of B = 20 for the number of bootstrap samples. From each
set of 20 bootstrap samples, we used the 0.99 sample quantile as the estimate q̂0.99 (t0 ).§
Hence, there were 1,000 realizations of q̂0.99 (t0 ) altogether. Next, we used the scaling rule
in equation (12) to obtain 1,000 realizations of the extrapolated estimate q̂0.99ext (t) for values

t ≥ t0 .
ext (t) over the 1,000 realizations,
In order to illustrate the variability of the estimate q̂0.99
we plot three different curves as a function of t. The blue curve represents the average
ext (t), while the green and yellow curves respectively correspond to the estimates
value of q̂0.99
ranking 100th an 900th out of the 1,000 realizations.

5.2. Comments on numerical results


Overall, the numerical results for the bootstrap extrapolation method are quite encouraging,
and to a large extent, the method is accurate across many choices of A and S. Given that
ext (t)] are closely aligned with the black curves for q
the blue curves representing E[q̂0.99 0.99 (t),
we see that the extrapolated estimate is essentially unbiased. Moreover, the variance of the
estimate is fairly low, as indicated by the small gap between the green and yellow curves.
The low variance is also notable when considered in light of the fact that only B = 20
bootstrap samples are used to construct q̂0.99 ext (t), since the variance should decrease as B

becomes larger.
With attention to the extrapolation rule (12), there are two main points to note. First,
the plots show that the extrapolation may be initiated at fairly low values of t0 , which are
much less than the sketch sizes needed to achieve a small sketching error εt . Second, we see
ext (t) remains accurate for t much larger than t , well up to t = 10, 000 and perhaps
that q̂0.99 0
even farther. Consequently, the results show that the extrapolation technique is capable of
saving quite a bit of computation without much detriment to statistical performance.
To consider the relationship between theory and practice, one basic observation is that
all three types of sketching matrices obey roughly similar bounds in Theorems 1 and 2,
and indeed, we also see generally similar numerical performance among the three types.
At a more fine-grained level however, the Gaussian and SRHT sketching matrices tend
ext (t) with somewhat higher variance than in the case of length
to produce estimates q̂0.99
sampling. Another difference between theory and simulation, is that the actual performance
of the method seems to be better than what the theory suggests — since the estimates are
accurate at values of t0 that are much smaller than what would be expected from the rates
in Theorems 1 and 2.
§
Note that since 19/20 = 0.95 and 20/20 = 1, the 0.99 quantile was obtained by an interpolation rule.

16
A Bootstrap Method for Error Estimation in Randomized Matrix Multiplication

(a) Low stable rank data.

(b) High stable rank data.

Figure 1: Extra-D1k.
Figure 2: Results for synthetic matrices. The black line represents q0.99 (t) as a function
of t. The blue star is the average bootstrap estimate at the initial sketch size
.
t0 = d/2 = 500, and the blue line represents the average extrapolated estimate
ext (t)] derived from the starting value t . To display the variability of the
E[q̂0.99 0
estimates, the green and yellow curves correspond to the 100th and 900th largest
ext (t) at each t.
among the 1,000 realizations of q̂0.99

6. Conclusions and extensions


In this paper, we have focused on estimating the quantile q1−α (t) as a way of addressing
two fundamental issues in randomized matrix multiplication: (1) knowing how accurate a
given sketched product is, and (2) knowing how much computation is needed to achieve a
specified degree of accuracy. With regard to methodology, our approach is relatively novel
in that it uses the statistical technique of bootstrapping to serve a computational purpose
— by quantifying the error of a randomized sketching algorithm. A second important
component of our method is the extrapolation technique, which ensures that the cost of
estimating q1−α (t) does not substantially increase the overall cost of standard sketching
methods. Furthermore, our numerical results show that the extrapolated estimate is quite
accurate in a variety of different situations, suggesting that our method may offer a general
way to enhance sketching algorithms in practice.
Extensions. More generally, the problems we have addressed for randomized matrix
multiplication arise for many other large-scale matrix computations. Hence, it is natural to
consider extensions of our approach to more complex settings, and in the remainder of this
section, we briefly mention a few possibilities for future study.

17
Lopes, Wang, and Mahoney

(a) Connect (n = 67, 557 and d = 126).

(b) DNA (n = 2, 000 and d = 180).

(c) MNIST (n = 60, 000 and d = 780).

(d) Mushrooms (n = 8, 124 and d = 112).

(e) Protein (n = 17, 766 and d = 356).

Figure 3: Results for natural matrices. The results for the natural matrices are plotted
in the same way as described
Figure in
4: the caption for the results on the synthetic
Real-extra.
7
matrices.

18
A Bootstrap Method for Error Estimation in Randomized Matrix Multiplication

At a high level, each of the applications below deals with an object, say Θ, that is
difficult to compute, as well as a randomized approximation, say Θ,
e that is built from a
sketching matrix S with t rows. Next, if we consider the random error variable

ε t = kΘ
e − Θk,

for an unspecified norm k · k, then the problem of estimating the relationship between
accuracy and computation can again be viewed as the problem of estimating the quantile
function q1−α (t) associated with εt . In turn, this leads to the question of how to develop a
new bootstrap procedure that can generate approximate samples of εt , yielding an estimate
q̂1−α (t). However, instead of starting from the multiplier bootstrap (Algorithm 1) as before,
it may be conceptually easier to extend the non-parametric bootstrap (Algorithm 2) —
because the latter bootstrap can viewed as a “plug-in” procedure that replaces AT B with
ÃT B̃, and replaces ÃT B̃ with (Ã∗ )T (B̃∗ ).

• Linear regression. Consider a multi-response linear regression problem, where the rows
0
of B ∈ Rn×d are response vectors, and the rows of A ∈ Rn×d are input observations.
The optimal solution to `2 -regression is given by
2
Wopt = argmin AW − B F
= (AT A)† AT B,
W∈Rd×d0

which has O(nd2 + ndd0 ) cost. In the case where max{d, d0 }  n, the matrix
multiplications are a computational bottleneck, and an approximate solution can be
obtained via
Wf opt = (ÃT Ã)† (ÃT B̃),

which has a cost O(td2 + tdd0 ) + Csketch , where Csketch is cost of matrix sketch-
ing (Drineas et al., 2006b, 2011, 2012; Clarkson and Woodruff, 2013). In order to
estimate the quantile function associated with the error variable εt = kW f opt − Wopt k,
we could consider generating bootstrap samples of the form εt = kW ∗ f∗ − W f opt k,
opt
where Wf ∗ = ((Ã∗ )T (Ã∗ ))† (Ã∗ )T (B̃∗ ). For recent results in the case where W is a
opt
vector, we refer to the paper (Lopes et al., 2018b).

• Functions of covariance matrices. If the rows of the matrix A are viewed as a sample
of observations, then inferences on the population covariance structure are often based
on functions of the form ψ(AT A). For instance, the function ψ(AT A) could be the
top eigenvector, a set of eigenvalues, the condition number, or a test statistic. In any
of these cases, if ψ(ÃT Ã) is used as a fast approximation (Dasarathy et al., 2015),
then the sketching error εt = kψ(ÃT Ã) − ψ(AT A)k might be bootstrapped using
ε∗t = kψ((Ã∗ )T (Ã∗ )) − ψ(ÃT Ã)k.

• Approximate Newton methods. In large-scale applications, Newton’s method is often


impractical, since it involves the costly processing of a Hessian matrix. As an example,
consider an optimization problem arising in binary classification, where the rows of
X ∈ Rn×d are observations x1 , . . . , xn ∈ Rd , and y1 , . . . , yn ∈ {0, 1} are labels. If
an `2 -regularized logistic classifier is used, this leads to minimizing the objective

19
Lopes, Wang, and Mahoney

T
function f (w) = ni=1 log 1 + e−yi w xi + γ2 kwk22 over coefficient vectors w in Rd .
P 

The associated Newton step, with step size κ, is

w ← w − κ H−1 ∇f,

involving the Hessian


Tx Tx −1
H = AT A + γId , where A = diag 1 + ey1 w 1
, . . . , 1 + eyn w n
X.

If d  n, the cost of Newton’s method is dominated by the formation of H


at each iteration, and the Hessian matrix can be approximated by the sketched
version H̃ = ÃT Ã + γId , which reduces the per-iteration cost from O(nd2 ) to
O(td2 + nd) + Csketch (Pilanci and Wainwright, 2017; Roosta-Khorasani and Ma-
honey, 2016; Xu et al., 2016). In this context, the quality of the approximate Newton
step could be assessed in terms of the error

εt = kH̃−1 ∇f − H−1 ∇f k,

and in turn, this might be bootstrapped using ε∗t = k(H̃∗ )−1 ∇f − H̃−1 ∇f k, where
H̃∗ = (Ã∗ )T (Ã∗ ) + γId .

Acknowledgments

We thank the anonymous reviewers for their helpful suggestions. MEL thanks the National
Science Foundation for partial support under grant DMS-1613218. MWM would like
to thank the National Science Foundation, the Army Research Office, and the Defense
Advanced Research Projects Agency for providing partial support of this work.

Appendices
Outline of appendices. Appendix A explains the main conceptual ideas underlying the
proofs of Theorems 1 and 2. In particular, the proofs of these theorems will be decomposed
into two main results: Propositions 3 and 4, which are given in Appendix A.
Appendix B will prove the sub-Gaussian case of Proposition 3, and Appendix C will
prove the sub-Gaussian case of Proposition 4. Later on, Appendices D and E, will explain
how the arguments can be changed to handle the length-sampling and SRHT cases.

Conventions used in proofs. If either of the matrices A or B are 0, then εt has a


trivial point-mass distribution at 0. In this degenerate case, it is simple to check that the
bootstrap produces an exact approximation. So, without loss of generality, all proofs are
written under the assumption that A and B are non-zero. Next, since Assumption 1 is
formulated using the . notation, there is no loss of generality in carrying out calculations
under the assumption that all the numbers t, n, d, d0 are at least 8, which will ensure that
quantities such as log(d) are greater than 2. Lastly, if a numbered lemma is invoked in the
middle of a proof, the lemma may be found in Appendix F.

20
A Bootstrap Method for Error Estimation in Randomized Matrix Multiplication

Appendix A. Gaussian and bootstrap approximations


Section A.1 introduces some notation that helps us to analyze the rescaled sketching error
Zt from the viewpoint of empirical processes. Next, in Section A.2, Theorem 1 will be
decomposed into two propositions that compare Zt and Zt? with the maximum of a suitable
Gaussian process. The proofs of these propositions may be found in Appendices B and C.

A.1. Making a link between empirical processes and sketching error


The main idea of our analysis is to view Zt as the maximum of an empirical process, which
we now define. Recall the notation
Di = AT si sTi B and Mi = AT (si sTi − In )B.
0
Let Gt (·) be the empirical process that acts on linear functions f : Rd×d → R, according
to
t t
1 X  1 X
Gt (f ) := √ f (Di ) − f (AT B) = √ f (Mi ).
t i=1 t i=1
For future reference, we also define the corresponding bootstrap process
t
1 X  
G?t (f ) := √ ξi · f (Di ) − f AT ST SB ,
t i=1

where ξ1 , . . . , ξt are i.i.d. N (0, 1) and independent of S.


0
Next, we define a certain collection F of linear functions from Rd×d to R. Let j1 ∈ [d],
0
j2 ∈ [d0 ], s ∈ {−1, 1}, and j := (j1 , j2 , s). Then, for any matrix W ∈ Rd×d , we put
fj (W) := s · tr CTj W ,


0 0
where Cj := s ej1 eTj2 ∈ Rd×d and ej1 ∈ Rd , ej2 ∈ Rd are standard basis vectors. In words,
the function fj merely picks out the (j1 , j2 ) entry of W, and multiplies by a sign s. Likewise,
let J be the collection of all the triples j, and define the class of linear functions
F := fj | j ∈ J .


Clearly, card(F ) = 2dd0 . Under this definition, it is simple to check that Zt and Zt? , defined
in equations (16) and (17), can be expressed as
Zt = max Gt (fj ), and Zt? = max G?t (fj ).
fj ∈F fj ∈F

A.2. Statements of the approximation results


Theorems 1 and 2 are obtained by combining the following two results (Propositions 3 and 4)
via the triangle inequality. In essence, these results are based on a comparison with the
maximum of a certain Gaussian process. More specifically, let G : F → R be a zero-mean
Gaussian process whose covariance structure is defined according to
  
E G(fj ) G(fk ) = cov fj (D1 ), fk (D1 )
h i
= E fj (D1 ) fk (D1 ) − fj (AT B) fk (AT B), (22)

21
Lopes, Wang, and Mahoney

for all j, k ∈ J . In turn, define the following random variable as the the maximum of this
Gaussian process,
Z := max G(fj ).
fj ∈F

In order to handle the case of SRHT matrices, define another zero-mean Gaussian process
G̃ : F → R (conditionally on a fixed realization of D◦n ) to have its covariance structure
given by

E G̃(fj ) G̃(fk ) D◦n = cov fj (D1 ), fk (D1 ) D◦n


  
h i
= E fj (D1 ) fk (D1 ) D◦n − fj (AT B) fk (AT B), (23)

and let Z̃ denote the maximum of the process G̃,

Z̃ := max G̃(fj ).
fj ∈F

We are now in position to state the approximation results.

Proposition 3 (Gaussian approximation) Under Assumption 1 (a), the following bound


holds, p
 c · ν(A, B)3/4 · log(d)
dLP L(Zt ) , L(Z) ≤ .
t1/8
Under Assumption 1 (b), the following bound holds,
p
 c · (kAkF kBkF )3/4 · log(d)
dLP L(Zt ) , L(Z) ≤ .
t1/8
Under Assumption 1 (c), the following bound holds with probability at least 1 − c/n
p

 c · ν(A, B)3/4 · (log(n))3/4 · log(d)
dLP L(Zt ) , L(Z̃|Dn ) ≤ .
t1/8
Proposition 4 (Bootstrap approximation) If Assumption 1 (a) holds, then the follow-
ing bound holds with probability at least 1 − 1t − dd1 0 ,
p
c · ν(A, B)1/2 log(d)
L(Zt? |S)

dLP L(Z) , ≤ .
t1/8
If Assumption 1 (b) holds, then the following bound holds with probability at least 1− 1t − dd1 0 ,
p
c · (kAkF kBkF )1/2 log(d)
L(Zt? |S)

dLP L(Z) , ≤ .
t1/8
If Assumption 1 (c) holds, then the following bound holds with probability at least
1 − 1t − dd1 0 − nc ,
p
◦ ?
 c · ν(A, B)1/2 · log(n)1/2 · log(d)
dLP L(Z̃|Dn ) , L(Zt |S) ≤ .
t1/8

22
A Bootstrap Method for Error Estimation in Randomized Matrix Multiplication

Appendix B. Proof of Proposition 3, part (a)


Let A ⊂ R be a Borel set. Due to Theorem 3.1 from the paper Chernozhukov et al. (2016),
we have for any δ > 0,
2
 
P(Zt ∈ A) ≤ P(Z ∈ Acδ ) + c log√
δ3 t
(d)
L t + Kt (δ) + Jt (δ)) , (24)

where we define the following non-random quantities


t
1X h 3
i
Lt := max E fj (Mi ) , (25)
fj ∈F t
i=1
" #
n √ o
3 δ t
Kt (δ) := E max |fj (M1 )| · 1 max |fj (M1 )| > log(card(F )) , (26)
fj ∈F fj ∈F
" #
n √
o
δ t
Jt (δ) := E max |G(fj )|3 · 1 max |G(fj )| > log(card(F )) . (27)
fj ∈F fj ∈F

The remainder of the proof consists in bounding each of these quantities, and we will
establish the following two bounds for all δ > 0,

Lt ≤ c ν(A, B)3 , (28)


 √ 3  √ 
δ t δ t
Kt (δ) + Jt (δ) ≤ c log(d) + log(d) ν(A, B) · exp − c ν(A,B) log2 (d)
. (29)

Recall also that card(F ) = 2dd0 , and d  d0 under Assumption 1.


For the moment, we set aside the task of proving these bounds, and consider the choice
of δ. There are two constraints that we would like δ to satisfy. First, we would like to
choose δ so that the bounds on Lt and (Kt (δ) + Jt (δ)) are of the same order. In particular,
we desire
Kt (δ) + Jt (δ) ≤ c ν(A, B)3 . (30)
Second, with regard to line (24) we would like δ to solve the equation
2
ν(A,B)3
1 log (d)√
δ = δ3 t
, (31)

so that the second term in line (24) is of order δ. The idea is that if δ satisfies both of the
conditions (30) and (31), then the definition of the dLP metric and line (24) imply

dLP (L(Z), L(Zt )) ≤ c δ.

To proceed, consider the choice


log1/2 (d) ν(A,B)3/4
δ0 := t1/8
,

which clearly satisfies line (31). Futhermore, it can be checked that δ0 also satisfies the
constraint (30) under Assumption 1 (a). (The details of verifying this are somewhat tedious
and are given in Lemma 16 in Appendix F.)

23
Lopes, Wang, and Mahoney

To finish the proof, it remains to establish the bounds (28) and (29). To handle Lt , note
that¶

E[|fj (Mi )|3 ] = kfj (M1 )k33

≤ ckfj (M1 )k3ψ1 , (Lemma 9) (32)


 kBCT AT k2 3
≤ c kBCjT AT kF , (Lemma 14)
j 2

 3
= c kBCTj AT kF (since kHk2 = kHkF when H is rank-1)
 3/2
= c tr BCTj AT ACj BT
 3/2
= c eTj1 AT Aej1 · eTj2 BT Bej2

≤ c ν(A, B)3 , (33)

which proves the claimed bound in line (28).

Next, regarding Kt (δ), let us consider the random variable

η := max |fj (M1 )|.


fj ∈F

It follows from Lemma 9 (part 4) and Lemma 13 in Appendix F that Kt (δ) can be bounded
in terms of the Orlicz norm kηkψ1 ,
 √ 3 √
δ t δ t
Kt (δ) ≤ c log(card(F )) + kηk ψ1 · exp(− kηkψ log(card(F ) ).
1

To handle kηkψ1 , it follows from Lemma 9 (part 3), that

kηkψ1 ≤ c log(card(F )) · max kfj (M1 )kψ1 . (34)


fj ∈F

Furthermore, due to the earlier calculation starting at line (32) above,

kfj (M1 )kψ1 ≤ c ν(A, B). (35)

Combining the last few steps, we conclude that


 √ 3  √ 
δ t δ t (36)
Kt (δ) ≤ c log(card(F )) + log(card(F ))ν(A, B) · exp − cν(A,B) log2 (card(F ))
.

Lastly, we turn to bounding Jt (δ). Fortunately, much of the argument for bounding
Kt (δ) can be carried over. Specifically, consider the random variable

ζ := max |G(fj )|.


fj ∈F



In this step, we use the assumption that k tSi,j kψ2 ≤ c for all i and j.

24
A Bootstrap Method for Error Estimation in Randomized Matrix Multiplication

Lemma 13 in Appendix F shows that Jt (δ) can be bounded in terms of kζkψ1 ,


 √ 3  √ 
δ t δ t
Jt (δ) ≤ c log(card(F )) + kζkψ1 · exp − kζkψ1 log(card(F )) .

Proceeding in a way that is similar to the bound for Kt (δ), it follows from part (3) of
Lemma 9 that

kζkψ1 ≤ c log(card(F )) · max kG(fj )kψ1 .


fj ∈F

Furthermore, for every fj ∈ F , the facts in Lemma 9 imply

kG(fj )kψ1 ≤ c kG(fj )kψ2


q
≤ c var(G(fj ))
q
= c var(fj (D1 )) (by definition of G)

≤ ckfj (D1 )k2

≤ ckfj (D1 )kψ1 (37)

≤ c kfj (D1 ) − E[fj (D1 )kψ1 + c |E[fj (D1 )]|

= ckfj (M1 )kψ1 + c |tr(BCTj A)|

≤ c ν(A, B), (38)

where the last step follows from the bounds (32) through (33), and the fact that
|tr(BCTj A)| ≤ ν(A, B). Consequently, up to a constant factor, Jt (δ) satisfies the same
bound as Kt (δ) given in line (36), and this proves the claim in line (29).

Appendix C. Proof of Proposition 4, part (a)


We will show there is a set of “good” sketching matrices S ⊂ Rt×n with the following two
properties. First, a randomly drawn sketching matrix S is likely to fall in S . Namely,

P(S ∈ S ) ≥ 1 − 1t . (39)

Second, whenever the event {S ∈ S } occurs, we have the following bound for any δ > 0
and any Borel set A ⊂ R,
   
P max G?t (fj ) ∈ A S ≤ P max G(fj ) ∈ Aδ + c ν(A,B)·δ log(card(F
t1/4
))
. (40)
fj ∈F fj ∈F

If we set δ to the particular choice δ0 := t−1/8 ν(A, B) · log(card(F )), then δ0 solves the
p

equation
δ0 = ν(A,B)·δ log(card(F
t1/4
))
.
0

25
Lopes, Wang, and Mahoney

Consequently, by the definition of the dLP metric, this implies that whenever the event
{S ∈ S } occurs, we have

dLP L(Zt? |S) , L(Z) ≤ c t−1/8 ν(A, B) · log(card(F )),


 p
(41)

and this implies the statement of Proposition 4.


To proceed with the main argument of constructing S and demonstrating the two
properties (39) and (40), it is helpful to think of G?t (conditionally on S) and G as Gaussian
vectors of dimension card(F ) = 2dd0 . From this point of view, we can compare the maxima
of these vectors using a result due to Chernozhukov et al. (2016, Theorem 3.2). Under our
assumptions, this result implies that for any realization of S, any number δ > 0, and any
Borel set A ⊂ R, we have
    √
? δ c ∆t (S) log(card(F ))
P max Gt (fj ) ∈ A S ≤ P max G(fj ) ∈ A + δ ,
fj ∈F fj ∈F

where we define the following function of S,

E G?t (fj )G?t (fk ) S − E G(fj )G(fk ) .


   
∆t (S) := max (42)
(fj ,fk )∈F ×F

When referencing Theorem 3.2 from the paper Chernozhukov et al. (2016), note that
E[G(fj )] = 0 and E[G?t (fj )|S] = 0 for all fj ∈ F . To interpret ∆t (S), it may be viewed as
the `∞ -distance between the covariance matrices associated with G?t (conditionally on S)
and G.
Using the above notation, we define the set of sketching matrices S ⊂ Rn×t according
to
S∈S if and only if ∆t (S) ≤ √ct · ν(A, B)2 · log(card(F )). (43)
Based on this definition, it is simple to check that the proof is reduced to showing that
the event {S ∈ S } occurs with probability at least 1 − 1t − dd1 0 . This is guaranteed by the
lemma below.

Lemma 5 Suppose Assumption 1 (a) holds. Then, the event


c
∆t (S) ≤ √ · ν(A, B)2 · log card(F )

t
1 1
occurs with probability at least 1 − t − dd0 .

Proof We begin by bounding ∆t (S) with two other quantities (to be denoted ∆0t (S),
00 1 Pt
∆t (S)) that are easier to bound. Using the fact that t i=1 Di = AT ST SB it can be
checked that
 P   P   P 
t t t
E G?t (fj )G?t (fk ) S = 1 1 1
 
t f
i=1 j (D i )fk (D i ) − t f
i=1 j (D i ) · t f
i=1 k (D i ) .

Similarly, recall from line (22) that


h i
E[G(fj )G(fk )] = E fj (D1 )fk (D1 ) − E[fj (D1 )] · E[fk (D1 )].

26
A Bootstrap Method for Error Estimation in Randomized Matrix Multiplication

From looking at the last two lines, it is natural to define the following zero-mean random
variables for any triple i, j, k,k
h i
Qi,j,k := fj (Di )fk (Di ) − E fj (Di )fk (Di ) ,

and Pt
1

Rt,j := t i=1 fj (Di ) − E[fj (Di )] .
Then, some algebra shows that
 P 
t
E G?t (fj ) G?t (fk ) S − E G(fj ) G(fk ) = 1t i=1 Qi,j,k − Rt,j · Rt,k
   

   
− E fj (D1 ) · Rt,k − E fk (D1 ) · Rt,j .

So, if we define the quantities


t
X
∆0t (S) := max 1
t Qi,j,k ,
(j,k)∈J ×J
i=1

∆00t (S) := max Rt,j ,


j∈J

then
∆t (S) ≤ ∆0t (S) + ∆00t (S)2 + 2ν(A, B) · ∆00t (S),
where we have made use of the simple bound |E[fj (D1 )]| ≤ kAT Bk∞ ≤ ν(A, B). The
following lemma establishes tail bounds for ∆0t (S) and ∆00t (S), which lead to the statement
of Proposition 4.

Lemma 6 Suppose Assumption 1 (a) holds. Then, the event


c
∆0t (S) ≤ √ · ν(A, B)2 · log card(F )

(i)
t

occurs with probability at least 1 − 1t , and the event

c
q
∆00t (S) ≤ √ · ν(A, B) ·

log card(F ) (ii)
t
1
occurs with probability at least 1 − dd0 .
k
Note that Qi,j,k is a multivariate polynomial of degree-4 in the variables Si,j , and so techniques based
on moment generating functions, like Chernoff bounds, are not generally applicable to controlling Qi,j,k . For
instance, if X ∼ N (0, 1), then the variable X 4 does not have a moment generating function. Handling this
obstacle is a notable aspect of our analysis.

27
Lopes, Wang, and Mahoney

Proof of Lemma 6 (i). Let p > 2. Due to part (3) of Lemma 9 in Appendix F, we have
1/p Pt
k∆0t (S)kp ≤ card(F )2 · max 1
t i=1 Qi,j,k . (44)
(j,k)∈J ×J p
Note that each variable Qi,j,k has moments of all orders, and when j and k are held fixed,
the sequence {Qi,j,k }1≤i≤t is i.i.d. For this reason, it is natural to use Rosenthal’s inequality
to bound the Lp norm of the right side of the previous line. Specifically, the version of
Rosenthal’s inequality∗∗ stated in Lemma 10 in Appendix F leads to
 
1 Pt p/ log(p) Pt Pt p 1/p
t i=1 Qi,j,k ≤c· t · max i=1 Qi,j,k , i=1 kQi,j,k kp . (45)
p 2

The L2 norm on the right side of Rosenthal’s inequality (45) satisfies the bound
q
Pt
var( ti=1 Qi,j,k )
P
i=1 Qi,j,k =
2
√q
= t var(Q1,j,k )
√q 
= t var fj (D1 ) fk (D1 )

≤ t fj (D1 ) fk (D1 )
2

≤ t fj (D1 ) 4 · fk (D1 ) 4 (Cauchy-Schwarz)

≤ c t fj (D1 ) ψ1 · fk (D1 ) ψ1
(Lemma 9)

≤ c tν(A, B)2 ,

where the last step follows from the fact

kfj (D1 )kψ1 ≤ c ν(A, B), (46)

obtained in the bounds (32) through (33).


Next, to handle the Lp norms in the bound (45), observe that
 
kQ1,j,k kp ≤ fj (D1 )fk (D1 ) p + E fj (D1 )fk (D1 )

≤ 2 fj (D1 ) 2p
· fk (D1 ) 2p
(Cauchy-Schwarz)

≤ c p2 fj (D1 ) ψ1
· fk (D1 ) ψ1
(Lemma 9 in Appendix F)

≤ c p2 ν(A, B)2 (inequality (46)).

Hence, the second term in the Rosenthal bound (45) satisfies


Pt p 1/p
i=1 kQi,j,k kp ≤ c · p2 · t1/p · ν(A, B)2 ,
∗∗
Here we are using the version of Rosenthal’s inequality with the optimal dependence on p. It is a
notable aspect of our argument that it makes essential use of this scaling in p.

28
A Bootstrap Method for Error Estimation in Randomized Matrix Multiplication

and as long as the first term in the Rosenthal bound dominates†† , i.e.

p2 t1/p . t1/2 (47)

then we conclude that for any j and k,

1 Pt c·(p/ log(p))·ν(A,B)2
t i=1 Qi,j,k ≤ √
t
.
p

Since the previous bound does not depend on j or k, combining it with the first step in
line (44) leads to
ν(A,B)2
∆0t (S) p
≤ c · (p/ log(p)) · card(F )2/p · √
t
.

Next, we convert this norm bound into a tail bound. Specifically, if we consider the value
ν(A,B)2
xp := c · (p/ log(p)) · card(F )2/p · √
t
· t1/p

then Markov’s inequality gives


k∆0t (S)kpp
P ∆0t (S) ≥ xp ≤ 1t .

≤ xpp

Considering the choice of p given by

p = log(card(F )),

and noting that card(F )1/p = e, it follows that under this choice of p,
 2
  1/p 
xp ≤ c·ν(A,B) ·log(card(F

t
)) t
· log(p) .

Moreover, as long as t . card(F )κ for some absolute constant κ ≥ 1 (which holds under
Assumption 1), then the last factor on the right satisfies

t1/p
 (card(F )1/p )κ eκ
log(p) ≤ log(p) = log(log(card(F )) . 1.

So, combining the last few steps, there is an absolute constant c such that
 2

P ∆0t (S) ≥ c·ν(A,B) ·log(card(F

t
))
≤ 1t ,

as needed.
††
Under the choice of p = log(card(F )) = log(2dd0 ) that will be made at the end of this argument, it is
straightforward to check that the condition (47) holds under Assumption 1.

29
Lopes, Wang, and Mahoney

Proof of Lemma 6 (ii). Note that for each i ∈ [t] and j ∈ J , we have

fj (Di ) − E[fj (Di )] = fj (Mi ) = sTi (BCTj AT )si − tr(BCTj AT ), (48)

which is a centered sub-Gaussian quadratic form. Due to the bound (35), we have

fj (Di ) − E[fj (Di )] ≤ c ν(A, B). (49)


ψ1

Furthermore, this can be combined with a standard concentration bound for sums of
independent sub-exponential random variables (Lemma 12) to show that for any r ≥ 0,

   
1 Pt 2 , r) .
P t i=1 fj (D i ) − E[fj (D i )] ≥ r ν(A, B) ≤ 2 exp − c · t · min(r (50)

Hence, taking a union bound over all j gives


   
P ∆00t (S) ≥ r ν(A, B) ≤ 2 exp log(card(F )) − c · t · min(r2 , r) . (51)

Regarding the choice of r, note that by Assumption 1, we have √1t log(card(F )) . 1. It


p

follows that there is a sufficiently large absolute constant c1 > 0 such that if we put
c1
p
r=√ t
log(card(F )),

then
c t min(r2 , r) ≥ 2 log(card(F )),
where c is the same as in the bound (51). In turn, this implies
  1
P ∆00t (S) ≥ c1
p
√ log(card(F )) · ν(A, B) ≤ 2 exp(− log(card(F )) = 0 , (52)
t dd
as desired.

Appendix D. Proof of Propositions 3 and 4 in case (b) (length sampling)


In order to carry out the proof Propositions 3 and 4 in the case of length sampling
(Assumption 1 (b)), there are only two bounds that need to be updated. Namely, we
must derive new bounds on kfj (D1 ) − E[fj (D1 )]kψ1 and kfj (D1 )kψ1 in order to account for
the new distributional assumptions in case (b). Both of the new bounds will turn out to be
of order kAkF kBkF , and consequently, the result of the propositions in case (b) will have
the same form as in case (a), but with kAkF kBkF replacing ν(A, B).
To derive the bound on kfj (D1 ) − E[fj (D1 )]kψ1 , first note that

|E[fj (D1 )]| = |tr(CTj AT B)|

≤ kAkF kBkF .

30
A Bootstrap Method for Error Estimation in Randomized Matrix Multiplication

Consequently,

kfj (D1 ) − E[fj (D1 )]kψ1 ≤ kfj (D1 )kψ1 + kE[fj (D1 )]kψ1
(53)
≤ kfj (D1 )kψ1 + ckAkF kBkF .

Hence, it remains to show that kfj (D1 )kψ1 ≤ ckAkF kBkF , which is the content of Lemma 7
below.

Lemma 7 If S is generated by length sampling with the probabilities in line (5), then for
any j ∈ J , we have the bound

fj (D1 ) ψ1
≤ 2kAkF kBkF . (54)

Proof
 By the definition
 of the ψ1 -Orlicz norm, it suffices to find a value of r > 0 so that
E exp |fj (D1 )|/r is at most 2. Due to the Cauchy-Schwarz inequality, the non-zero
length-sampling probabilities pl satisfy
qP qP
n T Ak2 n T 2
1 j=1 ke j 2 j=1 kej Bk2 kAkF kBkF
≤ T T
= T .
pl kel Ak2 kel Bk2 kel Ak2 keTl Bk2

Consequently, for each r > 0 we have


h  i  
|f (D )|
X
1
E exp j r 1 = pl · exp r fj ( p1l AT el eTl B)
l∈[n]: pl >0
 
1 1
≤ max exp r pl fj (AT el eTl B)
l∈[n]: pl >0

eT
 T

l B
≤ max exp 1
r kAkF kBkF keT Bk
CTj kAAT eelk2
l∈[n] l 2 l

 
1
≤ exp r kAkF kBkF kCj k2
 
1
= exp r kAkF kBkF .

Hence, if we take r = 2kAkF kBkF , then the right hand side is at most e1/2 ≤ 2.

Appendix E. Proof of Propositions 3 and 4 in case (c) (SRHT)


The steps needed to extend the propositions in the case of SRHT matrices follows the same
pattern as in case (b). However, there is a small subtlety insofar as all of the analysis is
done conditionally on the matrix of signs D◦n in the product S = Pn ( √1n Hn )D◦n . Hence, it
suffices to bound the ψ1 Orlicz norm of fj (D1 ) conditionally on D◦n , as well as the conditional
expectation |E[fj (D1 )|D◦n ]. Regarding the conditional expectation, it can be checked that
E[ST S|D◦n ] = In , and it follows that

|E[fj (D1 )|D◦n ]| = |tr(BCTj AT )| ≤ ν(A, B).

31
Lopes, Wang, and Mahoney

Since we are not aware of a standard notation for a conditional Orlicz norm, we define
n o
fj (D1 ) D◦n ψ1 := inf r > 0 E ψ1 |fj (D1 )|/r D◦n ≤ 1 ,
  

which is a random variable, since it is a function of D◦n . The following lemma provides
a bound on this quantity, which turns out to be of order log(n) ν(A, B). For this reason,
the SRHT case (c) of Propositions 3 and 4 will have the same form as case (a), but with
log(n) ν(A, B) replacing ν(A, B).

Lemma 8 If S is an SRHT matrix, then the following bound holds with probability at least
1 − c/n,
fj (D1 ) D◦n ψ1 ≤ c · log(n) · ν(A, B). (55)

Proof By thedefinition of the conditional ψ1 -Orlicz norm, it suffices to find a value of


r > 0 so that E exp |fj (D1 )|/r D◦n is at most 2 (with the stated probability).



For an SRHT matrix S = P √1n Hn D◦n , recall that the rows of tP are sampled uniformly
at random from the set { √1 e1 , . . . , √1 en }. It follows that
1/n 1/n

h   i h   i
|f (D )|
E exp j r 1 D◦n = E exp 1r tr(CTj AT s1 sT1 B) D◦n
n
1X   
exp 1r eTl Hn D◦n BCTj AT D◦n HTn el .

=
n
l=1

Next, let εl ∈ Rn be the lth row of Hn D◦n , which gives


i 1X n
εT T T
l (BCj A )εl
h    
|fj (D1 )| ◦
E exp r D n = exp r
n
l=1
(56)
 
εT T T
l (BCj A )εl
≤ exp max r .
l∈[n]

Recalling that D◦n = diag(ε) where ε ∈ Rn is a vector of i.i.d. Rademacher variables, and
that all entries of Hn are ±1, it follows that εl has the same distribution as ε for each l ∈ [n].
Consequently, each quadratic form εTl (BCTj AT )εl concentrates around tr(BCTj AT ), and
we can use a union bound to control the maximum of these quadratic forms. Note also
that the matrix BCTj AT is rank-1, and so kBCTj AT k22 = kBCTj AT k2F , Hence, by choosing
the parameter u to be proportional to log(n) · kBCTj AT kF in the Hanson-Wright inequality
(Lemma 11), and using a union bound, there is an absolute constant c > 0 such that
 
T T T T T T T
P max εl (BCj A )εl ≥ tr(BCj A ) + c log(n)kBCj A kF ≤ nc . (57)
l∈[n]

Furthermore, noting that tr(BCTj AT ) and kBCTj AT kF are both at most ν(A, B), we have
 
T T T
P max εl (BCj A )εl ≥ 2c log(n)ν(A, B) ≤ nc . (58)
l∈[n]

32
A Bootstrap Method for Error Estimation in Randomized Matrix Multiplication

Finally, this means that if we take r = 4c log(n)ν(A, B) in the bound (56), then the event

E exp |fj (D1 )|/r D◦n ≤ e1/2


  

holds with probability at least 1 − nc , which completes the proof, since e1/2 ≤ 2.

Appendix F. Technical Lemmas


Lemma 9 (Facts about Orlicz norms) Orlicz norms have the following properties,
where c, c1 , and c2 are positive absolute constants.

1. For any random variable X, and any p ≥ 1,

kXkp ≤ cpkXkψ1 (59)


kXkp ≤ c pkXkψ2 (60)

kXkψ1 ≤ ckXkψ2 . (61)

2. If X ∼ N (0, σ 2 ), then kXkψ2 ≤ c σ.

3. Let p ≥ 1. For any sequence of random variables X1 , . . . , Xd ,

max Xj ≤ d1/p max kXj kp


1≤j≤d p 1≤j≤d

and
max Xj ≤ c log(d) max kXj kψ1
1≤j≤d ψ1 1≤j≤d

4. Let X be any random variable. Then, for any x > 0 and p ≥ 1, we have

  kXkp p
P |X| ≥ x ≤ x .

and
P |X| ≥ x ≤ c1 e−c2 x/kXkψ1 .


Proof In part 1, line (59) follows from line 5.11 of Vershynin (2012), line (60) follows
from definition 5.13 of Vershynin (2012), and line (61) follows from p.94 of van der Vaart
and Wellner (1996). Next, part 2 follows from the definition of the ψ2 -Orlicz norm and
the moment generating function for N (0, σ 2 ). Part 3 is due to Lemma 2.2.2 of van der
Vaart and Wellner (1996). Lastly, part 4 follows from Markov’s inequality and line 5.14
of Vershynin (2012).

33
Lopes, Wang, and Mahoney

Lemma 10 (Rosenthal’s inequality with best constants) Fix any  number p > 2.
Let Y1 , . . . , Yt be independent random variables with E[Yi ] = 0 and E |Yi |p < ∞ for all


1 ≤ i ≤ t. Then,
 
Pt p  Pt Pt p 1/p
i=1 Yi p ≤ c log(p) · max i=1 Yi 2 , i=1 Yi kp . (62)

Proof See the paper Johnson et al. (1985). The statement above differs slightly from the
Theorem 4.1 in the paper Johnson et al. (1985), which requires symmetric random variables,
but the remark on p.247 of that paper explains why the variables Y1 , . . . , Yt need not be
symmetric as long as they have mean 0.

Lemma 11 (Hanson-Wright inequality) Let x = (X1 , . . . , Xn ) be a vector of indepen-


dent sub-Gaussian random variables with E[Xj ] = 0, and kXj kψ2 ≤ κ for all 1 ≤ j ≤ n.
Also, let H ∈ Rn×n be any fixed non-zero matrix. Then, there is an absolute constant c > 0
such that for any u ≥ 0,
   n o
u2
P xT Hx − E[xT Hx] ≥ u ≤ 2 exp − c · min κ2 kHk 2 , u
κkHk2 . (63)
F

Proof See the paper Rudelson and Vershynin (2013).

Lemma 12 (Bernstein inequality for sub-exponential variables) Let Y1 , . . . , Yt be


independent random variables with E[Yi ] = 0 and kYi kψ1 ≤ κ for all 1 ≤ i ≤ t. Then,
there is an absolute constant c > 0, such that for any u ≥ 0,
   
1 Pt
P t i=1 Yi ≥ κ · u ≤ 2 exp − c · t · min(u2 , u) . (64)

Proof See Proposition 16 in Vershynin (2012).

Lemma 13 ((Chernozhukov et al., 2016)) If η is a non-negative random variable, and


there are numbers a, b > 0 such that

P(η > x) ≤ ae−x/b ,

for all x > 0, then the following bound holds for all r > 0,

E[η 3 · 1{η > r}] ≤ 6a(r + b)3 e−r/b .

Proof See Lemma 6.6 in Chernozhukov et al. (2016).

34
A Bootstrap Method for Error Estimation in Randomized Matrix Multiplication

Remark. The following lemma may be of independent interest, since it provides an


explicit bound on the ψ1 -Orlicz norm of a centered sub-Gaussian quadratic form. Although
this bound follows from the Hanson-Wright inequality, we have not seen it stated in the
literature.

Lemma 14 Let x = (X1 , . . . , Xn ), be independent random variables satisfying E[Xj ] = 0


and kXj kψ2 ≤ κ for all 1 ≤ j ≤ n. Also, let H ∈ Rn×n be a non-zero fixed matrix. Then,
there is an absolute constant c > 0 such that
kHk2
xT Hx − E[xT Hx] ≤ cκ2 kHkF2 .
ψ1

Proof Define the random variable Q := xT Hx − E[xT Hx]. By the definition


 of the ψ1 -
Orlicz norm, it suffices to find a value r > 0 such that E exp(|Q|/r) ≤ 2. Using the
tail-sum formula, and the change of variable v = eu/r , we have
Z ∞
 
E exp(|Q|/r) ≤ 1 + P(exp(|Q|/r) > v)dv
1
Z ∞
=1+ 1
r P(|Q| > u) · eu/r du.
0

Next, we employ the Hanson-Wright inequality (Lemma 11). By considering the “threshold”
kHk2
u∗ := κ2 kHkF2 , it is helpful to note that the quantities in the exponent of the Hanson-Wright
u2
inequality satisfy κ4 kHk2F
≤ u
κ2 kHk2
if and only if u ≤ u∗ . Hence,
Z ∞ n o
1 u2 u
 
E exp(|Q|/r) ≤ 1 + r exp − c min κ4 kHk2F
, κ2 kHk 2
· eu/r du
0
Z u∗ Z ∞ n o
≤1+ 1
r eu/r du + 1
r exp − u c
κ2 kHk2 − 1
r du.
0 u∗

Evaluating the last two integrals directly, if we let C 0 := κ2 kHk


c
2
− 1
r and choose r so that
0
C > 0, then

1 −u∗ C 0
E exp(|Q|/r) ≤ eu /r + rC
 
0e ,
∗ /r
≤ eu + c·r
1
.
κ2 kHk2
−1

Note that the condition C 0 > 0 means that it is necessary to have r > 1c κ2 kHk2 . To finish
the argument, we further require that r is large enough so that (say)
u∗ 1 c·r
r ≤ 4 and κ2 kHk2
≥ 3, (65)

which ensures
E exp(|Q|/r) ≤ e1/4 + 1
 
2 < 2,
as desired. Note that the constraints (65) are the same as
kHk2
r ≥ 4κ2 kHkF2 and r ≥ 3c κ2 kHk2 .

35
Lopes, Wang, and Mahoney

Due to the basic fact that kHk2 ≤ kHkF for all matrices H, it follows that whenever
kHk2
r ≥ max(4, 3c )κ2 kHkF2 , we have E exp(|Q|/r) < 2.
 

Remark. The following lemma is a basic fact about the dLP metric, but may not be
widely known, and so we give a proof. Recall also that we use the generalized inverse
FV−1 (α) := inf{z ∈ R | FV (z) ≥ α}, where FV denotes the c.d.f. of V .

Lemma 15 Fix α ∈ (0, 1/2) and suppose there is some  ∈ (0, α) such that random
variables U and V satisfy
dLP (L(U ), L(V )) ≤ .
Then, the quantiles of U and V satisfy

FU−1 (1 − α) − FV−1 (1 − α) ≤ ψα (), (66)

where the right side is defined as

ψα () := FU−1 (1 − α + ) − FU−1 (1 − α − ) + .

Proof Consider the Lévy metric, defined as


n o
dL (L(U ), L(V )) := inf  > 0 FU (x − ) −  ≤ FV (x) ≤ FU (x + ) +  for all x ∈ R .

It is a fact that this metric is always dominated by the dLP metric in the sense that

dL (L(U ), L(V )) ≤ dLP (L(U ), L(V )),

for all scalar random variables U and V (Huber and Ronchetti, 2009, p.36). Based on the
definition of the dL metric, it is straightforward to check that the following inequalities hold
under the assumption of the lemma,

FU−1 (1 − α − ) −  ≤ FV−1 (1 − α) ≤ FU−1 (1 − α + ) + .

(Specifically, consider the choices x = FV−1 (1 − α) and x = FU−1 (1 − α + ) + .) Next,


if we subtract FU−1 (1 − α) from each side of the inequalities above, and note that FU−1 (·)
is non-decreasing, it follows that if we put a = FU−1 (1 − α + ) − FU−1 (1 − α) +  and
b = FU−1 (1 − α) − FU−1 (1 − α − ) + , then

|FV−1 (α) − FU−1 (α)| ≤ max{a, b} ≤ ψα (),

as needed.

Lemma 16 Under Assumption 1 (a), the quantity

δ0 = t−1/8 log1/2 (d)ν(A, B)3/4

satisfies conditions (30) and (31).

36
A Bootstrap Method for Error Estimation in Randomized Matrix Multiplication

Proof Consider the number


log2 (d)ν(A,B)
δ1 (r) := √
t
· r,

where r ≥ 1 is a free parameter to be adjusted. Based on the bound (29), it is easy to check
that plugging δ1 (r) into Kt (·) and Jt (·) leads to

Kt (δ1 (r)) + Jt (δ1 (r)) ≤ c · ν(A, B)3 · log(d)3 · r3 · exp(−r/c)

and if we take r ≥ c log(log(d)4 ), then

Kt (δ1 (r)) + Jt (δ1 (r)) ≤ c ν(A, B)3 ,

as desired in (30). Hence, as long as there is a choice of r satisfying

r ≥ c log(log(d)4 ) and δ1 (r) = δ0 ,

then δ1 (r) will satisfy both of the desired constraints (30) and (31). Solving the equation
δ1 (r) = δ0 gives
r = t3/8 · log−3/2 (d) · ν(A, B)−1/4 ,
and then the condition r ≥ c log(log(d)4 ) is the same as
 8/3
t ≥ c ν(A, B)1/4 log(d)3/2 · log(log(d)4 )
(67)
2/3 4 4 8/3
= c ν(A, B) log(d) · log(log(d) ) ,

which holds under Assumption 1 (a).

References
N. Ailon and B. Chazelle. Approximate nearest neighbors and the fast Johnson-
Lindenstrauss transform. In Annual ACM Symposium on Theory of Computing (STOC),
2006.

N. Ailon and E. Liberty. Fast dimension reduction using Rademacher series on dual BCH
codes. Discrete & Computational Geometry, 42(4):615–630, 2009.

S. Ar, M. Blum, B. Codenotti, and P. Gemmell. Checking approximate computations over


the reals. In Annual ACM Symposium on Theory of Computing (STOC), 1993.

H. Avron, P. Maymounkov, and S. Toledo. Blendenpik: Supercharging lapack’s least-squares


solver. SIAM Journal on Scientific Computing, 32(3):1217–1236, 2010.

C. Boutsidis and A. Gittens. Improved matrix algorithms via the subsampled randomized
hadamard transform. SIAM Journal on Matrix Analysis and Applications, 34(3):1301–
1340, 2013.

37
Lopes, Wang, and Mahoney

C. Brezinski and M. R. Zaglia. Extrapolation methods: theory and practice. Elsevier, 2013.

C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines. ACM
Transactions on Intelligent Systems and Technology (TIST), 2(3):27, 2011. URL http:
//www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/.

J. Chang, W. Zhou, W.-X. Zhou, and L. Wang. Comparing large covariance matrices
under weak conditions on the dependence structure and its application to gene clustering.
Biometrics, 2016.

X. Chen. Gaussian and bootstrap approximations for high-dimensional u-statistics and


their applications. The Annals of Statistics, 46(2):642–678, 2018.

V. Chernozhukov, D. Chetverikov, and K. Kato. Gaussian approximations and multiplier


bootstrap for maxima of sums of high-dimensional random vectors. The Annals of
Statistics, 41(6):2786–2819, 2013.

V. Chernozhukov, D. Chetverikov, and K. Kato. Gaussian approximation of suprema of


empirical processes. The Annals of Statistics, 42(4):1564–1597, 2014.

V. Chernozhukov, D. Chetverikov, and K. Kato. Comparison and anti-concentration bounds


for maxima of Gaussian random vectors. Probability Theory and Related Fields, 162(1-2):
47–70, 2015.

V. Chernozhukov, D. Chetverikov, and K. Kato. Empirical and multiplier bootstraps for


suprema of empirical processes of increasing complexity, and related Gaussian couplings.
Stochastic Processes and their Applications, 2016.

V. Chernozhukov, D. Chetverikov, and K. Kato. Central limit theorems and bootstrap in


high dimensions. The Annals of Probability, 45(4):2309–2352, 2017.

K. L. Clarkson and D. P. Woodruff. Low rank approximation and regression in input


sparsity time. In Annual ACM Symposium on theory of computing (STOC), 2013.

G. Dasarathy, P. Shah, B. Narayan Bhaskar, and R. D. Nowak. Sketching sparse matrices,


covariances, and graphs via tensor products. IEEE Transactions on Information Theory,
61(3):1373–1388, 2015.

J. Demmel, I. Dumitriu, O. Holtz, and R. Kleinberg. Fast matrix multiplication is stable.


Numerische Mathematik, 106(2):199–224, 2007.

J. D. Dixon. Estimating extremal eigenvalues and condition numbers of matrices. SIAM


Journal on Numerical Analysis, 20(4):812–814, 1983.

P. Drineas and M. W. Mahoney. RandNLA: randomized numerical linear algebra.


Communications of the ACM, 59(6):80–90, 2016.

P. Drineas, R. Kannan, and M. W. Mahoney. Fast Monte Carlo algorithms for matrices
I: Approximating matrix multiplication. SIAM Journal on Computing, 36(1):132–157,
2006a.

38
A Bootstrap Method for Error Estimation in Randomized Matrix Multiplication

P. Drineas, M. W. Mahoney, and S. Muthukrishnan. Sampling algorithms for `2 regression


and applications. In Annual ACM-SIAM Symposium on Discrete Algorithm (SODA),
2006b.

P. Drineas, M. W. Mahoney, and S. Muthukrishnan. Relative-error CUR matrix


decompositions. SIAM Journal on Matrix Analysis and Applications, 30(2):844–881,
September 2008.

P. Drineas, M. W. Mahoney, S. Muthukrishnan, and T. Sarlós. Faster least squares


approximation. Numerische Mathematik, 117(2):219–249, 2011.

P. Drineas, M. Magdon-Ismail, M. W. Mahoney, and D. P. Woodruff. Fast approximation


of matrix coherence and statistical leverage. Journal of Machine Learning Research, 13:
3475–3506, 2012.

A. Frank and A. Asuncion. UCI machine learning repository, 2010. URL http://archive.
ics.uci.edu/ml.

R. Freivalds. Fast probabilistic algorithms. Mathematical Foundations of Computer Science,


pages 57–69, 1979.

N. Halko, P.-G. Martinsson, and J. A. Tropp. Finding structure with randomness:


probabilistic algorithms for constructing approximate matrix decompositions. SIAM
Review, 53(2):217–288, 2011.

N. J. Higham. Accuracy and stability of numerical algorithms. SIAM, 2002.

P. J. Huber and E. M. Ronchetti. Robust Statistics. Wiley, 2009.

W. B. Johnson and J. Lindenstrauss. Extensions of Lipschitz mappings into a Hilbert space.


Contemporary mathematics, 26(189-206), 1984.

W. B. Johnson, G. Schechtman, and J. Zinn. Best constants in moment inequalities for


linear combinations of independent and exchangeable random variables. The Annals of
Probability, pages 234–253, 1985.

E. Liberty, F. Woolfe, P.-G. Martinsson, V. Rokhlin, and M. Tygert. Randomized algorithms


for the low-rank approximation of matrices. Proceedings of the National Academy of
Sciences, 104(51):20167–20172, 2007.

M. E. Lopes. Estimating the algorithmic variance of randomized ensembles via the


bootstrap. The Annals of Statistics, 47(2):1088–1112, 2019.

M. E. Lopes, Z. Lin, and H.-G. Mueller. Bootstrapping max statistics in high dimensions:
Near-parametric rates under weak variance decay and application to functional data
analysis. arXiv:1807.04429, 2018a.

M. E. Lopes, S. Wang, and M. W. Mahoney. Error estimation for randomized least-squares


algorithms via the bootstrap. In Proceedings of the 35th International Conference on
Machine Learning (ICML), 2018b.

39
Lopes, Wang, and Mahoney

P. Ma, M. W. Mahoney, and B. Yu. A statistical perspective on algorithmic leveraging. In


International Conference on Machine Learning (ICML), 2014.

A. Magen and A. Zouzias. Low rank matrix-valued Chernoff bounds and approximate matrix
multiplication. In Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 2011.

M. W. Mahoney. Randomized algorithms for matrices and data. Foundations and Trends
in Machine Learning, 3(2):123–224, 2011.

R. Pagh. Compressed matrix multiplication. ACM Transactions on Computation Theory,


5(3):9, 2013.

M. Pilanci and M. J. Wainwright. Newton sketch: a near linear-time optimization algorithm


with linear-quadratic convergence. SIAM Journal on Optimization, 27(1):205–245, 2017.

F. Roosta-Khorasani and M. W. Mahoney. Sub-sampled Newton methods II: local


convergence rates. arXiv:1601.04738, 2016.

M. Rudelson and R. Vershynin. Hanson-Wright inequality and sub-Gaussian concentration.


Electronic Communications in Probability, 18:9 pp., 2013.

T. Sarlós. Improved approximation algorithms for large matrices via random projections.
In Annual IEEE Symposium on Foundations of Computer Science (FOCS), 2006.

A. Sidi. Practical Extrapolation Methods: Theory and Applications. Cambridge University


Press, 2003.

A. W. van der Vaart and J. A. Wellner. Weak Convergence and Empirical Processes.
Springer, 1996.

R. Vershynin. Introduction to the non-asymptotic analysis of random matrices. In


Compressed Sensing, Theory and Applications. Cambridge University Press, 2012.

S. Wang. A practical guide to randomized matrix computations with MATLAB


implementations. arXiv:1505.07570, 2015.

D. P. Woodruff. Sketching as a tool for numerical linear algebra. Foundations and Trends R
in Theoretical Computer Science, 10(1–2):1–157, 2014.

F. Woolfe, E. Liberty, V. Rokhlin, and M. Tygert. A fast randomized algorithm for the
approximation of matrices. Applied and Computational Harmonic Analysis, 25(3):335–
366, 2008.

P. Xu, J. Yang, F. Roosta-Khorasani, C. Ré, and M. W. Mahoney. Sub-sampled Newton


methods with non-uniform sampling. In Advances in Neural Information Processing
Systems (NIPS), pages 3000–3008, 2016.

J. Yang, X. Meng, and M. W. Mahoney. Implementing randomized matrix algorithms in


parallel and distributed environments. Proceedings of the IEEE, 104(1):58–92, 2016.

40

You might also like