SGM With Random Features
SGM With Random Features
SGM With Random Features
1 2 1,3
Luigi Carratino Alessandro Rudi Lorenzo Rosasco
luigi.carratino@dibris.unige.it alessandro.rudi@inria.fr lrosasco@mit.edu
Abstract
Sketching and stochastic gradient methods are arguably the most common tech-
niques to derive efficient large scale learning algorithms. In this paper, we investigate
their application in the context of nonparametric statistical learning. More precisely,
we study the estimator defined by stochastic gradient with mini batches and random
features. The latter can be seen as form of nonlinear sketching and used to define approx-
imate kernel methods. The considered estimator is not explicitly penalized/constrained
and regularization is implicit. Indeed, our study highlights how different parameters,
such as number of features, iterations, step-size and mini-batch size control the learning
properties of the solutions. We do this by deriving optimal finite sample bounds,
under standard assumptions. The obtained results are corroborated and illustrated by
numerical experiments.
1 Introduction
The interplay between statistical and computational performances is key for modern machine
learning algorithms [1]. On the one hand, the ultimate goal is to achieve the best possible
prediction error. On the other hand, budgeted computational resources need be factored
in, while designing algorithms. Indeed, time and especially memory requirements are
unavoidable constraints, especially in large-scale problems.
In this view, stochastic gradient methods [2] and sketching techniques [3] have emerged
as fundamental algorithmic tools. Stochastic gradient methods allow to process data
points individually, or in small batches, keeping good convergence rates, while reducing
computational complexity [4]. Sketching techniques allow to reduce data-dimensionality,
hence memory requirements, by random projections [3]. Combining the benefits of both
methods is tempting and indeed it has attracted much attention, see [5] and references
therein.
In this paper, we investigate these ideas for nonparametric learning. Within a least
squares framework, we consider an estimator defined by mini-batched stochastic gradients
and random features [6]. The latter are typically defined by nonlinear sketching: random
projections followed by a component-wise nonlinearity [3]. They can be seen as shallow
networks with random weights [7], but also as approximate kernel methods [8]. Indeed,
random features provide a standard approach to overcome the memory bottleneck that
1
DIBRIS – Università degli Studi di Genova, Genova, Italy.
2
INRIA - Département d’informatique, École Normale Supérieure - PSL Research University, Paris,
France.
3
LCSL – Istituto Italiano di Tecnologia, Genova, Italy & MIT, Cambridge, USA.
1
prevents large-scale applications of kernel methods. The theory of reproducing kernel
Hilbert spaces [9] provides a rigorous mathematical framework to study the properties
of stochastic gradient method with random features. The approach we consider is not
based on penalizations or explicit constraints; regularization is implicit and controlled by
different parameters. In particular, our analysis shows how the number of random features,
iterations, step-size and mini-batch size control the stability and learning properties of the
solution. By deriving finite sample bounds, we investigate how optimal learning rates can
be achieved with different parameter choices. In particular, we show that similarly to ridge
regression [10], a number of random features proportional to the square root of the number
√
of samples suffices for O(1/ n) error bounds.
The rest of the paper is organized as follows. We introduce problem, background and
the proposed algorithm in section 2. We present our main results in section 3 and illustrate
numerical experiments in section 4.
given only a training set of pairs (xi , yi )ni ∈ (X × R)n , n ∈ N, sampled independently
according to ρ. Here the minimum is intended over all functions for which the above integral
is well defined and ρ is assumed fixed but known only through the samples.
In practice, the search for a solution needs to be restricted to a suitable space of
hypothesis to allow efficient computations and reliable estimation [12]. In this paper, we
consider functions of the form
Here T ∈ N is the number of iterations and J = {j1 , . . . , jbT } denotes the strategy to select
training set points. In particular, in this work we assume the points to be drawn uniformly
2
at random with replacement. Note that given this sampling strategy, one pass over the
data is reached on average after d nb e iterations. Our analysis allows to consider multiple as
well as single passes. For b = 1 the above algorithm reduces to a simple stochastic gradient
iteration. For b > 1 it is a mini-batch version, where b points are used in each iteration to
compute a gradient estimate. The parameter γt is the step-size.
The algorithm requires specifying different parameters. In the following, we study how
their choice is related and can be performed to achieve optimal learning bounds. Before
doing this, we further discuss the class of feature maps we consider.
where σ : R → R is a nonlinear function, for example σ(a) = cos(a) [6], σ(a) = |a|+ =
max(a, 0), a ∈ R [7]. If we write the corresponding function (2) explicitly, we get
M
X
f (x) = wj σ(hsj , xi), ∀x ∈ X. (5)
j=1
that is as shallow neural nets with random weights [7] (offsets can be added easily).
For many examples of random features the inner product,
M
X
0
hφM (x), φM (x )i = σ(hx, sj i)σ(hx0 , sj i), (6)
j=1
Example 1 (Random features and kernel). Let σ(a) = cos(a) and consider (hx, si + b)
in place of the inner product hx, si, with s drawn from a standard Gaussian distribution
with variance σ 2 , and b uniformly from [0, 2π]. These are the so called Fourier random
0 2 2
features and recover the Gaussian kernel k(x, x0 ) = e−kx−x k /2σ [6] as M increases. If
instead σ(a) = a, and the s is sampled according to a standard Gaussian the linear kernel
k(x, x0 ) = σ 2 hx, x0 i is recovered in the limit. [15].
These last observations allow to establish a connection with kernel methods [10] and
the theory of reproducing kernel Hilbert spaces [9]. Recall that a reproducing kernel
Hilbert space H is a Hilbert space of functions for which there is a symmetric positive
3
definite function1 k : X × X → R called reproducing kernel, such that k(x, ·) ∈ H and
hf, k(x, ·)i = f (x) for all f ∈ H, x ∈ X. It is also useful to recall that k is a reproducing
kernel if and only if there exists a Hilbert (feature) space F and a (feature) map φ : X → F
such that
k(x, x0 ) = hφ(x), φ(x0 )i, ∀x, x0 ∈ X, (7)
where F can be infinite dimensional.
The connection to RKHS is interesting in at least two ways. First, it allows to use
results and techniques from the theory of RKHS to analyze random features. Second, it
shows that random features can be seen as an approach to derive scalable kernel methods
[10]. Indeed, kernel methods have complexity at least quadratic in the number of points,
while random features have complexity which is typically linear in the number of points.
From this point of view, the intuition behind random features is to relax (7) considering
4
2.3 Related approaches
We comment on the connection to related algorithms. Random features are typically used
within an empirical risk minimization framework [18]. Results considering convex Lipschitz
loss functions and `∞ constraints are given in [19], while [20] considers `2 constraints.
A ridge regression framework is considered in [8], where it is shown that it is possible
to achieve optimal statistical guarantees with a number of random features in the order
√
of n. The combination of random features and gradient methods is less explored. A
stochastic coordinate descent approach is considered in [21], see also [22, 23]. A related
approach is based on subsampling and is often called Nyström method [24, 25]. Here a
shallow network is defined considering a nonlinearity which is a positive definite kernel,
and weights chosen as a subset of training set points. This idea can be used within a
penalized empirical risk minimization framework [26, 27, 28] but also considering gradient
[29, 30] and stochastic gradient [31] techniques. An empirical comparison between Nyström
method, random features and full kernel method is given in [23], where the empirical
risk minimization problem is solved by block coordinate descent. Note that numerous
works have combined stochastic gradient and kernel methods with no random projections
approximation [32, 33, 34, 35, 36, 5]. The above list of references is only partial and focusing
on papers providing theoretical analysis. In the following, after stating our main results we
provide a further quantitative comparison with related results.
3 Main Results
In this section, we first discuss our main results under basic assumptions and then more
refined results under further conditions.
The above class of random features cover all the examples described in section 2.1, as
well as many others, see [8, 20] and references therein. Next we introduce the positive
definite kernel defined by the above random features. Let k : X × X → R be defined by
Z
k(x, x ) = ψ(x, ω)ψ(x0 , ω)dπ(ω), ∀, x, x0 ∈ X.
0
It is easy to check that k is a symmetric and positive definite kernel. To control basic
properties of the induced kernel (continuity, boundedness), we require the following assump-
tion, which is again satisfied by the examples described in section 2.1 (see also [8, 20] and
references therein).
5
Assumption 2. The function ψ is continuous and there exists κ ≥ 1 such that |ψ(x, ω)| ≤ κ
for any x ∈ X, ω ∈ Ω.
The kernel introduced above allows to compare random feature maps of different size
and to express the regularity of the largest function class they induce. In particular, we
require a standard assumption in the context of non-parametric regression (see [11]), which
consists in assuming a minimum for the expected risk, over the space of functions induced
by the kernel.
To conclude, we need some basic assumption on the data distribution. For all x ∈ X,
we denote by ρ(y|x) the conditional probability of ρ and by ρX the corresponding marginal
probability on X. We need a standard moment assumption to derive probabilistic results.
γt log 1δ
γ γt 1 1
EJ E(ft+1 ) −E(fH ) . +
b +1 + + . (11)
b M n γt M
The above theorem bounds the excess risk with a sum of terms controlled by the different
parameters. The following corollary shows how these parameters can be chosen to derive
finite sample bounds.
Corollary 1. Under the same assumptions of Theorem 1, for one of the following conditions
√ √
(c1.1 ). b = 1, γt ' n1 , and T = n n iterations ( n passes over the data);
6
√ √
(c1.3 ). b = n, γt ' 1, and T = n iterations (1 pass over the data);
√ √
(c1.4 ). b = n, γt ' 1, and T = n iterations ( n passes over the data);
a number √
M = O(
e n) (12)
of random features is sufficient to guarantee with high probability that
1
EJ E(fbT ) − E(fH ) . √ . (13)
n
The above learning rate is the same achieved by an exact kernel ridge regression (KRR)
estimator [11, 37, 38], which has been proved to be optimal in a minimax sense [11] under
the same assumptions. Further, the number of random features required to achieve this
bound is the same as the kernel ridge regression estimator with random features [8]. Notice
that, for the limit case where the number of random features grows to infinity for Corollary 1
under conditions (c1.2 ) and (c1.3 ) we recover the same results for one pass SGD of [39], [40].
In this limit, our results are also related to those in [41]. Here, however, averaging of the
iterates is used to achieve larger step-sizes.
Note that conditions (c1.1 ) and (c1.2 ) in the corollary above show that, when no mini-
batches are used (b = 1) and n1 ≤ γ ≤ √1n , then the step-size γ determines the number
of passes over the data required for optimal generalization. In particular the number of
√
passes varies from constant, when γ = √1n , to n, when γ = n1 . In order to increase the
step-size over √1n the algorithm needs to be run with mini-batches. The step-size can then
√
be increased up to a constant if b is chosen equal to n (condition (c1.3 )), requiring the
same number of passes over the data of the setting (c1.2 ). Interestingly condition (c1.4 )
√
shows that increasing the mini-batch size over n does not allow to take larger step-sizes,
while it seems to increase the number of passes over the data required to reach optimality.
We now compare the time complexity of algorithm (3) with some closely related methods
which achieve the same optimal rate of √1n . Computing the classical KRR estimator [11] has
a complexity of roughly O(n3 ) in time and O(n2 ) in memory. Lowering this computational
cost is possible with random projection techniques. Both random features and Nyström
method on KRR [8, 26] lower the time complexity to O(n2 ) and the memory complexity
√
to O(n n) preserving the statistical accuracy. The same time complexity is achieved by
stochastic gradient method solving the full kernel method [33, 36], but with the higher space
complexity of O(n2 ). The combination of the stochastic gradient iteration, random features
√
and mini-batches allows our algorithm to achieve a complexity of O(n n) in time and
O(n) in space for certain choices of the free parameters (like (c1.2 ) and (c1.3 )). Note that
these time and memory complexity are lower with respect to those of stochastic gradient
with mini-batches and Nyström approximation which are O(n2 ) and O(n) respectively
[31]. A method with similar complexity to SGD with RF is FALKON [30]. This method
√
has indeed a time complexity of O(n n log(n)) and O(n) space complexity. This method
blends together Nyström approximation, a sketched preconditioner and conjugate gradient.
7
3.2 Refined analysis and fast rates
We next discuss how the above results can be refined under an additional regularity
assumption. We need some preliminary definitions. Let H be the RKHS defined by k, and
L : L2 (X, ρX ) → L2 (X, ρX ) the integral operator
Z
Lf (x) = k(x, x0 )f (x0 )dρX (x0 ), ∀f ∈ L2 (X, ρX ), x ∈ X,
is symmetric and positive definite. Moreover, Assumption 1 ensures that the kernel is
bounded, which in turn ensures L is trace class, hence compact [18].
Assumption 5. For any λ > 0, define the effective dimension as N (λ) = Tr((L + λI)−1 L),
and assume there exist Q > 0 and α ∈ [0, 1] such that
The above assumption describes the capacity/complexity of the RKHS H and the
measure ρ. It is equivalent to classic entropy/covering number conditions, see e.g. [18]. The
case α = 1 corresponds to making no assumptions on the kernel, and reduces to the worst
case analysis in the previous section. The smaller is α the more stringent is the capacity
condition. A classic example is considering X = RD with dρX (x) = p(x)dx, where p is a
probability density, strictly positive and bounded away from zero, and H to be a Sobolev
space with smoothness s > D/2. Indeed, in this case α = D/2s and classical nonparametric
statistics assumptions are recovered as a special case. Note that in particular the worst
case is s = D/2.
The following theorem is a refined version of Theorem 1 where we also consider the
above capacity condition (Assumption 5).
Theorem 2. Let n, M ∈ N+ , δ ∈ (0, 1) and t ∈ [T ], under Assumptions 1 to 4, for b ∈ [n],
n 1 2 2
γt = γ s.t. γ ≤ 9T log n ∧
8(1+log T ) , n ≥ 32 log δ and M & γT the following holds with
δ
high probability:
N (1/γt) log 1δ
γ γt 1 1
EJ E(ft+1 ) − E(fH ) . +
b +1 + + . (15)
b M n γt M
The main difference is the presence of the effective dimension providing a sharper control
of the stability of the considered estimator. As before, explicit learning bounds can be
derived considering different parameter settings.
Corollary 2. Under the same assumptions of Theorem 2, for one of the following conditions
2+α 1
(c2.1 ). b = 1, γt ' n−1 , and T = n 1+α iterations (n 1+α passes over the data);
1 2 1−α
(c2.2 ). b = 1, γt ' n− 1+α , and T = n 1+α iterations (n 1+α passes over the data);
1 1 1−α
(c2.3 ). b = n 1+α , γt ' 1, and T = n 1+α iterations (n 1+α passes over the data);
1 1
(c2.4 ). b = n, γt ' 1, and T = n 1+α iterations (n 1+α passes over the data);
8
a number 1
M = O(n
e 1+α ) (16)
of random features suffies to guarantee with high probability that
1
bT ) − E(fH ) . n− 1+α .
EJ E(w (17)
The corollary above shows that multi-pass SGD achieves a learning rate that is the
same as kernel ridge regression under the regularity assumption 5 and is again minimax
optimal (see [11] with r = 1/2 and γ = α). Moreover, we obtain the minimax optimal
rate with the same number of random features required for ridge regression with random
features [8] under the same assumptions. Finally, when the number of random features goes
to infinity we also recover the results for the infinite dimensional case of the single-pass and
multiple pass stochastic gradient method [33] when fH ∈ H.
It is worth noting that, under the additional regularity assumption 5, the number of
both random features and passes over the data sufficient for optimal learning rates increase
with respect to the one required in the worst case (see Corollary 1). The same effect
occurs in the context of ridge regression with random features as noted in [8]. In this latter
paper, it is observed that this issue tackled can be using more refined, possibly more costly,
sampling schemes [20].
Finally, we present a general result from which all our previous results follow as special
cases. We consider a more general setting where we allow decreasing step-sizes.
Theorem 3. Let n, M, T ∈ N, b ∈ [n] and γ > 0. Let δ ∈ (0, 1) and w bt+1 be the estimator
in Eq. (3) with γt = γκ−2 t−θ and θ ∈ [0, 1[. Under Assumptions 1 to 4, when n ≥ 32 log2 2δ
and ( θ∧(1−θ)
n 7 θ ∈]0, 1[
γ≤ 1−θ n ∧ 1 (18)
9T log δ 8(1+log T ) otherwise,
moreover
12γT 1−θ
M ≥ 4 + 18γT 1−θ log , (19)
δ
then, for any t ∈ [T ] the following holds with probability at least 1 − 9δ
γ
EJ E(w
bt+1 ) − inf E(w) ≤ c1 (log t ∨ 1) (20)
w∈F btmin(θ,1−θ)
2
1 M 1−θ N γtκ1−θ 4
log2 (t) ∨ 1 log2
+ c2 + c3 log γt ∨1 (21)
M δ n δ
1−θ
1 1 M log (γt ) 2 1
+ c4 1−θ
+ log + c5 log + c6 1−θ , (22)
γt M δ M δ γt
We note that as the number of random features M goes to infinity, we recover the same
bound of [33] for decreasing step-sizes and fH ∈ H. Moreover, the above theorem shows
that there is no apparent gain in using a decreasing stepsize (i.e. θ > 0) with respect to the
regimes identified in Corollaries 1 and 2.
9
3.3 Sketch of the Proof
In this section, we sketch the main ideas in the proof. We relate fbt and fH introducing
several intermediate functions. In particular, the following iterations are useful,
n
1X
vb1 = 0; vbt+1 = vbt − γt hbvt , φM (xi )i − yi φM (xi ), ∀t ∈ [T ]. (23)
n
i=1
Z
ve1 = 0; vet+1 = vet − γt he
vt , φM (x)i − y φM (x)dρ(x, y), ∀t ∈ [T ]. (24)
ZX
v1 = 0; vt+1 = vt − γt hvt , φM (x)i − fH (x) φM (x)dρX (x), ∀t ∈ [T ]. (25)
X
Further, we let
Z
2
u
eλ = argmin hu, φM (x)i − fH (x) dρX (x) + λkuk2 , λ > 0, (26)
u∈RM X
Z
2
uλ = argmin hu, φ(x)i − y dρ(x, y) + λkuk2 , λ > 0, (27)
u∈F X
where (F, φ) are feature space and feature map associated to the kernel k. The first three
vectors are defined by the random features and can be seen as an empirical and population
batch gradient descent iterations. The last two vectors can be seen as a population version
of ridge regression defined by the random features and the feature map φ, respectively.
Since the above objects (23), (24), (25), (26), (27) belong to different spaces, instead of
comparing them directly we compare the functions in L2 (X, ρX ) associated to them, letting
gbt = hb
vt , φM (·)i , get = he
vt , φM (·)i , gt = hvt , φM (·)i , geλ = he
uλ , φM (·)i , gλ = huλ , φ(·)i .
The first two terms control how SGD deviates from the batch gradient descent and the effect
of noise and sampling. They are studied in Lemma 1, 2, 3, 4 5, 6 in the Appendix, borrowing
and adapting ideas from [33, 36, 8]. The following terms account for the approximation
properties of random features and the bias of the algorithm. Here the basic idea and novel
result is the study of how the population gradient decent and ridge regression are related
(31) (Lemma 8 in the Appendix). Then, results from the the analysis of ridge regression
with random features are used [8].
10
SUSY HIGGS
classification error
classification error
n° of random features n° of random features
Figure 1: Classification error of SUSY (left) and HIGGS (right) datasets as the no of random
features varies
mini-batch size
mini-batch size
step-size step-size
Figure 2: Classification error of SUSY (left) and HIGGS (right) datasets as step-seize and
mini-batch size vary
4 Experiments
We study the behavior of the SGD with RF algorithm on subsets of n = 2 × 105 points of
the SUSY 2 and HIGGS 3 datasets [42]. The measures we show in the following experiments
are an average over 10 repetitions of the algorithm. Further, we consider random Fourier
features that are known to approximate translation invariant kernels [6]. We use random
features of the form ψ(x, ω) = cos(wT x + q), with ω := (w, q), w sampled according to the
normal distribution and q sampled uniformly at random between 0 and 2π. Note that the
random features defined this way satisfy Assumption 2.
√
Our theoretical analysis suggests that only a number of RF of the order of n suffices
to gain optimal learning properties. Hence we study how the number of RF affect the
accuracy of the algorithm on test sets of 105 points. In Figure 3.3 we show the classification
error after 5 passes over the data of SGD with RF as the number of RF increases, with a
2
https://archive.ics.uci.edu/ml/datasets/SUSY
3
https://archive.ics.uci.edu/ml/datasets/HIGGS
11
√
fixed batch size of n and a step-size of 1. We can observe that over a certain threshold of
√
the order of n, increasing the number of RF does not improve the accuracy, confirming
what our theoretical results suggest.
Further, theory suggests that the step-size can be increased as the mini-batch size
increases to reach an optimal accuracy, and that after a mini-batch size of the order of
√
n more than 1 pass over the data is required to reach the same accuracy. We show in
Figure 2 the classification error of SGD with RF after 1 pass over the data, with a fixed
√
number of random features n, as mini-batch size and step-size vary, on test sets of 105
points. As suggested by theory, to reach the lowest error as the mini-batch size grows the
√
step-size needs to grow as well. Further for mini-batch sizes bigger that n the lowest error
can not be reached in only 1 pass even if increasing the step-size.
5 Conclusions
In this paper we investigate the combination of sketching and stochastic techniques in
the context of non-parametric regression. In particular we studied the statistical and
computational properties of the estimator defined by stochastic gradient descent with
multiple passes, mini-batches and random features. We proved that the estimator achieves
√
optimal statistical properties with a number of random features in the order of n (with n
the number of examples). Moreover we analyzed possible trade-offs between the number
of passes, the step and the dimension of the mini-batches showing that there exist differ-
ent configurations which achieve the same optimal statistical guarantees, with different
computational impacts.
Our work can be extended in several ways: First, (a) we can study the effect of combining
random features with accelerated/averaged stochastic techniques as [32]. Second, (b) we
can extend our analysis to consider more refined assumptions, generalizing [35] to SGD
with random features. Additionally, (c) we can study the statistical properties of the
considered estimator in the context of classification with the goal of showing fast decay
of the classification error, as in [34]. Moreover, (d) we can apply the proposed method
in the more general context of least squares frameworks for multitask learning [43, 44]
or structured prediction [45, 46, 47], with the goal of obtaining faster algorithms, while
retaining strong statistical guarantees. Finally, (e) to integrate our analysis with more
refined methods to select the random features analogously to [48, 49] in the context of
column sampling.
Acknowledgments.
This material is based upon work supported by the Center for Brains, Minds and Machines (CBMM),
funded by NSF STC award CCF-1231216, and the Italian Institute of Technology. We gratefully
acknowledge the support of NVIDIA Corporation for the donation of the Titan Xp GPUs and the
Tesla k40 GPU used for this research. L. R. acknowledges the support of the AFOSR projects
FA9550-17-1-0390 and BAA-AFRL-AFOSR-2016-0007 (European Office of Aerospace Research and
Development), and the EU H2020-MSCA-RISE project NoMADS - DLV-777826. A. R. acknowledges
the support of the European Research Council (grant SEQUOIA 724063).
12
References
[1] Alekh Agarwal, Sahand Negahban, and Martin J Wainwright. Stochastic optimization
and sparse statistical recovery: Optimal algorithms for high dimensions. In Advances
in Neural Information Processing Systems, pages 1538–1546, 2012.
[2] Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals
of mathematical statistics, pages 400–407, 1951.
[3] Haim Avron, Vikas Sindhwani, and David Woodruff. Sketching structured matrices
for faster nonlinear regression. In Advances in neural information processing systems,
pages 2994–3002, 2013.
[4] Léon Bottou and Olivier Bousquet. The tradeoffs of large scale learning. In Advances
in neural information processing systems, pages 161–168, 2008.
[5] Francesco Orabona. Simultaneous model selection and optimization through parameter-
free stochastic learning. In Advances in Neural Information Processing Systems, pages
1116–1124, 2014.
[6] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In
Advances in neural information processing systems, pages 1177–1184, 2008.
[7] Youngmin Cho and Lawrence K Saul. Kernel methods for deep learning. In Advances
in neural information processing systems, pages 342–350, 2009.
[8] Alessandro Rudi and Lorenzo Rosasco. Generalization properties of learning with
random features. In Advances in Neural Information Processing Systems 30, pages
3215–3225. 2017.
[10] Bernhard Schölkopf and Alexander J Smola. Learning with kernels: support vector
machines, regularization, optimization, and beyond. 2002.
[11] Andrea Caponnetto and Ernesto De Vito. Optimal rates for the regularized least-
squares algorithm. Foundations of Computational Mathematics, 7(3):331–368, 2007.
[12] Luc Devroye, László Györfi, and Gábor Lugosi. A probabilistic theory of pattern
recognition, volume 31. Springer Science & Business Media, 2013.
[13] David P Woodruff et al. Sketching as a tool for numerical linear algebra. Foundations
and Trends in Theoretical Computer Science, 10(1–2):1–157, 2014.
[14] Bharath Sriperumbudur and Zoltán Szabó. Optimal rates for random fourier features.
In Advances in Neural Information Processing Systems, pages 1144–1152, 2015.
[15] Raffay Hamid, Ying Xiao, Alex Gittens, and Dennis DeCoste. Compact random feature
maps. In International Conference on Machine Learning, pages 19–27, 2014.
13
[16] X Yu Felix, Ananda Theertha Suresh, Krzysztof M Choromanski, Daniel N Holtmann-
Rice, and Sanjiv Kumar. Orthogonal random features. In Advances in Neural Infor-
mation Processing Systems, pages 1975–1983, 2016.
[17] Quoc Le, Tamás Sarlós, and Alex Smola. Fastfood-approximating kernel expansions
in loglinear time. In Proceedings of the international conference on machine learning,
volume 85, 2013.
[18] Ingo Steinwart and Andreas Christmann. Support vector machines. Springer Science
& Business Media, 2008.
[19] Ali Rahimi and Benjamin Recht. Weighted sums of random kitchen sinks: Replacing
minimization with randomization in learning. In Advances in neural information
processing systems, pages 1313–1320, 2009.
[20] Francis Bach. On the equivalence between kernel quadrature rules and random feature
expansions. Journal of Machine Learning Research, 18(21):1–38, 2017.
[21] Bo Dai, Bo Xie, Niao He, Yingyu Liang, Anant Raj, Maria-Florina F Balcan, and
Le Song. Scalable kernel methods via doubly stochastic gradients. In Advances in
Neural Information Processing Systems, pages 3041–3049, 2014.
[22] Junhong Lin and Lorenzo Rosasco. Generalization properties of doubly online learning
algorithms. arXiv preprint arXiv:1707.00577, 2017.
[23] Stephen Tu, Rebecca Roelofs, Shivaram Venkataraman, and Benjamin Recht. Large
scale kernel learning using block coordinate descent. arXiv preprint arXiv:1602.05310,
2016.
[24] Alex J Smola and Bernhard Schölkopf. Sparse greedy matrix approximation for machine
learning. 2000.
[25] Christopher KI Williams and Matthias Seeger. Using the nyström method to speed up
kernel machines. In Advances in neural information processing systems, pages 682–688,
2001.
[26] Alessandro Rudi, Raffaello Camoriano, and Lorenzo Rosasco. Less is more: Nyström
computational regularization. In Advances in Neural Information Processing Systems,
pages 1657–1665, 2015.
[27] Yun Yang, Mert Pilanci, and Martin J Wainwright. Randomized sketches for kernels:
Fast and optimal non-parametric regression. arXiv preprint arXiv:1501.06195, 2015.
[28] Ahmed Alaoui and Michael W Mahoney. Fast randomized kernel ridge regression with
statistical guarantees. In Advances in Neural Information Processing Systems, pages
775–783, 2015.
[29] Raffaello Camoriano, Tomás Angles, Alessandro Rudi, and Lorenzo Rosasco. Nytro:
When subsampling meets early stopping. In Artificial Intelligence and Statistics, pages
1403–1411, 2016.
14
[30] Alessandro Rudi, Luigi Carratino, and Lorenzo Rosasco. FALKON: An optimal large
scale kernel method. In Advances in Neural Information Processing Systems, pages
3891–3901, 2017.
[31] Junhong Lin and Lorenzo Rosasco. Optimal rates for learning with nyström stochastic
gradient methods. arXiv preprint arXiv:1710.07797, 2017.
[32] Aymeric Dieuleveut, Nicolas Flammarion, and Francis Bach. Harder, better, faster,
stronger convergence rates for least-squares regression. The Journal of Machine
Learning Research, 18(1):3520–3570, 2017.
[33] Junhong Lin and Lorenzo Rosasco. Optimal rates for multi-pass stochastic gradient
methods. Journal of Machine Learning Research, 18(97):1–47, 2017.
[34] Loucas Pillaud-Vivien, Alessandro Rudi, and Francis Bach. Exponential convergence
of testing error for stochastic gradient methods. In Proceedings of the 31st Conference
On Learning Theory, volume 75, pages 250–296, 2018.
[35] Loucas Pillaud-Vivien, Alessandro Rudi, and Francis Bach. Statistical optimality of
stochastic gradient descent on hard learning problems through multiple passes. arXiv
preprint arXiv:1805.10074, 2018.
[36] Lorenzo Rosasco and Silvia Villa. Learning with incremental iterative regularization.
In Advances in Neural Information Processing Systems, pages 1630–1638, 2015.
[37] Ingo Steinwart, Don R Hush, Clint Scovel, et al. Optimal rates for regularized least
squares regression. In COLT, 2009.
[38] Junhong Lin, Alessandro Rudi, Lorenzo Rosasco, and Volkan Cevher. Optimal rates
for spectral algorithms with least-squares regression over hilbert spaces. Applied and
Computational Harmonic Analysis, 2018.
[39] Ohad Shamir and Tong Zhang. Stochastic gradient descent for non-smooth optimization:
Convergence results and optimal averaging schemes. In International Conference on
Machine Learning, pages 71–79, 2013.
[40] Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, and Lin Xiao. Optimal distributed on-
line prediction using mini-batches. Journal of Machine Learning Research, 13(Jan):165–
202, 2012.
[42] Pierre Baldi, Peter Sadowski, and Daniel Whiteson. Searching for exotic particles in
high-energy physics with deep learning. Nature communications, 5:4308, 2014.
[43] Andreas Argyriou, Theodoros Evgeniou, and Massimiliano Pontil. Convex multi-task
feature learning. Machine Learning, 73(3):243–272, 2008.
15
[44] Carlo Ciliberto, Alessandro Rudi, Lorenzo Rosasco, and Massimiliano Pontil. Consistent
multitask learning with nonlinear output relations. In Advances in Neural Information
Processing Systems, pages 1986–1996, 2017.
[45] Carlo Ciliberto, Lorenzo Rosasco, and Alessandro Rudi. A consistent regularization
approach for structured prediction. Advances in Neural Information Processing Systems
29 (NIPS), pages 4412–4420, 2016.
[46] Anton Osokin, Francis Bach, and Simon Lacoste-Julien. On structured prediction
theory with calibrated convex surrogate losses. In Advances in Neural Information
Processing Systems, pages 302–313, 2017.
[47] Carlo Ciliberto, Francis Bach, and Alessandro Rudi. Localized structured prediction.
arXiv preprint arXiv:1806.02402, 2018.
[48] Petros Drineas, Malik Magdon-Ismail, Michael W Mahoney, and David P Woodruff.
Fast approximation of matrix coherence and statistical leverage. Journal of Machine
Learning Research, 13(Dec):3475–3506, 2012.
[49] Alessandro Rudi, Daniele Calandriello, Luigi Carratino, and Lorenzo Rosasco. On fast
leverage score sampling and optimal learning. arXiv preprint arXiv:1810.13258, 2018.
[50] Felipe Cucker and Steve Smale. On the mathematical foundations of learning. Bulletin
of the American mathematical society, 39(1):1–49, 2002.
[51] Ernesto De Vito, Lorenzo Rosasco, Andrea Caponnetto, Umberto De Giovannini, and
Francesca Odone. Learning from examples as an inverse problem. Journal of Machine
Learning Research, 6(May):883–904, 2005.
[52] Alessandro Rudi, Guillermo D Canas, and Lorenzo Rosasco. On the sample complexity
of subspace learning. In Advances in Neural Information Processing Systems, pages
2067–2075, 2013.
16
A Appendix
We start recalling some definitions and define some new operators.
then
P fρ = SfH ,
or equivalently, there exists g ∈ L2 (X, ρX ) such that
1
P fρ = L 2 g.
In particular, we have R := kfH kH = kgkL2 (X,ρX ) .
17
With the operators introduced above and Remark 1, we can rewrite the auxiliary objects
(23), (24), (25), (26), (27) respectively as
∗
vb1 = 0; vbt+1 = (I − γt C
bM )b
vt + γt SbM yb, ∀t ∈ [T ], (39)
∗
ve1 = 0; vet+1 = (I − γt CM )e
vt + γt SM fρ , ∀t ∈ [T ], (40)
∗
v1 = 0; vt+1 = (I − γt CM )vt + γt SM P fρ , ∀t ∈ [T ]. (41)
u ∗
eλ = SM L−1
M,λ P fρ (42)
uλ = S ∗ L−1
λ P fρ (43)
bt − P fρ = SM w
SM w bt − SM vbt (47)
+ SM vbt − SM vet (48)
+ SM vet − SM vt (49)
+ SM vt − LM L−1
M,λ P fρ (50)
+ LM L−1 −1
M,λ P fρ − LLλ P fρ (51)
+ LL−1
λ P fρ − P fρ . (52)
A.3 Lemmas
The first three lemmas we present are some technical lemmas used when bounding the first
three terms (47), (48), (49) of the error decomposition.
ke
vt − vt k = 0 a.s. (53)
Proof. Given (45), (46) and defining AM t = ti=1 γi tk=i+1 (I − γk CM ), we can write
P Q
∗ ∗
ke
vt − vt k = kAM t SM (I − P )fρ k ≤ kAM t k kSM (I − P )k kfρ k . (54)
∗ (I − P )k = 0, which completes the
Under Assumption 2, by Lemma 2 of [8], we have kSM
proof.
18
Lemma 2. Let M ∈ N. Under Assumption 2 and 3, let γt κ2 ≤ 1, δ ∈]0, 1], the following
holds with probability 1 − δ for all t ∈ [T ]
r
t
! 12
2
36R κ 2 M X
ke
vt+1 k ≤ 2R + log max γt , κ−1 . (55)
M δ
i=1
ke
vt+1 k ≤ ke
vt+1 − vt+1 k + kvt+1 k = kvt+1 k, (56)
where in the last equality we used the result from Lemma 1. Using Assumption 3 (see also
Remark 1), we derive
X t t
Xt t
Y Y 1
∗ ∗
kvt+1 k =
γi S M (I − γk LM )P fρ
≤ R
γi SM (I − γk LM )L 2
(57)
i=1 k=i+1 i=1 k=i+1
Pt ∗
Qt −1/2 1/2
Define QM t = i=1 γi SMk=i+1 (I − γk LM ). Note that kLM,η L k ≤ 2 holds with
2
probability 1 − δ when 9κ M
log δ ≤ η ≤ kLk (see Lemma 5 in [26]). Moreover, when
M
−1/2 −1/2
η ≥ kLk, we have that kLM,η L1/2 k ≤ η −1/2 kL1/2 k ≤ 1. So kLM,η L1/2 k ≤ 2 with
2
probability 1 − δ, when 9κ M
M log δ ≤ η. So
1
1 −1 1 1
RkQM t L 2 k ≤ RkQM t LM,η
2
kkLM,η
2
L 2 k ≤ 2RkQM t LM,η
2
k
1
1
≤ 2R kQM t LM k + η 2 kQM t k .
2
(58)
1
(see Lemma B.10(i) in [36] or Lemma 16 of [33]). We use (59) with r = 2 and r = 0 to
1/2
bound kQM t LM k and kQM t k respectively and plug the results in (58). To complete the
9κ2
proof we take η = M log M
δ .
Lemma 3. Let λ > 0, R ∈ N and δ ∈ (0, 1). Let ζ1 , . . . , ζR be i.i.d. random vectors
1 PR
bounded by κ > 0. Denote with QR = R j=1 ζj ⊗ ζj and by Q the expectation of QR .
9κ2
Then, for any λ ≥ R log Rδ , we have
19
paper, we have that k(QR + λI)−1/2 (Q + λI)1/2 k ≤ 2, with probability at least 1 − δ. To
cover the case λ > kQk, note that
We need the following technical lemma that complements Proposition 10 of [8] when
λ ≥ kLk, and that we will need for the proof of Lemma 6.
Lemma 4. Let M ∈ N and δ ∈ (0, 1]. For any λ > 0 such that
18κ2 12κ2
M ≥ 4+ log ,
λ λδ
the following holds with probability 1 − δ
2κ2
Z
− 12 2
NM (λ) := k(LM + λI) φM (x)k dρX (x) ≤ max 2.55, N (λ).
X kLk
Proof. First of all note that
Z
1 − 21 − 21
NM (λ) := k(LM + λI)− 2 φM (x)k2 dρX (x) = Tr(LM,λ LM LM,λ ) = Tr(L−1
M,λ LM ).
X
Now consider the case when λ ≤ kLk. By applying Proposition 10 of [8] we have that under
the required condition on M , the following holds with probability at least 1 − δ
For the case λ > kLk, note that Tr(AA−1 λ ) satisfies the following inequality for any trace
class positive linear operator A with trace bounded by κ2 and λ > 0,
kAk Tr(A)
≤ Tr(AA−1
λ )≤ .
kAk + λ λ
−1 −1
So, when λ > kLk, since NM (λ) = Tr(CM CM λ ) and N (λ) = Tr(LLλ ), and both L and
bM have trace bounded by κ2 , we have NM (λ) ≤ κ2 and N (λ) ≥ kLk . So by selecting
C λ kLk+λ
κ2 (kLk+λ)
q= λkLk , we have
κ2 kLk
NM (λ) ≤ =q ≤ qN (λ).
λ kLk + λ
Finally note that
κ2 (kLk + λ) κ2
q ≤ sup ≤2 .
λ>kLk λkLk kLk
20
We now start bounding the different parts of the error decomposition. The next two lemmas
bound the first two terms (47), (48). To bound these we require the above lemmas and
adapting ideas from [33, 36, 8].
Lemma 5. Under Assumption 2 and 4, let δ ∈]0, 1[, n ≥ 32 log2 2δ , and γt = γκ−2 t−θ for
all t ∈ [T ], with θ ∈ [0, 1[ and γ such that
tmin(θ,1−θ)
0<γ≤ , ∀t ∈ [T ]. (60)
8(log t + 1)
When
1 9 n
1−θ
≥ log (61)
γt n δ
for all t ∈ [T ], with probability at least 1 − 2δ,
208Bp − min(θ,1−θ)
bt+1 − vbt+1 )k2 ≤
EJ kSM (w γt (log t ∨ 1). (62)
(1 − θ)b
Proof. The proof is derived by applying Proposition 6 in [33] with γ satisfying condition
(60), λ = γ1t t , δ2 = δ3 = δ, and some changes that we now describe. Instead of the stochastic
iteration wt and the batch gradient iteration νt as defined in [33] we consider (3) and (39)
respectively, as well as the operators SM , CM , LM , SbM , C bM , L
b M defined in Section 2 instead
of Sρ , Tρ , Lρ , Sx , Tx , Lx defined in [33]. Instead of assuming that exists a κ ≥ 1 for which
hx, x0 i ≤ κ2 , ∀x, x0 ∈ X we have Assumption 2 which implies the same κ2 upper bound
of the operators used in the proof. To apply this version of Proposition 6 note that their
Equation (63) is satisfied by Lemma 25 of [33], while their Equation (47) is satisfied by our
Lemma 3, from which we obtain the condition (61).
Lemma 6. Under Assumptions 2, 4 and 3, let δ ∈]0, 1[ and γt = γκ−2 t−θ for all t ∈ [T ],
with γ ∈]0, 1] and θ ∈ [0, 1[. When
12γt1−θ
M ≥ 4 + 18γt1−θ log , (63)
δ
for all t ∈ [T ] with probability at least 1 − 3δ
r ! !
32R2 M p 1−θ √
kSM (b vt+1 −evt+1 )k ≤ 4 2R 1 + log γt ∨1 + B ×
M δ
q √
κ2
8 √
p 1−θ
γt 2 pq 0 N ( )
log 4 ,
γt1−θ
× + 4 log t + 4 + 2γ + √
(1 − θ) n n δ
(64)
2κ2
where q0 = max 2.55, kLk .
Proof. The proof can be derived from the one of Theorem 5 in [33] with λ = γ1t t , δ1 = δ2 = δ,
and some changes we now describe. Instead of the iteration νt and µt defined in [33] we
21
consider (39) and (40) respectively, as well as the operators SM , CM , LM , SbM , C bM , L
bM
defined in Section 2 instead of Sρ , Tρ , Lρ , Sx , Tx , Lx defined in [33]. Instead of assuming
that exists a κ ≥ 1 for which hx, x0 i ≤ κ2 , ∀x, x0 ∈ X we have Assumption 2 which imply
the same kCM k ≤ κ2 upper bound of the operators used in the proof. Further, when in the
proof we need to bound kvt+1 k we use our Lemma 2 instead of Lemma 16 of [33]. In addition
instead of Lemma 18 of [33] we use Lemma 6 of [8], together with Lemma 4, obtaining the
desired result with probability 1 − 3δ, when M satisfies M ≥ (4 + 18γt t) log 12γδ t t . Under
the assumption that γt = γκ−2 t−θ , the two condition above can be rewritten as (63).
The next lemma states that the third term (49) of the error decomposition is equal to zero.
Lemma 7. Under Assumption 3 the following holds for any t, M, n ∈ N
Proof. From Lemma 1 and the definition of operator norm the result follows trivially.
The next lemma is one of our main contributions and studies how the population gradient
decent with RF and ridge regression with RF are related (50).
Lemma 8. Under Assumption 3 the following holds with probability 1 − δ for all t ∈ [T ]
1 r !
9κ2
−1
1 1 2 M
SM vt+1 − LM,λ LM P fρ
≤ 2R λ 2 + +2 log . (66)
2e tk=1 γk M δ
P
ρ
Proof. We start noting that, given the sequence (40) and applying it recursively,
t
X t
Y
SM vt+1 = γi (I − γk LM )LM P fρ
i=1 k=i+1
t
X t
Y
= (I − (I − γi LM )) (I − γk LM )P fρ
i=1 k=i+1
t
X t
Y t Y
X t
= (I − γk LM )P fρ − (I − γk LM )P fρ
i=1 k=i+1 i=1 k=i
t
!
Y
= I− (I − γk LM ) P fρ (67)
k=1
and also
L−1
M,λ LM P fρ = I − λL−1
M,λ P fρ (68)
Now taking the difference between (67) and (68)
t t
! !
Y Y
−1
I− (I − γk LM ) P fρ − I − λLM,λ P fρ = λL−1
M,λ − (I − γk LM ) P fρ
k=1 k=1
t
!
Y 1
= λL−1
M,λ − (I − γk LM ) L 2 g (69)
k=1
22
Now letting AM λt = λL−1
Qt
M,λ − k=1 (I − γk LM ), taking the norm of the above quantity, and
recalling that R ≥ kgk we have
1
− 12
1
1
AM λt L g
≤ R
AM λt LM,η LM,η L
2
(70)
2
2
1
1
1
−
2 1
≤ R
2
AM λt LM + η 2 I
LM,η L 2
(71)
1
1
≤ 2R
AM λt LM + η I
2 2
, (72)
−1/2
where in the last inequality we used the fact that kLM,η L1/2 k ≤ 2 holds with probability
9κ2
1 − δ when M log M
δ ≤ η ≤ kLk (see Lemma 5 in [26]). Further, note that when η ≥ kLk,
−1/2 −1/2
we have that kLM,η L1/2 k ≤ η −1/2 kL1/2 k ≤ 1. This allows us to say that kLM,η L1/2 k ≤ 2
2
with probability 1 − δ, when 9κ M
M log δ ≤ η. We now split (72) into two terms
1
1
1
1
2R
AM λt LM + η I
≤ 2R
AM λt LM
+ 2Rη
AM λt
2 2
2
2
. (73)
Considering the last term in the right hand side of (75), using results from [Lemma 15 in
[33]] we can write
t
!1
2
Y 1
1
(I − γk LM )LM
≤
2
(76)
Pt
k=1
2e k=1 γk
For the first term in the right hand side of (75) we can derive
−1 1
1
1 1
λL L 2
≤ λ 21
λ 12 L− 2
L− 2 L 2
≤ λ 12
M,λ M
M,λ
M,λ M
(77)
9κ2
Plugging (74) (76) and (77) in (72), and choosing η = M log M
δ we complete the proof.
The next Lemma is a known result from Lemma 8 of [8] which bounds the distance between
the Tikhonov solution with RF and the Tikhonov solution without RF (51).
23
Lemma 9. Under Assumption 2 and 3 for any λ > 0, δ ∈ (0, 1/2], when
!
18κ2 8κ2
M ≥ 4+ log (78)
λ λδ
the following holds with probability at least 1 − 2δ.
r
1 11κ2 2
kLL−1
λ P fρ − LM L−1
M,λ P fρ k ≤ 8R √ log log (79)
M λ δ
The last result is a classical bound of the approximation error for the Tikhonov filter (52),
see [11].
Lemma 10 (From [11] or Lemma 5 of [8]). Under Assumption 3
1
kLL−1
λ P fρ − P fρ k ≤ Rλ
2 (80)
24
when (81) holds.
Similarly from Lemma 10
R2 κ2
kLL−1 2
λ P fρ − P fρ k ≤ . (85)
γt1−θ
The desired result is obtained by gathering the results in (62), (82), (83), (84), (85).
Requiring γ, M to satisfy the associated conditions (81), (60), (61). In particular note that
(60) is satisfied when θ = 0 by γ ≤ (8(log T + 1))−1 , while, if θ > 0, we have
z min(θ, 1 − θ)
≥ e− min(θ,1−θ) inf 8 ≥ e− min(θ,1−θ) ,
z≥1
min(θ,1−θ) log z 4
where we performed the change of variable tmin(θ,1−θ) = z. Finally note that e− min(θ,1−θ) ≥
e−1/2 , for any θ ∈ (0, 1). Moreover the (81), (61) are satisfied for any t ∈ [T ] by requiring
them to hold for t = T .
25