Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Nips 2007

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

The Tradeoffs of Large Scale Learning

Léon Bottou Olivier Bousquet


NEC laboratories of America Google Zürich
Princeton, NJ 08540, USA 8002 Zurich, Switzerland
leon@bottou.org olivier.bousquet@m4x.org

Abstract

This contribution develops a theoretical framework that takes into account the
effect of approximate optimization on learning algorithms. The analysis shows
distinct tradeoffs for the case of small-scale and large-scale learning problems.
Small-scale learning problems are subject to the usual approximation–estimation
tradeoff. Large-scale learning problems are subject to a qualitatively different
tradeoff involving the computational complexity of the underlying optimization
algorithms in non-trivial ways.

1 Motivation

The computational complexity of learning algorithms has seldom been taken into account by the
learning theory. Valiant [1] states that a problem is “learnable” when there exists a probably approx-
imatively correct learning algorithm with polynomial complexity. Whereas much progress has been
made on the statistical aspect (e.g., [2, 3, 4]), very little has been told about the complexity side of
this proposal (e.g., [5].)
Computational complexity becomes the limiting factor when one envisions large amounts of training
data. Two important examples come to mind:

• Data mining exists because competitive advantages can be achieved by analyzing the
masses of data that describe the life of our computerized society. Since virtually every
computer generates data, the data volume is proportional to the available computing power.
Therefore one needs learning algorithms that scale roughly linearly with the total volume
of data.
• Artificial intelligence attempts to emulate the cognitive capabilities of human beings. Our
biological brains can learn quite efficiently from the continuous streams of perceptual data
generated by our six senses, using limited amounts of sugar as a source of power. This
observation suggests that there are learning algorithms whose computing time requirements
scale roughly linearly with the total volume of data.

This contribution finds its source in the idea that approximate optimization algorithms might be
sufficient for learning purposes. The first part proposes new decomposition of the test error where
an additional term represents the impact of approximate optimization. In the case of small-scale
learning problems, this decomposition reduces to the well known tradeoff between approximation
error and estimation error. In the case of large-scale learning problems, the tradeoff is more com-
plex because it involves the computational complexity of the learning algorithm. The second part
explores the asymptotic properties of the large-scale learning tradeoff for various prototypical learn-
ing algorithms under various assumptions regarding the statistical estimation rates associated with
the chosen objective functions. This part clearly shows that the best optimization algorithms are not
necessarily the best learning algorithms. Maybe more surprisingly, certain algorithms perform well
regardless of the assumed rate for the statistical estimation error.
2 Approximate Optimization
2.1 Setup

Following [6, 2], we consider a space of input-output pairs (x, y) ∈ X × Y endowed with a proba-
bility distribution P (x, y). The conditional distribution P (y|x) represents the unknown relationship
between inputs and outputs. The discrepancy between the predicted output ŷ and the real output
y is measured with a loss function ℓ(ŷ, y). Our benchmark is the function f ∗ that minimizes the
expected risk Z
E(f ) = ℓ(f (x), y) dP (x, y) = E [ℓ(f (x), y)],
that is,
f ∗ (x) = arg min E [ ℓ(ŷ, y)| x].

Although the distribution P (x, y) is unknown, we are given a sample S of n independently drawn
training examples (xi , yi ), i = 1 . . . n. We define the empirical risk
n
1X
En (f ) = ℓ(f (xi ), yi ) = En [ℓ(f (x), y)].
n i=1
Our first learning principle consists in choosing a family F of candidate prediction functions and
finding the function fn = arg minf ∈F En (f ) that minimizes the empirical risk. Well known com-
binatorial results (e.g., [2]) support this approach provided that the chosen family F is sufficiently
restrictive. Since the optimal function f ∗ is unlikely to belong to the family F, we also define
fF∗ = arg minf ∈F E(f ). For simplicity, we assume that f ∗ , fF∗ and fn are well defined and unique.
We can then decompose the excess error as
E [E(fn ) − E(f ∗ )] = E [E(fF∗ ) − E(f ∗ )] + E [E(fn ) − E(fF∗ )] = Eapp + Eest , (1)
where the expectation is taken with respect to the random choice of training set. The approximation
error Eapp measures how closely functions in F can approximate the optimal solution f ∗ . The
estimation error Eest measures the effect of minimizing the empirical risk En (f ) instead of the
expected risk E(f ). The estimation error is determined by the number of training examples and by
the capacity of the family of functions [2]. Large families1 of functions have smaller approximation
errors but lead to higher estimation errors. This tradeoff has been extensively discussed in the
literature [2, 3] and lead to excess errors that scale between the inverse and the inverse square root
of the number of examples [7, 8].

2.2 Optimization Error

Finding fn by minimizing the empirical risk En (f ) is often a computationally expensive operation.


Since the empirical risk En (f ) is already an approximation of the expected risk E(f ), it should
not be necessary to carry out this minimization with great accuracy. For instance, we could stop an
iterative optimization algorithm long before its convergence.
Let us assume that our minimization algorithm returns an approximate solution f˜n such that
En (f˜n ) < En (fn ) + ρ
where ρ ≥ 0 is a predefined tolerance. An additional term Eopt = E E(f˜n ) − E(fn ) then appears
 

in the decomposition of the excess error E = E E(f˜n ) − E(f ∗ ) :


 

E = E [E(fF∗ ) − E(f ∗ )] + E [E(fn ) − E(fF∗ )] + E E(f˜n ) − E(fn )


 

= Eapp + Eest + Eopt . (2)


We call this additional term optimization error. It reflects the impact of the approximate optimization
on the generalization performance. Its magnitude is comparable to ρ (see section 3.1.)
1
We often consider nested families of functions of the form Fc = {f ∈ H, Ω(f ) ≤ c}. Then, for each
value of c, function fn is obtained by minimizing the regularized empirical risk En (f ) + λΩ(f ) for a suitable
choice of the Lagrange coefficient λ. We can then control the estimation-approximation tradeoff by choosing
λ instead of c.
2.3 The Approximation–Estimation–Optimization Tradeoff

This decomposition leads to a more complicated compromise. It involves three variables and two
constraints. The constraints are the maximal number of available training example and the maximal
computation time. The variables are the size of the family of functions F, the optimization accuracy
ρ, and the number of examples n. This is formalized by the following optimization problem.

n ≤ nmax
min E = Eapp + Eest + Eopt subject to (3)
F ,ρ,n T (F, ρ, n) ≤ Tmax
The number n of training examples is a variable because we could choose to use only a subset of
the available training examples in order to complete the optimization within the alloted time. This
happens often in practice. Table 1 summarizes the typical evolution of the quantities of interest with
the three variables F, n, and ρ increase.

Table 1: Typical variations when F, n, and ρ increase.


F n ρ
Eapp (approximation error) ց
Eest (estimation error) ր ց
Eopt (optimization error) ··· ··· ր
T (computation time) ր ր ց

The solution of the optimization program (3) depends critically of which budget constraint is active:
constraint n < nmax on the number of examples, or constraint T < Tmax on the training time.

• We speak of small-scale learning problem when (3) is constrained by the maximal number
of examples nmax . Since the computing time is not limited, we can reduce the optimization
error Eopt to insignificant levels by choosing ρ arbitrarily small. The excess error is then
dominated by the approximation and estimation errors, Eapp and Eest . Taking n = nmax ,
we recover the approximation-estimation tradeoff that is the object of abundant literature.
• We speak of large-scale learning problem when (3) is constrained by the maximal comput-
ing time Tmax . Approximate optimization, that is, choosing ρ > 0, possibly can achieve
better generalization because more training examples can be processed during the allowed
time. The specifics depend on the computational properties of the chosen optimization
algorithm through the expression of the computing time T (F, ρ, n).

3 The Asymptotics of Large-scale Learning


In the previous section, we have extended the classical approximation-estimation tradeoff by taking
into account the optimization error. We have given an objective criterion to distiguish small-scale
and large-scale learning problems. In the small-scale case, we recover the classical tradeoff between
approximation and estimation. The large-scale case is substantially different because it involves
the computational complexity of the learning algorithm. In order to clarify the large-scale learning
tradeoff with sufficient generality, this section makes several simplifications:

• We are studying upper bounds of the approximation, estimation, and optimization er-
rors (2). It is often accepted that these upper bounds give a realistic idea of the actual
convergence rates [9, 10, 11, 12]. Another way to find comfort in this approach is to say
that we study guaranteed convergence rates instead of the possibly pathological special
cases.
• We are studying the asymptotic properties of the tradeoff when the problem size increases.
Instead of carefully balancing the three terms, we write E = O(Eapp ) + O(Eest ) + O(Eopt )
and only need to ensure that the three terms decrease with the same asymptotic rate.
• We are considering a fixed family of functions F and therefore avoid taking into account
the approximation error Eapp . This part of the tradeoff covers a wide spectrum of practical
realities such as choosing models and choosing features. In the context of this work, we do
not believe we can meaningfully address this without discussing, for instance, the thorny
issue of feature selection. Instead we focus on the choice of optimization algorithm.
• Finally, in order to keep this paper short, we consider that the family of functions F is
linearly parametrized by a vector w ∈ Rd . We also assume that x, y and w are bounded,
ensuring that there is a constant B such that 0 ≤ ℓ(fw (x), y) ≤ B and ℓ(·, y) is Lipschitz.
We first explain how the uniform convergence bounds provide convergence rates that take the op-
timization error into account. Then we discuss and compare the asymptotic learning properties of
several optimization algorithms.

3.1 Convergence of the Estimation and Optimization Errors

The optimization error Eopt depends directly on the optimization accuracy ρ. However, the accuracy
ρ involves the empirical quantity En (f˜n ) − En (fn ), whereas the optimization error Eopt involves
its expected counterpart E(f˜n ) − E(fn ). This section discusses the impact on the optimization
error Eopt and of the optimization accuracy ρ on generalization bounds that leverage the uniform
convergence concepts pioneered by Vapnik and Chervonenkis (e.g., [2].)
In this discussion, we use the letter c to refer to any positive constant. Multiple occurences of the
letter c do not necessarily imply that the constants have identical values.

3.1.1 Simple Uniform Convergence Bounds


Recall that we assume that F is linearly parametrized by w ∈ Rd . Elementary uniform convergence
results then state that » – r
d
E sup |E(f ) − En (f )| ≤ c ,
f ∈F n
where the expectation is taken with respect to the random choice of the training set.2 This result
immediately provides a bound on the estimation error:
E(fn ) − En (fn ) + En (fn ) − En (fF∗ ) + En (fF∗ ) − E(fF∗ )
ˆ` ´ ` ´ ` ´˜
Eest = E
» – r
d
≤ 2 E sup |E(f ) − En (f )| ≤ c .
f ∈F n
This same result also provides a combined bound for the estimation and optimization errors:
E E(f˜n ) − En (f˜n ) + E En (f˜n ) − En (fn )
ˆ ˜ ˆ ˜
Eest + Eopt =
+ E [En (fn ) − En (fF∗ )] + E [En (fF∗ ) − E(fF∗ )]
r r r !
d d d
≤ c +ρ+0+c = c ρ+ .
n n n
Unfortunately, this convergence rate is known to be pessimistic in many important cases. More
sophisticated bounds are required.

3.1.2 Faster Rates in the Realizable Case


When the loss functions ℓ(ŷ, y) is positive, with probability 1 − e−τ for any τ > 0, relative uniform
convergence bounds state that
r
E(f ) − En (f ) d n τ
sup p ≤c log + .
f ∈F E(f ) n d n
This result is very useful because it provides faster convergence rates O(log n/n) in the realizable
case, that is when ℓ(fn (xi ), yi ) = 0 for all training examples (xi , yi ). We have then En (fn ) = 0,
En (f˜n ) ≤ ρ, and we can write
r
d n τ
q
E(f˜n ) − ρ ≤ c E(f˜n ) log + .
n d n
q
2
Although the original Vapnik-Chervonenkis bounds have the form c nd log nd , the logarithmic term can
be eliminated using the “chaining” technique (e.g., [10].)
q
Viewing this as a second degree polynomial inequality in variable E(f˜n ), we obtain
„ «
d n τ
E(f˜n ) ≤ c ρ + log + .
n d n
Integrating this inequality using a standard technique (see, e.g., [13]), we obtain a better convergence
rate of the combined estimation and optimization error:
„ «
h i h i d n
Eest + Eopt = E E(f˜n ) − E(fF∗ ) ≤ E E(f˜n ) = c ρ + log .
n d

3.1.3 Fast Rate Bounds


Many authors (e.g., [10, 4, 12]) obtain fast statistical estimation rates in more general conditions.
These bounds have the general form
  α 
d n 1
Eapp + Eest ≤ c Eapp + log for ≤ α ≤ 1. (4)
n d 2
This result holds when one can establish the following variance condition:
h  2− α1
2 i
∀f ∈ F E ℓ(f (X), Y ) − ℓ(fF∗ (X), Y ) ≤ c E(f ) − E(fF∗ ) . (5)

The convergence rate of (4) is described by the exponent α which is determined by the quality of
the variance bound (5). Works on fast statistical estimation identify two main ways to establish such
a variance condition.
• Exploiting the strict convexity of certain loss functions [12, theorem 12]. For instance, Lee
et al. [14] establish a O(log n/n) rate using the squared loss ℓ(ŷ, y) = (ŷ − y)2 .
• Making assumptions on the data distribution. In the case of pattern recognition problems,
for instance, the “Tsybakov condition” indicates how cleanly the posterior distributions
P (y|x) cross near the optimal decision boundary [11, 12]. The realizable case discussed in
section 3.1.2 can be viewed as an extreme case of this.
Despite their much greater complexity, fast rate estimation results can accomodate the optimization
accuracy ρ using essentially the methods illustrated in sections 3.1.1 and 3.1.2. We then obtain a
bound of the form
  α 
h i d n
E = Eapp + Eest + Eopt = E E(f˜n ) − E(f ∗ ) ≤ c Eapp + log +ρ . (6)
n d
For instance, a general result with α = 1 is provided by Massart [13, theorem 4.2]. Combining this
result with standard bounds on the complexity of classes of linear functions (e.g., [10]) yields the
following result:
 
h
˜ ∗
i d n
E = Eapp + Eest + Eopt = E E(fn ) − E(f ) ≤ c Eapp + log + ρ . (7)
n d
See also [15, 4] for more bounds taking into account the optimization accuracy.

3.2 Gradient Optimization Algorithms

We now discuss and compare the asymptotic learning properties of four gradient optimization algo-
rithms. Recall that the family of function F is linearly parametrized by w ∈ Rd . Let wF∗
and wn

correspond to the functions fF and fn defined in section 2.1. In this section, we assume that the
functions w 7→ ℓ(fw (x), y) are convex and twice differentiable with continuous second derivatives.
Convexity ensures that the empirical const function C(w) = En (fw ) has a single minimum.
Two matrices play an important role in the analysis: the Hessian matrix H and the gradient covari-
ance matrix G, both measured at the empirical optimum wn .
∂2C
 2 
∂ ℓ(fwn (x), y)
H = (w n ) = E n , (8)
∂w2 ∂w2
"  ′ #
∂ℓ(fwn (x), y) ∂ℓ(fwn (x), y)
G = En . (9)
∂w ∂w
The relation between these two matrices depends on the chosen loss function. In order to summarize
them, we assume that there are constants λmax ≥ λmin > 0 and ν > 0 such that, for any η > 0,
we can choose the number of examples n large enough to ensure that the following assertion is true
with probability greater than 1 − η :
tr(G H −1 ) ≤ ν and EigenSpectrum(H) ⊂ [ λmin , λmax ] (10)
The condition number κ = λmax /λmin characterizes the optimisation difficulty (e.g., [16].)
The condition λmin > 0 avoids complications with stochastic gradient algorithms. Note that this
condition only implies strict convexity around the optimum. For instance, consider a loss function
obtained by smoothing the well known hinge loss ℓ(z, y) = max{0, 1−yz} in a small neighborhood
of its non-differentiable points. Function C(w) is then piecewise linear with smoothed edges and
vertices. It is not strictly convex. However its minimum is likely to be on a smoothed vertex with a
non singular Hessian. When we have strict convexity, the argument of [12, theorem 12] yields fast
estimation rates α ≈ 1 in (4) and (6). This is not necessarily the case here.
The four algorithm considered in this paper use information about the gradient of the cost function
to iteratively update their current estimate w(t) of the parameter vector.

• Gradient Descent (GD) iterates


n
∂C 1X ∂ 
w(t + 1) = w(t) − η (w(t)) = w(t) − η ℓ fw(t) (xi ), yi
∂w n i=1 ∂w

where η > 0 is a small enough gain. GD is an algorithm with linear convergence [16]:
when η = 1/λmax , this algorithm requires O(κ log(1/ρ)) iterations to reach accuracy ρ.
The exact number of iterations depends on the choice of the initial parameter vector.
• Second Order Gradient Descent (2GD) iterates
n
∂C 1 X ∂
w(t + 1) = w(t) − H −1 (w(t)) = w(t) − H −1

ℓ fw(t) (xi ), yi
∂w n i=1
∂w

where matrix H −1 is the inverse of the Hessian matrix (8). This is more favorable than
Newton’s algorithm because we do not evaluate the local Hessian at each iteration but
simply assume that we know in advance the Hessian at the optimum. 2GD is a superlinear
optimization algorithm with quadratic convergence [16]. When the cost is quadratic, a
single iteration is sufficient. In the general case, O(log log(1/ρ)) iterations are required to
reach accuracy ρ.
• Stochastic Gradient Descent (SGD) picks a random training example (xt , yt ) at each
iteration and updates the parameter w on the basis of this example only,
η ∂ 
w(t + 1) = w(t) − ℓ fw(t) (xt ), yt .
t ∂w
Murata [17, section 2.2], characterizes the mean ES [w(t)] and variance VarS [w(t)] with
respect to the distribution implied by the random examples drawn from a given training
set S at each iteration. Applying this result to the discrete training set distribution for
η = 1/λmin , we have δw(t)2 = O(1/t) where δw(t) is a shorthand notation for w(t)−wn .
We can then write
ES tr H δw(t) δw(t)′ + o 1t
ˆ ` ´˜ ` ´
ES [ C(w(t)) − inf C ] =
tr H ES [δw(t)] ES [δw(t)]′ + H VarS [w(t)] + o 1t
` ´ ` ´
= (11)
tr(GH) 2
+ o 1t ≤ νκt + o 1t .
` ´ ` ´
≤ t

Therefore the SGD algorithm reaches accuracy ρ after less than νκ2/ρ + o(1/ρ) iterations
on average. The SGD convergence is essentially limited by the stochastic noise induced
by the random choice of one example at each iteration. Neither the initial value of the
parameter vector w nor the total number of examples n appear in the dominant term of this
bound! When the training set is large, one could reach the desired accuracy ρ measured on
the whole training set without even visiting all the training examples. This is in fact a kind
of generalization bound.
Table 2: Asymptotic results for gradient algorithms (with probability 1). Compare the second
last column (time to optimize) with the last column (time to reach the excess test error ǫ).
Legend: n number of examples; d parameter dimension; κ, ν see equation (10).

Algorithm Cost of one Iterations Time to reach Time to reach


iteration to reach ρ accuracy ρ E ≤ c (Eapp + ε)
     2 
GD O(nd) O κ log ρ1 O ndκ log ρ1 O εd1/ακ
log2 1ε
     2 
O d2 + nd O log log ρ1 O d2 + nd log log ρ1 O ε1/α
d
log 1ε log log 1ε
 
2GD
   2  
νκ2 1 2
SGD O(d) ρ +o ρ O dνκρ O d νεκ
   2   2 
1
O d2 ν d ν d ν

2SGD ρ + o ρ O ρ O ε

• Second Order Stochastic Gradient Descent (2SGD) replaces the gain η by the inverse of
the Hessian matrix H:
1 −1 ∂ 
w(t + 1) = w(t) − H ℓ fw(t) (xt ), yt .
t ∂w
Unlike standard gradient algorithms, using the second order information does not change
the influence of ρ on the convergence rate but improves the constants. Using again [17,
theorem 4], accuracy ρ is reached after ν/ρ + o(1/ρ) iterations.

For each of the four gradient algorithms, the first three columns of table 2 report the time for a single
iteration, the number of iterations needed to reach a predefined accuracy ρ, and their product, the
time needed to reach accuracy ρ. These asymptotic results are valid with probability 1, since the
probability of their complement is smaller than η for any η > 0.
The fourth column bounds the time necessary to reduce the excess error E below c (E
´αapp
+ε) where c
is the constant from (6). This is computed by observing that choosing ρ ∼ nd log nd in (6) achieves
`

the fastest rate for ε, with minimal computation time. We can then use the asymptotic equivalences
ρ ∼ ε and n ∼ ε1/α d
log 1ε . Setting the fourth column expressions to Tmax and solving for ǫ yields
the best excess error achieved by each algorithm within the limited time Tmax . This provides the
asymptotic solution of the Estimation–Optimization tradeoff (3) for large scale problems satisfying
our assumptions.
These results clearly show that the generalization performance of large-scale learning systems de-
pends on both the statistical properties of the objective function and the computational properties of
the chosen optimization algorithm. Their combination leads to surprising consequences:

• The SGD and 2SGD results do not depend on the estimation rate α. When the estimation
rate is poor, there is less need to optimize accurately. That leaves time to process more
examples. A potentially more useful interpretation leverages the fact that (11) is already a
kind of generalization bound: its fast rate trumps the slower rate assumed for the estimation
error.
• Second order algorithms bring little asymptotical improvements in ε. Although the super-
linear 2GD algorithm improves the logarithmic term, all four algorithms are dominated by
the polynomial term in (1/ε). However, there are important variations in the influence of
the constants d, κ and ν. These constants are very important in practice.
• Stochastic algorithms (SGD, 2SGD) yield the best generalization performance despite be-
ing the worst optimization algorithms. This had been described before [18] and observed
in experiments.

In contrast, since the optimization error Eopt of small-scale learning systems can be reduced to
insignificant levels, their generalization performance is solely determined by the statistical properties
of the objective function.
4 Conclusion
Taking in account budget constraints on both the number of examples and the computation time,
we find qualitative differences between the generalization performance of small-scale learning sys-
tems and large-scale learning systems. The generalization properties of large-scale learning systems
depend on both the statistical properties of the objective function and the computational proper-
ties of the optimization algorithm. We illustrate this fact with some asymptotic results on gradient
algorithms.
Considerable refinements of this framework can be expected. Extending the analysis to regular-
ized risk formulations would make results on the complexity of primal and dual optimization algo-
rithms [19, 20] directly exploitable. The choice of surrogate loss function [7, 12] could also have a
non-trivial impact in the large-scale case.

Acknowledgments Part of this work was funded by NSF grant CCR-0325463.

References
[1] Leslie G. Valiant. A theory of learnable. Proc. of the 1984 STOC, pages 436–445, 1984.
[2] Vladimir N. Vapnik. Estimation of Dependences Based on Empirical Data. Springer Series in Statistics.
Springer-Verlag, Berlin, 1982.
[3] Stéphane Boucheron, Olivier Bousquet, and Gábor Lugosi. Theory of classification: a survey of recent
advances. ESAIM: Probability and Statistics, 9:323–375, 2005.
[4] Peter L. Bartlett and Shahar Mendelson. Empirical minimization. Probability Theory and Related Fields,
135(3):311–334, 2006.
[5] J. Stephen Judd. On the complexity of loading shallow neural networks. Journal of Complexity, 4(3):177–
192, 1988.
[6] Richard O. Duda and Peter E. Hart. Pattern Classification And Scene Analysis. Wiley and Son, 1973.
[7] Tong Zhang. Statistical behavior and consistency of classification methods based on convex risk mini-
mization. The Annals of Statistics, 32:56–85, 2004.
[8] Clint Scovel and Ingo Steinwart. Fast rates for support vector machines. In Peter Auer and Ron Meir,
editors, Proceedings of the 18th Conference on Learning Theory (COLT 2005), volume 3559 of Lecture
Notes in Computer Science, pages 279–294, Bertinoro, Italy, June 2005. Springer-Verlag.
[9] Vladimir N. Vapnik, Esther Levin, and Yann LeCun. Measuring the VC-dimension of a learning machine.
Neural Computation, 6(5):851–876, 1994.
[10] Olivier Bousquet. Concentration Inequalities and Empirical Processes Theory Applied to the Analysis of
Learning Algorithms. PhD thesis, Ecole Polytechnique, 2002.
[11] Alexandre B. Tsybakov. Optimal aggregation of classifiers in statistical learning. Annals of Statististics,
32(1), 2004.
[12] Peter L. Bartlett, Michael I. Jordan, and Jon D. McAuliffe. Convexity, classification and risk bounds.
Journal of the American Statistical Association, 101(473):138–156, March 2006.
[13] Pascal Massart. Some applications of concentration inequalities to statistics. Annales de la Faculté des
Sciences de Toulouse, (2):245–303, 2000.
[14] Wee S. Lee, Peter L. Bartlett, and Robert C. Williamson. The importance of convexity in learning with
squared loss. IEEE Transactions on Information Theory, 44(5):1974–1980, 1998.
[15] Shahar Mendelson. A few notes on statistical learning theory. In Shahar Mendelson and Alexander J.
Smola, editors, Advanced Lectures in Machine Learning, volume 2600 of Lecture Notes in Computer
Science, pages 1–40. Springer-Verlag, Berlin, 2003.
[16] John E. Dennis, Jr. and Robert B. Schnabel. Numerical Methods For Unconstrained Optimization and
Nonlinear Equations. Prentice-Hall, Inc., Englewood Cliffs, New Jersey, 1983.
[17] Noboru Murata. A statistical study of on-line learning. In David Saad, editor, Online Learning and Neural
Networks. Cambridge University Press, Cambridge, UK, 1998.
[18] Léon Bottou and Yann LeCun. Large scale online learning. In Sebastian Thrun, Lawrence Saul, and Bern-
hard Schölkopf, editors, Advances in Neural Information Processing Systems 16. MIT Press, Cambridge,
MA, 2004.
[19] Thorsten Joachims. Training linear svms in linear time. In Proceedings of KDD’06, Philadelphia, PA,
USA, August 20-23 2006. ACM.
[20] Don Hush, Patrick Kelly, Clint Scovel, and Ingo Steinwart. QP algorithms with guaranteed accuracy and
run time for support vector machines. Journal of Machine Learning Research, 7:733–769, 2006.

You might also like