Efficient and Numerically Stable Sparse Learning
Sihong Xie1 , Wei Fan2 , Olivier Verscheure2 , and Jiangtao Ren1
1
2
Sun Yat-Sen University, Guangzhou, China
{xiesihong1, issrjt}@gmail.com
IBM T.J. Watson Research Center, New York, USA
{weifan, ov1}@us.ibm.com
Abstract. We consider the problem of numerical stability and model
density growth when training a sparse linear model from massive data.
We focus on scalable algorithms that optimize certain loss function using
gradient descent, with either ℓ0 or ℓ1 regularization. We observed numerical stability problems in several existing methods, leading to divergence
and low accuracy. In addition, these methods typically have weak controls over sparsity, such that model density grows faster than necessary.
We propose a framework to address the above problems. First, the update rule is numerically stable with convergence guarantee and results in
more reasonable models. Second, besides ℓ1 regularization, it exploits the
sparsity of data distribution and achieves a higher degree of sparsity with
a PAC generalization error bound. Lastly, it is parallelizable and suitable
for training large margin classifiers on huge datasets. Experiments show
that the proposed method converges consistently and outperforms other
baselines using 10% of features by as much as 6% reduction in error rate
on average. Datasets and software are available from the authors.
1
Introduction
In this paper, we focus on training a sparse large margin model. Assume that we
are given m labeled examples Z = {(x1 , y1 ), . . . , (xm , ym )}, where xi ∈ Rd , i =
1, . . . , m. We aim at solving the following optimization problem:
m
min
w∈Rd
1 X
L(hw, xi i, yi ) + λkwk1
m i=1
(1)
L is any smooth and differentiable loss function such as logistic or hinge loss. λ
is the parameter for trading off between loss and ℓ1 regularization. We are interested in scalable algorithm, with parallelizable data accesses and small communication cost. One can find such application in text mining and webspam
detection, where the number of features could be in millions.
Sparse learning aims at accurate models using a small number of non-zero
elements, with the advantages of efficiency and generalizability [13]. Some existing methods produce sparse models by forward-backward feature selection [13]
or boosting [3]. Though effective, these methods have to scan the training set
Table 1: Notations and definitions
Notation definition
x
y
w
wZ
w̃
S(·, λ)
λ
η
J
kvk1
kvk∞
Notation Definition
Example in Rd
X
Label of x
Z
Linear model in Rd
θ
Linear model learned from Z
w∗
Vector to be thresholded
H(·, s)
Soft-thresholding function
∇f (·)
ℓ1 regularization parameter
t
Learning rate
[d]
Set of indices, J ⊆ [d]
kvk0
The sum of |vi |, i ∈ [d]
kvk2
The maximum of |vi |
σi
Data matrix
A series of m labeled examples
Dual vector of w in Eq. (8)
Optimal linear model
Hard-thresholding function
Gradient of f (·)
Index of iterations
Set of indices {1, . . . , d}
The support of v
The Euclidean length of v
Eigenvalues
at each iteration, which is expensive or even impossible for large datasets. Under some restrictive assumptions [14], one can achieve sparse models via convex
programming with ℓ1 regularization or constraint. Though scalable as methods
in [8, 12], there are two main drawbacks. First, numerical problems can lead to
iteration divergence or summation cancellation, making the algorithm less useful. Second, ℓ1 regularization only encourages sparsity and may not be enough
for sparse learning. As the training proceeds, model complexity can grow faster
than necessary and result in dense models, regardless of the regularization.
In this paper, we propose a perceptron-based algorithm to address the above
problems. First, it is numerically stable with convergence guarantee. Second, in
addition to ℓ1 regularization, it further takes advantage of the sparsity of training examples to achieve a higher degree of model sparsity, and furthermore, a
generalization error bound which is not provided in [8, 12]. Experiments show
that the proposed method converges with steadily reduced test error rates as
the training proceeds. and consistently outperforms previous state-of-the-art algorithms by as much as 8.4% in accuracy using only 10% of all features in one
of the tasks.
2
Numerical Challenges in Sparse Learning
Notations are summarized in Table 1. Note that an element of a vector is in
normal font, for example, wj is the j-th element of the vector w. About norms,
for any v ∈ Rd , kvkp1 ≤ kvkp2 for any p1 ≥ p2 [6]. If not otherwise specified,
k · k is equivalent to k · k2
We have observed numerical problems in two sparse learning algorithms using
either ℓ1 or ℓ0 regularization.
2.1
Numerical Problems of Direct Iterative Methods
The first problem arises when directly applying matrix iterative methods to solve
a system of linear equations
y = Xw
(2)
where X = (x1 , . . . , xm )⊤ . ℓ1 or ℓ0 regularization are used to obtain sparse
model, as in compressive sensing or sparse learning. Suppose there exists an
optimal solution w∗ satisfying Eq. (2), namely y − Xw∗ = 0 (0 denotes the zero
vector). The direct method tries to find w∗ by minimizing the potential function
Ψ (w) =
1
ky − Xwk2
2
(3)
Starting with an arbitrary vector w(0) (usually 0), it follows the gradient direction at each iteration to reduce the potential function value. The gradient of Eq.
(3) is ∇Ψ (w) = −X ⊤ (y − Xw) and we update w(t) by
w(t) = w(t−1) + ηX ⊤ (y − Xw(t−1) )
(4)
where η is the learning rate. To see when the above iteration fails to converge,
let ε(t) = w(t) − w∗ be the error vector.
ε(t) = w(t−1) + ηX ⊤ (y − Xw(t−1) ) − w∗ − ηX ⊤ (y − Xw∗ )
= (I − ηX ⊤ X)(w(t−1) − w∗ ) = M ε(t−1)
where M = I − ηX ⊤ X is the iteration matrix. Thus for t > 0 we have
kε(t) k = kw(t) − w∗ k = kM ε(t−1) k = · · · = kM t ε(0) k
The sufficient and necessary condition for the update rule (4) to converge is
that the spectral radius of M , ρ(M ) = maxi {|σi |}less than 1, where σi is the
eigenvalues of M . Formally,
lim kM t k = 0 ⇔ ρ(M ) < 1
t→∞
(5)
The above analysis shows that if one cannot guarantee ρ(M ) < 1 at each iteration, then update rule (4) may not be applicable in reducing the potential
function.
In addition, we would like w(t) to be sparse. It is shown in [5] that under the
RIP condition [1], a truncated version of update rule (4) reduces the potential
function at each iteration while maintaining a sparse solution. We show that the
truncation does not automatically guarantee the convergence of ℓ0 constrained
gradient descent. Consider the following iteration
w(t) = H(w̃(t) , s) = DJ (t) (w(t−1) + ηX ⊤ (y − Xw(t−1) ))
(6)
with w(0) being a zero vector. w̃(t) is the t-th solution before thresholding.
H(v, s) : Rd → Rd keeps only s elements with largest absolute values in v
and set other elements to zero. H(v, s) equals the row selection operation as
shown in Eq. (6). J (t) ⊆ [d] ( J (t) = s, ∀t = 1, 2, . . . ) represents the set of
indices of s remaining elements in H(w̃, s). DJ (t) is a diagonal matrix with
Dj,j = 1, ∀j ∈ J (t) and 0 otherwise. The error vector of iteration rule (6) can
then be written as
kε(t) k2 = kw(t) − w∗ k2 = kDJ (t) w̃(t) − w∗ k2
≥ kDJ (t) w̃(t) − DJ (t) w∗ k2
= kDJ (t) [w(t−1) + ηX ⊤ (y − Xw(t−1) )]
(7)
−DJ (t) [w∗ + ηX ⊤ (y − Xw∗ )]k2
= kDJ (t) M (w(t−1) − w∗ )k2
= kDJ (t) M ε(t−1) k2
Inequality (7) follows since wj∗ , j 6∈ J (t) are held out of the sum of the ℓ2 -norm.
Therefore, kε(t) k2 is lower-bounded by kDJ (t) M ε(t−1) k2 . For the ℓ0 projected
gradient descent to converge, DJ (t) M must satisfy ρ(DJ (t) M ) < 1, otherwise,
update rule (6) diverges. Simple truncation using H(v, s) doesn’t guarantee such
condition, as we demonstrated in experiments.
2.2
Numerical Problems of Mirror Descent
The mirror descent algorithm (MDA) has been recognized as an effective online
learning algorithm. For example, MDA using Bregman divergence is proposed to
approximate the exponentiated gradient (EG) algorithm, which have cumulative
loss bound logarithmically to the number of irrelevant features in the target
weight vector [6]. In convergence rate, it is proved that MDA is superior to the
usual stochastic gradient descent methods [12], where MDA is adopted in online
sparse learning. However, the price MDA pays for these advantages is numerical
unstability, especially when the dimensionality of data is high, leading to less
discriminative models. As proposed in [12], the main component of MDA for
sparse learning is the alternative updates of two vectors θ, w ∈ Rn by the
following rules:
(t)
(t)
sgn(θj )|θj |p−1
(t)
wj = fj−1 (θ(t) ) =
, ∀j
(8)
kθ(t) kp−2
p
θ(t+1) = S(θ(t) − η∇L(w(t) ), λ)
(9)
where S(·, λ) is the soft-thresholding operator [2]:
S(wj , λ) = sgn(wj )(wj − λ)+
wj − λ, if wj > 0 and |wj | > λ
= wj + λ, if wj < 0 and |wj | > λ
0, if |wj | < λ
(10)
S(v, λ) means applying Eq. (10) at each element of v. As suggested in [6], the
parameter p in Eq. (8) is set to O(ln(d)) (Note that different p lead to different
“algorithms”, for example, in binary classification, one obtain Perceptron when
p = 2 and Weighted Majority as p → ∞. However, for MDA to approximate
the EG algorithm, p should be sufficiently large such as logarithmic p). w and
θ are called primal and dual vector, respectively. In the t-th iteration, MDA
first computes w(t) using w(t) = f −1 (θ (t) ), with θ(0) = 0. Then the stochastic
gradient of the loss function ∇L(w(t) ) is estimated using (xi , yi ) at w(t) . Finally
gradient descent and soft-thresholding update θ (t) to get θ(t+1) . Converting θ to
w using Eq. (8) can cause numerical problem in MDA. Specifically, elements of
w could become very small and sensitive to difference of θj ’s scale. The following
lemma reveals the numerical problem that update rules (8) and (9) can bring.
Lemma 1. Assume that the values of features and λ in Eq. (9) are in the scale
of O(1). At the t-th iteration of MDA minimizing logistic loss, where t = O(p),
(t)
(t)
θj is also in O(p) and wj is at most in the order of O(a−p ) for some a > 1.
Proof. Elements of the gradient of the logistic loss ∇j L(·, ·) with respect to
(t)
the first argument is L′ (hw, xi, y)x and thus in the order of O(1). θj is the
summation of t terms in O(1) and thus in the order of O(p). Since kθkp > kθk∞ ,
∃ǫ > 0 s.t. kθkp > (1 + ǫ)kθk∞ = (1 + ǫ) maxi |θi |. By Eq. (8),
|θj |p−1
|θj |p−1
<
(1 + ǫ)p−2 (maxi |θi |)p−2
kθkp−2
p
p−2
|θj |
= |θj |
(1 + ǫ) maxi |θi |
|wj | =
= O(p)O(a2−p ) = O(a−p )
where a = (1 + ǫ) maxi |θi |/|θj | > 1.
Particularly, because wj is exponential in p = O(ln(d)), small difference between
exponents of dimensions could be greatly amplified in the resulting model. Consider two entries of θ: θ1 and θ2 . Without loss of generality, assume that θ1 is
only one order smaller than θ2 in magnitude: |θ1 | ≈ 10−1 |θ2 |. Reconstruct the
primal vector w from θ, we can see that w1 is p order smaller than w2 in magnitude : |w1 | ≈ 10−p |w2 |. When computing the inner product between w and x, if
the difference between exponents of w1 and w2 is larger than the precision that
the data type supports, then the 2nd element in x would be totally lost. For
example, double precision floating point number supported by C++ (64 bits in
length according to IEEE 754-2008) typically has precision of 10−16 . Therefore
1.0 + 1.0−17 = 1.0 in machine addition. In general, the larger p is, the more
difference between wj ’s exponents. We’ll show in experiment (Section 4.3) how
the parameter p affects the performance of MDA.
3
Efficient and Numerically Stable Sparse Learning
We present the proposed sparse learning algorithm in Section 3.1. In Section 3.2
we show that the method is guaranteed to converge, with numerical errors taken
into account. Finally we show generalization error bound in Section 3.3.
3.1
The Proposed Method
Given the labeled data Z = {(x1 , y1 ), . . . , (xm , ym )}, drawn from some distribution D over Rd × {−1, 1}, we consider learning a sparse large margin linear
classifier. This reduces to minimizing certain loss function. There are three concerns. First, the algorithm should be scalable. One of the best practical methods
known is online learning or stochastic gradient descent, which iteratively minimizes the loss function and keeps only the footprint of the sparse model without
storing any example in memory. Second, the algorithm should be numerically
stable. We prefer algorithms which rely on practical assumption and are robust
to small difference in the scale between dimensions of data. Finally, for efficiency consideration, algorithms achieve better classification performance using
less features are desirable.
Algorithm 1 Numerically Stable Sparse Learning
1: Input: margin threshold τ , learning rate η,
regularization parameter λ, density upper bound s
2: initialize linear model w = 0, model density=0
3: while model density less than s do
4:
Receive instance (xi , yi ) and compute z = yi wxi
5:
if z ≤ τ then
6:
for Each non-zero element xij do
7:
Update wj by wj = wj + ηxij yi
8:
Soft threshold wj using Eq. (10)
9:
end for
10:
end if
11: end while
We alter the perceptron algorithm such that it robustly produces sparse
model with convergence and generalization error guarantees (see Section 3.2
and 3.3). The proposed method is described in Algorithm 1. It begins with a
zero vector. During each iteration, it loads an example from secondary storage
and updates the model as the original perceptron (line 4-7). The difference is
that after each online update, the algorithm suppresses the weights in the model
(line 8) using the soft-thresholding operator (Eq. (10)). Since we assume that
the data is separable by a sparse linear classifier, by the fact that the data
is in high dimensional space, it is expected that there would be many noisy
or irrelevant features. For an example classified with margin higher than τ ,
it is more likely that the features in the current model w(t) are sufficient to
make correct decision, and the example is not used for update. This prevents
unnecessarily introducing new features into the current model and maximally
preserves the sparsity without increasing regret. In general, perceptron algorithm
exploits the sparsity of the instance space, and further leads to a model with
fewer features.
The above framework can be applied to very large datasets. The main task
at each iteration is computing hw(t) , xi, which is in the form of summation
and can be distributed to r machines. One can partition the dataset on a perfeature basis into r partitions, with each assigned to a machine. Each machine
computes O(⌈d/r⌉) components in hw(t) , xi. Adding these parts up obtains the
inner product. There are O(⌈d/r⌉) cost on each machine for I/O and O(r) cost
for communication. Thus the algorithm is scalable via parallelization on columns
of the data matrix X.
3.2
Convergence of the Proposed Method
Without loss of generality, we assume that the learning rate equals 1 in the
following analysis. We alter the convergence theorem in [10] and prove similar
results of perceptron with soft-thresholding, and analyze conditions under which
the soft-thresholding version perceptron converges. Assume that the t-th mistake
happens on example (x, y). Let J0 = {j|xj = 0} and J1 = {j|xj 6= 0}. Then
(t−1)
(t)
for j ∈ J0 and we can only pay attention to the difference between
wj = wj
(t)
wj
(t−1)
and wj
for j ∈ J1 . By Algorithm 1 and Eq. (10),
w(t) = S(w̃(t) , λ) = S(w(t−1) + yx, λ)
Project w(t−1) , w(t) and x onto J0 and J1 respectively, and let the projection
of a vector v denoted by the vector with the index set as its subscript (for
example, xJ0 means keep the elements on indices in J0 unchanged and set the
other elements to zero). By simple algebra, w(t) can be decomposed into the sum
(t)
(t−1)
of two orthogonal vectors w(t) = wJ0 + wJ1 . We have the following lemma.
(t)
Lemma 2. Assume that ∀(x, y) ∈ Z, kxk < R. Let J+ denote the set {j :
P
(t)
(t)
(t)
(t)
(t)
|w̃j | ≥ λ}∩J1 and J− be the set {j : |w̃j | < λ}∩J1 . Let ǫλ = λ j∈J (t) |w̃j |+
+
P
(t)
(t) 2
(t) 2
2
(t) |w̃
j | . If τ ≤ ǫλ /2, kw k ≤ tR .
j∈J
−
Proof. Prove by induction,
(t−1)
kw(t) k2 = kwJ0
(t)
(t−1) 2
≤ kwJ0
=
≤
(t−1) 2
(t)
+ wJ1 k2 = kwJ0
(t)
k + kw̃J1 k2 − ǫλ
(t−1)
kwJ0 k2
(t−1)
kwJ0 k2
+
−
≤ kw(t−1) k2 +
(t)
k + kwJ1 k2
(t)
(t−1)
kwJ1 + yxk2 − ǫλ
(t)
(t−1)
ǫλ + kwJ1 k2 + kyxk2
(t)
R2 − ǫλ + 2τ
(11)
(12)
+ 2τ
(13)
(14)
Inequality (11) follows from the fact that at each iteration t, w̃(t) suffered from
(t)
at least ǫλ of shrinkage. Inequality (13) holds because update occurs only when
(t)
hw, yxi ≤ τ . Since τ ≤ ǫλ /2, then kw(t) k2 ≤ kw(t−1) k2 + R2 .
(t)
Discussion We assume the condition τ ≤ ǫλ /2 for the convenience of proof. If
τ = 0, then the condition is satisfied for ∀λ > 0. Particularly, if λ is set too small,
then the algorithm asymptotically becomes Perceptron; By using a large λ, one
can achieve a sparser model with sacrifice of accuracy. This is a more general
problem of model selection and one need to trade off sparsity for accuracy[11]. In
the experiment, we choose the parameters λ and τ via validation, then they’re
fixed during iterations.
We proved the above lemma without considering numerical errors. Let the
machine precision be eps = 2−r . Soft-thresholding introduces errors if λ and
the truncated elements are rounded-off. This gives an error up to |ε1 | ≤ 2eps
which is minor and can be ignored. Note that kw(t) k usually grows with the
iterations, thus more importantly, we must consider the numerical errors associ(t−1)
ated with this term. With |ε2 | ≤ eps, the exact term kwJ1 + yxk2 in Eq. (12)
(t−1)
becomes k(1 + ε2 )(wJ1
+ yx)k2 and can be approximately upper-bounded by
(t−1) 2
(kwJ1
k + kyxk2 + 2τ )(1 + 2ε2 ) by ignoring the higher order term ε22 (an exact
(t−1)
upper-bound should be (kwJ1 k2 + kyxk2 + 2τ )(1 + 2ε2 + ε22 ) ). Substitute this
error into Eq. (12), we can prove the following theorem, which is the key to the
convergence proof.
(t)
Theorem 1. Given λ ≥ 0, let ǫ∗λ = mint ǫλ , then kw(t) k2 < tR2 holds, ∀
t ∈ Z + such that
2τ
ǫ∗ − 2τ
− 2
(15)
t≤ λ
2
2ε2 R
R
Proof. The inequality (14) becomes
(t−1) 2
kw(t−1) k2 +R2 − ǫ∗λ + 2τ + 2ε2 (kwJ1
|
{z
k + kyxk2 + 2τ )
}
As in the proof of Lemma 2, we need to prove that the under-braced sum is
(t−1)
smaller than R2 . Since by induction, kwJ1 k2 ≤ kw(t−1) k2 ≤ (t − 1)R2 and
2
2
kyxk ≤ R by assumption, we need only to prove that
−ǫ∗λ + 2τ + 2ε2 (tR2 + 2τ ) ≤ 0
Solving for t gets the conclusion.
In C++, ε2 ≈ 10−16 in double-precision. If we set λ and τ such that ǫ∗λ − 2τ is
not too small (for example, in the order of o(R2 )), then kw(t) k2 < tR2 holds for
a sufficient large t in a reasonable training process.
Theorem 2. Assume for all examples xi ∈ Z, kxi k < R. If there exists a
linear model u such that kuk = 1 and 0 < ǫλ,γ < 1 such that h(yi xi − v), ui ≥
(1 − ǫλ,γ )γ for ∀(x, y) ∈ Z and any v, kvk∞ < λ, then the number of mistakes
made by the online perceptron algorithm on Z is at most k = (R/(1 − ǫλ,γ )γ)2 .
Proof. For any t satisfying Eq. (15),
w(t) · u = w(t−1) · u + (∆(t) + yi xi ) · u
≥w
(t−1)
(16)
· u + (1 − ǫλ,γ )γ
where ∆(t) = S(w̃(t) , λ) − w̃(t) . Thus w(t) · u ≥ (1 − ǫλ,γ )tγ. By Lemma (2),
kw(t) k2 ≤ tR2 .
√
(1 − ǫλ,γ )tγ ≤ w(t) · u ≤ kw(t) k ≤ tR
This theorem indicates that if data in Z can be separated with margin at least
γ, and this margin does not shrink too much compared with γ (up to ǫλ,γ )
given the examples are soft-thresholded, then the Algorithm 1 needs at most
(R/(1 − ǫλ,γ )γ)2 examples among Z to learn a linear classifier consistent with all
examples in Z. For the inseparable case, according to the method in [4], one can
extend all the examples x to x′ and the linear model u to u′ . In this extended
space, (x′ , y) is linearly separable by u′ .
3.3
Generalization Error Bound of the Proposed Method
In Algorithm 1, features enter the model only when necessary. Such strategy
also allow us to construct a consistent classifier using a subset of the training
set. Combining these two properties, it is possible to learn a sparse model with
generalization error guarantee. Algorithm 1 can be seen as a compression scheme
[9] consisting of two mappings. The first mapping κ maps any training set Z of
size m to Zk ⊂ Z, called the “kernel” of Z, where k is the kernel size. The
second mapping π uses Zk to reconstruct the labels of all the examples in Z.
κ can be seen as an algorithm, learning from m training examples and encoding the produced hypothesis h using only k of them. The mapping π requires
that h is consistent with all the training examples. An algorithm with the data
compression property is guaranteed with generalization error bound [9].
Lemma 3. For any compression scheme with kernel size k, the probability that
the generalization error of the learned
(with respect to distribution D)
hypothesis
m−k
being larger than ǫ is less than m
(1
−
ǫ)
k
With a smaller k where k < ⌊m/2⌋, the bound on generalization error becomes
smaller. Combining Theorem 2 and Lemma 3, with high probability over the
randomly drawn training set Z, the soft-thresholding perceptron algorithm can
obtain a classifier wZ of good generalizability using only k examples out of the
total m training examples.
Theorem 3. With probability at least 1−δ over the random draw of the training
set Z of size m, given the conditions in Theorem 2, the generalization error of
the classifier found by the proposed algorithm is less than
1
m
1
ln
(17)
+ ln(m) + ln
m−k
δ
k
where k = (R/(1 − ǫλ,γ )γ)2 for some 0 < ǫλ,γ < 1.
Table 2: Description of classification tasks. Columns sequentially show the task ids,
the tasks’ names, the size of the training sets (#Tr), validation set (#Valid), test set
(#Test) and number of total features (#Dim)
No.
1
2
3
4
5
6
7
8
Task
comp vs rec
comp vs sci
comp vs talk
rec vs talk
sci vs rec
sci vs talk
rcv1
webspam
#Tr
4278
4248
3960
3456
3772
3397
20242
50000
#Valid
1644
1766
1577
1423
1598
1445
3357
4999
#Test #Dim
2948 21317
2829 22944
2607 24178
2353 23253
2561 22839
2363 24777
6854 47236
50000 16609143
Proof. The kernel size of the model learned from Z by the soft-thresholding
algorithm is bounded by k = (R/(1 − ǫλ,γ )γ)2 , for some 0 < ǫλ,γ < 1. By
Lemma 3,
m (k−m)ǫ
m
m−k
(1 − ǫ)
≤
e
(18)
k
k
Let the right hand size of the above inequality equal to δ and solve for ǫ, we
obtain the conclusion.
This theorem is similar to the results given in [7] where the perceptron algorithm
is run in the dual form and the output is a linear classifier in the kernel space.
The difference is that the dual perceptron algorithm needs to store k training
examples for prediction while the proposed method stores only a sparse vector.
4
Experiment
In this section, we present experiment results of the proposed and other methods,
focusing on the numerical stability and efficiency of sparse models. After briefing
the experiment settings, three subsections address the problems listed below:
i Does it converge when applying gradient descent directly on real-world data.
ii How numerical unstability affects MDA’s convergence and model accuracy.
iii To what extent of sparsity can one achieve with performance guarantee.
4.1
Experiment Settings
We conducted experiments on three datasets. The first dataset is 20newsgroups,
from which we construct six binary classification tasks. We used word vector
representation with TFIDF weighting. The seventh and eighth tasks are constructed from rcv1 and webspam datasets, respectively. Note that the webspam
dataset is from Pascal Large Scale Learning Challenge3 . See Table 2 for details
3
http://largescale.first.fraunhofer.de/instructions/
300
max absolute eigenvalues
max absolute eigenvalues (103)
10
9.5
9
8.5
8
7.5
7
comp_vs_rec
comp_vs_sci
comp_vs_talk
sci_vs_rec
sci_vs_talk
rec_vs_talk
6.5
6
5.5
0
2
4
6
8
280
260
240
220
180
160
10
comp_vs_rec
comp_vs_sci
comp_vs_talk
sci_vs_rec
sci_vs_talk
rec_vs_talk
200
0
2
iterations (gamma=3)
Fig. 1: Spectral Radius of Iteration
Matrices of GraDes [5] (γ = 3)
8
10
30
squared error in log10 scale
squared error in log10 scale
6
Fig. 2: Spectral Radius of Iteration
Matrices of GraDes (γ = 100)
40
35
30
25
20
comp_vs_rec
comp_vs_sci
comp_vs_talk
sci_vs_rec
sci_vs_talk
rec_vs_talk
15
10
5
4
iterations (gamma=100)
1
2
3
4
5
6
7
8
iterations (gamma=3)
9
10
Fig. 3: Potential function value of
GraDes of the first 10 iterations (γ = 3)
25
20
15
10
comp_vs_rec
comp_vs_sci
comp_vs_talk
sci_vs_rec
sci_vs_talk
rec_vs_talk
5
0
1
2
3
4
5
6
7
8
iterations (gamma=100)
9
10
Fig. 4: Potential function value of
GraDes of the first 10 iterations (γ = 100)
of tasks. We compared 4 baselines with the proposed method, focusing on numerical stability, model sparsity and accuracy. The baselines are: GraDes [5],
SMIDAS (Stochastic Mirror Descent Algorithm made Sparse), SCD (Stochastic
Coordinate Descent) [12] and TG(Truncated Gradient) [8]. We next demonstrate
the numerical problems of GraDes and SMIDAS, shown in Section 2.1 and 2.2.
4.2
Numerical Problems of Direct Gradient Descent
The GraDes algorithm attempts to find a sparse solution for the least squared
error regression problem using ℓ0 regularization. The accuracy of GraDes is not
directly comparable to the proposed method due to the difference of the loss
functions (GraDes uses squared loss while other three use logistic loss in our
experiments), therefore we report the numerical problem of GraDes without
comparing to other algorithms. As we analyzed in Section 2.1, for the direct
gradient descent method to converge, one should keep the spectral radius of the
iteration matrix M less than 1. For GraDes, though by sparsification using hard
thresholding function (Eq. 6), the iteration can still diverge.
In practice, it is unrealistic to verify that the spectral radius of M is less than
1. This is restricted by the space and time complexity (M is a dense matrix in
size of O(d2 )). For instance, in one of the authors’ machines with 2GB memory,
MATLAB ran out of memory when computing the largest eigenvalue of a 20000×
20000 matrix. It costs even more if we have to compute the spectral radius of
Table 3: Distribution and statistics of exponents
p in Eq. (8)
range
comp vs rec
comp vs sci
comp vs talk
sci vs rec
sci vs talk
rec vs talk
rcv1
webspam
p = 2 ln(d)
within mean±std
0.095 -80.3±23.6
0.258 -69.8±32.4
0.282 -78.6±43.1
0.381 -63.6±29.4
0.273 -69.3±26.9
0.347 -68.4±33.7
0.012 -136.8±32.6
NA
NA
p = 0.5 ln(d)
within mean±std
1.000 -12.7±7.2
0.999 -15.4±7.1
0.996 -17.2±9.1
1.000 -14.3±6.1
1.000 -16.0±5.9
0.999 -15.8±7.5
0.998 -26.3±6.5
0.976 -71.4±22.1
the thresholded iteration matrix DJ (t) M to decide how to truncate w̃(t) to w(t) .
For the above 20newsgroup tasks, we reduce the number of features to around
5000 during text preprocessing while keeping the number of examples the same.
Then we compute the top eigenvalues of the iteration matrices DJ (t) M at each
iteration. |J (t) | is set to 2000 and the elements of J (t) are determined by the
GraDes algorithm, DJ (0) is the identity matrix. According to [], γ is set to 3 and
100, respectively, Note that the learning rate is 1/γ, thus we have two different
learning rate settings. The spectral radii are plotted as a function of the number
of iterations ,under two learning rates, as shown in Fig. 1 and Fig. 2.
One can easily observe that the spectral radius is far more than 1, indicating
a fast divergence speed of the solution. For example, in Fig. 1, before the first
iteration, the iteration matrix M has a spectral radius at least 7.5×103 (task
“comp vs rec”, shown in the red line with plus “+”). Although in the first iteration, the spectral radii are reduced by the hard thresholding function, they are
still in the scale of 103 . Moreover, the spectral radii increase as the algorithm
proceeds. The radius of the iteration matrix for task “comp vs rec” increases
from about 6 × 103 at the first iteration to about 7 × 103 at the 10-th iteration.
All the other spectral radii increase to some extent and remain above 103 . Fig.
2 shows similar situations, the spectral radii are over 160. The large spectral
radii of the iteration matrices indicate that the solution should go beyond the
optimal solution and therefore increase the potential function. To show this, for
each iteration of GraDes, we compute the the value of Ψ (w), which is expected
to be reduced by the algorithm. However, the potential function values also go
up quickly. In Fig. 3 and Fig. 4, for the 6 20newsgroups tasks, we plot Ψ (w) in
log10 scale at each GraDes iteration under two settings of γ. The scales climb
up linearly, indicating an exponential increase of Ψ (w).
4.3
Numerical Problem of Mirror Descent Algorithm
We show the evidence of the numerical problem leading to the low accuracy
of MDA. Typical IEEE 745 double precision floating numbers have at most 52
bits in the mantissa. If p in Eq. (8) is set too large, then small difference in
exponents of θj would be greatly amplified in wj . Adding two binary numerals
with difference in exponent larger than 52 would cause the smaller one to be
truncated. Usually, elements in one example of the data are normalized to be
in the same scale, such as in the range of [0, 1], the great disparity between
exponents in the weights would cancel out parts of the data.
For each task, we trained a model with 40% of density (i.e. the percentage of
non-zeros in the model, with 100% density we include all features in the model)
using SMIDAS. p is set to 2 ln(d) and 0.5 ln(d), respectively. For the last task,
the density is set to 0.1% due to the large number of features (over 16 millions),
and p ≈ 33 when p = 2 ln(d), this will obviously cause data truncation when
computing inner products. Thus we only report results when p = 0.5 ln(d) for
this task. We calculated the exponents of wj in 2-based numeric. Denote the
largest exponent by em . We showed in Table 3 the ratios of exponents falling
within [em − 51, em ] (the column “within”). The columns “mean±std” show the
means and standard deviations of exponents. As we can see, for p = 2 ln(d),
most of the wj have their exponents out of [em − 51, em ]. During prediction,
the corresponding features of these elements in the data are totally truncated,
therefore the models SMIDAS produce lack of sufficient discriminability and
lead to poor performance (black lines with empty squares in Fig. (5)). The
standard deviations are large, indicating a wide dynamic range of exponents.
For p = 0.5 ln(d), SMIDAS approximates the normal gradient descent [6] and
produces more reasonable models. Most of the exponents are in the range of
[em − 51, em ] and the dynamic ranges become much smaller. The performances
go up dramatically, but still inferior to the proposed method (blue lines with
filled squares in Fig. (5)).
4.4
Sparsity and Performance Comparison
We focus on how the performance varies as a function of sparsity, and how numerical problem affects the MDA. The proposed method is compared with three
scalable algorithms, SMIDAS, TG and SCD, all use logistic loss function. We
randomly split all labeled data of each task into training, test and validation
set (see Table 2). The training sets are further randomly shuffled (for task 1-7)
or split (task 8) to create 10 copies of training data. For the proposed algorithm and TG, for each parameter setting, we trained 10 models using these 10
copies and calculated the average performance of the models on validation set.
For SMIDAS and SCD, since they randomly shuffle the examples and features
respectively, only one copy of training data of each task is needed. We chose the
best parameters according to this average performance. Finally we reported the
best models’ averaged accuracy on test sets. All algorithms require a regularization parameter λ to control the intensity of sparsification. With a larger λ, we
expect more sparsity of the model. We tuned λ using 0.0001, 0.0005, 0.001, 0.01
and 0.1. For learning rate η needed in SMIDAS, TG and the proposed method,
we varied it from 0.1 to 0.5 with step size 0.1. The proposed method requires
one more parameter τ determining the minimum margin above which to skip
one instance. τ was searched using 0.001, 0.01 and 0.1.
To compare the performance with sparsity restriction, we require the percentage of the non-zeros in the model (model density) to be at most 40% for
task 1-7 and 0.1% for task 8. We recorded the error rates when the density reach
pre-specified percentages. For all algorithms, we stopped running the program
when they reach density upper bounds, or the maximal scans of training set (10
scans for all tasks). These error rates of each algorithms in 8 tasks are depicted
in Fig. 5. In general, the proposed algorithm achieves lower error rate using
fewer features compared with the other three. Particularly, in task 2, 7 and 8,
though the density of the proposed algorithm went up to 40%, the errors are
consistently lower than other methods. In the rest 5 tasks, the proposed method
stop updating the model before the model is too dense, even when the maximum
number of scans is reached. These not only demonstrated the convergence and
generalizability of the proposed algorithm (see Section 3.2 and 3.3), but also the
sparsity of the models it produces. Specifically, in task 1, the proposed algorithm
converges before the density reaches 30%, with approximately 5% of error rate,
which outperformed the remaining algorithms, even they used 40% of features.
In task 3, the proposed method achieved 5% of error rate at density 20% which
can only be obtained when SCD reached 40% of density, with other two have
their error rates higher than that of SCD. Note that in task 1-7 SMIDAS with
p = 2 ln(d) was beat by all other 3 methods and itself with p = 0.5 ln(d). This
showed that SMIDAS is not applicable for approximating the EG algorithm,
especially when the dimensionality is high.
In task 1-6, when d and m differ by at most one order, TG converges slower
than SCD. SCD adds features according to global information provided by all
training examples, while TG works based on stochastic gradient. This is inaccurate compared with SCD. However, in task 8, when d are several order larger
than m and sparser model are required, SCD fails to converge before reaching
40% of density. TG converges since it exploits more features than SCD. The
proposed method takes the best of both. For any training example, it is always
possible to reduce the loss following the gradient direction. As long as the current
example does not increase the regret of online learning, we simply hold it out
of the model. In this way, the proposed method achieves better generalization
ability with a higher degree of sparsity.
5
Conclusions
We addressed several problems in existing sparse learning methods. Though RIP
provides theoretical guarantee to recover the sparse solutions, it is mostly satisfied by designed matrices in compressive sensing, rather than data collected in
the real world. Failing to meet RIP can cause matrix iteration-based gradient descent to diverge [5], while ℓ0 constraint doesn’t solve the problem. Second, though
MDA using Bregman divergence enjoys fast convergence and approximates the
EG algorithm, its update rules produce inaccurate models when dimensionality
is high. Lastly, ℓ1 regularization is insufficient for finding an sparser solution, in
several existing methods, density of model grows faster than necessary.
We have proposed to combine the perceptron algorithm with soft-thresholding.
The algorithm converges with numerical error taken into account. We have also
provided generalization error bound of the algorithm. Finally, the algorithm is
highly scalable, data access can be distributed on a per-feature basis, making it
more suitable for real-world applications. Experiments have shown that the proposed method outperformed 4 existing sparse learning algorithms in numerical
stability, accuracy and model sparsity control. In one task with approximately
3.5GB of training data (16 million features, 50k examples), the proposed approach achieved 5.9% error rate using model with 0.8% density. But the best
competing approach (TG for this dataset) can only get 8.2% error rate, using
model with 0.7% density.
References
1. E. J. Candes and M. B. Wakin. An introduction to compressive sampling. IEEE
Signal Processing Magazine, 25(2):21–30, March 2008.
2. David Donoho and Iain M. Johnstone. Adapting to unknown smoothness via
wavelet shrinkage. Journal of the American Statistical Association, 90:1200–1224,
1995.
3. John Duchi and Yoram Singer. Boosting with structural sparsity. In ICML,
page 38, 2009.
4. Yoav Freund and Robert E. Schapire. Large margin classification using the perceptron algorithm. In Machine Learning, pages 277–296, 1998.
5. Rahul Garg and Rohit Khandekar. Gradient descent with sparsification: an iterative algorithm for sparse recovery with restricted isometry property. In ICML,
pages 337–344, 2009.
6. Claudio Gentile and Nick Littlestone. The robustness of the p-norm algorithms. In
Proceeding of 12th Annual Conference on Computer Learning Theory, pages 1–11.
ACM Press, New York, NY, 1999.
7. Thore Graepel and Ralf Herbrich. From margin to sparsity. In In Advances in
Neural Information Processing Systems 13, pages 210–216. MIT Press, 2001.
8. John Langford, Lihong Li, and Tong Zhang. Sparse online learning via truncated
gradient. Journal of Machine Learning Research, 10:777–801, 2009.
9. N. Littlestone and M. Warmuth. Relating data compression and learnability, 1986.
10. Albert B. Novikoff. On convergence proofs for perceptrons. In Proceedings of the
Symposium on the Mathematical Theory of Automata, volume 12, pages 615–622,
1963.
11. Shai Shalev-Shwartz, Nathan Srebro, and Tong Zhang. Trading accuracy for sparsity. Technical report, Toyota Technological Institute at Chicago, 2009.
12. Shai S. Shwartz and Ambuj Tewari. Stochastic methods for ℓ1 regularized loss
minimization. In ICML, pages 929–936. ACM, 2009.
13. Tong Zhang. Adaptive forward-backward greedy algorithm for sparse learning with
linear models. In NIPS, pages 1921–1928, 2008.
14. Peng Zhao and Bin Yu. On model selection consistency of lasso. Journal of Machine
Learning Research, 7:2541–2563, 2006.
comp_vs_sci
SMIDAS (p=2)
SMIDAS (p=0.5)
SCD
TG
Proposed
0.4
SMIDAS (p=2)
SMIDAS (p=0.5)
SCD
TG
Proposed
0.35
Error rate
Error rate
comp_vs_rec
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
0.3
0.25
0.2
0.15
0.1
0.05
5
10
15
20
25
30
35
40
5
10
15
Density (%)
SMIDAS (p=2)
SMIDAS (p=0.5)
SCD
TG
Proposed
5
10
15
20 25 30
Density (%)
35
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
40
5
Error rate
Error rate
0.3
0.25
0.2
0.15
0.1
0.05
5
10
15
10
20 25 30
Density (%)
35
40
10
15
20
25
20 25 30
Density (%)
35
40
35
40
SMIDAS (p=2)
SMIDAS (p=0.5)
SCD
TG
Proposed
5
Error rate
Error rate
40
10
15
20 25 30
Density (%)
webspam
SMIDAS (p=2)
SMIDAS (p=0.5)
SCD
TG
Proposed
5
15
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
rcv1
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
35
rec_vs_talk
SMIDAS (p=2)
SMIDAS (p=0.5)
SCD
TG
Proposed
0.35
30
SMIDAS (p=2)
SMIDAS (p=0.5)
SCD
TG
Proposed
sci_vs_talk
0.4
25
sci_vs_rec
Error rate
Error rate
comp_vs_talk
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
20
Density (%)
30
35
40
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
Proposed
SCD
TG
SMIDAS (p=2)
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
Density (%)
Fig. 5: Comparison of accuracy
Density (%)