Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Efficient and Numerically Stable Sparse Learning Sihong Xie1 , Wei Fan2 , Olivier Verscheure2 , and Jiangtao Ren1 1 2 Sun Yat-Sen University, Guangzhou, China {xiesihong1, issrjt}@gmail.com IBM T.J. Watson Research Center, New York, USA {weifan, ov1}@us.ibm.com Abstract. We consider the problem of numerical stability and model density growth when training a sparse linear model from massive data. We focus on scalable algorithms that optimize certain loss function using gradient descent, with either ℓ0 or ℓ1 regularization. We observed numerical stability problems in several existing methods, leading to divergence and low accuracy. In addition, these methods typically have weak controls over sparsity, such that model density grows faster than necessary. We propose a framework to address the above problems. First, the update rule is numerically stable with convergence guarantee and results in more reasonable models. Second, besides ℓ1 regularization, it exploits the sparsity of data distribution and achieves a higher degree of sparsity with a PAC generalization error bound. Lastly, it is parallelizable and suitable for training large margin classifiers on huge datasets. Experiments show that the proposed method converges consistently and outperforms other baselines using 10% of features by as much as 6% reduction in error rate on average. Datasets and software are available from the authors. 1 Introduction In this paper, we focus on training a sparse large margin model. Assume that we are given m labeled examples Z = {(x1 , y1 ), . . . , (xm , ym )}, where xi ∈ Rd , i = 1, . . . , m. We aim at solving the following optimization problem: m min w∈Rd 1 X L(hw, xi i, yi ) + λkwk1 m i=1 (1) L is any smooth and differentiable loss function such as logistic or hinge loss. λ is the parameter for trading off between loss and ℓ1 regularization. We are interested in scalable algorithm, with parallelizable data accesses and small communication cost. One can find such application in text mining and webspam detection, where the number of features could be in millions. Sparse learning aims at accurate models using a small number of non-zero elements, with the advantages of efficiency and generalizability [13]. Some existing methods produce sparse models by forward-backward feature selection [13] or boosting [3]. Though effective, these methods have to scan the training set Table 1: Notations and definitions Notation definition x y w wZ w̃ S(·, λ) λ η J kvk1 kvk∞ Notation Definition Example in Rd X Label of x Z Linear model in Rd θ Linear model learned from Z w∗ Vector to be thresholded H(·, s) Soft-thresholding function ∇f (·) ℓ1 regularization parameter t Learning rate [d] Set of indices, J ⊆ [d] kvk0 The sum of |vi |, i ∈ [d] kvk2 The maximum of |vi | σi Data matrix A series of m labeled examples Dual vector of w in Eq. (8) Optimal linear model Hard-thresholding function Gradient of f (·) Index of iterations Set of indices {1, . . . , d} The support of v The Euclidean length of v Eigenvalues at each iteration, which is expensive or even impossible for large datasets. Under some restrictive assumptions [14], one can achieve sparse models via convex programming with ℓ1 regularization or constraint. Though scalable as methods in [8, 12], there are two main drawbacks. First, numerical problems can lead to iteration divergence or summation cancellation, making the algorithm less useful. Second, ℓ1 regularization only encourages sparsity and may not be enough for sparse learning. As the training proceeds, model complexity can grow faster than necessary and result in dense models, regardless of the regularization. In this paper, we propose a perceptron-based algorithm to address the above problems. First, it is numerically stable with convergence guarantee. Second, in addition to ℓ1 regularization, it further takes advantage of the sparsity of training examples to achieve a higher degree of model sparsity, and furthermore, a generalization error bound which is not provided in [8, 12]. Experiments show that the proposed method converges with steadily reduced test error rates as the training proceeds. and consistently outperforms previous state-of-the-art algorithms by as much as 8.4% in accuracy using only 10% of all features in one of the tasks. 2 Numerical Challenges in Sparse Learning Notations are summarized in Table 1. Note that an element of a vector is in normal font, for example, wj is the j-th element of the vector w. About norms, for any v ∈ Rd , kvkp1 ≤ kvkp2 for any p1 ≥ p2 [6]. If not otherwise specified, k · k is equivalent to k · k2 We have observed numerical problems in two sparse learning algorithms using either ℓ1 or ℓ0 regularization. 2.1 Numerical Problems of Direct Iterative Methods The first problem arises when directly applying matrix iterative methods to solve a system of linear equations y = Xw (2) where X = (x1 , . . . , xm )⊤ . ℓ1 or ℓ0 regularization are used to obtain sparse model, as in compressive sensing or sparse learning. Suppose there exists an optimal solution w∗ satisfying Eq. (2), namely y − Xw∗ = 0 (0 denotes the zero vector). The direct method tries to find w∗ by minimizing the potential function Ψ (w) = 1 ky − Xwk2 2 (3) Starting with an arbitrary vector w(0) (usually 0), it follows the gradient direction at each iteration to reduce the potential function value. The gradient of Eq. (3) is ∇Ψ (w) = −X ⊤ (y − Xw) and we update w(t) by w(t) = w(t−1) + ηX ⊤ (y − Xw(t−1) ) (4) where η is the learning rate. To see when the above iteration fails to converge, let ε(t) = w(t) − w∗ be the error vector. ε(t) = w(t−1) + ηX ⊤ (y − Xw(t−1) ) − w∗ − ηX ⊤ (y − Xw∗ ) = (I − ηX ⊤ X)(w(t−1) − w∗ ) = M ε(t−1) where M = I − ηX ⊤ X is the iteration matrix. Thus for t > 0 we have kε(t) k = kw(t) − w∗ k = kM ε(t−1) k = · · · = kM t ε(0) k The sufficient and necessary condition for the update rule (4) to converge is that the spectral radius of M , ρ(M ) = maxi {|σi |}less than 1, where σi is the eigenvalues of M . Formally, lim kM t k = 0 ⇔ ρ(M ) < 1 t→∞ (5) The above analysis shows that if one cannot guarantee ρ(M ) < 1 at each iteration, then update rule (4) may not be applicable in reducing the potential function. In addition, we would like w(t) to be sparse. It is shown in [5] that under the RIP condition [1], a truncated version of update rule (4) reduces the potential function at each iteration while maintaining a sparse solution. We show that the truncation does not automatically guarantee the convergence of ℓ0 constrained gradient descent. Consider the following iteration w(t) = H(w̃(t) , s) = DJ (t) (w(t−1) + ηX ⊤ (y − Xw(t−1) )) (6) with w(0) being a zero vector. w̃(t) is the t-th solution before thresholding. H(v, s) : Rd → Rd keeps only s elements with largest absolute values in v and set other elements to zero. H(v, s) equals the row selection operation as shown in Eq. (6). J (t) ⊆ [d] ( J (t) = s, ∀t = 1, 2, . . . ) represents the set of indices of s remaining elements in H(w̃, s). DJ (t) is a diagonal matrix with Dj,j = 1, ∀j ∈ J (t) and 0 otherwise. The error vector of iteration rule (6) can then be written as kε(t) k2 = kw(t) − w∗ k2 = kDJ (t) w̃(t) − w∗ k2 ≥ kDJ (t) w̃(t) − DJ (t) w∗ k2 = kDJ (t) [w(t−1) + ηX ⊤ (y − Xw(t−1) )] (7) −DJ (t) [w∗ + ηX ⊤ (y − Xw∗ )]k2 = kDJ (t) M (w(t−1) − w∗ )k2 = kDJ (t) M ε(t−1) k2 Inequality (7) follows since wj∗ , j 6∈ J (t) are held out of the sum of the ℓ2 -norm. Therefore, kε(t) k2 is lower-bounded by kDJ (t) M ε(t−1) k2 . For the ℓ0 projected gradient descent to converge, DJ (t) M must satisfy ρ(DJ (t) M ) < 1, otherwise, update rule (6) diverges. Simple truncation using H(v, s) doesn’t guarantee such condition, as we demonstrated in experiments. 2.2 Numerical Problems of Mirror Descent The mirror descent algorithm (MDA) has been recognized as an effective online learning algorithm. For example, MDA using Bregman divergence is proposed to approximate the exponentiated gradient (EG) algorithm, which have cumulative loss bound logarithmically to the number of irrelevant features in the target weight vector [6]. In convergence rate, it is proved that MDA is superior to the usual stochastic gradient descent methods [12], where MDA is adopted in online sparse learning. However, the price MDA pays for these advantages is numerical unstability, especially when the dimensionality of data is high, leading to less discriminative models. As proposed in [12], the main component of MDA for sparse learning is the alternative updates of two vectors θ, w ∈ Rn by the following rules: (t) (t) sgn(θj )|θj |p−1 (t) wj = fj−1 (θ(t) ) = , ∀j (8) kθ(t) kp−2 p θ(t+1) = S(θ(t) − η∇L(w(t) ), λ) (9) where S(·, λ) is the soft-thresholding operator [2]: S(wj , λ) = sgn(wj )(wj − λ)+   wj − λ, if wj > 0 and |wj | > λ = wj + λ, if wj < 0 and |wj | > λ   0, if |wj | < λ (10) S(v, λ) means applying Eq. (10) at each element of v. As suggested in [6], the parameter p in Eq. (8) is set to O(ln(d)) (Note that different p lead to different “algorithms”, for example, in binary classification, one obtain Perceptron when p = 2 and Weighted Majority as p → ∞. However, for MDA to approximate the EG algorithm, p should be sufficiently large such as logarithmic p). w and θ are called primal and dual vector, respectively. In the t-th iteration, MDA first computes w(t) using w(t) = f −1 (θ (t) ), with θ(0) = 0. Then the stochastic gradient of the loss function ∇L(w(t) ) is estimated using (xi , yi ) at w(t) . Finally gradient descent and soft-thresholding update θ (t) to get θ(t+1) . Converting θ to w using Eq. (8) can cause numerical problem in MDA. Specifically, elements of w could become very small and sensitive to difference of θj ’s scale. The following lemma reveals the numerical problem that update rules (8) and (9) can bring. Lemma 1. Assume that the values of features and λ in Eq. (9) are in the scale of O(1). At the t-th iteration of MDA minimizing logistic loss, where t = O(p), (t) (t) θj is also in O(p) and wj is at most in the order of O(a−p ) for some a > 1. Proof. Elements of the gradient of the logistic loss ∇j L(·, ·) with respect to (t) the first argument is L′ (hw, xi, y)x and thus in the order of O(1). θj is the summation of t terms in O(1) and thus in the order of O(p). Since kθkp > kθk∞ , ∃ǫ > 0 s.t. kθkp > (1 + ǫ)kθk∞ = (1 + ǫ) maxi |θi |. By Eq. (8), |θj |p−1 |θj |p−1 < (1 + ǫ)p−2 (maxi |θi |)p−2 kθkp−2 p p−2  |θj | = |θj | (1 + ǫ) maxi |θi | |wj | = = O(p)O(a2−p ) = O(a−p ) where a = (1 + ǫ) maxi |θi |/|θj | > 1. Particularly, because wj is exponential in p = O(ln(d)), small difference between exponents of dimensions could be greatly amplified in the resulting model. Consider two entries of θ: θ1 and θ2 . Without loss of generality, assume that θ1 is only one order smaller than θ2 in magnitude: |θ1 | ≈ 10−1 |θ2 |. Reconstruct the primal vector w from θ, we can see that w1 is p order smaller than w2 in magnitude : |w1 | ≈ 10−p |w2 |. When computing the inner product between w and x, if the difference between exponents of w1 and w2 is larger than the precision that the data type supports, then the 2nd element in x would be totally lost. For example, double precision floating point number supported by C++ (64 bits in length according to IEEE 754-2008) typically has precision of 10−16 . Therefore 1.0 + 1.0−17 = 1.0 in machine addition. In general, the larger p is, the more difference between wj ’s exponents. We’ll show in experiment (Section 4.3) how the parameter p affects the performance of MDA. 3 Efficient and Numerically Stable Sparse Learning We present the proposed sparse learning algorithm in Section 3.1. In Section 3.2 we show that the method is guaranteed to converge, with numerical errors taken into account. Finally we show generalization error bound in Section 3.3. 3.1 The Proposed Method Given the labeled data Z = {(x1 , y1 ), . . . , (xm , ym )}, drawn from some distribution D over Rd × {−1, 1}, we consider learning a sparse large margin linear classifier. This reduces to minimizing certain loss function. There are three concerns. First, the algorithm should be scalable. One of the best practical methods known is online learning or stochastic gradient descent, which iteratively minimizes the loss function and keeps only the footprint of the sparse model without storing any example in memory. Second, the algorithm should be numerically stable. We prefer algorithms which rely on practical assumption and are robust to small difference in the scale between dimensions of data. Finally, for efficiency consideration, algorithms achieve better classification performance using less features are desirable. Algorithm 1 Numerically Stable Sparse Learning 1: Input: margin threshold τ , learning rate η, regularization parameter λ, density upper bound s 2: initialize linear model w = 0, model density=0 3: while model density less than s do 4: Receive instance (xi , yi ) and compute z = yi wxi 5: if z ≤ τ then 6: for Each non-zero element xij do 7: Update wj by wj = wj + ηxij yi 8: Soft threshold wj using Eq. (10) 9: end for 10: end if 11: end while We alter the perceptron algorithm such that it robustly produces sparse model with convergence and generalization error guarantees (see Section 3.2 and 3.3). The proposed method is described in Algorithm 1. It begins with a zero vector. During each iteration, it loads an example from secondary storage and updates the model as the original perceptron (line 4-7). The difference is that after each online update, the algorithm suppresses the weights in the model (line 8) using the soft-thresholding operator (Eq. (10)). Since we assume that the data is separable by a sparse linear classifier, by the fact that the data is in high dimensional space, it is expected that there would be many noisy or irrelevant features. For an example classified with margin higher than τ , it is more likely that the features in the current model w(t) are sufficient to make correct decision, and the example is not used for update. This prevents unnecessarily introducing new features into the current model and maximally preserves the sparsity without increasing regret. In general, perceptron algorithm exploits the sparsity of the instance space, and further leads to a model with fewer features. The above framework can be applied to very large datasets. The main task at each iteration is computing hw(t) , xi, which is in the form of summation and can be distributed to r machines. One can partition the dataset on a perfeature basis into r partitions, with each assigned to a machine. Each machine computes O(⌈d/r⌉) components in hw(t) , xi. Adding these parts up obtains the inner product. There are O(⌈d/r⌉) cost on each machine for I/O and O(r) cost for communication. Thus the algorithm is scalable via parallelization on columns of the data matrix X. 3.2 Convergence of the Proposed Method Without loss of generality, we assume that the learning rate equals 1 in the following analysis. We alter the convergence theorem in [10] and prove similar results of perceptron with soft-thresholding, and analyze conditions under which the soft-thresholding version perceptron converges. Assume that the t-th mistake happens on example (x, y). Let J0 = {j|xj = 0} and J1 = {j|xj 6= 0}. Then (t−1) (t) for j ∈ J0 and we can only pay attention to the difference between wj = wj (t) wj (t−1) and wj for j ∈ J1 . By Algorithm 1 and Eq. (10), w(t) = S(w̃(t) , λ) = S(w(t−1) + yx, λ) Project w(t−1) , w(t) and x onto J0 and J1 respectively, and let the projection of a vector v denoted by the vector with the index set as its subscript (for example, xJ0 means keep the elements on indices in J0 unchanged and set the other elements to zero). By simple algebra, w(t) can be decomposed into the sum (t) (t−1) of two orthogonal vectors w(t) = wJ0 + wJ1 . We have the following lemma. (t) Lemma 2. Assume that ∀(x, y) ∈ Z, kxk < R. Let J+ denote the set {j : P (t) (t) (t) (t) (t) |w̃j | ≥ λ}∩J1 and J− be the set {j : |w̃j | < λ}∩J1 . Let ǫλ = λ j∈J (t) |w̃j |+ + P (t) (t) 2 (t) 2 2 (t) |w̃ j | . If τ ≤ ǫλ /2, kw k ≤ tR . j∈J − Proof. Prove by induction, (t−1) kw(t) k2 = kwJ0 (t) (t−1) 2 ≤ kwJ0 = ≤ (t−1) 2 (t) + wJ1 k2 = kwJ0 (t) k + kw̃J1 k2 − ǫλ (t−1) kwJ0 k2 (t−1) kwJ0 k2 + − ≤ kw(t−1) k2 + (t) k + kwJ1 k2 (t) (t−1) kwJ1 + yxk2 − ǫλ (t) (t−1) ǫλ + kwJ1 k2 + kyxk2 (t) R2 − ǫλ + 2τ (11) (12) + 2τ (13) (14) Inequality (11) follows from the fact that at each iteration t, w̃(t) suffered from (t) at least ǫλ of shrinkage. Inequality (13) holds because update occurs only when (t) hw, yxi ≤ τ . Since τ ≤ ǫλ /2, then kw(t) k2 ≤ kw(t−1) k2 + R2 . (t) Discussion We assume the condition τ ≤ ǫλ /2 for the convenience of proof. If τ = 0, then the condition is satisfied for ∀λ > 0. Particularly, if λ is set too small, then the algorithm asymptotically becomes Perceptron; By using a large λ, one can achieve a sparser model with sacrifice of accuracy. This is a more general problem of model selection and one need to trade off sparsity for accuracy[11]. In the experiment, we choose the parameters λ and τ via validation, then they’re fixed during iterations. We proved the above lemma without considering numerical errors. Let the machine precision be eps = 2−r . Soft-thresholding introduces errors if λ and the truncated elements are rounded-off. This gives an error up to |ε1 | ≤ 2eps which is minor and can be ignored. Note that kw(t) k usually grows with the iterations, thus more importantly, we must consider the numerical errors associ(t−1) ated with this term. With |ε2 | ≤ eps, the exact term kwJ1 + yxk2 in Eq. (12) (t−1) becomes k(1 + ε2 )(wJ1 + yx)k2 and can be approximately upper-bounded by (t−1) 2 (kwJ1 k + kyxk2 + 2τ )(1 + 2ε2 ) by ignoring the higher order term ε22 (an exact (t−1) upper-bound should be (kwJ1 k2 + kyxk2 + 2τ )(1 + 2ε2 + ε22 ) ). Substitute this error into Eq. (12), we can prove the following theorem, which is the key to the convergence proof. (t) Theorem 1. Given λ ≥ 0, let ǫ∗λ = mint ǫλ , then kw(t) k2 < tR2 holds, ∀ t ∈ Z + such that 2τ ǫ∗ − 2τ − 2 (15) t≤ λ 2 2ε2 R R Proof. The inequality (14) becomes (t−1) 2 kw(t−1) k2 +R2 − ǫ∗λ + 2τ + 2ε2 (kwJ1 | {z k + kyxk2 + 2τ ) } As in the proof of Lemma 2, we need to prove that the under-braced sum is (t−1) smaller than R2 . Since by induction, kwJ1 k2 ≤ kw(t−1) k2 ≤ (t − 1)R2 and 2 2 kyxk ≤ R by assumption, we need only to prove that −ǫ∗λ + 2τ + 2ε2 (tR2 + 2τ ) ≤ 0 Solving for t gets the conclusion. In C++, ε2 ≈ 10−16 in double-precision. If we set λ and τ such that ǫ∗λ − 2τ is not too small (for example, in the order of o(R2 )), then kw(t) k2 < tR2 holds for a sufficient large t in a reasonable training process. Theorem 2. Assume for all examples xi ∈ Z, kxi k < R. If there exists a linear model u such that kuk = 1 and 0 < ǫλ,γ < 1 such that h(yi xi − v), ui ≥ (1 − ǫλ,γ )γ for ∀(x, y) ∈ Z and any v, kvk∞ < λ, then the number of mistakes made by the online perceptron algorithm on Z is at most k = (R/(1 − ǫλ,γ )γ)2 . Proof. For any t satisfying Eq. (15), w(t) · u = w(t−1) · u + (∆(t) + yi xi ) · u ≥w (t−1) (16) · u + (1 − ǫλ,γ )γ where ∆(t) = S(w̃(t) , λ) − w̃(t) . Thus w(t) · u ≥ (1 − ǫλ,γ )tγ. By Lemma (2), kw(t) k2 ≤ tR2 . √ (1 − ǫλ,γ )tγ ≤ w(t) · u ≤ kw(t) k ≤ tR This theorem indicates that if data in Z can be separated with margin at least γ, and this margin does not shrink too much compared with γ (up to ǫλ,γ ) given the examples are soft-thresholded, then the Algorithm 1 needs at most (R/(1 − ǫλ,γ )γ)2 examples among Z to learn a linear classifier consistent with all examples in Z. For the inseparable case, according to the method in [4], one can extend all the examples x to x′ and the linear model u to u′ . In this extended space, (x′ , y) is linearly separable by u′ . 3.3 Generalization Error Bound of the Proposed Method In Algorithm 1, features enter the model only when necessary. Such strategy also allow us to construct a consistent classifier using a subset of the training set. Combining these two properties, it is possible to learn a sparse model with generalization error guarantee. Algorithm 1 can be seen as a compression scheme [9] consisting of two mappings. The first mapping κ maps any training set Z of size m to Zk ⊂ Z, called the “kernel” of Z, where k is the kernel size. The second mapping π uses Zk to reconstruct the labels of all the examples in Z. κ can be seen as an algorithm, learning from m training examples and encoding the produced hypothesis h using only k of them. The mapping π requires that h is consistent with all the training examples. An algorithm with the data compression property is guaranteed with generalization error bound [9]. Lemma 3. For any compression scheme with kernel size k, the probability that the generalization error of the learned (with respect to distribution D)  hypothesis m−k being larger than ǫ is less than m (1 − ǫ) k With a smaller k where k < ⌊m/2⌋, the bound on generalization error becomes smaller. Combining Theorem 2 and Lemma 3, with high probability over the randomly drawn training set Z, the soft-thresholding perceptron algorithm can obtain a classifier wZ of good generalizability using only k examples out of the total m training examples. Theorem 3. With probability at least 1−δ over the random draw of the training set Z of size m, given the conditions in Theorem 2, the generalization error of the classifier found by the proposed algorithm is less than     1 m 1 ln (17) + ln(m) + ln m−k δ k where k = (R/(1 − ǫλ,γ )γ)2 for some 0 < ǫλ,γ < 1. Table 2: Description of classification tasks. Columns sequentially show the task ids, the tasks’ names, the size of the training sets (#Tr), validation set (#Valid), test set (#Test) and number of total features (#Dim) No. 1 2 3 4 5 6 7 8 Task comp vs rec comp vs sci comp vs talk rec vs talk sci vs rec sci vs talk rcv1 webspam #Tr 4278 4248 3960 3456 3772 3397 20242 50000 #Valid 1644 1766 1577 1423 1598 1445 3357 4999 #Test #Dim 2948 21317 2829 22944 2607 24178 2353 23253 2561 22839 2363 24777 6854 47236 50000 16609143 Proof. The kernel size of the model learned from Z by the soft-thresholding algorithm is bounded by k = (R/(1 − ǫλ,γ )γ)2 , for some 0 < ǫλ,γ < 1. By Lemma 3,     m (k−m)ǫ m m−k (1 − ǫ) ≤ e (18) k k Let the right hand size of the above inequality equal to δ and solve for ǫ, we obtain the conclusion. This theorem is similar to the results given in [7] where the perceptron algorithm is run in the dual form and the output is a linear classifier in the kernel space. The difference is that the dual perceptron algorithm needs to store k training examples for prediction while the proposed method stores only a sparse vector. 4 Experiment In this section, we present experiment results of the proposed and other methods, focusing on the numerical stability and efficiency of sparse models. After briefing the experiment settings, three subsections address the problems listed below: i Does it converge when applying gradient descent directly on real-world data. ii How numerical unstability affects MDA’s convergence and model accuracy. iii To what extent of sparsity can one achieve with performance guarantee. 4.1 Experiment Settings We conducted experiments on three datasets. The first dataset is 20newsgroups, from which we construct six binary classification tasks. We used word vector representation with TFIDF weighting. The seventh and eighth tasks are constructed from rcv1 and webspam datasets, respectively. Note that the webspam dataset is from Pascal Large Scale Learning Challenge3 . See Table 2 for details 3 http://largescale.first.fraunhofer.de/instructions/ 300 max absolute eigenvalues max absolute eigenvalues (103) 10 9.5 9 8.5 8 7.5 7 comp_vs_rec comp_vs_sci comp_vs_talk sci_vs_rec sci_vs_talk rec_vs_talk 6.5 6 5.5 0 2 4 6 8 280 260 240 220 180 160 10 comp_vs_rec comp_vs_sci comp_vs_talk sci_vs_rec sci_vs_talk rec_vs_talk 200 0 2 iterations (gamma=3) Fig. 1: Spectral Radius of Iteration Matrices of GraDes [5] (γ = 3) 8 10 30 squared error in log10 scale squared error in log10 scale 6 Fig. 2: Spectral Radius of Iteration Matrices of GraDes (γ = 100) 40 35 30 25 20 comp_vs_rec comp_vs_sci comp_vs_talk sci_vs_rec sci_vs_talk rec_vs_talk 15 10 5 4 iterations (gamma=100) 1 2 3 4 5 6 7 8 iterations (gamma=3) 9 10 Fig. 3: Potential function value of GraDes of the first 10 iterations (γ = 3) 25 20 15 10 comp_vs_rec comp_vs_sci comp_vs_talk sci_vs_rec sci_vs_talk rec_vs_talk 5 0 1 2 3 4 5 6 7 8 iterations (gamma=100) 9 10 Fig. 4: Potential function value of GraDes of the first 10 iterations (γ = 100) of tasks. We compared 4 baselines with the proposed method, focusing on numerical stability, model sparsity and accuracy. The baselines are: GraDes [5], SMIDAS (Stochastic Mirror Descent Algorithm made Sparse), SCD (Stochastic Coordinate Descent) [12] and TG(Truncated Gradient) [8]. We next demonstrate the numerical problems of GraDes and SMIDAS, shown in Section 2.1 and 2.2. 4.2 Numerical Problems of Direct Gradient Descent The GraDes algorithm attempts to find a sparse solution for the least squared error regression problem using ℓ0 regularization. The accuracy of GraDes is not directly comparable to the proposed method due to the difference of the loss functions (GraDes uses squared loss while other three use logistic loss in our experiments), therefore we report the numerical problem of GraDes without comparing to other algorithms. As we analyzed in Section 2.1, for the direct gradient descent method to converge, one should keep the spectral radius of the iteration matrix M less than 1. For GraDes, though by sparsification using hard thresholding function (Eq. 6), the iteration can still diverge. In practice, it is unrealistic to verify that the spectral radius of M is less than 1. This is restricted by the space and time complexity (M is a dense matrix in size of O(d2 )). For instance, in one of the authors’ machines with 2GB memory, MATLAB ran out of memory when computing the largest eigenvalue of a 20000× 20000 matrix. It costs even more if we have to compute the spectral radius of Table 3: Distribution and statistics of exponents p in Eq. (8) range comp vs rec comp vs sci comp vs talk sci vs rec sci vs talk rec vs talk rcv1 webspam p = 2 ln(d) within mean±std 0.095 -80.3±23.6 0.258 -69.8±32.4 0.282 -78.6±43.1 0.381 -63.6±29.4 0.273 -69.3±26.9 0.347 -68.4±33.7 0.012 -136.8±32.6 NA NA p = 0.5 ln(d) within mean±std 1.000 -12.7±7.2 0.999 -15.4±7.1 0.996 -17.2±9.1 1.000 -14.3±6.1 1.000 -16.0±5.9 0.999 -15.8±7.5 0.998 -26.3±6.5 0.976 -71.4±22.1 the thresholded iteration matrix DJ (t) M to decide how to truncate w̃(t) to w(t) . For the above 20newsgroup tasks, we reduce the number of features to around 5000 during text preprocessing while keeping the number of examples the same. Then we compute the top eigenvalues of the iteration matrices DJ (t) M at each iteration. |J (t) | is set to 2000 and the elements of J (t) are determined by the GraDes algorithm, DJ (0) is the identity matrix. According to [], γ is set to 3 and 100, respectively, Note that the learning rate is 1/γ, thus we have two different learning rate settings. The spectral radii are plotted as a function of the number of iterations ,under two learning rates, as shown in Fig. 1 and Fig. 2. One can easily observe that the spectral radius is far more than 1, indicating a fast divergence speed of the solution. For example, in Fig. 1, before the first iteration, the iteration matrix M has a spectral radius at least 7.5×103 (task “comp vs rec”, shown in the red line with plus “+”). Although in the first iteration, the spectral radii are reduced by the hard thresholding function, they are still in the scale of 103 . Moreover, the spectral radii increase as the algorithm proceeds. The radius of the iteration matrix for task “comp vs rec” increases from about 6 × 103 at the first iteration to about 7 × 103 at the 10-th iteration. All the other spectral radii increase to some extent and remain above 103 . Fig. 2 shows similar situations, the spectral radii are over 160. The large spectral radii of the iteration matrices indicate that the solution should go beyond the optimal solution and therefore increase the potential function. To show this, for each iteration of GraDes, we compute the the value of Ψ (w), which is expected to be reduced by the algorithm. However, the potential function values also go up quickly. In Fig. 3 and Fig. 4, for the 6 20newsgroups tasks, we plot Ψ (w) in log10 scale at each GraDes iteration under two settings of γ. The scales climb up linearly, indicating an exponential increase of Ψ (w). 4.3 Numerical Problem of Mirror Descent Algorithm We show the evidence of the numerical problem leading to the low accuracy of MDA. Typical IEEE 745 double precision floating numbers have at most 52 bits in the mantissa. If p in Eq. (8) is set too large, then small difference in exponents of θj would be greatly amplified in wj . Adding two binary numerals with difference in exponent larger than 52 would cause the smaller one to be truncated. Usually, elements in one example of the data are normalized to be in the same scale, such as in the range of [0, 1], the great disparity between exponents in the weights would cancel out parts of the data. For each task, we trained a model with 40% of density (i.e. the percentage of non-zeros in the model, with 100% density we include all features in the model) using SMIDAS. p is set to 2 ln(d) and 0.5 ln(d), respectively. For the last task, the density is set to 0.1% due to the large number of features (over 16 millions), and p ≈ 33 when p = 2 ln(d), this will obviously cause data truncation when computing inner products. Thus we only report results when p = 0.5 ln(d) for this task. We calculated the exponents of wj in 2-based numeric. Denote the largest exponent by em . We showed in Table 3 the ratios of exponents falling within [em − 51, em ] (the column “within”). The columns “mean±std” show the means and standard deviations of exponents. As we can see, for p = 2 ln(d), most of the wj have their exponents out of [em − 51, em ]. During prediction, the corresponding features of these elements in the data are totally truncated, therefore the models SMIDAS produce lack of sufficient discriminability and lead to poor performance (black lines with empty squares in Fig. (5)). The standard deviations are large, indicating a wide dynamic range of exponents. For p = 0.5 ln(d), SMIDAS approximates the normal gradient descent [6] and produces more reasonable models. Most of the exponents are in the range of [em − 51, em ] and the dynamic ranges become much smaller. The performances go up dramatically, but still inferior to the proposed method (blue lines with filled squares in Fig. (5)). 4.4 Sparsity and Performance Comparison We focus on how the performance varies as a function of sparsity, and how numerical problem affects the MDA. The proposed method is compared with three scalable algorithms, SMIDAS, TG and SCD, all use logistic loss function. We randomly split all labeled data of each task into training, test and validation set (see Table 2). The training sets are further randomly shuffled (for task 1-7) or split (task 8) to create 10 copies of training data. For the proposed algorithm and TG, for each parameter setting, we trained 10 models using these 10 copies and calculated the average performance of the models on validation set. For SMIDAS and SCD, since they randomly shuffle the examples and features respectively, only one copy of training data of each task is needed. We chose the best parameters according to this average performance. Finally we reported the best models’ averaged accuracy on test sets. All algorithms require a regularization parameter λ to control the intensity of sparsification. With a larger λ, we expect more sparsity of the model. We tuned λ using 0.0001, 0.0005, 0.001, 0.01 and 0.1. For learning rate η needed in SMIDAS, TG and the proposed method, we varied it from 0.1 to 0.5 with step size 0.1. The proposed method requires one more parameter τ determining the minimum margin above which to skip one instance. τ was searched using 0.001, 0.01 and 0.1. To compare the performance with sparsity restriction, we require the percentage of the non-zeros in the model (model density) to be at most 40% for task 1-7 and 0.1% for task 8. We recorded the error rates when the density reach pre-specified percentages. For all algorithms, we stopped running the program when they reach density upper bounds, or the maximal scans of training set (10 scans for all tasks). These error rates of each algorithms in 8 tasks are depicted in Fig. 5. In general, the proposed algorithm achieves lower error rate using fewer features compared with the other three. Particularly, in task 2, 7 and 8, though the density of the proposed algorithm went up to 40%, the errors are consistently lower than other methods. In the rest 5 tasks, the proposed method stop updating the model before the model is too dense, even when the maximum number of scans is reached. These not only demonstrated the convergence and generalizability of the proposed algorithm (see Section 3.2 and 3.3), but also the sparsity of the models it produces. Specifically, in task 1, the proposed algorithm converges before the density reaches 30%, with approximately 5% of error rate, which outperformed the remaining algorithms, even they used 40% of features. In task 3, the proposed method achieved 5% of error rate at density 20% which can only be obtained when SCD reached 40% of density, with other two have their error rates higher than that of SCD. Note that in task 1-7 SMIDAS with p = 2 ln(d) was beat by all other 3 methods and itself with p = 0.5 ln(d). This showed that SMIDAS is not applicable for approximating the EG algorithm, especially when the dimensionality is high. In task 1-6, when d and m differ by at most one order, TG converges slower than SCD. SCD adds features according to global information provided by all training examples, while TG works based on stochastic gradient. This is inaccurate compared with SCD. However, in task 8, when d are several order larger than m and sparser model are required, SCD fails to converge before reaching 40% of density. TG converges since it exploits more features than SCD. The proposed method takes the best of both. For any training example, it is always possible to reduce the loss following the gradient direction. As long as the current example does not increase the regret of online learning, we simply hold it out of the model. In this way, the proposed method achieves better generalization ability with a higher degree of sparsity. 5 Conclusions We addressed several problems in existing sparse learning methods. Though RIP provides theoretical guarantee to recover the sparse solutions, it is mostly satisfied by designed matrices in compressive sensing, rather than data collected in the real world. Failing to meet RIP can cause matrix iteration-based gradient descent to diverge [5], while ℓ0 constraint doesn’t solve the problem. Second, though MDA using Bregman divergence enjoys fast convergence and approximates the EG algorithm, its update rules produce inaccurate models when dimensionality is high. Lastly, ℓ1 regularization is insufficient for finding an sparser solution, in several existing methods, density of model grows faster than necessary. We have proposed to combine the perceptron algorithm with soft-thresholding. The algorithm converges with numerical error taken into account. We have also provided generalization error bound of the algorithm. Finally, the algorithm is highly scalable, data access can be distributed on a per-feature basis, making it more suitable for real-world applications. Experiments have shown that the proposed method outperformed 4 existing sparse learning algorithms in numerical stability, accuracy and model sparsity control. In one task with approximately 3.5GB of training data (16 million features, 50k examples), the proposed approach achieved 5.9% error rate using model with 0.8% density. But the best competing approach (TG for this dataset) can only get 8.2% error rate, using model with 0.7% density. References 1. E. J. Candes and M. B. Wakin. An introduction to compressive sampling. IEEE Signal Processing Magazine, 25(2):21–30, March 2008. 2. David Donoho and Iain M. Johnstone. Adapting to unknown smoothness via wavelet shrinkage. Journal of the American Statistical Association, 90:1200–1224, 1995. 3. John Duchi and Yoram Singer. Boosting with structural sparsity. In ICML, page 38, 2009. 4. Yoav Freund and Robert E. Schapire. Large margin classification using the perceptron algorithm. In Machine Learning, pages 277–296, 1998. 5. Rahul Garg and Rohit Khandekar. Gradient descent with sparsification: an iterative algorithm for sparse recovery with restricted isometry property. In ICML, pages 337–344, 2009. 6. Claudio Gentile and Nick Littlestone. The robustness of the p-norm algorithms. In Proceeding of 12th Annual Conference on Computer Learning Theory, pages 1–11. ACM Press, New York, NY, 1999. 7. Thore Graepel and Ralf Herbrich. From margin to sparsity. In In Advances in Neural Information Processing Systems 13, pages 210–216. MIT Press, 2001. 8. John Langford, Lihong Li, and Tong Zhang. Sparse online learning via truncated gradient. Journal of Machine Learning Research, 10:777–801, 2009. 9. N. Littlestone and M. Warmuth. Relating data compression and learnability, 1986. 10. Albert B. Novikoff. On convergence proofs for perceptrons. In Proceedings of the Symposium on the Mathematical Theory of Automata, volume 12, pages 615–622, 1963. 11. Shai Shalev-Shwartz, Nathan Srebro, and Tong Zhang. Trading accuracy for sparsity. Technical report, Toyota Technological Institute at Chicago, 2009. 12. Shai S. Shwartz and Ambuj Tewari. Stochastic methods for ℓ1 regularized loss minimization. In ICML, pages 929–936. ACM, 2009. 13. Tong Zhang. Adaptive forward-backward greedy algorithm for sparse learning with linear models. In NIPS, pages 1921–1928, 2008. 14. Peng Zhao and Bin Yu. On model selection consistency of lasso. Journal of Machine Learning Research, 7:2541–2563, 2006. comp_vs_sci SMIDAS (p=2) SMIDAS (p=0.5) SCD TG Proposed 0.4 SMIDAS (p=2) SMIDAS (p=0.5) SCD TG Proposed 0.35 Error rate Error rate comp_vs_rec 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0.3 0.25 0.2 0.15 0.1 0.05 5 10 15 20 25 30 35 40 5 10 15 Density (%) SMIDAS (p=2) SMIDAS (p=0.5) SCD TG Proposed 5 10 15 20 25 30 Density (%) 35 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 40 5 Error rate Error rate 0.3 0.25 0.2 0.15 0.1 0.05 5 10 15 10 20 25 30 Density (%) 35 40 10 15 20 25 20 25 30 Density (%) 35 40 35 40 SMIDAS (p=2) SMIDAS (p=0.5) SCD TG Proposed 5 Error rate Error rate 40 10 15 20 25 30 Density (%) webspam SMIDAS (p=2) SMIDAS (p=0.5) SCD TG Proposed 5 15 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 rcv1 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 35 rec_vs_talk SMIDAS (p=2) SMIDAS (p=0.5) SCD TG Proposed 0.35 30 SMIDAS (p=2) SMIDAS (p=0.5) SCD TG Proposed sci_vs_talk 0.4 25 sci_vs_rec Error rate Error rate comp_vs_talk 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 20 Density (%) 30 35 40 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 Proposed SCD TG SMIDAS (p=2) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Density (%) Fig. 5: Comparison of accuracy Density (%)