Abstract
Rapid explosion in data accumulation has yielded large scale data mining problems, many of which have intrinsically unbalanced or rare class distributions. Standard classification algorithms, which focus on overall classification accuracy, often perform poorly in these cases. Recently, Tayal et al. (IEEE Trans Knowl Data Eng 27(12):3347–3359, 2015) proposed a kernel method called RankRC for large-scale unbalanced learning. RankRC uses a ranking loss to overcome biases inherent in standard classification based loss functions, while achieving computational efficiency by enforcing a rare class hypothesis representation. In this paper we establish a theoretical bound for RankRC by establishing equivalence between instantiating a hypothesis using a subset of training points and instantiating a hypothesis using the full training set but with the feature mapping equal to the orthogonal projection of the original mapping. This bound suggests that it is optimal to select points from the rare class first when choosing the subset of data points for a hypothesis representation. In addition, we show that for an arbitrary loss function, the Nyström kernel matrix approximation is equivalent to instantiating a hypothesis using a subset of data points. Consequently, a theoretical bound for the Nyström kernel SVM can be established based on the perturbation analysis of the orthogonal projection in the feature mapping. This generally leads to a tighter bound in comparison to perturbation analysis based on kernel matrix approximation. To further illustrate computational effectiveness of RankRC, we apply a multi-level rare class kernel ranking method to the Heritage Health Provider Network’s health prize competition problem and compare the performance of RankRC to other existing methods.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
The competition ran for over two years ending in April 2013 and was highly publicized due to the potential impact on US healthcare and a US $3Mil prize. The authors participated in the competition placing 4th out of 1600+ teams. Our final submission used additional dataset variants and results from other methods as well, which were combined using a model stacking approach.
References
Achlioptas D, McSherry F, Schölkopf B (2001) Sampling techniques for kernel methods. Adv Neural Inf Process Syst 14:335–342
Agarwal S, Niyogi P (2005) Stability and generalization of bipartite ranking algorithms. In: Proceedings of the 18th annual conference on learning theory, COLT’05, pp 32–47
Airola A, Pahikkala T, Salakoski T (2011a) On learning and cross-validation with decomposed nyström approximation of kernel matrix. Neural Process Lett 33(1):17–30
Airola A, Pahikkala T, Salakoski T (2011b) Training linear ranking SVMs in linearithmic time using red-black trees. Pattern Recognit Lett 32(9):1328–1336
Baker CTH (1977) The numerical treatment of integral equations. Clarendon Press, Oxford
Bartlett PL, Jordan MI, McAuliffe JD (2006) Convexity, classification, and risk bounds. J Am Stat Assoc 101(473):138–156
Batista GEAPA, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explor Newsl 6(1):20–29
Bordes A, Ertekin S, Weston J, Bottou L (2005) Fast kernel classifiers with online and active learning. J Mach Learn Res 6:1579–1619
Boser BE, Guyon IM, Vapnik VN (1992) A training algorithm for optimal margin classifiers. In: Proceedings of the fifth annual workshop on Computational learning theory. ACM, pp 144–152
Bousquet O, Elisseeff A (2002) Stability and generalization. J Mach Learn Res 2:499–526
Bradley AP (1997) The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit 30:1145–1159
Branch MA, Coleman TF, Li Y (1999) A subspace, interior, and conjugate gradient method for large-scale bound-constrained minimization problems. SIAM J Sci Comput 21(1):1–23
Chang C-C, Lin C-J (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2:27:1–27:27. http://www.csie.ntu.edu.tw/~cjlin/libsvm
Chapelle O (2007) Training a support vector machine in the primal. Neural Comput 19(5):1155–1178
Chapelle O, Keerthi SS (2010) Efficient algorithms for ranking with SVMs. Inf Retr J 13(3):201–215
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Chawla NV, Japkowicz N, Kotcz A (2004) Editorial: special issue on learning from imbalanced data sets. SIGKDD Explor Newsl 6(1):1–6
Coleman TF, Li Y (1994) On the convergence of interior-reflective newton methods for nonlinear minimization subject to bounds. Math Program 67(1–3):189–224
Collobert R, Sinz F, Weston J, Bottou L (2006) Trading convexity for scalability. In: Proceedings of the 23rd international conference on machine learning, pp 201–208
Cortes C, Mohri M, Talwalkar A (2010) On the impact of kernel approximation on learning accuracy. In: Proceedings of the 13th international workshop on artificial intelligence and statistics
Cotter A, Shalev-Shwartz S, Srebro N (2013) Learning optimally sparse support vector machines. In: Proceedings of the 30th international conference on machine learning, pp 266–274
DeLong ER, DeLong DM, Clarke-Pearson DL (1988) Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics 44(3):837–845
El Emam K, Arbuckle L, Koru G, Eze B, Gaudette L, Neri E, Rose S, Howard J, Gluck J (2012) De-identification methods for open health data: the case of the heritage health prize claims dataset. J Med Internet Res 14(1):e33
Ezawa K, Singh M, Norton SW (1996) Learning goal oriented Bayesian networks for telecommunications risk management. In: International conference on machine learning, pp 139–147
Farahat AK, Ghodsi A, Kamel MS (2011) A novel greedy algorithm for Nyström approximation. In: International conference on artificial intelligence and statistics, pp 269–277
Fine S, Scheinberg K (2002) Efficient svm training using low-rank kernel representations. J Mach Learn Res 2:243–264
Fowlkes C, Belongie S, Chung F, Malik J (2004) Spectral grouping using the Nyström method. IEEE Trans Pattern Anal Mach Intell 26(2):214–225
Hanley JA, Mcneil BJ (1982) The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143(1):29–36
He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284
Herbrich R, Graepel T, Obermayer K (2000) Large margin rank boundaries for ordinal regression. MIT Press, Cambridge
Heritage provider network health prize. http://www.heritagehealthprize.com/c/hhp. Accessed 31 Aug 2013
Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study. Intell Data Anal 6(5):429–449
Joachims T (2002) Optimizing search engines using clickthrough data. In: SIGKDD international conference on knowledge discovery and data mining, pp 133–142
Joachims T (2005) A support vector method for multivariate performance measures. In: International conference on machine learning, pp 377–384
Joachims T, Yu C-N (2009) Sparse kernel svms via cutting-plane training. Mach Learn 76(2–3):179–193
Karakoulas G, Shawe-Taylor J (1999) Optimizing classifiers for imbalanced training sets. In: Proceedings of the 11th international conference on neural information processing systems, pp 253–259
Kimeldorf G, Wahba G (1970) A correspondence between Bayesian estimation of stochastic processes and smoothing by splines. Ann Math Stat 41:495–502
Kubat M, Matwin S (1997) Addressing the curse of imbalanced training sets: one-sided selection. In: International conference on machine learning, pp 179–186
Kumar S, Mohri M, Talwalkar A (2009) Sampling techniques for the Nyström method. In: International conference on artificial intelligence and statistics, pp 304–311
Kumar S, Mohri M, Talwalkar A (2012) Sampling methods for the Nyström method. J Mach Learn Res 13:981–1006
Kuo T-M, Lee C-P, Lin C-J (2014) Large-scale kernel rankSVM. In: Proceedings of SIAM international conference on data mining
Lee Y-J, Huang S-Y (2007) Reduced support vector machines: a statistical theory. IEEE Trans Neural Netw 18(1):1–13
Lin Y, Lee Y, Wahba G (2000) Support vector machines for classification in nonstandard situations. Mach Learn 46(1):192–202
Maloof MA (2003) Learning when data sets are imbalanced and when costs are unequal and unknown. In: International conference of machine learning
Metz CE (1978) Basic principles of ROC analysis. Semin Nuclear Med 8(4):283–298
Nguyen X, Wainwright MJ, Jordan MI (2009) On surrogate loss functions and f-divergences. Ann Stat 37(2):876–904
Platt JC (1999) Fast training of support vector machines using sequential minimal optimization. In: Advances in kernel methods-support vector learning, pp 185–208
Platt JC (2005) Fastmap, metricmap, and landmark mds are all Nyström algorithms. In: International conference on artificial intelligence and statistics, pp 261–268
Provost F, Fawcett T, Kohavi R (1997) The case against accuracy estimation for comparing induction algorithms. In: International conference on machine learning, pp 445–453
Raskutti B, Kowalczyk A (2004) Extreme rebalancing for SVMs: a case study. SIGKDD Explor Newsl 6(1):60–69
Rifkin R, Yeo G, Poggio T (2003) Regularized least-squares classification. Nato Sci Ser Sub Ser III Comput Syst Sci 190:131–154
Rosset S, Zhu J, Hastie T (2003) Margin maximizing loss functions. In: Advances in neural information processing systems, pp 1237–1244
Schölkopf B, Herbrich R, Smola AJ (2001) A generalized representer theorem. In: Proceedings of the 14th annual conference on computational learning theory, pp 416–426
Smola AJ, Schölkopf B (2000) Sparse greedy matrix approximation for machine learning. In: International conference on machine learning, pp 911–918
Sun Y, Kamel MS, Wong AKC, Wang Y (2007) Cost-sensitive boosting for classification of imbalanced data. Pattern Recognit 40(12):3358–3378
Talwalkar A (2010) Matrix approximation for large-scale learning. PhD thesis, Courant Institute of Mathematical Sciences, New York University, New York, NY
Tayal A, Coleman TF, Li Y (2015) RankRC: large-scale nonlinear rare class ranking. IEEE Trans Knowl Data Eng 27(12):3347–3359
Tsang IW, Kwok JT, Cheung PM (2005) Core vector machines: fast SVM training on very large datasets. J Mach Learn Res 6:364–392
Turney PD (2000) Types of cost in inductive concept learning. In: International conference on machine learning
Waegeman W, Baets BD, Boullart L (2006) A comparison of different roc measures for ordinal regression. In: International conference on machine learning
Weiss GM (2004) Mining with rarity: a unifying framework. SIGKDD Explor Newsl 6(1):7–19
Williams C, Seeger M (2001) Using the Nyström method to speed up kernel machines. In: Proceedings of the 13th international conference on neural information processing systems, pp 682–688
Woods K, Doss C, Bowyer K, Solka J, Preibe C, Keglmyer P (1993) Comparative evaluation of pattern recognition techniques for detection of microcalcifications in mammography. Int J Pattern Recognit Artif Intell 7:1417–1436
Wu G, Chang EY (2003) Class-boundary alignment for imbalanced dataset learning. In: International conference on machine learning, pp 49–56
Zhang J, Mani I, (2003) KNN approach to unbalanced data distributions: a case study involving information extraction. In: International conference on machine learning
Zhang K, Tsang IW, Kwok JT (2008) Improved Nyström low-rank approximation and error analysis. In: International conference on machine learning, pp 1232–1239
Zhang K, Lan L, Wang Z, Moerchen F (2012) Scaling up kernel svm on limited resources: a low-rank linearization approach. Int Conf Artif Intell Stat 22:1425–1434
Zhu M, Su W, Chipman HA (2006) LAGO: a computationally efficient approach for statistical detection. Technometrics 48:193–205
Acknowledgements
The authors acknowledge comments from anonymous referees which have significantly improved presentation of the paper.
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Chih-Jen Lin.
All three authors acknowledge funding from the National Sciences and Engineering Research Council of Canada. Thomas F. Coleman acknowledges funding from the Ophelia Lazaridis University Research Chair. The views expressed herein are solely from the authors.
Appendix A: Proof Sketch for Theorem 5
Appendix A: Proof Sketch for Theorem 5
Proof
Assume that \(\mathbf {w}^*\) and \(\mathbf {w}_{\mathscr {R}}^*\) are minimizers of (28) and (30), respectively, with total loss for multi-level RankSVM given by:
and
Let \(\Delta \mathbf {w}= \mathbf {w}_{\mathscr {R}}^* - \mathbf {w}^*\).
A convex function g satisfies
for all \(\mathbf {u}, \mathbf {v}, t \in [0,1]\). Since \(\ell _h\) is convex, \(R_\phi \) and \(R_{\phi _{\mathscr {R}}}\) are convex. Then
for all \(t \in [0,1]\).
Since \(\mathbf {w}^*\) and \(\mathbf {w}_{\mathscr {R}}^*\) are minimizers of \(F_\phi \) and \(F_{\phi _{\mathscr {R}}}\), for any \(t \in [0,1]\), we have
Summing (A.5) and (A.6), using \(F_\phi (\mathbf {w}) = R_\phi (\mathbf {w}) + \frac{\lambda }{2} \Vert \mathbf {w}\Vert _2^2 \) and the identity
we obtain
Substituting (A.3) and (A.4) into (A.7), dividing by \(\lambda t\), and taking the limit \(t \rightarrow 0\) gives
where the last inequality uses the multi-level RankSVM definitions of \(R_\phi \) and \(R_{\phi _{\mathscr {R}}}\). Since \(\ell _h(\cdot )\) is 1-Lipschitz, we obtain
where the last equality is obtained by factoring over common index sets \(\{i : y_i = r\}\).
From \(\phi (\mathbf {x}) = \phi _{\mathscr {R}}(\mathbf {x}) + \phi _{\mathscr {R}}^\perp (\mathbf {x})\), with \(\phi _{\mathscr {R}}(\mathbf {x}) \in {\mathscr {S}}_{\mathscr {R}}\) and \(\phi _{\mathscr {R}}^\perp (\mathbf {x})\) is in the space orthogonal to \({\mathscr {S}}_{\mathscr {R}}\), we have, for \(i=1,\ldots ,m\),
In addition, recall that RankSVM is equivalent to a 1-class SVM on an enlarged dataset with the set of points \(\mathscr {P} = \{ \phi (\mathbf {x}_i) - \phi (\mathbf {x}_j) : y_i > y_j, i,j=1\ldots ,m\}\). Therefore \(\mathbf {w}\) can be expressed in terms of the dual variables \(0 \le \alpha _{ij}^* \le C\) of an SVM problem trained on \(\mathscr {P}\) with \(C = \frac{1}{\lambda m_+ m_-}\), as follows,
Since \(\Vert \phi (\mathbf {x}) \Vert \le \sqrt{\kappa }\) and \(C = \frac{1}{\lambda m_+ m_-}\), we get \( \Vert \mathbf {w}^*\Vert \le \sqrt{\kappa } C m_- m_+ + \sqrt{\kappa } C m_+ m_- = \frac{2\sqrt{\kappa }}{\lambda } \). Similarly, \(\Vert \phi _{\mathscr {R}}(\mathbf {x}) \Vert \le \Vert \phi (\mathbf {x}) \Vert \le \sqrt{\kappa }\) and \(\Vert \mathbf {w}_{\mathscr {R}}^* \Vert \le \frac{2\sqrt{\kappa }}{\lambda }\). Together with (A.9), we can then bound (A.8) by
where \(M = \sum _{r < s} m_r m_s\).
Therefore, we obtain
This completes the proof. \(\square \)
Rights and permissions
About this article
Cite this article
Tayal, A., Coleman, T.F. & Li, Y. Bounding the difference between RankRC and RankSVM and application to multi-level rare class kernel ranking. Data Min Knowl Disc 32, 417–452 (2018). https://doi.org/10.1007/s10618-017-0540-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-017-0540-z