Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Bounding the difference between RankRC and RankSVM and application to multi-level rare class kernel ranking

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Rapid explosion in data accumulation has yielded large scale data mining problems, many of which have intrinsically unbalanced or rare class distributions. Standard classification algorithms, which focus on overall classification accuracy, often perform poorly in these cases. Recently, Tayal et al. (IEEE Trans Knowl Data Eng 27(12):3347–3359, 2015) proposed a kernel method called RankRC for large-scale unbalanced learning. RankRC uses a ranking loss to overcome biases inherent in standard classification based loss functions, while achieving computational efficiency by enforcing a rare class hypothesis representation. In this paper we establish a theoretical bound for RankRC by establishing equivalence between instantiating a hypothesis using a subset of training points and instantiating a hypothesis using the full training set but with the feature mapping equal to the orthogonal projection of the original mapping. This bound suggests that it is optimal to select points from the rare class first when choosing the subset of data points for a hypothesis representation. In addition, we show that for an arbitrary loss function, the Nyström kernel matrix approximation is equivalent to instantiating a hypothesis using a subset of data points. Consequently, a theoretical bound for the Nyström kernel SVM can be established based on the perturbation analysis of the orthogonal projection in the feature mapping. This generally leads to a tighter bound in comparison to perturbation analysis based on kernel matrix approximation. To further illustrate computational effectiveness of RankRC, we apply a multi-level rare class kernel ranking method to the Heritage Health Provider Network’s health prize competition problem and compare the performance of RankRC to other existing methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Notes

  1. The competition ran for over two years ending in April 2013 and was highly publicized due to the potential impact on US healthcare and a US $3Mil prize. The authors participated in the competition placing 4th out of 1600+ teams. Our final submission used additional dataset variants and results from other methods as well, which were combined using a model stacking approach.

References

  • Achlioptas D, McSherry F, Schölkopf B (2001) Sampling techniques for kernel methods. Adv Neural Inf Process Syst 14:335–342

    Google Scholar 

  • Agarwal S, Niyogi P (2005) Stability and generalization of bipartite ranking algorithms. In: Proceedings of the 18th annual conference on learning theory, COLT’05, pp 32–47

  • Airola A, Pahikkala T, Salakoski T (2011a) On learning and cross-validation with decomposed nyström approximation of kernel matrix. Neural Process Lett 33(1):17–30

    Article  Google Scholar 

  • Airola A, Pahikkala T, Salakoski T (2011b) Training linear ranking SVMs in linearithmic time using red-black trees. Pattern Recognit Lett 32(9):1328–1336

    Article  Google Scholar 

  • Baker CTH (1977) The numerical treatment of integral equations. Clarendon Press, Oxford

    MATH  Google Scholar 

  • Bartlett PL, Jordan MI, McAuliffe JD (2006) Convexity, classification, and risk bounds. J Am Stat Assoc 101(473):138–156

    Article  MathSciNet  MATH  Google Scholar 

  • Batista GEAPA, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explor Newsl 6(1):20–29

    Article  Google Scholar 

  • Bordes A, Ertekin S, Weston J, Bottou L (2005) Fast kernel classifiers with online and active learning. J Mach Learn Res 6:1579–1619

    MathSciNet  MATH  Google Scholar 

  • Boser BE, Guyon IM, Vapnik VN (1992) A training algorithm for optimal margin classifiers. In: Proceedings of the fifth annual workshop on Computational learning theory. ACM, pp 144–152

  • Bousquet O, Elisseeff A (2002) Stability and generalization. J Mach Learn Res 2:499–526

    MathSciNet  MATH  Google Scholar 

  • Bradley AP (1997) The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit 30:1145–1159

    Article  Google Scholar 

  • Branch MA, Coleman TF, Li Y (1999) A subspace, interior, and conjugate gradient method for large-scale bound-constrained minimization problems. SIAM J Sci Comput 21(1):1–23

    Article  MathSciNet  MATH  Google Scholar 

  • Chang C-C, Lin C-J (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2:27:1–27:27. http://www.csie.ntu.edu.tw/~cjlin/libsvm

  • Chapelle O (2007) Training a support vector machine in the primal. Neural Comput 19(5):1155–1178

    Article  MathSciNet  MATH  Google Scholar 

  • Chapelle O, Keerthi SS (2010) Efficient algorithms for ranking with SVMs. Inf Retr J 13(3):201–215

    Article  Google Scholar 

  • Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357

    MATH  Google Scholar 

  • Chawla NV, Japkowicz N, Kotcz A (2004) Editorial: special issue on learning from imbalanced data sets. SIGKDD Explor Newsl 6(1):1–6

    Article  Google Scholar 

  • Coleman TF, Li Y (1994) On the convergence of interior-reflective newton methods for nonlinear minimization subject to bounds. Math Program 67(1–3):189–224

    Article  MathSciNet  MATH  Google Scholar 

  • Collobert R, Sinz F, Weston J, Bottou L (2006) Trading convexity for scalability. In: Proceedings of the 23rd international conference on machine learning, pp 201–208

  • Cortes C, Mohri M, Talwalkar A (2010) On the impact of kernel approximation on learning accuracy. In: Proceedings of the 13th international workshop on artificial intelligence and statistics

  • Cotter A, Shalev-Shwartz S, Srebro N (2013) Learning optimally sparse support vector machines. In: Proceedings of the 30th international conference on machine learning, pp 266–274

  • DeLong ER, DeLong DM, Clarke-Pearson DL (1988) Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics 44(3):837–845

    Article  MATH  Google Scholar 

  • El Emam K, Arbuckle L, Koru G, Eze B, Gaudette L, Neri E, Rose S, Howard J, Gluck J (2012) De-identification methods for open health data: the case of the heritage health prize claims dataset. J Med Internet Res 14(1):e33

  • Ezawa K, Singh M, Norton SW (1996) Learning goal oriented Bayesian networks for telecommunications risk management. In: International conference on machine learning, pp 139–147

  • Farahat AK, Ghodsi A, Kamel MS (2011) A novel greedy algorithm for Nyström approximation. In: International conference on artificial intelligence and statistics, pp 269–277

  • Fine S, Scheinberg K (2002) Efficient svm training using low-rank kernel representations. J Mach Learn Res 2:243–264

    MATH  Google Scholar 

  • Fowlkes C, Belongie S, Chung F, Malik J (2004) Spectral grouping using the Nyström method. IEEE Trans Pattern Anal Mach Intell 26(2):214–225

    Article  Google Scholar 

  • Hanley JA, Mcneil BJ (1982) The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143(1):29–36

    Article  Google Scholar 

  • He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284

    Article  Google Scholar 

  • Herbrich R, Graepel T, Obermayer K (2000) Large margin rank boundaries for ordinal regression. MIT Press, Cambridge

    Google Scholar 

  • Heritage provider network health prize. http://www.heritagehealthprize.com/c/hhp. Accessed 31 Aug 2013

  • Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study. Intell Data Anal 6(5):429–449

    MATH  Google Scholar 

  • Joachims T (2002) Optimizing search engines using clickthrough data. In: SIGKDD international conference on knowledge discovery and data mining, pp 133–142

  • Joachims T (2005) A support vector method for multivariate performance measures. In: International conference on machine learning, pp 377–384

  • Joachims T, Yu C-N (2009) Sparse kernel svms via cutting-plane training. Mach Learn 76(2–3):179–193

    Article  Google Scholar 

  • Karakoulas G, Shawe-Taylor J (1999) Optimizing classifiers for imbalanced training sets. In: Proceedings of the 11th international conference on neural information processing systems, pp 253–259

  • Kimeldorf G, Wahba G (1970) A correspondence between Bayesian estimation of stochastic processes and smoothing by splines. Ann Math Stat 41:495–502

    Article  MathSciNet  MATH  Google Scholar 

  • Kubat M, Matwin S (1997) Addressing the curse of imbalanced training sets: one-sided selection. In: International conference on machine learning, pp 179–186

  • Kumar S, Mohri M, Talwalkar A (2009) Sampling techniques for the Nyström method. In: International conference on artificial intelligence and statistics, pp 304–311

  • Kumar S, Mohri M, Talwalkar A (2012) Sampling methods for the Nyström method. J Mach Learn Res 13:981–1006

    MathSciNet  MATH  Google Scholar 

  • Kuo T-M, Lee C-P, Lin C-J (2014) Large-scale kernel rankSVM. In: Proceedings of SIAM international conference on data mining

  • Lee Y-J, Huang S-Y (2007) Reduced support vector machines: a statistical theory. IEEE Trans Neural Netw 18(1):1–13

    Article  Google Scholar 

  • Lin Y, Lee Y, Wahba G (2000) Support vector machines for classification in nonstandard situations. Mach Learn 46(1):192–202

    MATH  Google Scholar 

  • Maloof MA (2003) Learning when data sets are imbalanced and when costs are unequal and unknown. In: International conference of machine learning

  • Metz CE (1978) Basic principles of ROC analysis. Semin Nuclear Med 8(4):283–298

    Article  Google Scholar 

  • Nguyen X, Wainwright MJ, Jordan MI (2009) On surrogate loss functions and f-divergences. Ann Stat 37(2):876–904

    Article  MathSciNet  MATH  Google Scholar 

  • Platt JC (1999) Fast training of support vector machines using sequential minimal optimization. In: Advances in kernel methods-support vector learning, pp 185–208

  • Platt JC (2005) Fastmap, metricmap, and landmark mds are all Nyström algorithms. In: International conference on artificial intelligence and statistics, pp 261–268

  • Provost F, Fawcett T, Kohavi R (1997) The case against accuracy estimation for comparing induction algorithms. In: International conference on machine learning, pp 445–453

  • Raskutti B, Kowalczyk A (2004) Extreme rebalancing for SVMs: a case study. SIGKDD Explor Newsl 6(1):60–69

    Article  Google Scholar 

  • Rifkin R, Yeo G, Poggio T (2003) Regularized least-squares classification. Nato Sci Ser Sub Ser III Comput Syst Sci 190:131–154

    Google Scholar 

  • Rosset S, Zhu J, Hastie T (2003) Margin maximizing loss functions. In: Advances in neural information processing systems, pp 1237–1244

  • Schölkopf B, Herbrich R, Smola AJ (2001) A generalized representer theorem. In: Proceedings of the 14th annual conference on computational learning theory, pp 416–426

  • Smola AJ, Schölkopf B (2000) Sparse greedy matrix approximation for machine learning. In: International conference on machine learning, pp 911–918

  • Sun Y, Kamel MS, Wong AKC, Wang Y (2007) Cost-sensitive boosting for classification of imbalanced data. Pattern Recognit 40(12):3358–3378

    Article  MATH  Google Scholar 

  • Talwalkar A (2010) Matrix approximation for large-scale learning. PhD thesis, Courant Institute of Mathematical Sciences, New York University, New York, NY

  • Tayal A, Coleman TF, Li Y (2015) RankRC: large-scale nonlinear rare class ranking. IEEE Trans Knowl Data Eng 27(12):3347–3359

    Article  Google Scholar 

  • Tsang IW, Kwok JT, Cheung PM (2005) Core vector machines: fast SVM training on very large datasets. J Mach Learn Res 6:364–392

    MATH  Google Scholar 

  • Turney PD (2000) Types of cost in inductive concept learning. In: International conference on machine learning

  • Waegeman W, Baets BD, Boullart L (2006) A comparison of different roc measures for ordinal regression. In: International conference on machine learning

  • Weiss GM (2004) Mining with rarity: a unifying framework. SIGKDD Explor Newsl 6(1):7–19

    Article  Google Scholar 

  • Williams C, Seeger M (2001) Using the Nyström method to speed up kernel machines. In: Proceedings of the 13th international conference on neural information processing systems, pp 682–688

  • Woods K, Doss C, Bowyer K, Solka J, Preibe C, Keglmyer P (1993) Comparative evaluation of pattern recognition techniques for detection of microcalcifications in mammography. Int J Pattern Recognit Artif Intell 7:1417–1436

    Article  Google Scholar 

  • Wu G, Chang EY (2003) Class-boundary alignment for imbalanced dataset learning. In: International conference on machine learning, pp 49–56

  • Zhang J, Mani I, (2003) KNN approach to unbalanced data distributions: a case study involving information extraction. In: International conference on machine learning

  • Zhang K, Tsang IW, Kwok JT (2008) Improved Nyström low-rank approximation and error analysis. In: International conference on machine learning, pp 1232–1239

  • Zhang K, Lan L, Wang Z, Moerchen F (2012) Scaling up kernel svm on limited resources: a low-rank linearization approach. Int Conf Artif Intell Stat 22:1425–1434

    Google Scholar 

  • Zhu M, Su W, Chipman HA (2006) LAGO: a computationally efficient approach for statistical detection. Technometrics 48:193–205

Download references

Acknowledgements

The authors acknowledge comments from anonymous referees which have significantly improved presentation of the paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Aditya Tayal.

Additional information

Responsible editor: Chih-Jen Lin.

All three authors acknowledge funding from the National Sciences and Engineering Research Council of Canada. Thomas F. Coleman acknowledges funding from the Ophelia Lazaridis University Research Chair. The views expressed herein are solely from the authors.

Appendix A: Proof Sketch for Theorem 5

Appendix A: Proof Sketch for Theorem 5

Proof

Assume that \(\mathbf {w}^*\) and \(\mathbf {w}_{\mathscr {R}}^*\) are minimizers of (28) and (30), respectively, with total loss for multi-level RankSVM given by:

$$\begin{aligned} R_\phi (\mathbf {w}) = \frac{1}{\sum _{r < s} m_r m_s} \sum _{r=1}^R \sum _{ \{i : y_i = r\} } \sum _{ \{j : y_i > y_j \}} \ell _h \left( \mathbf {w}^T\phi (\mathbf {x}_i) - \mathbf {w}^T\phi (\mathbf {x}_j) \right) \end{aligned}$$
(A.1)

and

$$\begin{aligned} R_{\phi _{\mathscr {R}}}(\mathbf {w}) = \frac{1}{\sum _{r < s} m_r m_s} \sum _{r=1}^R \sum _{ \{i : y_i = r\} } \sum _{ \{j : y_i > y_j \}} \ell _h \left( \mathbf {w}^T\phi _{\mathscr {R}}(\mathbf {x}_i) - \mathbf {w}^T\phi _{\mathscr {R}}(\mathbf {x}_j) \right) . \end{aligned}$$
(A.2)

Let \(\Delta \mathbf {w}= \mathbf {w}_{\mathscr {R}}^* - \mathbf {w}^*\).

A convex function g satisfies

$$\begin{aligned} g(\mathbf {u}+ t(\mathbf {v}-\mathbf {u})) - g(\mathbf {u}) \le t ( g(\mathbf {v}) - g(\mathbf {u}) ) \end{aligned}$$

for all \(\mathbf {u}, \mathbf {v}, t \in [0,1]\). Since \(\ell _h\) is convex, \(R_\phi \) and \(R_{\phi _{\mathscr {R}}}\) are convex. Then

$$\begin{aligned} R_{\phi }( \mathbf {w}^* + t \Delta \mathbf {w}) - R_{\phi }( \mathbf {w}^*)&\le t ( R_{\phi }(\mathbf {w}_{\mathscr {R}}^*) - R_{\phi }( \mathbf {w}^*) )\end{aligned}$$
(A.3)
$$\begin{aligned} \text {and} \quad R_{\phi _{\mathscr {R}}}( \mathbf {w}_{\mathscr {R}}^* - t \Delta \mathbf {w}) - R_{\phi _{\mathscr {R}}}( \mathbf {w}_{\mathscr {R}}^*)&\le t ( R_{\phi _{\mathscr {R}}}( \mathbf {w}^*) - R_{\phi _{\mathscr {R}}}(\mathbf {w}_{\mathscr {R}}^*)), \end{aligned}$$
(A.4)

for all \(t \in [0,1]\).

Since \(\mathbf {w}^*\) and \(\mathbf {w}_{\mathscr {R}}^*\) are minimizers of \(F_\phi \) and \(F_{\phi _{\mathscr {R}}}\), for any \(t \in [0,1]\), we have

$$\begin{aligned} F_{\phi }( \mathbf {w}^*)&\le F_{\phi }(\mathbf {w}^* + t \Delta \mathbf {w}) \end{aligned}$$
(A.5)
$$\begin{aligned} \text {and} \quad F_{\phi _{\mathscr {R}}}( \mathbf {w}_{\mathscr {R}}^*)&\le F_{\phi _{\mathscr {R}}}(\mathbf {w}_{\mathscr {R}}^* - t \Delta \mathbf {w}). \end{aligned}$$
(A.6)

Summing (A.5) and (A.6), using \(F_\phi (\mathbf {w}) = R_\phi (\mathbf {w}) + \frac{\lambda }{2} \Vert \mathbf {w}\Vert _2^2 \) and the identity

$$\begin{aligned} \left( \Vert \mathbf {w}^*\Vert ^2 - \Vert \mathbf {w}^* + t \Delta \mathbf {w}\Vert ^2 \right) + \left( \Vert \mathbf {w}_{\mathscr {R}}^*\Vert ^2 - \Vert \mathbf {w}_{\mathscr {R}}^* - t \Delta \mathbf {w}\Vert ^2 \right) = 2t(1-t)\Vert \Delta \mathbf {w}\Vert ^2, \end{aligned}$$

we obtain

$$\begin{aligned} \lambda t(1-t) \Vert \Delta \mathbf {w}\Vert ^2 \le \left( R_{\phi }( \mathbf {w}^* + t \Delta \mathbf {w}) - R_{\phi }( \mathbf {w}^*) \right) + \left( R_{\phi _{\mathscr {R}}}(\mathbf {w}_{\mathscr {R}}^* - t \Delta \mathbf {w}) - R_{\phi _{\mathscr {R}}}( \mathbf {w}_{\mathscr {R}}^*) \right) \end{aligned}$$
(A.7)

Substituting (A.3) and (A.4) into (A.7), dividing by \(\lambda t\), and taking the limit \(t \rightarrow 0\) gives

$$\begin{aligned}&\Vert \Delta \mathbf {w}\Vert ^2\\&\quad \le \frac{1}{\lambda } \left( R_{\phi }(\mathbf {w}_{\mathscr {R}}^*) - R_{\phi _{\mathscr {R}}}( \mathbf {w}_{\mathscr {R}}^*) + R_{\phi _{\mathscr {R}}}( \mathbf {w}^*) - R_{\phi }( \mathbf {w}^*) \right) \\&\quad = \frac{1}{\lambda \sum _{r < s} m_r m_s} \sum _{r=1}^R \sum _{ \{i : y_i = r\} } \sum _{ \{j : y_i > y_j \}} \Big [ \ell _h \left( (\mathbf {w}_{\mathscr {R}}^*)^T \phi (\mathbf {x}_i) - (\mathbf {w}_{\mathscr {R}}^*)^T \phi (\mathbf {x}_j) \right) \\&\qquad -\,\ell _h \left( (\mathbf {w}_{\mathscr {R}}^*)^T \phi _{\mathscr {R}}(\mathbf {x}_i) - (\mathbf {w}_{\mathscr {R}}^*)^T \phi _{\mathscr {R}}(\mathbf {x}_j) \right) + \ell _h \left( (\mathbf {w}^*)^T \phi _{\mathscr {R}}(\mathbf {x}_i) - (\mathbf {w}^*)^T \phi _{\mathscr {R}}(\mathbf {x}_j) \right) \\&\qquad - \,\ell _h \left( (\mathbf {w}^*)^T \phi (\mathbf {x}_i) - (\mathbf {w}^*)^T \phi (\mathbf {x}_j) \right) \Big ] \\ \end{aligned}$$

where the last inequality uses the multi-level RankSVM definitions of \(R_\phi \) and \(R_{\phi _{\mathscr {R}}}\). Since \(\ell _h(\cdot )\) is 1-Lipschitz, we obtain

(A.8)

where the last equality is obtained by factoring over common index sets \(\{i : y_i = r\}\).

From \(\phi (\mathbf {x}) = \phi _{\mathscr {R}}(\mathbf {x}) + \phi _{\mathscr {R}}^\perp (\mathbf {x})\), with \(\phi _{\mathscr {R}}(\mathbf {x}) \in {\mathscr {S}}_{\mathscr {R}}\) and \(\phi _{\mathscr {R}}^\perp (\mathbf {x})\) is in the space orthogonal to \({\mathscr {S}}_{\mathscr {R}}\), we have, for \(i=1,\ldots ,m\),

$$\begin{aligned} \Vert \phi (\mathbf {x}_i) - \phi _{\mathscr {R}}(\mathbf {x}_i) \Vert = \Vert \phi _{\mathscr {R}}^\perp (\mathbf {x}_i) \Vert \le {\left\{ \begin{array}{ll} \Vert \phi (\mathbf {x}_i) \Vert = \sqrt{ k(\mathbf {x}_i,\mathbf {x}_i) } \le \sqrt{\kappa } , &{} \text {if}\ i \not \in {\mathscr {R}} \\ 0, &{} \text {if}\ i \in {\mathscr {R}}. \end{array}\right. } \end{aligned}$$
(A.9)

In addition, recall that RankSVM is equivalent to a 1-class SVM on an enlarged dataset with the set of points \(\mathscr {P} = \{ \phi (\mathbf {x}_i) - \phi (\mathbf {x}_j) : y_i > y_j, i,j=1\ldots ,m\}\). Therefore \(\mathbf {w}\) can be expressed in terms of the dual variables \(0 \le \alpha _{ij}^* \le C\) of an SVM problem trained on \(\mathscr {P}\) with \(C = \frac{1}{\lambda m_+ m_-}\), as follows,

$$\begin{aligned} \mathbf {w}^*&= \sum _{ \{ i,j: y_i > y_j\}} \alpha _{ij}^* ( \phi (\mathbf {x}_i) - \phi (\mathbf {x}_j) ) = \sum _{ \{i : y_i = +1 \} } \sum _{ \{ j : y_j = -1 \}} \alpha _{ij}^* ( \phi (\mathbf {x}_i) - \phi (\mathbf {x}_j) ) \\&= \sum _{ \{i : y_i = +1 \} } \phi (\mathbf {x}_i) \left( \sum _{ \{ j : y_j = -1 \}} \alpha _{ij}^* \right) - \sum _{ \{ j : y_j = -1 \}} \phi (\mathbf {x}_j) \left( \sum _{ \{i : y_i = +1 \} } \alpha _{ij}^* \right) . \end{aligned}$$

Since \(\Vert \phi (\mathbf {x}) \Vert \le \sqrt{\kappa }\) and \(C = \frac{1}{\lambda m_+ m_-}\), we get \( \Vert \mathbf {w}^*\Vert \le \sqrt{\kappa } C m_- m_+ + \sqrt{\kappa } C m_+ m_- = \frac{2\sqrt{\kappa }}{\lambda } \). Similarly, \(\Vert \phi _{\mathscr {R}}(\mathbf {x}) \Vert \le \Vert \phi (\mathbf {x}) \Vert \le \sqrt{\kappa }\) and \(\Vert \mathbf {w}_{\mathscr {R}}^* \Vert \le \frac{2\sqrt{\kappa }}{\lambda }\). Together with (A.9), we can then bound (A.8) by

$$\begin{aligned} \Vert \Delta \mathbf {w}\Vert ^2&\le \frac{ 4 \kappa }{\lambda ^2 M } \sum _{r=0}^R \Bigg ( (m_0 + m_1 + \cdots + m_{r-1} + m_{r+1} + \cdots + m_R ) \sum _{ \{i : y_i = r\} } \mathbb {I}[i \not \in {\mathscr {R}}] \Bigg ), \end{aligned}$$

where \(M = \sum _{r < s} m_r m_s\).

Therefore, we obtain

$$\begin{aligned} | f_{\phi _{\mathscr {R}}}(\mathbf {x}) - f_\phi (\mathbf {x}) |&= | \mathbf {w}_{\mathscr {R}}^T \phi _{\mathscr {R}}(\mathbf {x}) - \mathbf {w}^T \phi (\mathbf {x}) | \\&= | \mathbf {w}_{\mathscr {R}}^T \left( \phi (\mathbf {x}) - \phi _{\mathscr {R}}^\perp (\mathbf {x}) \right) - \mathbf {w}^T \phi (\mathbf {x}) | \\&= | \Delta \mathbf {w}^T \phi (\mathbf {x}) - \mathbf {w}_{\mathscr {R}}^T \phi _{\mathscr {R}}^\perp (\mathbf {x}) | \\&= | \Delta \mathbf {w}^T \phi (\mathbf {x}) | \\&\le \Vert \Delta \mathbf {w}\Vert \Vert \phi (\mathbf {x}) \Vert \\&\le \frac{ 2 \kappa }{\lambda \sqrt{M}} \Bigg [ \sum _{r=0}^R \Bigg ( (m_0 + m_1 + \cdots + m_{r-1} + m_{r+1} + \cdots + m_R )\\&\quad \sum _{ \{i : y_i = r\} } \mathbb {I}[i \not \in {\mathscr {R}}] \Bigg )\Bigg ]^{\frac{1}{2}}. \end{aligned}$$

This completes the proof. \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tayal, A., Coleman, T.F. & Li, Y. Bounding the difference between RankRC and RankSVM and application to multi-level rare class kernel ranking. Data Min Knowl Disc 32, 417–452 (2018). https://doi.org/10.1007/s10618-017-0540-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-017-0540-z

Keywords