Bounding the difference between RankRC and RankSVM and application to multi-level rare class kernel ranking

Tayal, Aditya; Coleman, Thomas F.; Li, Yuying

doi:10.1007/s10618-017-0540-z

Bounding the difference between RankRC and RankSVM and application to multi-level rare class kernel ranking

Published: 08 September 2017

Volume 32, pages 417–452, (2018)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Aditya Tayal¹,
Thomas F. Coleman² &
Yuying Li¹

366 Accesses
2 Citations
1 Altmetric
Explore all metrics

Abstract

Rapid explosion in data accumulation has yielded large scale data mining problems, many of which have intrinsically unbalanced or rare class distributions. Standard classification algorithms, which focus on overall classification accuracy, often perform poorly in these cases. Recently, Tayal et al. (IEEE Trans Knowl Data Eng 27(12):3347–3359, 2015) proposed a kernel method called RankRC for large-scale unbalanced learning. RankRC uses a ranking loss to overcome biases inherent in standard classification based loss functions, while achieving computational efficiency by enforcing a rare class hypothesis representation. In this paper we establish a theoretical bound for RankRC by establishing equivalence between instantiating a hypothesis using a subset of training points and instantiating a hypothesis using the full training set but with the feature mapping equal to the orthogonal projection of the original mapping. This bound suggests that it is optimal to select points from the rare class first when choosing the subset of data points for a hypothesis representation. In addition, we show that for an arbitrary loss function, the Nyström kernel matrix approximation is equivalent to instantiating a hypothesis using a subset of data points. Consequently, a theoretical bound for the Nyström kernel SVM can be established based on the perturbation analysis of the orthogonal projection in the feature mapping. This generally leads to a tighter bound in comparison to perturbation analysis based on kernel matrix approximation. To further illustrate computational effectiveness of RankRC, we apply a multi-level rare class kernel ranking method to the Heritage Health Provider Network’s health prize competition problem and compare the performance of RankRC to other existing methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

On the Noise Resilience of Ranking Measures

Improving the Performance of the k Rare Class Nearest Neighbor Classifier by the Ranking of Point Patterns

Learning to Rank

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Notes

The competition ran for over two years ending in April 2013 and was highly publicized due to the potential impact on US healthcare and a US $3Mil prize. The authors participated in the competition placing 4th out of 1600+ teams. Our final submission used additional dataset variants and results from other methods as well, which were combined using a model stacking approach.

References

Achlioptas D, McSherry F, Schölkopf B (2001) Sampling techniques for kernel methods. Adv Neural Inf Process Syst 14:335–342
Google Scholar
Agarwal S, Niyogi P (2005) Stability and generalization of bipartite ranking algorithms. In: Proceedings of the 18th annual conference on learning theory, COLT’05, pp 32–47
Airola A, Pahikkala T, Salakoski T (2011a) On learning and cross-validation with decomposed nyström approximation of kernel matrix. Neural Process Lett 33(1):17–30
Article Google Scholar
Airola A, Pahikkala T, Salakoski T (2011b) Training linear ranking SVMs in linearithmic time using red-black trees. Pattern Recognit Lett 32(9):1328–1336
Article Google Scholar
Baker CTH (1977) The numerical treatment of integral equations. Clarendon Press, Oxford
MATH Google Scholar
Bartlett PL, Jordan MI, McAuliffe JD (2006) Convexity, classification, and risk bounds. J Am Stat Assoc 101(473):138–156
Article MathSciNet MATH Google Scholar
Batista GEAPA, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explor Newsl 6(1):20–29
Article Google Scholar
Bordes A, Ertekin S, Weston J, Bottou L (2005) Fast kernel classifiers with online and active learning. J Mach Learn Res 6:1579–1619
MathSciNet MATH Google Scholar
Boser BE, Guyon IM, Vapnik VN (1992) A training algorithm for optimal margin classifiers. In: Proceedings of the fifth annual workshop on Computational learning theory. ACM, pp 144–152
Bousquet O, Elisseeff A (2002) Stability and generalization. J Mach Learn Res 2:499–526
MathSciNet MATH Google Scholar
Bradley AP (1997) The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit 30:1145–1159
Article Google Scholar
Branch MA, Coleman TF, Li Y (1999) A subspace, interior, and conjugate gradient method for large-scale bound-constrained minimization problems. SIAM J Sci Comput 21(1):1–23
Article MathSciNet MATH Google Scholar
Chang C-C, Lin C-J (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2:27:1–27:27. http://www.csie.ntu.edu.tw/~cjlin/libsvm
Chapelle O (2007) Training a support vector machine in the primal. Neural Comput 19(5):1155–1178
Article MathSciNet MATH Google Scholar
Chapelle O, Keerthi SS (2010) Efficient algorithms for ranking with SVMs. Inf Retr J 13(3):201–215
Article Google Scholar
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
MATH Google Scholar
Chawla NV, Japkowicz N, Kotcz A (2004) Editorial: special issue on learning from imbalanced data sets. SIGKDD Explor Newsl 6(1):1–6
Article Google Scholar
Coleman TF, Li Y (1994) On the convergence of interior-reflective newton methods for nonlinear minimization subject to bounds. Math Program 67(1–3):189–224
Article MathSciNet MATH Google Scholar
Collobert R, Sinz F, Weston J, Bottou L (2006) Trading convexity for scalability. In: Proceedings of the 23rd international conference on machine learning, pp 201–208
Cortes C, Mohri M, Talwalkar A (2010) On the impact of kernel approximation on learning accuracy. In: Proceedings of the 13th international workshop on artificial intelligence and statistics
Cotter A, Shalev-Shwartz S, Srebro N (2013) Learning optimally sparse support vector machines. In: Proceedings of the 30th international conference on machine learning, pp 266–274
DeLong ER, DeLong DM, Clarke-Pearson DL (1988) Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics 44(3):837–845
Article MATH Google Scholar
El Emam K, Arbuckle L, Koru G, Eze B, Gaudette L, Neri E, Rose S, Howard J, Gluck J (2012) De-identification methods for open health data: the case of the heritage health prize claims dataset. J Med Internet Res 14(1):e33
Ezawa K, Singh M, Norton SW (1996) Learning goal oriented Bayesian networks for telecommunications risk management. In: International conference on machine learning, pp 139–147
Farahat AK, Ghodsi A, Kamel MS (2011) A novel greedy algorithm for Nyström approximation. In: International conference on artificial intelligence and statistics, pp 269–277
Fine S, Scheinberg K (2002) Efficient svm training using low-rank kernel representations. J Mach Learn Res 2:243–264
MATH Google Scholar
Fowlkes C, Belongie S, Chung F, Malik J (2004) Spectral grouping using the Nyström method. IEEE Trans Pattern Anal Mach Intell 26(2):214–225
Article Google Scholar
Hanley JA, Mcneil BJ (1982) The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143(1):29–36
Article Google Scholar
He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284
Article Google Scholar
Herbrich R, Graepel T, Obermayer K (2000) Large margin rank boundaries for ordinal regression. MIT Press, Cambridge
Google Scholar
Heritage provider network health prize. http://www.heritagehealthprize.com/c/hhp. Accessed 31 Aug 2013
Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study. Intell Data Anal 6(5):429–449
MATH Google Scholar
Joachims T (2002) Optimizing search engines using clickthrough data. In: SIGKDD international conference on knowledge discovery and data mining, pp 133–142
Joachims T (2005) A support vector method for multivariate performance measures. In: International conference on machine learning, pp 377–384
Joachims T, Yu C-N (2009) Sparse kernel svms via cutting-plane training. Mach Learn 76(2–3):179–193
Article Google Scholar
Karakoulas G, Shawe-Taylor J (1999) Optimizing classifiers for imbalanced training sets. In: Proceedings of the 11th international conference on neural information processing systems, pp 253–259
Kimeldorf G, Wahba G (1970) A correspondence between Bayesian estimation of stochastic processes and smoothing by splines. Ann Math Stat 41:495–502
Article MathSciNet MATH Google Scholar
Kubat M, Matwin S (1997) Addressing the curse of imbalanced training sets: one-sided selection. In: International conference on machine learning, pp 179–186
Kumar S, Mohri M, Talwalkar A (2009) Sampling techniques for the Nyström method. In: International conference on artificial intelligence and statistics, pp 304–311
Kumar S, Mohri M, Talwalkar A (2012) Sampling methods for the Nyström method. J Mach Learn Res 13:981–1006
MathSciNet MATH Google Scholar
Kuo T-M, Lee C-P, Lin C-J (2014) Large-scale kernel rankSVM. In: Proceedings of SIAM international conference on data mining
Lee Y-J, Huang S-Y (2007) Reduced support vector machines: a statistical theory. IEEE Trans Neural Netw 18(1):1–13
Article Google Scholar
Lin Y, Lee Y, Wahba G (2000) Support vector machines for classification in nonstandard situations. Mach Learn 46(1):192–202
MATH Google Scholar
Maloof MA (2003) Learning when data sets are imbalanced and when costs are unequal and unknown. In: International conference of machine learning
Metz CE (1978) Basic principles of ROC analysis. Semin Nuclear Med 8(4):283–298
Article Google Scholar
Nguyen X, Wainwright MJ, Jordan MI (2009) On surrogate loss functions and f-divergences. Ann Stat 37(2):876–904
Article MathSciNet MATH Google Scholar
Platt JC (1999) Fast training of support vector machines using sequential minimal optimization. In: Advances in kernel methods-support vector learning, pp 185–208
Platt JC (2005) Fastmap, metricmap, and landmark mds are all Nyström algorithms. In: International conference on artificial intelligence and statistics, pp 261–268
Provost F, Fawcett T, Kohavi R (1997) The case against accuracy estimation for comparing induction algorithms. In: International conference on machine learning, pp 445–453
Raskutti B, Kowalczyk A (2004) Extreme rebalancing for SVMs: a case study. SIGKDD Explor Newsl 6(1):60–69
Article Google Scholar
Rifkin R, Yeo G, Poggio T (2003) Regularized least-squares classification. Nato Sci Ser Sub Ser III Comput Syst Sci 190:131–154
Google Scholar
Rosset S, Zhu J, Hastie T (2003) Margin maximizing loss functions. In: Advances in neural information processing systems, pp 1237–1244
Schölkopf B, Herbrich R, Smola AJ (2001) A generalized representer theorem. In: Proceedings of the 14th annual conference on computational learning theory, pp 416–426
Smola AJ, Schölkopf B (2000) Sparse greedy matrix approximation for machine learning. In: International conference on machine learning, pp 911–918
Sun Y, Kamel MS, Wong AKC, Wang Y (2007) Cost-sensitive boosting for classification of imbalanced data. Pattern Recognit 40(12):3358–3378
Article MATH Google Scholar
Talwalkar A (2010) Matrix approximation for large-scale learning. PhD thesis, Courant Institute of Mathematical Sciences, New York University, New York, NY
Tayal A, Coleman TF, Li Y (2015) RankRC: large-scale nonlinear rare class ranking. IEEE Trans Knowl Data Eng 27(12):3347–3359
Article Google Scholar
Tsang IW, Kwok JT, Cheung PM (2005) Core vector machines: fast SVM training on very large datasets. J Mach Learn Res 6:364–392
MATH Google Scholar
Turney PD (2000) Types of cost in inductive concept learning. In: International conference on machine learning
Waegeman W, Baets BD, Boullart L (2006) A comparison of different roc measures for ordinal regression. In: International conference on machine learning
Weiss GM (2004) Mining with rarity: a unifying framework. SIGKDD Explor Newsl 6(1):7–19
Article Google Scholar
Williams C, Seeger M (2001) Using the Nyström method to speed up kernel machines. In: Proceedings of the 13th international conference on neural information processing systems, pp 682–688
Woods K, Doss C, Bowyer K, Solka J, Preibe C, Keglmyer P (1993) Comparative evaluation of pattern recognition techniques for detection of microcalcifications in mammography. Int J Pattern Recognit Artif Intell 7:1417–1436
Article Google Scholar
Wu G, Chang EY (2003) Class-boundary alignment for imbalanced dataset learning. In: International conference on machine learning, pp 49–56
Zhang J, Mani I, (2003) KNN approach to unbalanced data distributions: a case study involving information extraction. In: International conference on machine learning
Zhang K, Tsang IW, Kwok JT (2008) Improved Nyström low-rank approximation and error analysis. In: International conference on machine learning, pp 1232–1239
Zhang K, Lan L, Wang Z, Moerchen F (2012) Scaling up kernel svm on limited resources: a low-rank linearization approach. Int Conf Artif Intell Stat 22:1425–1434
Google Scholar
Zhu M, Su W, Chipman HA (2006) LAGO: a computationally efficient approach for statistical detection. Technometrics 48:193–205

Download references

Acknowledgements

The authors acknowledge comments from anonymous referees which have significantly improved presentation of the paper.

Author information

Authors and Affiliations

Cheriton School of Computer Science, University of Waterloo, Waterloo, ON, N2L 3G1, Canada
Aditya Tayal & Yuying Li
Combinatorics and Optimization, University of Waterloo, Waterloo, ON, N2L 3G1, Canada
Thomas F. Coleman

Authors

Aditya Tayal
View author publications
You can also search for this author in PubMed Google Scholar
Thomas F. Coleman
View author publications
You can also search for this author in PubMed Google Scholar
Yuying Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Aditya Tayal.

Additional information

Responsible editor: Chih-Jen Lin.

All three authors acknowledge funding from the National Sciences and Engineering Research Council of Canada. Thomas F. Coleman acknowledges funding from the Ophelia Lazaridis University Research Chair. The views expressed herein are solely from the authors.

Appendix A: Proof Sketch for Theorem 5

Proof

Assume that $\mathbf {w}^*$ and $\mathbf {w}_{\mathscr {R}}^*$ are minimizers of (28) and (30), respectively, with total loss for multi-level RankSVM given by:

$$\begin{aligned} R_\phi (\mathbf {w}) = \frac{1}{\sum _{r < s} m_r m_s} \sum _{r=1}^R \sum _{ \{i : y_i = r\} } \sum _{ \{j : y_i > y_j \}} \ell _h \left( \mathbf {w}^T\phi (\mathbf {x}_i) - \mathbf {w}^T\phi (\mathbf {x}_j) \right) \end{aligned}$$

(A.1)

and

$$\begin{aligned} R_{\phi _{\mathscr {R}}}(\mathbf {w}) = \frac{1}{\sum _{r < s} m_r m_s} \sum _{r=1}^R \sum _{ \{i : y_i = r\} } \sum _{ \{j : y_i > y_j \}} \ell _h \left( \mathbf {w}^T\phi _{\mathscr {R}}(\mathbf {x}_i) - \mathbf {w}^T\phi _{\mathscr {R}}(\mathbf {x}_j) \right) . \end{aligned}$$

(A.2)

Let $\Delta \mathbf {w}= \mathbf {w}_{\mathscr {R}}^* - \mathbf {w}^*$.

A convex function g satisfies

$$\begin{aligned} g(\mathbf {u}+ t(\mathbf {v}-\mathbf {u})) - g(\mathbf {u}) \le t ( g(\mathbf {v}) - g(\mathbf {u}) ) \end{aligned}$$

for all $\mathbf {u}, \mathbf {v}, t \in [0,1]$. Since $\ell _h$ is convex, $R_\phi $ and $R_{\phi _{\mathscr {R}}}$ are convex. Then

$$\begin{aligned} R_{\phi }( \mathbf {w}^* + t \Delta \mathbf {w}) - R_{\phi }( \mathbf {w}^*)&\le t ( R_{\phi }(\mathbf {w}_{\mathscr {R}}^*) - R_{\phi }( \mathbf {w}^*) )\end{aligned}$$

(A.3)

$$\begin{aligned} \text {and} \quad R_{\phi _{\mathscr {R}}}( \mathbf {w}_{\mathscr {R}}^* - t \Delta \mathbf {w}) - R_{\phi _{\mathscr {R}}}( \mathbf {w}_{\mathscr {R}}^*)&\le t ( R_{\phi _{\mathscr {R}}}( \mathbf {w}^*) - R_{\phi _{\mathscr {R}}}(\mathbf {w}_{\mathscr {R}}^*)), \end{aligned}$$

(A.4)

for all $t \in [0,1]$.

Since $\mathbf {w}^*$ and $\mathbf {w}_{\mathscr {R}}^*$ are minimizers of $F_\phi $ and $F_{\phi _{\mathscr {R}}}$, for any $t \in [0,1]$, we have

$$\begin{aligned} F_{\phi }( \mathbf {w}^*)&\le F_{\phi }(\mathbf {w}^* + t \Delta \mathbf {w}) \end{aligned}$$

(A.5)

$$\begin{aligned} \text {and} \quad F_{\phi _{\mathscr {R}}}( \mathbf {w}_{\mathscr {R}}^*)&\le F_{\phi _{\mathscr {R}}}(\mathbf {w}_{\mathscr {R}}^* - t \Delta \mathbf {w}). \end{aligned}$$

(A.6)

Summing (A.5) and (A.6), using $F_\phi (\mathbf {w}) = R_\phi (\mathbf {w}) + \frac{\lambda }{2} \Vert \mathbf {w}\Vert _2^2 $ and the identity

$$\begin{aligned} \left( \Vert \mathbf {w}^*\Vert ^2 - \Vert \mathbf {w}^* + t \Delta \mathbf {w}\Vert ^2 \right) + \left( \Vert \mathbf {w}_{\mathscr {R}}^*\Vert ^2 - \Vert \mathbf {w}_{\mathscr {R}}^* - t \Delta \mathbf {w}\Vert ^2 \right) = 2t(1-t)\Vert \Delta \mathbf {w}\Vert ^2, \end{aligned}$$

we obtain

$$\begin{aligned} \lambda t(1-t) \Vert \Delta \mathbf {w}\Vert ^2 \le \left( R_{\phi }( \mathbf {w}^* + t \Delta \mathbf {w}) - R_{\phi }( \mathbf {w}^*) \right) + \left( R_{\phi _{\mathscr {R}}}(\mathbf {w}_{\mathscr {R}}^* - t \Delta \mathbf {w}) - R_{\phi _{\mathscr {R}}}( \mathbf {w}_{\mathscr {R}}^*) \right) \end{aligned}$$

(A.7)

Substituting (A.3) and (A.4) into (A.7), dividing by $\lambda t$, and taking the limit $t \rightarrow 0$ gives

$$\begin{aligned}&\Vert \Delta \mathbf {w}\Vert ^2\\&\quad \le \frac{1}{\lambda } \left( R_{\phi }(\mathbf {w}_{\mathscr {R}}^*) - R_{\phi _{\mathscr {R}}}( \mathbf {w}_{\mathscr {R}}^*) + R_{\phi _{\mathscr {R}}}( \mathbf {w}^*) - R_{\phi }( \mathbf {w}^*) \right) \\&\quad = \frac{1}{\lambda \sum _{r < s} m_r m_s} \sum _{r=1}^R \sum _{ \{i : y_i = r\} } \sum _{ \{j : y_i > y_j \}} \Big [ \ell _h \left( (\mathbf {w}_{\mathscr {R}}^*)^T \phi (\mathbf {x}_i) - (\mathbf {w}_{\mathscr {R}}^*)^T \phi (\mathbf {x}_j) \right) \\&\qquad -\,\ell _h \left( (\mathbf {w}_{\mathscr {R}}^*)^T \phi _{\mathscr {R}}(\mathbf {x}_i) - (\mathbf {w}_{\mathscr {R}}^*)^T \phi _{\mathscr {R}}(\mathbf {x}_j) \right) + \ell _h \left( (\mathbf {w}^*)^T \phi _{\mathscr {R}}(\mathbf {x}_i) - (\mathbf {w}^*)^T \phi _{\mathscr {R}}(\mathbf {x}_j) \right) \\&\qquad - \,\ell _h \left( (\mathbf {w}^*)^T \phi (\mathbf {x}_i) - (\mathbf {w}^*)^T \phi (\mathbf {x}_j) \right) \Big ] \\ \end{aligned}$$

where the last inequality uses the multi-level RankSVM definitions of $R_\phi $ and $R_{\phi _{\mathscr {R}}}$. Since $\ell _h(\cdot )$ is 1-Lipschitz, we obtain

(A.8)

where the last equality is obtained by factoring over common index sets $\{i : y_i = r\}$.

From $\phi (\mathbf {x}) = \phi _{\mathscr {R}}(\mathbf {x}) + \phi _{\mathscr {R}}^\perp (\mathbf {x})$, with $\phi _{\mathscr {R}}(\mathbf {x}) \in {\mathscr {S}}_{\mathscr {R}}$ and $\phi _{\mathscr {R}}^\perp (\mathbf {x})$ is in the space orthogonal to ${\mathscr {S}}_{\mathscr {R}}$, we have, for $i=1,\ldots ,m$,

$$\begin{aligned} \Vert \phi (\mathbf {x}_i) - \phi _{\mathscr {R}}(\mathbf {x}_i) \Vert = \Vert \phi _{\mathscr {R}}^\perp (\mathbf {x}_i) \Vert \le {\left\{ \begin{array}{ll} \Vert \phi (\mathbf {x}_i) \Vert = \sqrt{ k(\mathbf {x}_i,\mathbf {x}_i) } \le \sqrt{\kappa } , &{} \text {if}\ i \not \in {\mathscr {R}} \\ 0, &{} \text {if}\ i \in {\mathscr {R}}. \end{array}\right. } \end{aligned}$$

(A.9)

In addition, recall that RankSVM is equivalent to a 1-class SVM on an enlarged dataset with the set of points $\mathscr {P} = \{ \phi (\mathbf {x}_i) - \phi (\mathbf {x}_j) : y_i > y_j, i,j=1\ldots ,m\}$. Therefore $\mathbf {w}$ can be expressed in terms of the dual variables $0 \le \alpha _{ij}^* \le C$ of an SVM problem trained on $\mathscr {P}$ with $C = \frac{1}{\lambda m_+ m_-}$, as follows,

$$\begin{aligned} \mathbf {w}^*&= \sum _{ \{ i,j: y_i > y_j\}} \alpha _{ij}^* ( \phi (\mathbf {x}_i) - \phi (\mathbf {x}_j) ) = \sum _{ \{i : y_i = +1 \} } \sum _{ \{ j : y_j = -1 \}} \alpha _{ij}^* ( \phi (\mathbf {x}_i) - \phi (\mathbf {x}_j) ) \\&= \sum _{ \{i : y_i = +1 \} } \phi (\mathbf {x}_i) \left( \sum _{ \{ j : y_j = -1 \}} \alpha _{ij}^* \right) - \sum _{ \{ j : y_j = -1 \}} \phi (\mathbf {x}_j) \left( \sum _{ \{i : y_i = +1 \} } \alpha _{ij}^* \right) . \end{aligned}$$

Since $\Vert \phi (\mathbf {x}) \Vert \le \sqrt{\kappa }$ and $C = \frac{1}{\lambda m_+ m_-}$, we get $ \Vert \mathbf {w}^*\Vert \le \sqrt{\kappa } C m_- m_+ + \sqrt{\kappa } C m_+ m_- = \frac{2\sqrt{\kappa }}{\lambda } $. Similarly, $\Vert \phi _{\mathscr {R}}(\mathbf {x}) \Vert \le \Vert \phi (\mathbf {x}) \Vert \le \sqrt{\kappa }$ and $\Vert \mathbf {w}_{\mathscr {R}}^* \Vert \le \frac{2\sqrt{\kappa }}{\lambda }$. Together with (A.9), we can then bound (A.8) by

$$\begin{aligned} \Vert \Delta \mathbf {w}\Vert ^2&\le \frac{ 4 \kappa }{\lambda ^2 M } \sum _{r=0}^R \Bigg ( (m_0 + m_1 + \cdots + m_{r-1} + m_{r+1} + \cdots + m_R ) \sum _{ \{i : y_i = r\} } \mathbb {I}[i \not \in {\mathscr {R}}] \Bigg ), \end{aligned}$$

where $M = \sum _{r < s} m_r m_s$.

Therefore, we obtain

$$\begin{aligned} | f_{\phi _{\mathscr {R}}}(\mathbf {x}) - f_\phi (\mathbf {x}) |&= | \mathbf {w}_{\mathscr {R}}^T \phi _{\mathscr {R}}(\mathbf {x}) - \mathbf {w}^T \phi (\mathbf {x}) | \\&= | \mathbf {w}_{\mathscr {R}}^T \left( \phi (\mathbf {x}) - \phi _{\mathscr {R}}^\perp (\mathbf {x}) \right) - \mathbf {w}^T \phi (\mathbf {x}) | \\&= | \Delta \mathbf {w}^T \phi (\mathbf {x}) - \mathbf {w}_{\mathscr {R}}^T \phi _{\mathscr {R}}^\perp (\mathbf {x}) | \\&= | \Delta \mathbf {w}^T \phi (\mathbf {x}) | \\&\le \Vert \Delta \mathbf {w}\Vert \Vert \phi (\mathbf {x}) \Vert \\&\le \frac{ 2 \kappa }{\lambda \sqrt{M}} \Bigg [ \sum _{r=0}^R \Bigg ( (m_0 + m_1 + \cdots + m_{r-1} + m_{r+1} + \cdots + m_R )\\&\quad \sum _{ \{i : y_i = r\} } \mathbb {I}[i \not \in {\mathscr {R}}] \Bigg )\Bigg ]^{\frac{1}{2}}. \end{aligned}$$

This completes the proof. $\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tayal, A., Coleman, T.F. & Li, Y. Bounding the difference between RankRC and RankSVM and application to multi-level rare class kernel ranking. Data Min Knowl Disc 32, 417–452 (2018). https://doi.org/10.1007/s10618-017-0540-z

Download citation

Received: 14 August 2015
Accepted: 16 August 2017
Published: 08 September 2017
Issue Date: March 2018
DOI: https://doi.org/10.1007/s10618-017-0540-z

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Bounding the difference between RankRC and RankSVM and application to multi-level rare class kernel ranking

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

On the Noise Resilience of Ranking Measures

Improving the Performance of the k Rare Class Nearest Neighbor Classifier by the Ranking of Point Patterns

Learning to Rank

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix A: Proof Sketch for Theorem 5

Proof

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Bounding the difference between RankRC and RankSVM and application to multi-level rare class kernel ranking

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

On the Noise Resilience of Ranking Measures

Improving the Performance of the k Rare Class Nearest Neighbor Classifier by the Ranking of Point Patterns

Learning to Rank

Explore related subjects

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix A: Proof Sketch for Theorem 5

Appendix A: Proof Sketch for Theorem 5

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation