Model averaging for support vector classifier by cross-validation

Zou, Jiahui; Yuan, Chaoxia; Zhang, Xinyu; Zou, Guohua; Wan, Alan T. K.

doi:10.1007/s11222-023-10284-6

Model averaging for support vector classifier by cross-validation

Original Paper
Published: 08 August 2023

Volume 33, article number 117, (2023)
Cite this article

Statistics and Computing Aims and scope Submit manuscript

Jiahui Zou¹,
Chaoxia Yuan²,
Xinyu Zhang³,
Guohua Zou⁴ &
…
Alan T. K. Wan⁵

481 Accesses
Explore all metrics

Abstract

Support vector classification (SVC) is a well-known statistical technique for classification problems in machine learning and other fields. An important question for SVC is the selection of covariates (or features) for the model. Many studies have considered model selection methods. As is well-known, selecting one winning model over others can entail considerable instability in predictive performance due to model selection uncertainties. This paper advocates model averaging as an alternative approach, where estimates obtained from different models are combined in a weighted average. We propose a model weighting scheme and provide the theoretical underpinning for the proposed method. In particular, we prove that our proposed method yields a model average estimator that achieves the smallest hinge risk among all feasible combinations asymptotically. To remedy the computational burden due to a large number of feasible models, we propose a screening step to eliminate the uninformative features before combining the models. Results from real data applications and a simulation study show that the proposed method generally yields more accurate estimates than existing methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Algorithm 2

Least Squares Model Averaging Based on Generalized Cross Validation

Article 01 July 2021

Bias-corrected support vector machine with Gaussian kernel in high-dimension, low-sample-size settings

Article 15 July 2019

A working likelihood approach to support vector regression with a data-driven insensitivity parameter

Article Open access 10 October 2022

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Notes

The parameter that minimises the population hinge loss is the “quasi-true" parameter when the working model is not identical to the true data generating process. If the two are identical, the “quasi-true" parameter is the true parameter.
We find that the number of folds generally has little effect on the performance of the method.

References

Ando, T., Li, K.-C.: A weight-relaxed model averaging approach for high-dimensional generalized linear models. Ann. Stat. 45, 2654–2679 (2017)
Article MathSciNet MATH Google Scholar
Becker, N., Toedt, G., Lichter, P., Benner, A.: Elastic scad as a novel penalization method for svm classification tasks in high-dimensional data. BMC Bioinform. 12, 138–151 (2011)
Article Google Scholar
Borah, P., Gupta, D.: Affinity and transformed class probability-based fuzzy least squares support vector machines. Fuzzy Sets Syst. 443, 203–235 (2022)
Article MathSciNet Google Scholar
Bradley, P.S., Mangasarian, O.L.: Feature selection via concave minimization and support vector machines. In: ICML 98, 82–90 (1998)
Breiman, L.: Bagging predictors. Mach. Learn. 24, 123–140 (1996)
Article MATH Google Scholar
Buckland, S.T., Burnham, K.P., Augustin, N.H.: Model selection: an integral part of inference. Biometrics 53, 603–618 (1997)
Article MATH Google Scholar
Bühlmann, P., van de Geer, S.: Statistics for High-Dimensional Data: Methods. Theory and Applications. Springer, New York (2011)
Book MATH Google Scholar
Burges, C.J.C.: A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Disc. 2, 121–167 (1998)
Article Google Scholar
Claeskens, G., Croux, C., van Kerckhoven, J.: Variable selection for logistic regression using a prediction-focused information criterion. Biometrics 62, 972–979 (2006)
Article MathSciNet MATH Google Scholar
Claeskens, G., Croux, C., van Kerckhoven, J.: An information criterion for variable selection in support vector machines. J. Mach. Learn. Res. 9, 541–558 (2008)
MathSciNet MATH Google Scholar
Dua, D., Graff, C.: UCI Machine Learning Repository (2017). http://archive.ics.uci.edu/ml
Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55, 119–139 (1997)
Article MathSciNet MATH Google Scholar
Gorman, R.P., Sejnowski, T.J.: Analysis of hidden units in a layered network trained to classify sonar targets. Neural Netw. 1, 75–89 (1988)
Article Google Scholar
Gupta, U., Gupta, D.: Least squares structural twin bounded support vector machine on class scatter. Appl. Intell. 53, 15321–15351 (2023)
Article Google Scholar
Guyon, I., Weston, J., Barnhill, S., Vapnik, V.: Gene selection for cancer classification using support vector machines. Mach. Learn. 46, 389–422 (2002)
Article MATH Google Scholar
Hansen, B.E.: Least squares model averaging. Econometrica 75, 1175–1189 (2007)
Article MathSciNet MATH Google Scholar
Hansen, B.E., Racine, J.: Jackknife model averaging. J. Econom. 167, 38–46 (2012)
Article MathSciNet MATH Google Scholar
Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining. Inference and Prediction. Springer, New York (2001)
Book MATH Google Scholar
Hazarika, B.B., Gupta, D.: Affinity based fuzzy kernel ridge regression classifier for binary class imblance learning. Eng. Appl. Artif. Intell. 117, 105544 (2023)
Article Google Scholar
Hazarika, B.B., Gupta, D.: Improved twin bounded large margin distribution machines for binary classification. Multimedia Tools Appl. 83, 13341–13368 (2023)
Article Google Scholar
Hoeting, J.A., Madigan, D., Raftery, A.E., Volinsky, C.T.: Bayesian model averaging: a tutorial. Stat. Sci. 14, 382–417 (1999)
MathSciNet MATH Google Scholar
Jagannathan, R., Ma, T.: Risk reduction in large portfolios: Why imposing the wrong constraints helps. J. Fin. 58, 1651–1683 (2003)
Kaufman, L.: Solving the Quadratic Programming Problem Arising in Support Vector Classification, pp. 147–167. MIT Press, USA (1999)
Koo, J.-Y., Lee, Y., Kim, Y., Park, C.: A Bahadur representation of the linear support vector machine. J. Mach. Learn. Res. 9, 1343–1368 (2008)
Lee, E.R., Noh, H., Park, B.U.: Model selection via Bayesian information criterion for quantile regression models. J. Am. Stat. Assoc. 109, 216–229 (2014)
Article MathSciNet MATH Google Scholar
Park, C., Kim, K.R., Myung, R., Koo, J.Y.: Oracle properties of scad-penalized support vector machine. J. Stat. Plan. Infer. 142, 2257–2270 (2012)
Schölkopf, B., Smola, A.: Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, USA (2001)
Google Scholar
Sigillito, V.G., Wing, S.P., Hutton, L.V., Baker, K.B.: Classification of radar returns from the ionosphere using neural networks. J. Hopkins APL Tech. Dig. 10, 262–266 (1989)
Google Scholar
Tsanas, A., Little, M.A., Fox, C., Ramig, L.O.: Objective automatic assessment of rehabilitative speech treatment in Parkinson’s disease. IEEE Trans. Neural Syst. Rehabil. Eng. 22, 181–190 (2014)
Article Google Scholar
van de Geer, S.: Empirical Processes in M-Estimation. Cambridge University Press, Cambridge (2000)
van der Vaart, A.W., Wellner, J.A.: Weak Convergence and Empirical Process: With Applications to Statistics. Springer, New York (1996)
Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer, New York (1995)
Book MATH Google Scholar
Wan, A.T.K., Zhang, X., Zou, G.: Least squares model averaging by Mallows criterion. J. Econom. 156, 277–283 (2010)
Article MathSciNet MATH Google Scholar
Wang, L., Zhu, J., Zou, H.: The doubly regularized support vector machine. Stat. Sin. 16, 589–615 (2006)
MathSciNet MATH Google Scholar
Wang, L., Wu, Y., Li, R.: Quantile regression for analyzing heterogeneity in ultra-high dimension. J. Am. Stat. Assoc. 107, 214–222 (2012)
Article MathSciNet MATH Google Scholar
Wegkamp, M., Yuan, M.: Support vector machines with a reject option. Bernoulli 17, 1368–1385 (2011)
Weston, J., Mukherjee, S., Chapelle, O., Pontil, M., Poggio, T., Vapnik, V.: Feature selection for SVMs. In: NIPS 12, 668–674 (2000)
White, H.: Maximum likelihood estimation of misspecified models. Econometrica 50, 1–25 (1982)
Article MathSciNet MATH Google Scholar
Yuan, Z., Yang, Y.: Combining linear regression models: when and how? J. Am. Stat. Assoc. 100, 1202–1214 (2005)
Article MathSciNet MATH Google Scholar
Zhang, H.H., Ahn, J., Lin, X., Park, C.: Gene selection using support vector machines with non-convex penalty. Bioinformatics 22, 88–95 (2006)
Article Google Scholar
Zhang, X., Lu, Z., Zou, G.: Adaptively combined forecasting for discrete response time series. J. Econom. 176, 80–91 (2013)
Article MathSciNet MATH Google Scholar
Zhang, X., Wu, Y., Wang, L., Li, R.: Variable selection for support vector machines in moderately high dimensions. J. R. Stat. Soc. B 75, 53–76 (2016)
Zhang, X., Wu, Y., Wang, L., Li, R.: A consistent information criterion for support vector machines in diverging model spaces. J. Mach. Learn. Res. 17, 1–26 (2016)
MathSciNet MATH Google Scholar
Zhang, X., Yu, D., Zou, G., Liang, H.: Optimal model averaging estimation for generalized linear models and generalized linear mixed-effects models. J. Am. Stat. Assoc. 111, 1775–1790 (2016)
Article MathSciNet Google Scholar
Zhang, X., Zou, G., Liang, H., Carroll, R.J.: Parsimonious model averaging with a diverging number of parameters. J. Am. Stat. Assoc. 115, 972–984 (2020)
Article MathSciNet MATH Google Scholar
Zhou, Z.-H.: Ensemble Methods: Foundations and Algorithms. CRC Press, USA (2012)
Zhu, J., Rosset, S., Hastie, T., Tibshirani, R.: 1-norm support vector machines. Adv. Neural. Inf. Process. Syst. 16, 49–56 (2004)
Google Scholar
Zou, H., Yuan, M.: The $f_\infty $-norm support vector machine. Stat. Sin. 18, 379–398 (2008)
MathSciNet MATH Google Scholar
Zou, J., Wang, W., Zhang, X., Zou, G.: Optimal model averaging for divergent-dimensional Poisson regressions. Econom. Rev. 41, 775–805 (2022)
Article MathSciNet MATH Google Scholar

Download references

Acknowledgements

Jiahui Zou’s work was supported by the National Natural Science Foundation of China (Grant No. 12201431) and the Young Teacher Foundation from Capital University of Economics and Business (XRZ2022070, 00592254413070). Xinyu Zhang’s work was partially supported by the National Natural Science Foundation of China (Grant Nos. 71925007, 72091212 and 12288201) and the CAS Projectfor Young Scientists in Basic Research (YSBR-008). Guohua Zou’s work was partially supported by the National Natural Science Foundation of China (Grant Nos. 11971323 and 12031016). Wan’s work was supported by a General Research Fund from the Hong Kong Research Grants Council (No. CityU-11500419).

Author information

Authors and Affiliations

School of Statistics, Capital University of Economics and Business, Beijing, 100070, China
Jiahui Zou
School of Science, Shanghai Maritime University, Shanghai, 200090, China
Chaoxia Yuan
Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, 100190, China
Xinyu Zhang
School of Mathematical Sciences, Capital Normal University, Beijing, 100048, China
Guohua Zou
Department of Management Sciences and School of Data Science, City University of Hong Kong, Kowloon, 999077, Hong Kong
Alan T. K. Wan

Authors

Jiahui Zou
View author publications
You can also search for this author in PubMed Google Scholar
Chaoxia Yuan
View author publications
You can also search for this author in PubMed Google Scholar
Xinyu Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Guohua Zou
View author publications
You can also search for this author in PubMed Google Scholar
Alan T. K. Wan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jiahui Zou.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: The simulation for nonlinear boundary case

Here, we provide an example for a simple nonlinear boundary case to test the performance of SVCMA. The data generating process is described below.

DGP3: $\Pr (Y=1)=1-\Pr (Y=-1)=0.8I(\text {sign}(\textbf{x}$$ ^{\top }{\varvec{\beta }})=1)+0.2I(\text {sign}(\textbf{x}^{\top }{\varvec{\beta }})=-1)$, where $I(\cdot )$ is the indicator function, $\textbf{x}=(x_1, x_2, x_2^2, x_2^3, \ldots , x_2^{959}, x_3, x_4, \ldots , $$ x_{p-958})$, $x_1\sim U(0,1)$, $x_2, \ldots , x_{p-958} {\mathop {\sim }\limits ^{i.i.d.}} U(-1, 1)$ and ${\varvec{\beta }}=(2, 0, -2, 0, -2, 0, 0, \ldots , 0)$.

In this example, the training data contains all the covariates used in DGP3, and we set the dimension $p=1000$, the training size $n=500$ and testing size $n_{\text {test}}=10{,}000$. The results of error rates are shown in Fig. 13(a). The fitted boundaries of all methods are ploted in Fig. 13(b). It can be seen from these two figures that the SVCMA demonstrates the most optimal performance in the presence of a nonlinear boundary. Additionally, the majority of the methods appear to be capable of fitting a satisfactory boundary around the true boundary.

As for general nonlinear boundaries, the SVCs with different kernels are useful tools. Hence, in the future, we plan to adopt model averaging methodology to combine estimators with different kernels to avoid the hesitation in selecting, and try to promote the performance of SVCs.

Appendix B: Proofs

In the following, all limiting processes below correspond to $n\rightarrow \infty $ unless stated otherwise.

1.1 B.1 Proof of Lemma 1

Proof of Lemma 1

Part of this proof follows from Zhang et al. (2016b), but there are some differences and the conclusion is also different from that of Zhang et al. (2016b).

We will prove (9) first. Recall that ${\hat{{\varvec{\beta }}}}_{(s)}=\arg \min _{{\varvec{\beta }}_{(s)}}\{n^{-1} $$ \sum _{i=1}^{n}(1-y_i\textbf{x}_{{(s)},i}^{\top }{\varvec{\beta }}_{(s)})+2^{-1}\lambda _n\Vert {\varvec{\beta }}_{(s)}^+\Vert ^2\}$. We will show that, for any $0<\eta <1$, there exist a large constant $\triangle >0$ and an integer N such that when $n>N$, we have

$$\begin{aligned}{} & {} \Pr \left\{ \min _{1\le s\le S_n}\inf _{\Vert \textbf{u}_{(s)}\Vert =\triangle } \left\{ l_s\left( {\varvec{\beta }}^*_{(s)}+\sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})}\textbf{u}_{(s)}\right) \right. \right. \nonumber \\{} & {} \qquad \left. \left. -l_s({\varvec{\beta }}^*_{(s)})\right\}>0 \right\} >1-\eta , \end{aligned}$$

(B1)

where $\textbf{u}_{(s)}\in {\mathcal {R}}^{p_s}$ and $l_s({\varvec{\beta }}_{(s)})=n^{-1}\sum _{i=1}^{n}(1-y_i\textbf{x}_{{(s)},i} $$ ^{\top }{\varvec{\beta }}_{(s)})_++2^{-1}\lambda _n\Vert {\varvec{\beta }}_{(s)}^+\Vert ^2$. As the hinge loss is convex, this implies that with probability greater than $1-\eta $, $\max _{1\le s\le S_n}\Vert {\hat{{\varvec{\beta }}}}_{(s)}-{\varvec{\beta }}^*_{(s)}\Vert \le \triangle \sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})} $. Hence equation (9) in Lemma 1 holds.

Note that $l_s\left( {\varvec{\beta }}^*_{(s)}+\sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})}\textbf{u}_{(s)}\right) -l_s\left( {\varvec{\beta }}^*_{(s)}\right) $ can be expressed as

$$\begin{aligned}{} & {} l_s\left( {\varvec{\beta }}^*_{(s)}+\sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})}\textbf{u}_{(s)}\right) -l_s\left( {\varvec{\beta }}^*_{(s)}\right) \nonumber \\{} & {} \quad =n^{-1} \sum _{i=1}^{n}\left\{ \left( 1-y_i\textbf{x}_{{(s)},i}^{\top }({\varvec{\beta }}^*_{(s)}+\sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})}\textbf{u}_{(s)}) \right) _+\right. \nonumber \\{} & {} \qquad \left. -\left( 1-y_i\textbf{x}_{{(s)},i}^{\top }{\varvec{\beta }}^*_{(s)}\right) _+ \right\} \nonumber \\{} & {} \qquad + 2^{-1}\lambda _n \left\| {\varvec{\beta }}^{*+}_{(s)}+\sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})}\textbf{u}_{(s)}^+\right\| ^2\nonumber \\{} & {} \qquad -2^{-1}\lambda _n \Vert {\varvec{\beta }}^{*+}_{(s)}\Vert ^2. \end{aligned}$$

(B2)

It is shown that

$$\begin{aligned}{} & {} \max _{1\le s\le S_n}\sup _{\Vert \textbf{u}_{(s)}\Vert =\triangle }\bigg |\left\| {\varvec{\beta }}^{*+}_{(s)}+\sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})}\textbf{u}_{(s)}^+ \right\| ^2-\left\| {\varvec{\beta }}^{*+}_{(s)}\right\| ^2 \bigg |\nonumber \\{} & {} \quad \le \max _{1\le s\le S_n}\sup _{\Vert \textbf{u}_{(s)}\Vert =\triangle }\left( \left\| {\varvec{\beta }}^{*+}_{(s)}+\sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})}\textbf{u}_{(s)}^+ \right\| +\left\| {\varvec{\beta }}^{*+}_{(s)}\right\| \right) \nonumber \\{} & {} \qquad \times \left\| \sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})}\textbf{u}_{(s)}^+ \right\| \nonumber \\{} & {} \quad \le 2\triangle C_2{p_{\max }}\sqrt{n^{-1}\log ({p_{\max }})}+\triangle n^{-1}{p_{\max }}\log ({p_{\max }})\nonumber \\{} & {} \quad =O(\triangle {p_{\max }}\sqrt{n^{-1}\log ({p_{\max }})}), \end{aligned}$$

(B3)

where the last inequality is obtained from Condition 3. Hence the order of difference of penalty terms in (B2) is $O(\triangle \lambda _n{p_{\max }}\sqrt{n^{-1}\log ({p_{\max }})})$.

Denote

$$\begin{aligned} g_{s,i}(\textbf{u}_{(s)})&=\left( 1-y_i\textbf{x}_{{(s)},i}^{\top }({\varvec{\beta }}^*_{(s)}+\sqrt{n^{-1} {p_{\max }}\log ({p_{\max }})}\textbf{u}_{(s)}) \right) _+\\&\quad -\left( 1-y_i\textbf{x}_{{(s)},i}^{\top }{\varvec{\beta }}^*_{(s)}\right) _+\\&\quad +\sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})}y_i\textbf{x}_{{(s)},i}^{\top }\textbf{u}_{(s)}{\textbf{1}}\left( 1-y_i\textbf{x}_{{(s)},i}^{\top }{\varvec{\beta }}^*_{(s)}\ge 0 \right) \\&\quad -\mathbb {E}\left[ \left( 1-y_i\textbf{x}_{{(s)},i}^{\top }({\varvec{\beta }}^*_{(s)}+\sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})}\textbf{u}_{(s)})\right) _+ \right] \\&\quad +\mathbb {E}\left[ \left( 1-y_i\textbf{x}_{{(s)},i}^{\top }{\varvec{\beta }}^*_{(s)}\right) _+ \right] . \end{aligned}$$

It can be verified that $\mathbb {E}[g_{{(s)},i}(\textbf{u})]=0$, $s=1,2,\ldots ,S_n$ by the definition of ${\varvec{\beta }}^*_{(s)}$ and ${\textbf{J}}_{(s)}({\varvec{\beta }}^*_{(s)})=0$. Note that (B2) can be further decomposed as

$$\begin{aligned}&n^{-1}\sum _{i=1}^{n}\left\{ \left( 1-y_i\textbf{x}_{{(s)},i}^{\top }({\varvec{\beta }}^*_{(s)}+\sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})}\textbf{u}_{(s)}) \right) _+\right. \\&\quad \left. -\left( 1-y_i\textbf{x}_{{(s)},i}^{\top }{\varvec{\beta }}^*_{(s)}\right) _+ \right\} \\&\equiv n^{-1} (A_{s,n}+B_{s,n} ), \end{aligned}$$

where

$$\begin{aligned} A_{s,n}=\sum _{i=1}^{n}g_{s,i}(\textbf{u}_{(s)}) \end{aligned}$$

and

$$\begin{aligned} B_{s,n}&=\sum _{i=1}^{n}\Big [ -\sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})}y_i \textbf{x}_{{(s)},i}^{\top }\textbf{u}_{(s)}{\textbf{1}}\nonumber \\&\quad \left( 1-y_i\textbf{x}_{{(s)},i}^{\top }{\varvec{\beta }}^*_{(s)}\ge 0 \right) \nonumber \\&\quad + \mathbb {E}\left\{ \left( 1-y_i\textbf{x}_{{(s)},i}^{\top }({\varvec{\beta }}^*_{(s)}+\sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})}\textbf{u}_{(s)}) \right) _+ \right\} \nonumber \\&\quad -\mathbb {E}\left\{ \left( 1-y_i\textbf{x}_{{(s)},i}^{\top }{\varvec{\beta }}^*_{(s)}\right) _+\right\} \Big ]. \end{aligned}$$

(B4)

The remainder of the proof consists of three steps. In Step 1, we demonstrate that

$$\begin{aligned} \max _{1\le s\le S_n}\sup _{\Vert \textbf{u}_{(s)}\Vert =\triangle }|A_{s,n}|=\triangle ^{3/2}{p_{\max }}o_{p}(1). \end{aligned}$$

(B5)

In Step 2, it is shown that $\min _{1\le s\le S_n}\inf _{\Vert \textbf{u}_{(s)}\Vert =\triangle }B_{s,n}$ dominates the terms of order $\triangle ^{3/2}{p_{\max }}o_{p}(1)$ and is larger than zero. In Step 3, we use the results from the previous steps to prove (B1).

Step 1: We use the covering number introduced by van der Vaart and Wellner (1996) to prove the uniform rate in (B5). It suffices to show, for any $\epsilon >0$, that

$$\begin{aligned} \Pr \left( \max _{1\le s\le S_n}\sup _{\Vert \textbf{u}_{(s)}\Vert =\triangle } p_s^{-1}\bigg |\sum _{i=1}^{n}g_{s,i}(\textbf{u}_{(s)}) \bigg |>\triangle ^{3/2}\epsilon \right) \rightarrow 0. \end{aligned}$$

(B6)

Note that the hinge loss satisfies the Lipschitz condition and $\max _{1\le i \le n}\Vert \textbf{x}_{{(s)},i}\Vert \le C_1\sqrt{p_s}$, $\max _{1\le i \le n}\mathbb {E}\Vert \textbf{x}_{{(s)},i}\Vert \le C_1\sqrt{p_s}$ from Condition 2. It is shown that

$$\begin{aligned} |g_{s,i}(\textbf{u}_{(s)})|&\le 3\triangle \sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})} \nonumber \\&\quad \max \left\{ \max _{1\le i \le n}\Vert \textbf{x}_{{(s)},i}\Vert ,\max _{1\le i \le n}\mathbb {E}\Vert \textbf{x}_{{(s)},i}\Vert \right\} \nonumber \\&\le 3C_1\triangle {p_{\max }}\sqrt{n^{-1}\log ({p_{\max }})} \end{aligned}$$

(B7)

and thus $\max _{1\le s\le S_n}\sup _{\Vert \textbf{u}_{(s)}\Vert =\triangle }p_s^{-1}|g_{s,i}(\textbf{u}_{(s)})|=o(1)$ by Condition 6. By Lemma 2.5 of van de Geer (2000), the ball $\{\textbf{u}_{(s)}:\Vert \textbf{u}_{(s)}\Vert \le \triangle \}$ in ${\mathcal {R}}^{p_s+1}$ can be covered by $N_s$ balls with radius $\zeta _s$, where $N_s\le \{(4\triangle +\zeta _s)/\zeta _s\}^{p_s+1}$. Denote ${\textbf {u}}_{(s)}^{1},\ldots ,{\textbf {u}}_{(s)}^{N_s}$ as the centers of the $N_s$ balls, let $\zeta _s=(nM_1)^{-1} p_s$ (for some large constant $M_1>0$ ) and denote $ {\mathcal {U}}_s^{k}=\{\textbf{u}_{(s)}: \Vert \textbf{u}_{(s)}-\textbf{u}_{(s)}^{k}\Vert \le \zeta _s \& \Vert \textbf{u}_{(s)}\Vert =\triangle \}$. For any $\epsilon >0$, we have

$$\begin{aligned}&\max _{1\le s\le S_n}\max _{1\le k\le N_s} \sup _{\textbf{u}_{(s)}\in {\mathcal {U}}_s^{(k)}} p_s^{-1}\bigg |\sum _{i=1}^{n}g_{s,i}(\textbf{u}_{(s)})-\sum _{i=1}^{n}g_{s,i}(\textbf{u}_{(s)}^k)\bigg |\nonumber \\&\quad \le \max _{1\le s\le S_n}\max _{1\le k\le N_s} \sup _{\textbf{u}_{(s)}\in {\mathcal {U}}_s^{k}} p_s^{-1}\sum _{i=1}^{n}\bigg |g_{s,i}(\textbf{u}_{(s)})- g_{s,i}(\textbf{u}_{(s)}^k)\bigg |\nonumber \\&\quad \le \max _{1\le s\le S_n}\max _{1\le k\le N_s} \sup _{\textbf{u}_{(s)}\in {\mathcal {U}}_s^{k}} np_s^{-1}\nonumber \\&\qquad \quad \Big \{2\sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})}\Vert \textbf{x}_{{(s)},i}\Vert \Vert \textbf{u}_{(s)}-\textbf{u}_{(s)}^{k}\Vert \nonumber \\&\qquad +\sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})}\Vert \textbf{u}_{(s)}-\textbf{u}_{(s)}^{k}\Vert \mathbb {E}\Vert \textbf{x}_{{(s)},i}\Vert \Big \} \nonumber \\&\quad \le \max _{1\le s\le S_n}3\triangle n p_s^{-1} \sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})}\nonumber \\&\qquad \max \left\{ \max _{1\le i \le n}\Vert \textbf{x}_{{(s)},i}\Vert ,\max _{1\le i \le n}\mathbb {E}\Vert \textbf{x}_{{(s)},i}\Vert \right\} \zeta _s\nonumber \\&\quad \le 3C_1 M_1^{-1}\triangle {p_{\max }}\sqrt{n^{-1}\log ({p_{\max }})}\nonumber \\&\quad =o( \triangle ^{3/2}p_{\min }\epsilon /2), \end{aligned}$$

(B8)

where the last inequality arises from Condition 6. From (B8), it can be shown that

$$\begin{aligned}&\Pr \left( \max _{1\le s\le S_n}\sup _{\Vert \textbf{u}_{(s)}\Vert =\triangle } p_s^{-1}\bigg |\sum _{i=1}^{n}g_{s,i}(\textbf{u}_{(s)}) \bigg |>\triangle ^{3/2} \epsilon \right) \nonumber \\&\quad \le \Pr \Bigg (\max _{1\le s\le S_n}\max _{1\le k\le N_s} \sup _{\textbf{u}_{(s)}\in {\mathcal {U}}_s^{(k)}}p_s^{-1}\bigg |\sum _{i=1}^{n}g_{s,i}(\textbf{u}_{(s)})-\sum _{i=1}^{n}g_{s,i}(\textbf{u}_{(s)}^{k})\bigg |\nonumber \\&\qquad +\max _{1\le s\le S_n}\max _{1\le k \le N_s}p_s^{-1}\bigg |\sum _{i=1}^{n}g_{s,i}(\textbf{u}_{(s)}^{k}) \bigg |>\triangle ^{3/2}\epsilon \Bigg )\nonumber \\&\quad \le \Pr \Bigg (\max _{1\le s\le S_n}\max _{1\le k\le N_s} \sup _{\textbf{u}_{(s)}\in {\mathcal {U}}_s^{(k)}}\bigg |\sum _{i=1}^{n}g_{s,i}(\textbf{u}_{(s)})\nonumber \\&\qquad -\sum _{i=1}^{n}g_{s,i}(\textbf{u}_{(s)}^{k})\bigg |>\triangle ^{3/2}p_{\min }\epsilon /2\Bigg )\nonumber \\&\qquad +\sum _{s=1}^{S_n}\sum _{k=1}^{N_s}\Pr \left( \bigg |\sum _{i=1}^{n}g_{s,i}(\textbf{u}_{(s)}^{k}) \bigg |>\triangle ^{3/2}p_s\epsilon /2 \right) \nonumber \\&\quad = \sum _{s=1}^{S_n}\sum _{k=1}^{N_s}\Pr \left( \bigg |\sum _{i=1}^{n}g_{s,i}(\textbf{u}_{(s)}^{k}) \bigg |> \triangle ^{3/2}p_s\epsilon /2 \right) +o(1) \end{aligned}$$

(B9)

and $\sum _{i=1}^{n}g_{s,i}(\textbf{u}_{(s)}^{(k)})$ is the sum of independent zero-mean random variables.

By the bounded conditional density, under Conditions 1 and 4, recognising that

$\max _{1\le i \le n}\Vert \textbf{x}_{{(s)},i}\Vert \le C_1\sqrt{p_s}$, we have

$$\begin{aligned}&\Pr \left( |1-y_i\textbf{x}_{{(s)},i}^{\top }{\varvec{\beta }}^*_{(s)}|\le \sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})} \max _{1\le i \le n}\Vert \textbf{x}_{{(s)},i}\Vert \triangle \right) \nonumber \\&\quad =\Pr \Big (\pm 1-\sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})} \max _{1\le i \le n}\Vert \textbf{x}_{{(s)},i}\Vert \triangle \le \textbf{x}_{{(s)},i}^{\top }{\varvec{\beta }}^*_{(s)}\nonumber \\&\quad \le \sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})} \max _{1\le i \le n}\Vert \textbf{x}_{{(s)},i}\Vert \triangle \pm 1\Big \vert y_i=\pm 1\Big )\nonumber \\&\quad \le 2C_3\sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})}\max _{1\le i \le n}\Vert \textbf{x}_{{(s)},i}\Vert \triangle \nonumber \\&\quad \le 2\triangle C_1 C_3\sqrt{n^{-1} p_s{p_{\max }}\log ({p_{\max }})}. \end{aligned}$$

(B10)

Note that when $1-y_i\textbf{x}_{{(s)},i}^{\top }{\varvec{\beta }}^*_{(s)}<\sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})}$$y_i\textbf{x}_{{(s)},i}^{\top }\textbf{u}_{(s)}$ and $\sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})}y_i\textbf{x}_{{(s)},i}^{\top }\textbf{u}_{(s)}<0$, or when $1-y_i\textbf{x}_{{(s)},i}^{\top }{\varvec{\beta }}^*_{(s)}>\sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})}y_i\textbf{x}_{{(s)},i}^{\top }\textbf{u}_{(s)}$ and $\sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})}y_i\textbf{x}_{{(s)},i}^{\top }\textbf{u}_{(s)}>0$,

$$\begin{aligned}&\left( 1-y_i\textbf{x}_{{(s)},i}^{\top }({\varvec{\beta }}^*_{(s)}+\sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})}\textbf{u}_{(s)}) \right) _+\nonumber \\&\quad -(1-y_i\textbf{x}_{{(s)},i}^{\top }{\varvec{\beta }}^*_{(s)})_+ +\sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})} y_i\textbf{x}_{{(s)},i}\nonumber \\&\quad ^{\top }\textbf{u}_{(s)}{\textbf{1}}(1-y_i\textbf{x}_{{(s)},i}^{\top }{\varvec{\beta }}^*_{(s)}\ge 0)=0. \end{aligned}$$

(B11)

Furthermore, equation (B11) holds when $|1-y_i\textbf{x}_{{(s)},i}^{\top }{\varvec{\beta }}^*_{(s)}|>\sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})}\max _{1\le i \le n}\Vert \textbf{x}_{{(s)},i}\Vert \triangle $ as $\sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})}\max _{1\le i \le n}\Vert \textbf{x}_{{(s)},i}\Vert \triangle >\bigg |\sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})}y_i\textbf{x}_{{(s)},i}^{\top }\textbf{u}_{(s)}\bigg |$.

Hence we can write

$$\begin{aligned}&\mathop {\sum }\limits _{i=1}^{n}\mathbb {E}\{g_{s,i}^2(u^k_{(s)})\}\nonumber \\ {}&{} {\le } \sum _{i=1}^{n}\mathbb {E}\Bigg [\Bigg \{\Big |\left( 1{-}y_i{\textbf {x}}_{{(s)},i}^{\top }({\varvec{\beta }}^*_{(s)}{+}\sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})}{} {\textbf {u}}_{(s)}^{k}) \right) _+\nonumber \\{}&{} \qquad -\left( 1-y_i{\textbf {x}}_{{(s)},i}^{\top }{\varvec{\beta }}^*_{(s)}\right) _+\Big |\nonumber \\{}&{} \qquad +\Big |\sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})}y_i{\textbf {x}}_{{(s)},i}^{\top }{} {\textbf {u}}_{(s)}^{k} \Big |\Bigg \}^2{{\textbf {1}}}\Big (|1-y_i{\textbf {x}}_{{(s)},i}^{\top }{\varvec{\beta }}^*_{(s)}|\nonumber \\{}&{} \quad \le \sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})} \times \max _{1\le i \le n}\Vert {\textbf {x}}_{{(s)},i}\Vert \triangle \Big )\Bigg ]\nonumber \\{}&{} \quad \le \sum _{i=1}^{n}\mathbb {E}\Bigg \{ \left( 2\sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})}{} {\textbf {x}}_{{(s)},i}^{\top }{} {\textbf {u}}_{(s)}^k\right) ^2{{\textbf {1}}}\nonumber \\{}&{} \quad \quad \Big (|1-y_i{\textbf {x}}_{{(s)},i}^{\top }{\varvec{\beta }}^*_{(s)}|\le \sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})}\nonumber \\{}&{} \qquad \times \max _{1\le i \le n}\Vert {\textbf {x}}_{{(s)},i}\Vert \triangle \Big )\Bigg \}\nonumber \\{}&{} \quad \le \left( 2\sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})}\max _{1\le i \le n}\Vert {\textbf {x}}_{{(s)},i}\Vert \triangle \right) ^2\nonumber \\{}&{} \quad \qquad \sum _{i=1}^{n}\mathbb {E}\textbf{1}\Big (|1-y_i{\textbf {x}}_{{(s)},i}^{\top }{\varvec{\beta }}^*_{(s)}|\le \sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})}\nonumber \\{}&{} \qquad \times \max _{1\le i \le n}\Vert {\textbf {x}}_{{(s)},i}\Vert \triangle \Big )\nonumber \\{}&{} \quad \le 4 C_1^2\triangle ^2 n^{-1} p_s {p_{\max }}\log ({p_{\max }})\sum _{i=1}^{n}\mathbb {E}\textbf{1}\Big (|1-y_i{\textbf {x}}_{{(s)},i}^{\top }{\varvec{\beta }}^*_{(s)}|\nonumber \\{}&{} \qquad \le \sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})}\max _{1\le i \le n}\Vert {\textbf {x}}_{{(s)},i}\Vert \triangle \Big )\nonumber \\{}&{} \quad \le 4 C_1^2\triangle ^2 n^{-1} p_s {p_{\max }}\log ({p_{\max }})\nonumber \\{}&{} \qquad \times 2n\triangle C_1 C_3\sqrt{n^{-1} p_s{p_{\max }}\log ({p_{\max }})}\nonumber \\{}&{} \quad = 8\triangle ^3 C_1^3C_3n^{-1/2}p_s^{3/2}p^{3/2}_{\max }\log ^{3/2}({p_{\max }}) , \end{aligned}$$

(B12)

where the second-to-last inequality arises from $\max _{1\le i \le n}$$ \Vert \textbf{x}_{{(s)},i}\Vert \le C_1\sqrt{p_s}$ and the last inequality is from (B10). Finally, by Bernstein’s inequality and recognising (B7) and (B12), we can write

$$\begin{aligned}&\hspace{-21pc}\sum _{s=1}^{S_n}\sum _{k=1}^{N_s}\Pr \left( \bigg |\sum _{i=1}^{n}g_{s,i}(\textbf{u}^k) \bigg |>\triangle ^{3/2}p_s\epsilon /2 \right) \le \sum _{s=1}^{S_n}\sum _{k=1}^{N_s}2\exp \left( -\frac{\triangle ^{3}p_s^2\epsilon ^2/4}{\sum _{i=1}^{n}\mathbb {E}\{g_{s,i}^2(\textbf{u}^k)\}+3\triangle ^{5/2} C_1p_s{p_{\max }}\sqrt{n^{-1}\log ({p_{\max }})} \epsilon /2 }\right) \nonumber \\&\hspace{-21pc}\quad \le \sum _{s=1}^{S_n}\left( \frac{4\triangle +(nM_1)^{-1}p_s}{(nM_1)^{-1}p_s} \right) ^{p_s+1}\times \exp \left( -\frac{\triangle ^{3}p_s^2\epsilon ^2/4}{ 8\triangle ^3 C_1^3 C_3 n^{-1/2}p_s^{3/2}p^{3/2}_{\max }\log ^{3/2}({p_{\max }})+ 3\triangle ^{5/2} C_1p_s{p_{\max }}\sqrt{n^{-1}\log ({p_{\max }})} \epsilon /2 }\right) \nonumber \\&\hspace{-21pc}\quad \le S_n \left( \frac{4\triangle M_1n}{p_{\min }}+1 \right) ^{{p_{\max }}+1} \exp \left( -\frac{\triangle ^{3}p_{\min }^{1/2}\epsilon ^2/4}{ 16\triangle ^3 C_1^3 C_3 n^{-1/2}p^{3/2}_{\max }\log ^{3/2}({p_{\max }}) }\right) \nonumber \\&\hspace{-21pc}\quad =O(1)\exp \Big \{\log (S_n)+({p_{\max }}+1)\log (4\triangle nM_1p^{-1}_{\min }+1)-64^{-1}C_1^{-3}C_3^{-1}\epsilon ^2n^{1/2}p_{\min }^{1/2}p^{-3/2}_{\max }\log ^{-3/2}({p_{\max }})\Big \}\nonumber \\&\hspace{-21pc}\quad =o(1), \end{aligned}$$

(B13)

where the last equality is due to Condition 6 and $S_n=O\{\exp (n^{\tau })\}$ for $\tau \in (0, 1/2-3\kappa /2)$. The proof of (B6) is complete by combining (B9) and (B13).

Step 2: Let us rewrite $B_{s,n}$ as $B_{s,n}\equiv B_{s,n1}+B_{s,n2}$, where

$$\begin{aligned} B_{s,n1}&= -\sum _{i=1}^{n}\sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})}y_i \textbf{x}_{{(s)},i}^{\top }\textbf{u}_s {\textbf{1}}\\&\quad \left( 1-y_i\textbf{x}_{{(s)},i}^{\top }{\varvec{\beta }}^*_{(s)}\ge 0 \right) ,\\ \text {and} \\ B_{s,n2}&= \mathbb {E}\left\{ \left( 1-y_i\textbf{x}_{{(s)},i}^{\top }({\varvec{\beta }}^*_{(s)}+\sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})}\textbf{u}_{(s)}) \right) _+ \right\} \\&\quad -\mathbb {E}\left\{ \left( 1-y_i\textbf{x}_{{(s)},i}^{\top }{\varvec{\beta }}^*_{(s)}\right) _+\right\} . \end{aligned}$$

To analyse $B_{s,n1}$, we observe that

$$\begin{aligned}&\bigg |\sum _{i=1}^{n}y_i\textbf{x}_{{(s)},i}^{\top }\textbf{u}_{(s)}{\textbf{1}}\left( 1-y_i\textbf{x}_{{(s)},i}^{\top }{\varvec{\beta }}^*_{(s)}\ge 0 \right) \bigg |\nonumber \\&\quad =\bigg |\sum _{j=0}^{p_s}\sum _{i=1}^{n}y_i\textbf{x}_{{(s)},ij}u_{{(s)},j}{\textbf{1}}\left( 1-y_i\textbf{x}_{{(s)},i}^{\top }{\varvec{\beta }}^*_{(s)}\ge 0 \right) \bigg |\nonumber \\&\quad {\le }\sum _{j=0}^{p_s}|u_{{(s)},j}|\max _{0\le j\le p_s}\bigg |\sum _{i=1}^{n}y_i x_{{(s)},ij}{\textbf{1}}\left( 1{-}y_i\textbf{x}_{{(s)},i}^{\top }{\varvec{\beta }}^*_{(s)}\ge 0 \right) \bigg |\nonumber \\&\quad \le \sqrt{\sum _{j=0}^{p_s}u_{{(s)},j}^2}\sqrt{\sum _{j=0}^{p_s}1}\max _{0\le j\le p_s}\bigg |\sum _{i=1}^{n}y_ix_{{(s)},ij}{\textbf{1}}\left( 1\right. \nonumber \\&\qquad \left. -y_i\textbf{x}_{{(s)},i}^{\top }{\varvec{\beta }}^*_{(s)}\ge 0 \right) \bigg |\nonumber \\&\quad \le \sqrt{p_s+1} \triangle \max _{0\le j\le p_s}\bigg |\sum _{i=1}^{n}y_i x_{{(s)},ij}{\textbf{1}}\left( 1\right. \nonumber \\&\qquad \left. -y_i\textbf{x}_{{(s)},i}^{\top }{\varvec{\beta }}^*_{(s)}\ge 0 \right) \bigg |. \end{aligned}$$

(B14)

By the definition of ${\textbf{J}}_s({\varvec{\beta }}^*_{(s)})$, note that $\mathbb {E}\left[ y_ix_{{(s)},ij}{\textbf{1}}\left( 1- \right. \right. $$\left. \left. y_i\textbf{x}_{{(s)},i}^{\top }{\varvec{\beta }}^*_{(s)}\ge 0 \right) \right] =0$ for $0\le j \le p_s$. By Lemma 14.24 in Bühlmann and van de Geer (2011) (the Nemirovski moment inequality),

$$\begin{aligned}&\mathbb {E}\left\{ \max _{0\le j\le p_s}\bigg |\sum _{i=1}^{n}y_ix_{{(s)},ij}{\textbf{1}}\left( 1-y_i\textbf{x}_{{(s)},i}^{\top }{\varvec{\beta }}^*_{(s)}\ge 0 \right) \bigg |\right\} \nonumber \\&\quad \le \sqrt{8\log (2p_s+2)}\mathbb {E}\left( \max _{1\le j\le p_s+1}\sum _{i=1}^{n}y_i^2x^2_{{(s)},ij}\right) ^{1/2}\nonumber \\&\quad \le \sqrt{ 8\log (2p_s+2)}\sqrt{nC_1^2}\nonumber \\&\quad =O(\sqrt{n\log (p_s)}), \end{aligned}$$

(B15)

where the last inequality is established by Condition 2. Additionally, using Markov’s inequality and by (B15), we obtain

$$\begin{aligned}&\max _{0\le j\le p_s}\bigg |\sum _{i=1}^{n}y_ix_{{(s)},ij}{\textbf{1}}\left( 1-y_i\textbf{x}_{{(s)},i}^{\top }{\varvec{\beta }}^*_{(s)}\ge 0 \right) \bigg |\nonumber \\&\quad = O_{p}(\sqrt{n \log (p_s)}). \end{aligned}$$

(B16)

Combining (B14) and (B16), we have

$$\begin{aligned}&\max _{1\le s\le S_n}\sup _{\Vert \textbf{u}_{(s)}\Vert =\triangle }|B_{s,n1}|\nonumber \\&\quad {=}\max _{1\le s\le S_n}\sup _{\Vert \textbf{u}_{(s)}\Vert =\triangle }\bigg |\sum _{i=1}^{n}{-}\sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})}y_i \textbf{x}_{{(s)},i}^{\top }\textbf{u}_{(s)}{\textbf{1}}\nonumber \\&\qquad \left( 1-y_i\textbf{x}_{{(s)},i}^{\top }{\varvec{\beta }}^*_{(s)}\ge 0 \right) \bigg |\nonumber \\&\quad =\sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})}\max _{1\le s\le S_n}\sup _{\Vert \textbf{u}_{(s)}\Vert =\triangle }\bigg |\sum _{i=1}^{n}y_i\textbf{x}_{{(s)},i}^{\top }\textbf{u}_{(s)}{\textbf{1}}\nonumber \\&\qquad \left( 1-y_i\textbf{x}_{{(s)},i}^{\top }{\varvec{\beta }}^*_{(s)}\ge 0 \right) \bigg |\nonumber \\&\quad \le \sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})} \triangle O_p\left\{ \sqrt{{p_{\max }}+1}\sqrt{n\log ({p_{\max }})}\right\} \nonumber \\&\quad = O_{p}(\triangle {p_{\max }}\log ({p_{\max }})). \end{aligned}$$

(B17)

Turning to $B_{s,n2}$, under Conditions 5 and 6 and according to Koo et al. (2008), ${\textbf{H}}_{(s)}({\varvec{\beta }}_{(s)})$ is element-wise continuous at ${\varvec{\beta }}^*_s$. By Taylor expansion of the hinge loss at ${\varvec{\beta }}^*_{(s)}$, we have

$$\begin{aligned} {\textbf{H}}_{(s)}\left( {\varvec{\beta }}^*_{(s)}+t\sqrt{n^{-1}{p_{\max }}}\textbf{u}_{(s)}\right) ={\textbf{H}}_{(s)}({\varvec{\beta }}^*_{(s)})+o(1). \end{aligned}$$

(B18)

Hence, it is shown that

$$\begin{aligned}&\min _{1\le s\le S_n}\inf _{\Vert \textbf{u}_{(s)}\Vert =\triangle }B_{s,n2}\nonumber \\&\quad =\min _{1\le s\le S_n}\inf _{\Vert \textbf{u}_{(s)}\Vert =\triangle }\sum _{i=1}^{n}\nonumber \\&\qquad \Bigg [\mathbb {E}\left\{ \left( 1-y_i\textbf{x}_{{(s)},i}^{\top }({\varvec{\beta }}^*_{(s)}+\sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})}\textbf{u}_{(s)}) \right) _+ \right\} \nonumber \\&\qquad -\mathbb {E}\left\{ (1-y_i\textbf{x}_{{(s)},i}^{\top }{\varvec{\beta }}^*_{(s)})_+\right\} \Bigg ]\nonumber \\&\quad = \min _{1\le s\le S_n}\inf _{\Vert \textbf{u}_{(s)}\Vert =\triangle } 2^{-1}{p_{\max }}\log ({p_{\max }}) \textbf{u}_{(s)}^{\top }{\textbf{H}}_{(s)}\nonumber \\&\qquad \left( {\varvec{\beta }}^*_{(s)}+t\sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})}\textbf{u}_{(s)}\right) \textbf{u}_{(s)}\nonumber \\&\quad \ge 2^{-1}\triangle ^2 c_0{p_{\max }}\log ({p_{\max }}), \end{aligned}$$

(B19)

for some $0<t<1$, where the last inequality is due to (B18) and Condition 5. It can be readily shown by (B4), (B17), (B19) and Condition 6 that when $\triangle $ is sufficiently large, $2^{-1}\triangle ^2 c_0{p_{\max }}(>0)$ dominates other terms in $B_{s,n}$. This completes the proof of Step 2.

Step 3: Combining (B3), (B6), (B17) and (B19), when n and $\triangle $ are sufficiently large, we have

$$\begin{aligned}&\max _{1\le s\le S_n}\inf _{\Vert \textbf{u}_{(s)}\Vert =\triangle } \left\{ l_s\left( {\varvec{\beta }}^*_{(s)}+\sqrt{n^{-1}p_{\max }\log ({p_{\max }})}\textbf{u}_{(s)}\right) -l_s({\varvec{\beta }}^*_{(s)})\right\} \nonumber \\&\quad =\max _{1\le s\le S_n}\inf _{\Vert \textbf{u}_{(s)}\Vert =\triangle }\Big \{ n^{-1}(A_{s,n}+B_{s,n})+2^{-1}\lambda _n \left\| {\varvec{\beta }}^{*+}_{(s)}\right. \nonumber \\&\qquad \left. +\sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})}\textbf{u}_{(s)}^+\right\| ^2 -2^{-1}\lambda _n \Vert {\varvec{\beta }}^{*+}_{(s)}\Vert ^2\Big \}\nonumber \\&\quad \ge \max _{1\le s\le S_n}\inf _{\Vert \textbf{u}_{(s)}\Vert =\triangle }\Big \{ n^{-1}B_{s,n}-n^{-1}|A_{s,n}|-2^{-1}\lambda _n \Big |\left\| {\varvec{\beta }}^{*+}_{(s)}\right. \nonumber \\&\qquad \left. +\sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})}\textbf{u}_{(s)}^+\right\| ^2- \Vert {\varvec{\beta }}^{*+}_{(s)}\Vert ^2\Big |\Big \}\nonumber \\&\quad \ge \min _{1\le s\le S_n}\inf _{\Vert \textbf{u}_{(s)}\Vert =\triangle }n^{-1} B_{s,n2}-\max _{1\le s\le S_n}\sup _{\Vert \textbf{u}_{(s)}\Vert =\triangle }n^{-1}|B_{s,n1}|\nonumber \\&\qquad -\max _{1\le s\le S_n}\sup _{\Vert \textbf{u}_{(s)}\Vert =\triangle }n^{-1}|A_{s,n}|\nonumber \\&\qquad -2^{-1}\lambda _n \max _{1\le s\le S_n}\sup _{\Vert \textbf{u}_{(s)}\Vert =\triangle } \Big |\left\| {\varvec{\beta }}^{*+}_{(s)}+\sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})}\textbf{u}_{(s)}^+\right\| ^2\nonumber \\&\qquad - \Vert {\varvec{\beta }}^{*+}_{(s)}\Vert ^2\Big |\nonumber \\&\quad =2^{-1}n^{-1}\triangle ^2 c_0{p_{\max }}\log ({p_{\max }})- O_{p}(\triangle n^{-1} {p_{\max }}\log ({p_{\max }}))\nonumber \\&\qquad -\triangle ^{3/2}n^{-1}{p_{\max }}o_{p}(1)\nonumber \\&\qquad -2^{-1}\triangle \lambda _n{p_{\max }}\sqrt{n^{-1}\log ({p_{\max }})}\nonumber \\&>0, \end{aligned}$$

(B20)

where the last inequality is obtained from Conditions 5–6 and $\lambda _n=O(\sqrt{n^{-1}\log ({p_{\max }})})$. This completes the proof of (B1).

Equation (10) can be proved in a similar way. Note that $n-\lfloor n/J \rfloor \sim n$ and each sample from ${\mathcal {D}}_n$ is drawn independently from an identical distribution. Hence ${\widetilde{{\varvec{\beta }}}}^{[-j]}_{(s)}$ converges to ${\varvec{\beta }}^*_{(s)}$ in the same order as ${\hat{{\varvec{\beta }}}}_{(s)}$ for each $j=1,2,\ldots ,J$, i.e.,

$$\begin{aligned} \max _{1\le j\le J} \max _{1\le s\le S_n}\left\| {\widetilde{{\varvec{\beta }}}}_{(s)}^{[-j]}-{\varvec{\beta }}^*_{(s)}\right\| =O_p\left( \sqrt{\frac{{p_{\max }}\log ({p_{\max }})}{n}}\right) . \end{aligned}$$

(B21)

1.2 B.2 Proof of Theorem 1

Let us introduce Lemma 2 that facilitates the proof of Theorem 1.

Lemma 2

Assume that Condition 7 and

$$\begin{aligned} \sup _{\textbf{w}\in {{\mathcal {W}}}}\bigg |\frac{\textrm{CV}(\textbf{w})-R_n(\textbf{w})}{R_n(\textbf{w})}\bigg |=o_p(1) \end{aligned}$$

(B22)

hold. Then

$$\begin{aligned} \frac{R_n({\hat{\textbf{w}}})}{\inf _{\textbf{w}\in {\mathcal {W}}} R_n(\textbf{w})}\rightarrow 1 \end{aligned}$$

(B23)

in probability, where ${\hat{\textbf{w}}}$ is the optimal solution from (8).

Proof of Lemma 2

By the definition of infimum, there exist a sequence $\vartheta _n$ and a vector sequence $\textbf{w}_n\in {{\mathcal {W}}}$ such that as $n\rightarrow \infty $, $\vartheta _n\rightarrow 0$ and

$$\begin{aligned} \inf _{\textbf{w}\in {{\mathcal {W}}}}R_n(\textbf{w})=R_n(\textbf{w}_n)-\vartheta _n. \end{aligned}$$

(B24)

From Condition 7, we have

$$\begin{aligned} \frac{R_n(\textbf{w}_n)}{\inf _{\textbf{w}\in {{\mathcal {W}}}}R_n(\textbf{w})}&>\frac{\inf _{\textbf{w}\in {{\mathcal {W}}}}R_n(\textbf{w})}{\inf _{\textbf{w}\in {{\mathcal {W}}}}R_n(\textbf{w})}=1, \end{aligned}$$

(B25)

and

$$\begin{aligned} \frac{\vartheta _n}{\inf _{\textbf{w}\in {{\mathcal {W}}}}R_n(\textbf{w})}&=o_p(1). \end{aligned}$$

(B26)

Taking (B22), (B25) and (B26) together, for any $\delta >0$,

$$\begin{aligned}&\Pr \left\{ \bigg |\frac{\inf _{\textbf{w}\in {\mathcal {W}}} R_n(\textbf{w})}{R_n({\hat{\textbf{w}}})}-1\bigg |>\delta \right\} \nonumber \\&\quad =\Pr \left\{ \frac{R_n({\hat{\textbf{w}}})-\inf _{\textbf{w}\in {\mathcal {W}}} R_n(\textbf{w})}{R_n({\hat{\textbf{w}}})}-1>\delta \right\} \nonumber \\&\quad =\Pr \left\{ \frac{R_n({\hat{\textbf{w}}})-{\text {CV}}({\hat{\textbf{w}}})+{\text {CV}}({\hat{\textbf{w}}})-R_n(\textbf{w}_n)+\vartheta _n}{R_n({\hat{\textbf{w}}})}>\delta \right\} \nonumber \\&\quad \le \Pr \left\{ \frac{R_n({\hat{\textbf{w}}})-{\text {CV}}({\hat{\textbf{w}}})+{\text {CV}}(\textbf{w}_n)-R_n(\textbf{w}_n)+\vartheta _n}{R_n({\hat{\textbf{w}}})}>\delta \right\} \nonumber \\&\quad \le \Pr \left\{ \frac{|R_n({\hat{\textbf{w}}})-{\text {CV}}({\hat{\textbf{w}}})|}{R_n({\hat{\textbf{w}}})} +\frac{|{\text {CV}}(\textbf{w}_n)-R_n(\textbf{w}_n)|}{\inf _{\textbf{w}\in {{\mathcal {W}}}}R_n(\textbf{w})}\right. \nonumber \\&\qquad \left. +\frac{\vartheta _n}{\inf _{\textbf{w}\in {{\mathcal {W}}}}R_n(\textbf{w})}>\delta \right\} \nonumber \\&\quad \le \Pr \left\{ \sup _{\textbf{w}\in {{\mathcal {W}}}}\bigg |\frac{R_n(\textbf{w})-{\text {CV}}(\textbf{w})}{R_n(\textbf{w})}\bigg |\right. \nonumber \\&\qquad \left. +\frac{|{\text {CV}}(\textbf{w}_n)-R_n(\textbf{w}_n)|/R_n(\textbf{w}_n)}{\inf _{\textbf{w}\in {{\mathcal {W}}}}R_n(\textbf{w})/R_n(\textbf{w}_n)}+\frac{\vartheta _n}{\inf _{\textbf{w}\in {{\mathcal {W}}}}R_n(\textbf{w})}>\delta \right\} \nonumber \\&\quad \le \Pr \Bigg \{\sup _{\textbf{w}\in {{\mathcal {W}}}}\bigg |\frac{R_n(\textbf{w})-{\text {CV}}(\textbf{w})}{R_n(\textbf{w})}\bigg |\nonumber \\&\qquad +\sup _{\textbf{w}\in {{\mathcal {W}}}}\bigg |\frac{{\text {CV}}(\textbf{w})-R_n(\textbf{w})}{R_n(\textbf{w})}\bigg |\frac{R_n(\textbf{w}_n)}{\inf _{\textbf{w}\in {{\mathcal {W}}}}R_n(\textbf{w})}\nonumber \\&\qquad +\frac{\vartheta _n}{\inf _{\textbf{w}\in {{\mathcal {W}}}}R_n(\textbf{w})}>\delta \Bigg \}\nonumber \\&\quad \rightarrow 0, \end{aligned}$$

(B27)

which implies that (B23) is valid.

Proof of Theorem 1

Let

$$\begin{aligned} T_n=\frac{1}{n}\sum _{i=1}^{n}\left( 1-y_i\textbf{x}_i^{\top }{\hat{{\varvec{\beta }}}}(\textbf{w})\right) _+ \end{aligned}$$

(B28)

By Lemma 2 and the triangle inequality, it suffices to verify that

$$\begin{aligned} \sup _{\textbf{w}\in {{\mathcal {W}}}}\frac{|{\text {CV}}(\textbf{w})-T_n(\textbf{w})|}{R_n(\textbf{w})}=o_p(1), \end{aligned}$$

(B29)

and

$$\begin{aligned} \sup _{\textbf{w}\in {{\mathcal {W}}}}\frac{|T_n(\textbf{w})-R_n(\textbf{w})|}{R_n(\textbf{w})}=o_p(1). \end{aligned}$$

(B30)

For (B29), we have

$$\begin{aligned}&|{\text {CV}}(\textbf{w})-T_n(\textbf{w})|=\bigg |\frac{1}{n}\sum _{j=1}^{J}\sum _{i\in \mathcal{A}(j)}\left\{ (1-y_i\textbf{x}_i^{\top }{\widetilde{{\varvec{\beta }}}}^{[-j]}(\textbf{w}))_+\right. \nonumber \\&\qquad \left. -(1-y_i\textbf{x}_i^{\top }{\hat{{\varvec{\beta }}}}(\textbf{w}))_+\right\} \bigg |\nonumber \\&\quad \le \frac{1}{n}\sum _{j=1}^{J}\sum _{i\in \mathcal{A}(j)}\bigg |\int _{y_i\textbf{x}_i^{\top }{\widetilde{{\varvec{\beta }}}}^{[-j]}(\textbf{w})}^{y_i\textbf{x}_i^{\top }{\hat{{\varvec{\beta }}}}(\textbf{w})}I(t\le 1)\textrm{d}t \bigg |\nonumber \\&\quad \le \frac{1}{n}\sum _{j=1}^{J}\sum _{i\in \mathcal{A}(j)}\bigg |y_i\textbf{x}_i^{\top }\left( {\widetilde{{\varvec{\beta }}}}^{[-j]}(\textbf{w})-{\hat{{\varvec{\beta }}}}(\textbf{w})\right) \bigg |\nonumber \\&\quad \le \frac{1}{n}\sum _{j=1}^{J}\sum _{i\in \mathcal{A}(j)}\sum _{s=1}^{S_n}w_s\Vert \textbf{x}_{{(s)},i}\Vert \left\| {\widetilde{{\varvec{\beta }}}}_{(s)}^{[-j]}-{\hat{{\varvec{\beta }}}}_{(s)}\right\| \nonumber \\&\quad \le \frac{1}{n}\sum _{i=1}^{n}\max _{1\le s\le S_n}\Vert \textbf{x}_{{(s)},i}\Vert \max _{1\le j\le J}\max _{1\le s\le S_n}\left\| {\widetilde{{\varvec{\beta }}}}_{(s)}^{[-j]}-{\hat{{\varvec{\beta }}}}_{(s)}\right\| \nonumber \\&\quad \le C_1\sqrt{{p_{\max }}}\max _{1\le j\le J}\max _{1\le s\le S_n}\left\| {\widetilde{{\varvec{\beta }}}}_{(s)}^{[-j]}-{\hat{{\varvec{\beta }}}}_{(s)}\right\| \nonumber \\&\quad =O_p\left( \frac{{p_{\max }}\sqrt{\log ({p_{\max }})}}{\sqrt{n}}\right) \nonumber \\&\quad =o_p(1), \end{aligned}$$

(B31)

where the second last equality is established based on Lemma 1, and the last equality is based on Conditions 6. Coupled with Condition 7 and (B31), we obtain (B29).

To prove (B30), note that

$$\begin{aligned}&|T_n(\textbf{w})-R_n(\textbf{w})|=\bigg |\frac{1}{n}\sum _{i=1}^{n}\left( 1-y_i\textbf{x}_i^{\top }{\hat{{\varvec{\beta }}}}(\textbf{w}) \right) _+\nonumber \\&\qquad -\mathbb {E}\left\{ (1-y\textbf{x}^{\top }{\hat{{\varvec{\beta }}}}(\textbf{w}) )_+\mid {\mathcal {D}}_n\right\} \bigg |\nonumber \\&\quad \le \bigg |\frac{1}{n}\sum _{i=1}^{n}\left( 1{-}y_i\textbf{x}_i^{\top }{\hat{{\varvec{\beta }}}}(\textbf{w}) \right) _+{-}\frac{1}{n}\sum _{i=1}^{n}\left( 1{-}y_i\textbf{x}_i^{\top }{\varvec{\beta }}^*(\textbf{w}) \right) _+\bigg |\nonumber \\&\qquad +\bigg |\frac{1}{n}\sum _{i=1}^{n}\left( 1-y_i\textbf{x}_i^{\top }{\varvec{\beta }}^*(\textbf{w}) \right) _+-\mathbb {E}\left( 1-y\textbf{x}^{\top }{\varvec{\beta }}^*(\textbf{w})\right) _+\bigg |\nonumber \\&\qquad {+}\bigg |\mathbb {E}\left( 1{-}y\textbf{x}^{\top }{\varvec{\beta }}^*(\textbf{w})\right) _+{-}\mathbb {E}\left\{ (1{-}y\textbf{x}^{\top }{\hat{{\varvec{\beta }}}}(\textbf{w}))_+ \mid {\mathcal {D}}_n\right\} \bigg |\nonumber \\&\quad \equiv |\Omega _1(\textbf{w})|+|\Omega _2(\textbf{w})|+|\Omega _3(\textbf{w})|. \end{aligned}$$

(B32)

Recognising the above, Lemma 1 and Conditions 3 and 6, it can be shown that

$$\begin{aligned}&\sup _{\textbf{w}\in {{\mathcal {W}}}}|\Omega _1(\textbf{w})|\le \sup _{\textbf{w}\in {{\mathcal {W}}}}\frac{1}{n}\sum _{i=1}^{n}\nonumber \\&\qquad \bigg |\left( 1-y_i\textbf{x}_i^{\top }{\hat{{\varvec{\beta }}}}(\textbf{w}) \right) _+- \left( 1-y_i\textbf{x}_i^{\top }{\varvec{\beta }}^*(\textbf{w}) \right) _+\bigg |\nonumber \\&\quad \le \sup _{\textbf{w}\in {{\mathcal {W}}}}\frac{1}{n}\sum _{i=1}^{n}\bigg |y_i\textbf{x}_i^{\top }\left( {\varvec{\beta }}^*(\textbf{w})-{\hat{{\varvec{\beta }}}}(\textbf{w})\right) \bigg |\nonumber \\&\quad \le \sup _{\textbf{w}\in {{\mathcal {W}}}}\frac{1}{n}\sum _{i=1}^{n}\sum _{s=1}^{S_n}w_s \Vert \textbf{x}_{{(s)},i}\Vert \left\| {\varvec{\beta }}^*_{(s)}-{\hat{{\varvec{\beta }}}}_{(s)}\right\| \nonumber \\&\quad \le \max _{1\le s\le S_n}\left\| {\varvec{\beta }}^*_{(s)}-{\hat{{\varvec{\beta }}}}_{(s)}\right\| \max _{1\le i \le n}\max _{1\le s\le S_n}\Vert \textbf{x}_{{(s)},i}\Vert \nonumber \\&\quad =O_p\left( \frac{{p_{\max }}\sqrt{\log ({p_{\max }})}}{\sqrt{n}}\right) \nonumber \\&\quad =o_p(1). \end{aligned}$$

(B33)

Define

$$\begin{aligned} |\textbf{w}-\textbf{w}'|_1=\sum _{s=1}^{S_n}|w_s-w'_s|, \end{aligned}$$

(B34)

for any $\textbf{w}=(w_1,\ldots ,w_{S_n})\in {{\mathcal {W}}}$ and $\textbf{w}'=(w'_1,\ldots ,w'_{S_n})\in {{\mathcal {W}}}$. Let $h_n=1/( {p_{\max }}\log n)$ and create grids using regions of the form ${{\mathcal {W}}}^{(l)}=\{\textbf{w}:|\textbf{w}-\textbf{w}^{(l)}|_1\le h_n\}$. By the notion of the $\epsilon -$covering number introduced by van der Vaart and Wellner (1996), ${{\mathcal {W}}}$ can be covered with $N=O(1/h_n^{S_n-1})$ regions ${{\mathcal {W}}}^{(l)}$, $l=1,\ldots ,N.$

Note that

$$\begin{aligned}&\sup _{\textbf{w}\in {{\mathcal {W}}}^{(l)}}|\Omega _2(\textbf{w})-\Omega _2(\textbf{w}^{(l)})|\nonumber \\&\quad \le \sup _{\textbf{w}\in {{\mathcal {W}}}^{(l)}}\bigg |\frac{1}{n}\sum _{i=1}^{n}\left( 1-y_i\textbf{x}_i^{\top }{\varvec{\beta }}^*(\textbf{w}) \right) _+\nonumber \\&\qquad -\frac{1}{n}\sum _{i=1}^{n}\left( 1-y_i\textbf{x}_i^{\top }{\varvec{\beta }}^*(\textbf{w}^{(l)}) \right) _+\bigg |\nonumber \\&\qquad +\sup _{\textbf{w}\in {{\mathcal {W}}}^{(l)}}\bigg |\mathbb {E}\left( 1-y\textbf{x}^{\top }{\varvec{\beta }}^*(\textbf{w})\right) _+\nonumber \\&\qquad -\mathbb {E}\left( 1-y\textbf{x}^{\top }{\varvec{\beta }}^*(\textbf{w}^{(l)})\right) _+\bigg |\nonumber \\&\quad \le \sup _{\textbf{w}\in {{\mathcal {W}}}^{(l)}}\frac{1}{n}\sum _{i=1}^{n}\bigg |y_i\textbf{x}_i^{\top }\{{\varvec{\beta }}^*(\textbf{w}^{(l)})-{\varvec{\beta }}^*(\textbf{w})\} \bigg |\nonumber \\&\qquad +\sup _{\textbf{w}\in {{\mathcal {W}}}^{(l)}}\mathbb {E}\bigg |y\textbf{x}^{\top }\{{\varvec{\beta }}^*(\textbf{w}^{(l)})-{\varvec{\beta }}^*(\textbf{w})\} \bigg |\nonumber \\&\quad \le \sup _{\textbf{w}\in {{\mathcal {W}}}^{(l)}}\frac{1}{n}\sum _{i=1}^{n}\sum _{s=1}^{S_n}|w_s-w^{(l)}_s|\bigg |\textbf{x}_{{(s)},i}^{\top }{\varvec{\beta }}^*_{(s)}\bigg |\nonumber \\&\qquad +\sup _{\textbf{w}\in {{\mathcal {W}}}^{(l)}}\sum _{s=1}^{S_n}|w_s-w^{(l)}_s|\mathbb {E}\bigg |\textbf{x}_{{(s)},i}^{\top }{\varvec{\beta }}^*_{(s)}\bigg |\nonumber \\&\quad =\sup _{\textbf{w}\in {{\mathcal {W}}}^{(l)}}|\textbf{w}-\textbf{w}^{(l)}|_1\max _{1\le s\le S_n}\Vert {\varvec{\beta }}^*_{(s)}\Vert \nonumber \\&\qquad \left( \max _{1\le i \le n}\max _{1\le s\le S_n}\Vert \textbf{x}_{{(s)},i}\Vert + \max _{1\le i \le n}\max _{1\le s\le S_n}\mathbb {E}\Vert \textbf{x}_{{(s)},i}\Vert \right) \nonumber \\&\quad \le \frac{C_2\sqrt{{p_{\max }}}}{{p_{\max }}\log (n)}2C_1\sqrt{{p_{\max }}}\nonumber \\&\quad =O_p( \log ^{-1}(n))\nonumber \\&\quad =o_p(1), \end{aligned}$$

(B35)

where the result holds uniformly for j. Hence we have

$$\begin{aligned} \sup _{\textbf{w}\in {{\mathcal {W}}}}|\Omega _2(\textbf{w})|&=\max _{1\le l\le N}\sup _{\textbf{w}\in {{\mathcal {W}}}^{(j)}}|\Omega _2(\textbf{w})|\nonumber \\&\le \max _{1\le l\le N}|\Omega _2(\textbf{w}^{(l)})|\nonumber \\&\quad +\max _{1\le l\le N}\sup _{\textbf{w}\in {{\mathcal {W}}}^{(j)}}|\Omega _2(\textbf{w})-\Omega _2(\textbf{w}^{(l)})|\nonumber \\&=\max _{1\le l\le N}|\Omega _2({\textbf{w}}^{(l)})|+o_p(1). \end{aligned}$$

(B36)

Furthermore, for any $\epsilon >0$,

$$\begin{aligned}&\Pr \left\{ \max _{1\le l\le N}|\Omega _2({\textbf{w}}^{(l)})|> 3\epsilon \right\} \nonumber \\&\quad =\Pr \Bigg [\max _{1\le l\le N}\Big \vert \frac{1}{n}\sum _{i=1}^{n}\left( 1-y_i\textbf{x}_i^{\top }{\varvec{\beta }}^*({\textbf{w}}^{(l)})\right) _+{\textbf{1}}\nonumber \\&\qquad \left( |1-y_i\textbf{x}_i^{\top }{\varvec{\beta }}^*({\textbf{w}}^{(l)})|<{p_{\max }}n^{0.1}\right) \nonumber \\&\qquad +\frac{1}{n}\sum _{i=1}^{n}\left( 1-y_i\textbf{x}_i^{\top }{\varvec{\beta }}^*({\textbf{w}}^{(l)})\right) _+\nonumber \\&\qquad {\textbf{1}}\left( |1-y_i\textbf{x}_i^{\top }{\varvec{\beta }}^*({\textbf{w}}^{(l)})|\ge {p_{\max }}n^{0.1}\right) \nonumber \\&\qquad -\mathbb {E}\left\{ (1-y\textbf{x}^{\top }{\varvec{\beta }}^*({\textbf{w}}^{(l)}))_+{\textbf{1}}\right. \nonumber \\&\qquad \left. \left( |1-y\textbf{x}^{\top }{\varvec{\beta }}^*({\textbf{w}}^{(l)})|< {p_{\max }}n^{0.1}\right) \right\} \nonumber \\&\qquad -\mathbb {E}\left\{ (1-y\textbf{x}^{\top }{\varvec{\beta }}^*({\textbf{w}}^{(l)}))_+{\textbf{1}}\right. \nonumber \\&\qquad \left. \left( |1-y\textbf{x}^{\top }{\varvec{\beta }}^*({\textbf{w}}^{(l)})|\ge {p_{\max }}n^{0.1}\right) \right\} \Big \vert> 3\epsilon \Bigg ]\nonumber \\&\quad \le \Pr \Bigg [\max _{1\le l\le N}\Big \vert \frac{1}{n}\sum _{i=1}^{n}\left( 1-y_i\textbf{x}_i^{\top }{\varvec{\beta }}^*({\textbf{w}}^{(l)})\right) _+{\textbf{1}}\nonumber \\&\qquad \left( |1-y_i\textbf{x}_i^{\top }{\varvec{\beta }}^*({\textbf{w}}^{(l)})|<{p_{\max }}n^{0.1}\right) \nonumber \\&\qquad -\mathbb {E}\left\{ (1-y\textbf{x}^{\top }{\varvec{\beta }}^*({\textbf{w}}^{(l)}))_+{\textbf{1}}\right. \nonumber \\&\qquad \left. \left( |1-y\textbf{x}^{\top }{\varvec{\beta }}^*({\textbf{w}}^{(l)})|< {p_{\max }}n^{0.1}\right) \right\} \Big \vert>\epsilon \Bigg ]\nonumber \\&\qquad +\Pr \Bigg [\max _{1\le l\le N}\frac{1}{n}\sum _{i=1}^{n}\left( 1-y_i\textbf{x}_i^{\top }{\varvec{\beta }}^*({\textbf{w}}^{(l)})\right) _+{\textbf{1}}\nonumber \\&\qquad \left( |1-y_i\textbf{x}_i^{\top }{\varvec{\beta }}^*({\textbf{w}}^{(l)})|\ge {p_{\max }}n^{0.1}\right)>\epsilon \Bigg ]\nonumber \\&\qquad +\Pr \Bigg [\max _{1\le l\le N}\mathbb {E}\left\{ (1-y\textbf{x}^{\top }{\varvec{\beta }}^*({\textbf{w}}^{(l)}))_+{\textbf{1}}\right. \nonumber \\&\qquad \left. \left( |1-y\textbf{x}^{\top }{\varvec{\beta }}^*({\textbf{w}}^{(l)})|\ge {p_{\max }}n^{0.1}\right) \right\} > \epsilon \Bigg ]\nonumber \\&\quad \equiv \Xi _1+\Xi _2+\Xi _3. \end{aligned}$$

(B37)

Clearly,

$$\begin{aligned}&\sum _{i=1}^{n}\mathbb {E}\left\{ (1-y_i\textbf{x}_i^{\top }{\varvec{\beta }}^*({\textbf{w}}^{(l)}))_+\right\} ^2\nonumber \\&\quad \le \sum _{i=1}^{n}\mathbb {E}\bigg |1-y_i\textbf{x}_i^{\top }{\varvec{\beta }}^*({\textbf{w}}^{(l)})\bigg |^2\nonumber \\&\quad \le \sum _{i=1}^{n}\mathbb {E}\left( 1+2|\textbf{x}_i^{\top }{\varvec{\beta }}^*(\textbf{w}^{(l)})|+\textbf{x}_i^{\top }{\varvec{\beta }}^*(\textbf{w}^{(l)}) {\varvec{\beta }}^{*\text {T}}(\textbf{w}^{(l)})\textbf{x}_i^{\top }\right) \nonumber \\&\quad \le \sum _{i=1}^{n}\mathbb {E}\left( 1+2\max _{1\le s\le S_n}\Vert \textbf{x}_{{(s)},i}\Vert \Vert {\varvec{\beta }}^*_{(s)}\Vert \right. \nonumber \\&\quad \left. +\max _{1\le s\le S_n}\Vert {\varvec{\beta }}^*_{(s)}\Vert ^2\Vert \textbf{x}_{{(s)},i}\Vert ^2 \right) \nonumber \\&\quad \le 4C^2_1C^2_2 n p^2_{\max }. \end{aligned}$$

(B38)

Using Boole’s and Bernstein’s inequalities and by (B38),

$$\begin{aligned} \Xi _1&\le \sum _{j=1}^N\Pr \Bigg [\Big \vert \frac{1}{n}\sum _{i=1}^{n}\left( 1-y_i\textbf{x}_i^{\top }{\varvec{\beta }}^*({\textbf{w}}^{(l)})\right) _+{\textbf{1}}\nonumber \\&\quad \left( |1-y_i\textbf{x}_i^{\top }{\varvec{\beta }}^*({\textbf{w}}^{(l)})|<{p_{\max }}n^{0.1}\right) \nonumber \\&\quad -\mathbb {E}\left\{ (1-y\textbf{x}^{\top }{\varvec{\beta }}^*({\textbf{w}}^{(l)}))_+{\textbf{1}}\right. \nonumber \\&\quad \left. \left( |1-y\textbf{x}^{\top }{\varvec{\beta }}^*({\textbf{w}}^{(l)})|< {p_{\max }}n^{0.1}\right) \right\} \Big \vert >\epsilon \Bigg ]\nonumber \\&\le N\exp \left( - \frac{n^2\epsilon ^2/2}{4C_1^2C_2^2 np_{\max }^2+\epsilon {p_{\max }}n^{0.1}/3}\right) \nonumber \\&\le ({p_{\max }}\log n)^{S_n-1}\exp \left( {-} \frac{n^2\epsilon ^2/2}{4C_1^2C_2^2 np_{\max }^2{+}\epsilon {p_{\max }}n^{0.1}/3}\right) \nonumber \\&=O\left\{ \exp \left( -\epsilon ^2 n p^{-2}_{\max }{+} S_n \log ({p_{\max }})+S_n\log \log (n)\right) \right\} \nonumber \\&=o(1), \end{aligned}$$

(B39)

where the last equality is established from Condition 6 and the condition that $S_n=O(n^\tau )$ for $\tau \in (0,1-2\kappa )$. Additionally, we can write

$$\begin{aligned}&\Xi _2=\Pr \Bigg \{\max _{1\le l\le N}\frac{1}{n}\sum _{i=1}^{n}\left( 1-y_i\textbf{x}_i^{\top }{\varvec{\beta }}^*({\textbf{w}}^{(l)})\right) _+{\textbf{1}}\nonumber \\&\qquad \left( |1-y_i\textbf{x}_i^{\top }{\varvec{\beta }}^*({\textbf{w}}^{(l)})|\ge {p_{\max }}n^{0.1}\right) >\epsilon \Bigg \}\nonumber \\&\quad \le \Pr \left( \max _{1\le l\le N}\max _{1\le i \le n}|1-y_i\textbf{x}_i^{\top }{\varvec{\beta }}^*({\textbf{w}}^{(l)})|\ge {p_{\max }}n^{0.1}\right) \nonumber \\&\quad \le \Pr \left\{ \max _{1\le l\le N}\max _{1\le i \le n}\sum _{s=1}^{S_n}{w^{(l)}_s}\left( 1+\Vert \textbf{x}_{{(s)},i}\Vert \Vert {\varvec{\beta }}^*_{(s)}\Vert \right) \right. \nonumber \\&\quad \left. \ge {p_{\max }}n^{0.1}\right\} \nonumber \\&\quad \le \Pr \left\{ \left( 1+\max _{1\le i \le n}\max _{1\le s\le S_n}\Vert \textbf{x}_{{(s)},i}\Vert \max _{1\le s\le S_n}\Vert {\varvec{\beta }}^*_{(s)}\Vert \right) \right. \nonumber \\&\quad \left. \ge {p_{\max }}n^{0.1}\right\} \nonumber \\&\quad = o(1), \end{aligned}$$

(B40)

where the last inequality holds because of Conditions 2 and 3. Similarly,

$$\begin{aligned} \Xi _3&=\Pr \left[ \max _{1\le l\le N}\mathbb {E}\left\{ (1-y\textbf{x}^{\top }{\varvec{\beta }}^*({\textbf{w}}^{(l)}))_+{\textbf{1}}\right. \right. \nonumber \\&\quad \left. \left. \left( |1-y\textbf{x}^{\top }{\varvec{\beta }}^*({\textbf{w}}^{(l)})|\ge {p_{\max }}n^{0.1}\right) \right\} > \epsilon \right] \nonumber \\&\le \Pr \left( \max _{1\le l\le N}\mathbb {E}|1-y\textbf{x}^{\top }{\varvec{\beta }}^*({\textbf{w}}^{(l)})|\ge {p_{\max }}n^{0.1}\right) \nonumber \\&\le \Pr \left\{ \left( 1+\max _{1\le s\le S_n}\mathbb {E}\Vert \textbf{x}_{(s)}\Vert \max _{1\le s\le S_n}\Vert {\varvec{\beta }}^*_{(s)}\Vert \right) \ge {p_{\max }}n^{0.1}\right\} \nonumber \\&=o(1). \end{aligned}$$

(B41)

Together with (B37), (B39)–(B41), we obtain $\max _{1\le l\le N}|$$ \Omega _2({\textbf{w}}^{(l)})|=o_p(1)$. As well, by (B36), we have

$$\begin{aligned} \sup _{\textbf{w}\in {{\mathcal {W}}}}|\Omega _2(\textbf{w})|=o_P(1). \end{aligned}$$

(B42)

Finally, note that $(y,\textbf{x})$ and $({\tilde{y}}, {\tilde{\textbf{x}}})$ are independently and identically distributed, and under Lemma 1, we have

$$\begin{aligned}&\sup _{\textbf{w}\in {{\mathcal {W}}}}|\Omega _3(\textbf{w})|=\sup _{\textbf{w}\in {{\mathcal {W}}}}\bigg |\mathbb {E}\left( 1-y\textbf{x}^{\top }{\varvec{\beta }}^*(\textbf{w})\right) _+\nonumber \\&\qquad -\mathbb {E}\left\{ (1-{\tilde{y}}{\tilde{\textbf{x}}}^{\top }{\hat{{\varvec{\beta }}}}(\textbf{w})) \mid {\mathcal {D}}_n\right\} _+\bigg |\nonumber \\&\quad =\sup _{\textbf{w}\in {{\mathcal {W}}}}\bigg |\mathbb {E}\left( 1-{\tilde{y}}{\tilde{\textbf{x}}}^{\top }{\varvec{\beta }}^*(\textbf{w})\right) _+\nonumber \\&\quad -\mathbb {E}\left\{ (1-{\tilde{y}}{\tilde{\textbf{x}}}^{\top }{\hat{{\varvec{\beta }}}}(\textbf{w})) \mid {\mathcal {D}}_n\right\} _+\bigg |\nonumber \\&\quad = \sup _{\textbf{w}\in {{\mathcal {W}}}}\bigg |\mathbb {E}\int _{{\tilde{y}}{\tilde{\textbf{x}}}^{\top }{\varvec{\beta }}^*(\textbf{w})}^{{\tilde{y}}{\tilde{\textbf{x}}}^{\top }{\hat{{\varvec{\beta }}}}(\textbf{w})}I(t\le 1)\textrm{d}t\bigg \vert {\mathcal {D}}_n\bigg |\nonumber \\&\quad \le \sup _{\textbf{w}\in {{\mathcal {W}}}}\mathbb {E}\left\{ \bigg |{\tilde{y}}{\tilde{\textbf{x}}}^{\top }\left( {\hat{{\varvec{\beta }}}}(\textbf{w})-{\varvec{\beta }}^*(\textbf{w})\right) \bigg |\big \vert {\mathcal {D}}_n\right\} \nonumber \\&\quad \le \sup _{\textbf{w}\in {{\mathcal {W}}}}\sum _{s=1}^{S_n}w_s\mathbb {E}\left\{ \bigg |{\tilde{\textbf{x}}}_{(s)}^{\top }\left( {\hat{{\varvec{\beta }}}}_{(s)}-{\varvec{\beta }}^*_{(s)}\right) \bigg |\big \vert {\mathcal {D}}_n\right\} \nonumber \\&\quad \le \max _{1\le s\le S_n}\left\| {\hat{{\varvec{\beta }}}}_{(s)}-{\varvec{\beta }}^*_{(s)}\right\| \max _{1\le s\le S_n}\mathbb {E}\Vert {\tilde{\textbf{x}}}_{{(s)},i}\Vert \nonumber \\&\quad =O_p\left( \frac{ {p_{\max }}\sqrt{\log ({p_{\max }})}}{\sqrt{n}}\right) \nonumber \\&\quad =o_p(1), \end{aligned}$$

(B43)

where the last inequality holds due to Condition 6. Putting (B32), (B33), (B42) and (B43) together, we complete the proof of (B30).

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zou, J., Yuan, C., Zhang, X. et al. Model averaging for support vector classifier by cross-validation. Stat Comput 33, 117 (2023). https://doi.org/10.1007/s11222-023-10284-6

Download citation

Received: 02 January 2023
Accepted: 19 July 2023
Published: 08 August 2023
DOI: https://doi.org/10.1007/s11222-023-10284-6

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Model averaging for support vector classifier by cross-validation

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Least Squares Model Averaging Based on Generalized Cross Validation

Bias-corrected support vector machine with Gaussian kernel in high-dimension, low-sample-size settings

A working likelihood approach to support vector regression with a data-driven insensitivity parameter

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Appendix A: The simulation for nonlinear boundary case

Appendix B: Proofs

1.1 B.1 Proof of Lemma 1

Proof of Lemma 1

1.2 B.2 Proof of Theorem 1

Lemma 2

Proof of Lemma 2

Proof of Theorem 1

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Model averaging for support vector classifier by cross-validation

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Least Squares Model Averaging Based on Generalized Cross Validation

Bias-corrected support vector machine with Gaussian kernel in high-dimension, low-sample-size settings

A working likelihood approach to support vector regression with a data-driven insensitivity parameter

Explore related subjects

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Appendix A: The simulation for nonlinear boundary case

Appendix B: Proofs

1.1 B.1 Proof of Lemma 1

Proof of Lemma 1

1.2 B.2 Proof of Theorem 1

Lemma 2

Proof of Lemma 2

Proof of Theorem 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation