Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Model averaging for support vector classifier by cross-validation

  • Original Paper
  • Published:
Statistics and Computing Aims and scope Submit manuscript

Abstract

Support vector classification (SVC) is a well-known statistical technique for classification problems in machine learning and other fields. An important question for SVC is the selection of covariates (or features) for the model. Many studies have considered model selection methods. As is well-known, selecting one winning model over others can entail considerable instability in predictive performance due to model selection uncertainties. This paper advocates model averaging as an alternative approach, where estimates obtained from different models are combined in a weighted average. We propose a model weighting scheme and provide the theoretical underpinning for the proposed method. In particular, we prove that our proposed method yields a model average estimator that achieves the smallest hinge risk among all feasible combinations asymptotically. To remedy the computational burden due to a large number of feasible models, we propose a screening step to eliminate the uninformative features before combining the models. Results from real data applications and a simulation study show that the proposed method generally yields more accurate estimates than existing methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Algorithm 1
Algorithm 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Notes

  1. The parameter that minimises the population hinge loss is the “quasi-true" parameter when the working model is not identical to the true data generating process. If the two are identical, the “quasi-true" parameter is the true parameter.

  2. We find that the number of folds generally has little effect on the performance of the method.

References

  • Ando, T., Li, K.-C.: A weight-relaxed model averaging approach for high-dimensional generalized linear models. Ann. Stat. 45, 2654–2679 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  • Becker, N., Toedt, G., Lichter, P., Benner, A.: Elastic scad as a novel penalization method for svm classification tasks in high-dimensional data. BMC Bioinform. 12, 138–151 (2011)

    Article  Google Scholar 

  • Borah, P., Gupta, D.: Affinity and transformed class probability-based fuzzy least squares support vector machines. Fuzzy Sets Syst. 443, 203–235 (2022)

    Article  MathSciNet  Google Scholar 

  • Bradley, P.S., Mangasarian, O.L.: Feature selection via concave minimization and support vector machines. In: ICML 98, 82–90 (1998)

  • Breiman, L.: Bagging predictors. Mach. Learn. 24, 123–140 (1996)

    Article  MATH  Google Scholar 

  • Buckland, S.T., Burnham, K.P., Augustin, N.H.: Model selection: an integral part of inference. Biometrics 53, 603–618 (1997)

    Article  MATH  Google Scholar 

  • Bühlmann, P., van de Geer, S.: Statistics for High-Dimensional Data: Methods. Theory and Applications. Springer, New York (2011)

    Book  MATH  Google Scholar 

  • Burges, C.J.C.: A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Disc. 2, 121–167 (1998)

    Article  Google Scholar 

  • Claeskens, G., Croux, C., van Kerckhoven, J.: Variable selection for logistic regression using a prediction-focused information criterion. Biometrics 62, 972–979 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  • Claeskens, G., Croux, C., van Kerckhoven, J.: An information criterion for variable selection in support vector machines. J. Mach. Learn. Res. 9, 541–558 (2008)

    MathSciNet  MATH  Google Scholar 

  • Dua, D., Graff, C.: UCI Machine Learning Repository (2017). http://archive.ics.uci.edu/ml

  • Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55, 119–139 (1997)

    Article  MathSciNet  MATH  Google Scholar 

  • Gorman, R.P., Sejnowski, T.J.: Analysis of hidden units in a layered network trained to classify sonar targets. Neural Netw. 1, 75–89 (1988)

    Article  Google Scholar 

  • Gupta, U., Gupta, D.: Least squares structural twin bounded support vector machine on class scatter. Appl. Intell. 53, 15321–15351 (2023)

    Article  Google Scholar 

  • Guyon, I., Weston, J., Barnhill, S., Vapnik, V.: Gene selection for cancer classification using support vector machines. Mach. Learn. 46, 389–422 (2002)

    Article  MATH  Google Scholar 

  • Hansen, B.E.: Least squares model averaging. Econometrica 75, 1175–1189 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  • Hansen, B.E., Racine, J.: Jackknife model averaging. J. Econom. 167, 38–46 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  • Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining. Inference and Prediction. Springer, New York (2001)

    Book  MATH  Google Scholar 

  • Hazarika, B.B., Gupta, D.: Affinity based fuzzy kernel ridge regression classifier for binary class imblance learning. Eng. Appl. Artif. Intell. 117, 105544 (2023)

    Article  Google Scholar 

  • Hazarika, B.B., Gupta, D.: Improved twin bounded large margin distribution machines for binary classification. Multimedia Tools Appl. 83, 13341–13368 (2023)

    Article  Google Scholar 

  • Hoeting, J.A., Madigan, D., Raftery, A.E., Volinsky, C.T.: Bayesian model averaging: a tutorial. Stat. Sci. 14, 382–417 (1999)

    MathSciNet  MATH  Google Scholar 

  • Jagannathan, R., Ma, T.: Risk reduction in large portfolios: Why imposing the wrong constraints helps. J. Fin. 58, 1651–1683 (2003)

  • Kaufman, L.: Solving the Quadratic Programming Problem Arising in Support Vector Classification, pp. 147–167. MIT Press, USA (1999)

  • Koo, J.-Y., Lee, Y., Kim, Y., Park, C.: A Bahadur representation of the linear support vector machine. J. Mach. Learn. Res. 9, 1343–1368 (2008)

  • Lee, E.R., Noh, H., Park, B.U.: Model selection via Bayesian information criterion for quantile regression models. J. Am. Stat. Assoc. 109, 216–229 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  • Park, C., Kim, K.R., Myung, R., Koo, J.Y.: Oracle properties of scad-penalized support vector machine. J. Stat. Plan. Infer. 142, 2257–2270 (2012)

  • Schölkopf, B., Smola, A.: Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, USA (2001)

    Google Scholar 

  • Sigillito, V.G., Wing, S.P., Hutton, L.V., Baker, K.B.: Classification of radar returns from the ionosphere using neural networks. J. Hopkins APL Tech. Dig. 10, 262–266 (1989)

    Google Scholar 

  • Tsanas, A., Little, M.A., Fox, C., Ramig, L.O.: Objective automatic assessment of rehabilitative speech treatment in Parkinson’s disease. IEEE Trans. Neural Syst. Rehabil. Eng. 22, 181–190 (2014)

    Article  Google Scholar 

  • van de Geer, S.: Empirical Processes in M-Estimation. Cambridge University Press, Cambridge (2000)

  • van der Vaart, A.W., Wellner, J.A.: Weak Convergence and Empirical Process: With Applications to Statistics. Springer, New York (1996)

  • Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer, New York (1995)

    Book  MATH  Google Scholar 

  • Wan, A.T.K., Zhang, X., Zou, G.: Least squares model averaging by Mallows criterion. J. Econom. 156, 277–283 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  • Wang, L., Zhu, J., Zou, H.: The doubly regularized support vector machine. Stat. Sin. 16, 589–615 (2006)

    MathSciNet  MATH  Google Scholar 

  • Wang, L., Wu, Y., Li, R.: Quantile regression for analyzing heterogeneity in ultra-high dimension. J. Am. Stat. Assoc. 107, 214–222 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  • Wegkamp, M., Yuan, M.: Support vector machines with a reject option. Bernoulli 17, 1368–1385 (2011)

  • Weston, J., Mukherjee, S., Chapelle, O., Pontil, M., Poggio, T., Vapnik, V.: Feature selection for SVMs. In: NIPS 12, 668–674 (2000)

  • White, H.: Maximum likelihood estimation of misspecified models. Econometrica 50, 1–25 (1982)

    Article  MathSciNet  MATH  Google Scholar 

  • Yuan, Z., Yang, Y.: Combining linear regression models: when and how? J. Am. Stat. Assoc. 100, 1202–1214 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  • Zhang, H.H., Ahn, J., Lin, X., Park, C.: Gene selection using support vector machines with non-convex penalty. Bioinformatics 22, 88–95 (2006)

    Article  Google Scholar 

  • Zhang, X., Lu, Z., Zou, G.: Adaptively combined forecasting for discrete response time series. J. Econom. 176, 80–91 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  • Zhang, X., Wu, Y., Wang, L., Li, R.: Variable selection for support vector machines in moderately high dimensions. J. R. Stat. Soc. B 75, 53–76 (2016)

  • Zhang, X., Wu, Y., Wang, L., Li, R.: A consistent information criterion for support vector machines in diverging model spaces. J. Mach. Learn. Res. 17, 1–26 (2016)

    MathSciNet  MATH  Google Scholar 

  • Zhang, X., Yu, D., Zou, G., Liang, H.: Optimal model averaging estimation for generalized linear models and generalized linear mixed-effects models. J. Am. Stat. Assoc. 111, 1775–1790 (2016)

    Article  MathSciNet  Google Scholar 

  • Zhang, X., Zou, G., Liang, H., Carroll, R.J.: Parsimonious model averaging with a diverging number of parameters. J. Am. Stat. Assoc. 115, 972–984 (2020)

    Article  MathSciNet  MATH  Google Scholar 

  • Zhou, Z.-H.: Ensemble Methods: Foundations and Algorithms. CRC Press, USA (2012)

  • Zhu, J., Rosset, S., Hastie, T., Tibshirani, R.: 1-norm support vector machines. Adv. Neural. Inf. Process. Syst. 16, 49–56 (2004)

    Google Scholar 

  • Zou, H., Yuan, M.: The \(f_\infty \)-norm support vector machine. Stat. Sin. 18, 379–398 (2008)

    MathSciNet  MATH  Google Scholar 

  • Zou, J., Wang, W., Zhang, X., Zou, G.: Optimal model averaging for divergent-dimensional Poisson regressions. Econom. Rev. 41, 775–805 (2022)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

Jiahui Zou’s work was supported by the National Natural Science Foundation of China (Grant No. 12201431) and the Young Teacher Foundation from Capital University of Economics and Business (XRZ2022070, 00592254413070). Xinyu Zhang’s work was partially supported  by the National Natural Science Foundation of  China (Grant Nos. 71925007, 72091212 and 12288201) and the CAS Projectfor Young Scientists in Basic Research (YSBR-008). Guohua Zou’s work was partially supported by the National Natural Science Foundation of China (Grant Nos. 11971323 and 12031016). Wan’s work was supported by a General Research Fund from the Hong Kong Research Grants Council (No. CityU-11500419).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jiahui Zou.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: The simulation for nonlinear boundary case

Here, we provide an example for a simple nonlinear boundary case to test the performance of SVCMA. The data generating process is described below.

DGP3: \(\Pr (Y=1)=1-\Pr (Y=-1)=0.8I(\text {sign}(\textbf{x}\)\( ^{\top }{\varvec{\beta }})=1)+0.2I(\text {sign}(\textbf{x}^{\top }{\varvec{\beta }})=-1)\), where \(I(\cdot )\) is the indicator function, \(\textbf{x}=(x_1, x_2, x_2^2, x_2^3, \ldots , x_2^{959}, x_3, x_4, \ldots , \)\( x_{p-958})\), \(x_1\sim U(0,1)\), \(x_2, \ldots , x_{p-958} {\mathop {\sim }\limits ^{i.i.d.}} U(-1, 1)\) and \({\varvec{\beta }}=(2, 0, -2, 0, -2, 0, 0, \ldots , 0)\).

In this example, the training data contains all the covariates used in DGP3, and we set the dimension \(p=1000\), the training size \(n=500\) and testing size \(n_{\text {test}}=10{,}000\). The results of error rates are shown in Fig. 13(a). The fitted boundaries of all methods are ploted in Fig. 13(b). It can be seen from these two figures that the SVCMA demonstrates the most optimal performance in the presence of a nonlinear boundary. Additionally, the majority of the methods appear to be capable of fitting a satisfactory boundary around the true boundary.

As for general nonlinear boundaries, the SVCs with different kernels are useful tools. Hence, in the future, we plan to adopt model averaging methodology to combine estimators with different kernels to avoid the hesitation in selecting, and try to promote the performance of SVCs.

Fig. 13
figure 13

An example for nonlinear boundaries

Appendix B: Proofs

In the following, all limiting processes below correspond to \(n\rightarrow \infty \) unless stated otherwise.

1.1 B.1 Proof of Lemma 1

Proof of Lemma 1

Part of this proof follows from Zhang et al. (2016b), but there are some differences and the conclusion is also different from that of Zhang et al. (2016b).

We will prove (9) first. Recall that \({\hat{{\varvec{\beta }}}}_{(s)}=\arg \min _{{\varvec{\beta }}_{(s)}}\{n^{-1} \)\( \sum _{i=1}^{n}(1-y_i\textbf{x}_{{(s)},i}^{\top }{\varvec{\beta }}_{(s)})+2^{-1}\lambda _n\Vert {\varvec{\beta }}_{(s)}^+\Vert ^2\}\). We will show that, for any \(0<\eta <1\), there exist a large constant \(\triangle >0\) and an integer N such that when \(n>N\), we have

$$\begin{aligned}{} & {} \Pr \left\{ \min _{1\le s\le S_n}\inf _{\Vert \textbf{u}_{(s)}\Vert =\triangle } \left\{ l_s\left( {\varvec{\beta }}^*_{(s)}+\sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})}\textbf{u}_{(s)}\right) \right. \right. \nonumber \\{} & {} \qquad \left. \left. -l_s({\varvec{\beta }}^*_{(s)})\right\}>0 \right\} >1-\eta , \end{aligned}$$
(B1)

where \(\textbf{u}_{(s)}\in {\mathcal {R}}^{p_s}\) and \(l_s({\varvec{\beta }}_{(s)})=n^{-1}\sum _{i=1}^{n}(1-y_i\textbf{x}_{{(s)},i} \)\( ^{\top }{\varvec{\beta }}_{(s)})_++2^{-1}\lambda _n\Vert {\varvec{\beta }}_{(s)}^+\Vert ^2\). As the hinge loss is convex, this implies that with probability greater than \(1-\eta \), \(\max _{1\le s\le S_n}\Vert {\hat{{\varvec{\beta }}}}_{(s)}-{\varvec{\beta }}^*_{(s)}\Vert \le \triangle \sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})} \). Hence equation (9) in Lemma 1 holds.

Note that \(l_s\left( {\varvec{\beta }}^*_{(s)}+\sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})}\textbf{u}_{(s)}\right) -l_s\left( {\varvec{\beta }}^*_{(s)}\right) \) can be expressed as

$$\begin{aligned}{} & {} l_s\left( {\varvec{\beta }}^*_{(s)}+\sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})}\textbf{u}_{(s)}\right) -l_s\left( {\varvec{\beta }}^*_{(s)}\right) \nonumber \\{} & {} \quad =n^{-1} \sum _{i=1}^{n}\left\{ \left( 1-y_i\textbf{x}_{{(s)},i}^{\top }({\varvec{\beta }}^*_{(s)}+\sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})}\textbf{u}_{(s)}) \right) _+\right. \nonumber \\{} & {} \qquad \left. -\left( 1-y_i\textbf{x}_{{(s)},i}^{\top }{\varvec{\beta }}^*_{(s)}\right) _+ \right\} \nonumber \\{} & {} \qquad + 2^{-1}\lambda _n \left\| {\varvec{\beta }}^{*+}_{(s)}+\sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})}\textbf{u}_{(s)}^+\right\| ^2\nonumber \\{} & {} \qquad -2^{-1}\lambda _n \Vert {\varvec{\beta }}^{*+}_{(s)}\Vert ^2. \end{aligned}$$
(B2)

It is shown that

$$\begin{aligned}{} & {} \max _{1\le s\le S_n}\sup _{\Vert \textbf{u}_{(s)}\Vert =\triangle }\bigg |\left\| {\varvec{\beta }}^{*+}_{(s)}+\sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})}\textbf{u}_{(s)}^+ \right\| ^2-\left\| {\varvec{\beta }}^{*+}_{(s)}\right\| ^2 \bigg |\nonumber \\{} & {} \quad \le \max _{1\le s\le S_n}\sup _{\Vert \textbf{u}_{(s)}\Vert =\triangle }\left( \left\| {\varvec{\beta }}^{*+}_{(s)}+\sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})}\textbf{u}_{(s)}^+ \right\| +\left\| {\varvec{\beta }}^{*+}_{(s)}\right\| \right) \nonumber \\{} & {} \qquad \times \left\| \sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})}\textbf{u}_{(s)}^+ \right\| \nonumber \\{} & {} \quad \le 2\triangle C_2{p_{\max }}\sqrt{n^{-1}\log ({p_{\max }})}+\triangle n^{-1}{p_{\max }}\log ({p_{\max }})\nonumber \\{} & {} \quad =O(\triangle {p_{\max }}\sqrt{n^{-1}\log ({p_{\max }})}), \end{aligned}$$
(B3)

where the last inequality is obtained from Condition 3. Hence the order of difference of penalty terms in (B2) is \(O(\triangle \lambda _n{p_{\max }}\sqrt{n^{-1}\log ({p_{\max }})})\).

Denote

$$\begin{aligned} g_{s,i}(\textbf{u}_{(s)})&=\left( 1-y_i\textbf{x}_{{(s)},i}^{\top }({\varvec{\beta }}^*_{(s)}+\sqrt{n^{-1} {p_{\max }}\log ({p_{\max }})}\textbf{u}_{(s)}) \right) _+\\&\quad -\left( 1-y_i\textbf{x}_{{(s)},i}^{\top }{\varvec{\beta }}^*_{(s)}\right) _+\\&\quad +\sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})}y_i\textbf{x}_{{(s)},i}^{\top }\textbf{u}_{(s)}{\textbf{1}}\left( 1-y_i\textbf{x}_{{(s)},i}^{\top }{\varvec{\beta }}^*_{(s)}\ge 0 \right) \\&\quad -\mathbb {E}\left[ \left( 1-y_i\textbf{x}_{{(s)},i}^{\top }({\varvec{\beta }}^*_{(s)}+\sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})}\textbf{u}_{(s)})\right) _+ \right] \\&\quad +\mathbb {E}\left[ \left( 1-y_i\textbf{x}_{{(s)},i}^{\top }{\varvec{\beta }}^*_{(s)}\right) _+ \right] . \end{aligned}$$

It can be verified that \(\mathbb {E}[g_{{(s)},i}(\textbf{u})]=0\), \(s=1,2,\ldots ,S_n\) by the definition of \({\varvec{\beta }}^*_{(s)}\) and \({\textbf{J}}_{(s)}({\varvec{\beta }}^*_{(s)})=0\). Note that (B2) can be further decomposed as

$$\begin{aligned}&n^{-1}\sum _{i=1}^{n}\left\{ \left( 1-y_i\textbf{x}_{{(s)},i}^{\top }({\varvec{\beta }}^*_{(s)}+\sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})}\textbf{u}_{(s)}) \right) _+\right. \\&\quad \left. -\left( 1-y_i\textbf{x}_{{(s)},i}^{\top }{\varvec{\beta }}^*_{(s)}\right) _+ \right\} \\&\equiv n^{-1} (A_{s,n}+B_{s,n} ), \end{aligned}$$

where

$$\begin{aligned} A_{s,n}=\sum _{i=1}^{n}g_{s,i}(\textbf{u}_{(s)}) \end{aligned}$$

and

$$\begin{aligned} B_{s,n}&=\sum _{i=1}^{n}\Big [ -\sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})}y_i \textbf{x}_{{(s)},i}^{\top }\textbf{u}_{(s)}{\textbf{1}}\nonumber \\&\quad \left( 1-y_i\textbf{x}_{{(s)},i}^{\top }{\varvec{\beta }}^*_{(s)}\ge 0 \right) \nonumber \\&\quad + \mathbb {E}\left\{ \left( 1-y_i\textbf{x}_{{(s)},i}^{\top }({\varvec{\beta }}^*_{(s)}+\sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})}\textbf{u}_{(s)}) \right) _+ \right\} \nonumber \\&\quad -\mathbb {E}\left\{ \left( 1-y_i\textbf{x}_{{(s)},i}^{\top }{\varvec{\beta }}^*_{(s)}\right) _+\right\} \Big ]. \end{aligned}$$
(B4)

The remainder of the proof consists of three steps. In Step 1, we demonstrate that

$$\begin{aligned} \max _{1\le s\le S_n}\sup _{\Vert \textbf{u}_{(s)}\Vert =\triangle }|A_{s,n}|=\triangle ^{3/2}{p_{\max }}o_{p}(1). \end{aligned}$$
(B5)

In Step 2, it is shown that \(\min _{1\le s\le S_n}\inf _{\Vert \textbf{u}_{(s)}\Vert =\triangle }B_{s,n}\) dominates the terms of order \(\triangle ^{3/2}{p_{\max }}o_{p}(1)\) and is larger than zero. In Step 3, we use the results from the previous steps to prove (B1).

Step 1: We use the covering number introduced by van der Vaart and Wellner (1996) to prove the uniform rate in (B5). It suffices to show, for any \(\epsilon >0\), that

$$\begin{aligned} \Pr \left( \max _{1\le s\le S_n}\sup _{\Vert \textbf{u}_{(s)}\Vert =\triangle } p_s^{-1}\bigg |\sum _{i=1}^{n}g_{s,i}(\textbf{u}_{(s)}) \bigg |>\triangle ^{3/2}\epsilon \right) \rightarrow 0. \end{aligned}$$
(B6)

Note that the hinge loss satisfies the Lipschitz condition and \(\max _{1\le i \le n}\Vert \textbf{x}_{{(s)},i}\Vert \le C_1\sqrt{p_s}\), \(\max _{1\le i \le n}\mathbb {E}\Vert \textbf{x}_{{(s)},i}\Vert \le C_1\sqrt{p_s}\) from Condition 2. It is shown that

$$\begin{aligned} |g_{s,i}(\textbf{u}_{(s)})|&\le 3\triangle \sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})} \nonumber \\&\quad \max \left\{ \max _{1\le i \le n}\Vert \textbf{x}_{{(s)},i}\Vert ,\max _{1\le i \le n}\mathbb {E}\Vert \textbf{x}_{{(s)},i}\Vert \right\} \nonumber \\&\le 3C_1\triangle {p_{\max }}\sqrt{n^{-1}\log ({p_{\max }})} \end{aligned}$$
(B7)

and thus \(\max _{1\le s\le S_n}\sup _{\Vert \textbf{u}_{(s)}\Vert =\triangle }p_s^{-1}|g_{s,i}(\textbf{u}_{(s)})|=o(1)\) by Condition 6. By Lemma 2.5 of van de Geer (2000), the ball \(\{\textbf{u}_{(s)}:\Vert \textbf{u}_{(s)}\Vert \le \triangle \}\) in \({\mathcal {R}}^{p_s+1}\) can be covered by \(N_s\) balls with radius \(\zeta _s\), where \(N_s\le \{(4\triangle +\zeta _s)/\zeta _s\}^{p_s+1}\). Denote \({\textbf {u}}_{(s)}^{1},\ldots ,{\textbf {u}}_{(s)}^{N_s}\) as the centers of the \(N_s\) balls, let \(\zeta _s=(nM_1)^{-1} p_s\) (for some large constant \(M_1>0\) ) and denote \( {\mathcal {U}}_s^{k}=\{\textbf{u}_{(s)}: \Vert \textbf{u}_{(s)}-\textbf{u}_{(s)}^{k}\Vert \le \zeta _s \& \Vert \textbf{u}_{(s)}\Vert =\triangle \}\). For any \(\epsilon >0\), we have

$$\begin{aligned}&\max _{1\le s\le S_n}\max _{1\le k\le N_s} \sup _{\textbf{u}_{(s)}\in {\mathcal {U}}_s^{(k)}} p_s^{-1}\bigg |\sum _{i=1}^{n}g_{s,i}(\textbf{u}_{(s)})-\sum _{i=1}^{n}g_{s,i}(\textbf{u}_{(s)}^k)\bigg |\nonumber \\&\quad \le \max _{1\le s\le S_n}\max _{1\le k\le N_s} \sup _{\textbf{u}_{(s)}\in {\mathcal {U}}_s^{k}} p_s^{-1}\sum _{i=1}^{n}\bigg |g_{s,i}(\textbf{u}_{(s)})- g_{s,i}(\textbf{u}_{(s)}^k)\bigg |\nonumber \\&\quad \le \max _{1\le s\le S_n}\max _{1\le k\le N_s} \sup _{\textbf{u}_{(s)}\in {\mathcal {U}}_s^{k}} np_s^{-1}\nonumber \\&\qquad \quad \Big \{2\sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})}\Vert \textbf{x}_{{(s)},i}\Vert \Vert \textbf{u}_{(s)}-\textbf{u}_{(s)}^{k}\Vert \nonumber \\&\qquad +\sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})}\Vert \textbf{u}_{(s)}-\textbf{u}_{(s)}^{k}\Vert \mathbb {E}\Vert \textbf{x}_{{(s)},i}\Vert \Big \} \nonumber \\&\quad \le \max _{1\le s\le S_n}3\triangle n p_s^{-1} \sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})}\nonumber \\&\qquad \max \left\{ \max _{1\le i \le n}\Vert \textbf{x}_{{(s)},i}\Vert ,\max _{1\le i \le n}\mathbb {E}\Vert \textbf{x}_{{(s)},i}\Vert \right\} \zeta _s\nonumber \\&\quad \le 3C_1 M_1^{-1}\triangle {p_{\max }}\sqrt{n^{-1}\log ({p_{\max }})}\nonumber \\&\quad =o( \triangle ^{3/2}p_{\min }\epsilon /2), \end{aligned}$$
(B8)

where the last inequality arises from Condition 6. From (B8), it can be shown that

$$\begin{aligned}&\Pr \left( \max _{1\le s\le S_n}\sup _{\Vert \textbf{u}_{(s)}\Vert =\triangle } p_s^{-1}\bigg |\sum _{i=1}^{n}g_{s,i}(\textbf{u}_{(s)}) \bigg |>\triangle ^{3/2} \epsilon \right) \nonumber \\&\quad \le \Pr \Bigg (\max _{1\le s\le S_n}\max _{1\le k\le N_s} \sup _{\textbf{u}_{(s)}\in {\mathcal {U}}_s^{(k)}}p_s^{-1}\bigg |\sum _{i=1}^{n}g_{s,i}(\textbf{u}_{(s)})-\sum _{i=1}^{n}g_{s,i}(\textbf{u}_{(s)}^{k})\bigg |\nonumber \\&\qquad +\max _{1\le s\le S_n}\max _{1\le k \le N_s}p_s^{-1}\bigg |\sum _{i=1}^{n}g_{s,i}(\textbf{u}_{(s)}^{k}) \bigg |>\triangle ^{3/2}\epsilon \Bigg )\nonumber \\&\quad \le \Pr \Bigg (\max _{1\le s\le S_n}\max _{1\le k\le N_s} \sup _{\textbf{u}_{(s)}\in {\mathcal {U}}_s^{(k)}}\bigg |\sum _{i=1}^{n}g_{s,i}(\textbf{u}_{(s)})\nonumber \\&\qquad -\sum _{i=1}^{n}g_{s,i}(\textbf{u}_{(s)}^{k})\bigg |>\triangle ^{3/2}p_{\min }\epsilon /2\Bigg )\nonumber \\&\qquad +\sum _{s=1}^{S_n}\sum _{k=1}^{N_s}\Pr \left( \bigg |\sum _{i=1}^{n}g_{s,i}(\textbf{u}_{(s)}^{k}) \bigg |>\triangle ^{3/2}p_s\epsilon /2 \right) \nonumber \\&\quad = \sum _{s=1}^{S_n}\sum _{k=1}^{N_s}\Pr \left( \bigg |\sum _{i=1}^{n}g_{s,i}(\textbf{u}_{(s)}^{k}) \bigg |> \triangle ^{3/2}p_s\epsilon /2 \right) +o(1) \end{aligned}$$
(B9)

and \(\sum _{i=1}^{n}g_{s,i}(\textbf{u}_{(s)}^{(k)})\) is the sum of independent zero-mean random variables.

By the bounded conditional density, under Conditions 1 and 4, recognising that

\(\max _{1\le i \le n}\Vert \textbf{x}_{{(s)},i}\Vert \le C_1\sqrt{p_s}\), we have

$$\begin{aligned}&\Pr \left( |1-y_i\textbf{x}_{{(s)},i}^{\top }{\varvec{\beta }}^*_{(s)}|\le \sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})} \max _{1\le i \le n}\Vert \textbf{x}_{{(s)},i}\Vert \triangle \right) \nonumber \\&\quad =\Pr \Big (\pm 1-\sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})} \max _{1\le i \le n}\Vert \textbf{x}_{{(s)},i}\Vert \triangle \le \textbf{x}_{{(s)},i}^{\top }{\varvec{\beta }}^*_{(s)}\nonumber \\&\quad \le \sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})} \max _{1\le i \le n}\Vert \textbf{x}_{{(s)},i}\Vert \triangle \pm 1\Big \vert y_i=\pm 1\Big )\nonumber \\&\quad \le 2C_3\sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})}\max _{1\le i \le n}\Vert \textbf{x}_{{(s)},i}\Vert \triangle \nonumber \\&\quad \le 2\triangle C_1 C_3\sqrt{n^{-1} p_s{p_{\max }}\log ({p_{\max }})}. \end{aligned}$$
(B10)

Note that when \(1-y_i\textbf{x}_{{(s)},i}^{\top }{\varvec{\beta }}^*_{(s)}<\sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})}\)\(y_i\textbf{x}_{{(s)},i}^{\top }\textbf{u}_{(s)}\) and \(\sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})}y_i\textbf{x}_{{(s)},i}^{\top }\textbf{u}_{(s)}<0\), or when \(1-y_i\textbf{x}_{{(s)},i}^{\top }{\varvec{\beta }}^*_{(s)}>\sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})}y_i\textbf{x}_{{(s)},i}^{\top }\textbf{u}_{(s)}\) and \(\sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})}y_i\textbf{x}_{{(s)},i}^{\top }\textbf{u}_{(s)}>0\),

$$\begin{aligned}&\left( 1-y_i\textbf{x}_{{(s)},i}^{\top }({\varvec{\beta }}^*_{(s)}+\sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})}\textbf{u}_{(s)}) \right) _+\nonumber \\&\quad -(1-y_i\textbf{x}_{{(s)},i}^{\top }{\varvec{\beta }}^*_{(s)})_+ +\sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})} y_i\textbf{x}_{{(s)},i}\nonumber \\&\quad ^{\top }\textbf{u}_{(s)}{\textbf{1}}(1-y_i\textbf{x}_{{(s)},i}^{\top }{\varvec{\beta }}^*_{(s)}\ge 0)=0. \end{aligned}$$
(B11)

Furthermore, equation (B11) holds when \(|1-y_i\textbf{x}_{{(s)},i}^{\top }{\varvec{\beta }}^*_{(s)}|>\sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})}\max _{1\le i \le n}\Vert \textbf{x}_{{(s)},i}\Vert \triangle \) as \(\sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})}\max _{1\le i \le n}\Vert \textbf{x}_{{(s)},i}\Vert \triangle >\bigg |\sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})}y_i\textbf{x}_{{(s)},i}^{\top }\textbf{u}_{(s)}\bigg |\).

Hence we can write

$$\begin{aligned}&\mathop {\sum }\limits _{i=1}^{n}\mathbb {E}\{g_{s,i}^2(u^k_{(s)})\}\nonumber \\ {}&{} {\le } \sum _{i=1}^{n}\mathbb {E}\Bigg [\Bigg \{\Big |\left( 1{-}y_i{\textbf {x}}_{{(s)},i}^{\top }({\varvec{\beta }}^*_{(s)}{+}\sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})}{} {\textbf {u}}_{(s)}^{k}) \right) _+\nonumber \\{}&{} \qquad -\left( 1-y_i{\textbf {x}}_{{(s)},i}^{\top }{\varvec{\beta }}^*_{(s)}\right) _+\Big |\nonumber \\{}&{} \qquad +\Big |\sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})}y_i{\textbf {x}}_{{(s)},i}^{\top }{} {\textbf {u}}_{(s)}^{k} \Big |\Bigg \}^2{{\textbf {1}}}\Big (|1-y_i{\textbf {x}}_{{(s)},i}^{\top }{\varvec{\beta }}^*_{(s)}|\nonumber \\{}&{} \quad \le \sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})} \times \max _{1\le i \le n}\Vert {\textbf {x}}_{{(s)},i}\Vert \triangle \Big )\Bigg ]\nonumber \\{}&{} \quad \le \sum _{i=1}^{n}\mathbb {E}\Bigg \{ \left( 2\sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})}{} {\textbf {x}}_{{(s)},i}^{\top }{} {\textbf {u}}_{(s)}^k\right) ^2{{\textbf {1}}}\nonumber \\{}&{} \quad \quad \Big (|1-y_i{\textbf {x}}_{{(s)},i}^{\top }{\varvec{\beta }}^*_{(s)}|\le \sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})}\nonumber \\{}&{} \qquad \times \max _{1\le i \le n}\Vert {\textbf {x}}_{{(s)},i}\Vert \triangle \Big )\Bigg \}\nonumber \\{}&{} \quad \le \left( 2\sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})}\max _{1\le i \le n}\Vert {\textbf {x}}_{{(s)},i}\Vert \triangle \right) ^2\nonumber \\{}&{} \quad \qquad \sum _{i=1}^{n}\mathbb {E}\textbf{1}\Big (|1-y_i{\textbf {x}}_{{(s)},i}^{\top }{\varvec{\beta }}^*_{(s)}|\le \sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})}\nonumber \\{}&{} \qquad \times \max _{1\le i \le n}\Vert {\textbf {x}}_{{(s)},i}\Vert \triangle \Big )\nonumber \\{}&{} \quad \le 4 C_1^2\triangle ^2 n^{-1} p_s {p_{\max }}\log ({p_{\max }})\sum _{i=1}^{n}\mathbb {E}\textbf{1}\Big (|1-y_i{\textbf {x}}_{{(s)},i}^{\top }{\varvec{\beta }}^*_{(s)}|\nonumber \\{}&{} \qquad \le \sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})}\max _{1\le i \le n}\Vert {\textbf {x}}_{{(s)},i}\Vert \triangle \Big )\nonumber \\{}&{} \quad \le 4 C_1^2\triangle ^2 n^{-1} p_s {p_{\max }}\log ({p_{\max }})\nonumber \\{}&{} \qquad \times 2n\triangle C_1 C_3\sqrt{n^{-1} p_s{p_{\max }}\log ({p_{\max }})}\nonumber \\{}&{} \quad = 8\triangle ^3 C_1^3C_3n^{-1/2}p_s^{3/2}p^{3/2}_{\max }\log ^{3/2}({p_{\max }}) , \end{aligned}$$
(B12)

where the second-to-last inequality arises from \(\max _{1\le i \le n}\)\( \Vert \textbf{x}_{{(s)},i}\Vert \le C_1\sqrt{p_s}\) and the last inequality is from (B10). Finally, by Bernstein’s inequality and recognising (B7) and (B12), we can write

$$\begin{aligned}&\hspace{-21pc}\sum _{s=1}^{S_n}\sum _{k=1}^{N_s}\Pr \left( \bigg |\sum _{i=1}^{n}g_{s,i}(\textbf{u}^k) \bigg |>\triangle ^{3/2}p_s\epsilon /2 \right) \le \sum _{s=1}^{S_n}\sum _{k=1}^{N_s}2\exp \left( -\frac{\triangle ^{3}p_s^2\epsilon ^2/4}{\sum _{i=1}^{n}\mathbb {E}\{g_{s,i}^2(\textbf{u}^k)\}+3\triangle ^{5/2} C_1p_s{p_{\max }}\sqrt{n^{-1}\log ({p_{\max }})} \epsilon /2 }\right) \nonumber \\&\hspace{-21pc}\quad \le \sum _{s=1}^{S_n}\left( \frac{4\triangle +(nM_1)^{-1}p_s}{(nM_1)^{-1}p_s} \right) ^{p_s+1}\times \exp \left( -\frac{\triangle ^{3}p_s^2\epsilon ^2/4}{ 8\triangle ^3 C_1^3 C_3 n^{-1/2}p_s^{3/2}p^{3/2}_{\max }\log ^{3/2}({p_{\max }})+ 3\triangle ^{5/2} C_1p_s{p_{\max }}\sqrt{n^{-1}\log ({p_{\max }})} \epsilon /2 }\right) \nonumber \\&\hspace{-21pc}\quad \le S_n \left( \frac{4\triangle M_1n}{p_{\min }}+1 \right) ^{{p_{\max }}+1} \exp \left( -\frac{\triangle ^{3}p_{\min }^{1/2}\epsilon ^2/4}{ 16\triangle ^3 C_1^3 C_3 n^{-1/2}p^{3/2}_{\max }\log ^{3/2}({p_{\max }}) }\right) \nonumber \\&\hspace{-21pc}\quad =O(1)\exp \Big \{\log (S_n)+({p_{\max }}+1)\log (4\triangle nM_1p^{-1}_{\min }+1)-64^{-1}C_1^{-3}C_3^{-1}\epsilon ^2n^{1/2}p_{\min }^{1/2}p^{-3/2}_{\max }\log ^{-3/2}({p_{\max }})\Big \}\nonumber \\&\hspace{-21pc}\quad =o(1), \end{aligned}$$
(B13)

where the last equality is due to Condition 6 and \(S_n=O\{\exp (n^{\tau })\}\) for \(\tau \in (0, 1/2-3\kappa /2)\). The proof of (B6) is complete by combining (B9) and (B13).

Step 2: Let us rewrite \(B_{s,n}\) as \(B_{s,n}\equiv B_{s,n1}+B_{s,n2}\), where

$$\begin{aligned} B_{s,n1}&= -\sum _{i=1}^{n}\sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})}y_i \textbf{x}_{{(s)},i}^{\top }\textbf{u}_s {\textbf{1}}\\&\quad \left( 1-y_i\textbf{x}_{{(s)},i}^{\top }{\varvec{\beta }}^*_{(s)}\ge 0 \right) ,\\ \text {and} \\ B_{s,n2}&= \mathbb {E}\left\{ \left( 1-y_i\textbf{x}_{{(s)},i}^{\top }({\varvec{\beta }}^*_{(s)}+\sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})}\textbf{u}_{(s)}) \right) _+ \right\} \\&\quad -\mathbb {E}\left\{ \left( 1-y_i\textbf{x}_{{(s)},i}^{\top }{\varvec{\beta }}^*_{(s)}\right) _+\right\} . \end{aligned}$$

To analyse \(B_{s,n1}\), we observe that

$$\begin{aligned}&\bigg |\sum _{i=1}^{n}y_i\textbf{x}_{{(s)},i}^{\top }\textbf{u}_{(s)}{\textbf{1}}\left( 1-y_i\textbf{x}_{{(s)},i}^{\top }{\varvec{\beta }}^*_{(s)}\ge 0 \right) \bigg |\nonumber \\&\quad =\bigg |\sum _{j=0}^{p_s}\sum _{i=1}^{n}y_i\textbf{x}_{{(s)},ij}u_{{(s)},j}{\textbf{1}}\left( 1-y_i\textbf{x}_{{(s)},i}^{\top }{\varvec{\beta }}^*_{(s)}\ge 0 \right) \bigg |\nonumber \\&\quad {\le }\sum _{j=0}^{p_s}|u_{{(s)},j}|\max _{0\le j\le p_s}\bigg |\sum _{i=1}^{n}y_i x_{{(s)},ij}{\textbf{1}}\left( 1{-}y_i\textbf{x}_{{(s)},i}^{\top }{\varvec{\beta }}^*_{(s)}\ge 0 \right) \bigg |\nonumber \\&\quad \le \sqrt{\sum _{j=0}^{p_s}u_{{(s)},j}^2}\sqrt{\sum _{j=0}^{p_s}1}\max _{0\le j\le p_s}\bigg |\sum _{i=1}^{n}y_ix_{{(s)},ij}{\textbf{1}}\left( 1\right. \nonumber \\&\qquad \left. -y_i\textbf{x}_{{(s)},i}^{\top }{\varvec{\beta }}^*_{(s)}\ge 0 \right) \bigg |\nonumber \\&\quad \le \sqrt{p_s+1} \triangle \max _{0\le j\le p_s}\bigg |\sum _{i=1}^{n}y_i x_{{(s)},ij}{\textbf{1}}\left( 1\right. \nonumber \\&\qquad \left. -y_i\textbf{x}_{{(s)},i}^{\top }{\varvec{\beta }}^*_{(s)}\ge 0 \right) \bigg |. \end{aligned}$$
(B14)

By the definition of \({\textbf{J}}_s({\varvec{\beta }}^*_{(s)})\), note that \(\mathbb {E}\left[ y_ix_{{(s)},ij}{\textbf{1}}\left( 1- \right. \right. \)\(\left. \left. y_i\textbf{x}_{{(s)},i}^{\top }{\varvec{\beta }}^*_{(s)}\ge 0 \right) \right] =0\) for \(0\le j \le p_s\). By Lemma 14.24 in Bühlmann and van de Geer (2011) (the Nemirovski moment inequality),

$$\begin{aligned}&\mathbb {E}\left\{ \max _{0\le j\le p_s}\bigg |\sum _{i=1}^{n}y_ix_{{(s)},ij}{\textbf{1}}\left( 1-y_i\textbf{x}_{{(s)},i}^{\top }{\varvec{\beta }}^*_{(s)}\ge 0 \right) \bigg |\right\} \nonumber \\&\quad \le \sqrt{8\log (2p_s+2)}\mathbb {E}\left( \max _{1\le j\le p_s+1}\sum _{i=1}^{n}y_i^2x^2_{{(s)},ij}\right) ^{1/2}\nonumber \\&\quad \le \sqrt{ 8\log (2p_s+2)}\sqrt{nC_1^2}\nonumber \\&\quad =O(\sqrt{n\log (p_s)}), \end{aligned}$$
(B15)

where the last inequality is established by Condition 2. Additionally, using Markov’s inequality and by (B15), we obtain

$$\begin{aligned}&\max _{0\le j\le p_s}\bigg |\sum _{i=1}^{n}y_ix_{{(s)},ij}{\textbf{1}}\left( 1-y_i\textbf{x}_{{(s)},i}^{\top }{\varvec{\beta }}^*_{(s)}\ge 0 \right) \bigg |\nonumber \\&\quad = O_{p}(\sqrt{n \log (p_s)}). \end{aligned}$$
(B16)

Combining (B14) and (B16), we have

$$\begin{aligned}&\max _{1\le s\le S_n}\sup _{\Vert \textbf{u}_{(s)}\Vert =\triangle }|B_{s,n1}|\nonumber \\&\quad {=}\max _{1\le s\le S_n}\sup _{\Vert \textbf{u}_{(s)}\Vert =\triangle }\bigg |\sum _{i=1}^{n}{-}\sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})}y_i \textbf{x}_{{(s)},i}^{\top }\textbf{u}_{(s)}{\textbf{1}}\nonumber \\&\qquad \left( 1-y_i\textbf{x}_{{(s)},i}^{\top }{\varvec{\beta }}^*_{(s)}\ge 0 \right) \bigg |\nonumber \\&\quad =\sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})}\max _{1\le s\le S_n}\sup _{\Vert \textbf{u}_{(s)}\Vert =\triangle }\bigg |\sum _{i=1}^{n}y_i\textbf{x}_{{(s)},i}^{\top }\textbf{u}_{(s)}{\textbf{1}}\nonumber \\&\qquad \left( 1-y_i\textbf{x}_{{(s)},i}^{\top }{\varvec{\beta }}^*_{(s)}\ge 0 \right) \bigg |\nonumber \\&\quad \le \sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})} \triangle O_p\left\{ \sqrt{{p_{\max }}+1}\sqrt{n\log ({p_{\max }})}\right\} \nonumber \\&\quad = O_{p}(\triangle {p_{\max }}\log ({p_{\max }})). \end{aligned}$$
(B17)

Turning to \(B_{s,n2}\), under Conditions 5 and 6 and according to Koo et al. (2008), \({\textbf{H}}_{(s)}({\varvec{\beta }}_{(s)})\) is element-wise continuous at \({\varvec{\beta }}^*_s\). By Taylor expansion of the hinge loss at \({\varvec{\beta }}^*_{(s)}\), we have

$$\begin{aligned} {\textbf{H}}_{(s)}\left( {\varvec{\beta }}^*_{(s)}+t\sqrt{n^{-1}{p_{\max }}}\textbf{u}_{(s)}\right) ={\textbf{H}}_{(s)}({\varvec{\beta }}^*_{(s)})+o(1). \end{aligned}$$
(B18)

Hence, it is shown that

$$\begin{aligned}&\min _{1\le s\le S_n}\inf _{\Vert \textbf{u}_{(s)}\Vert =\triangle }B_{s,n2}\nonumber \\&\quad =\min _{1\le s\le S_n}\inf _{\Vert \textbf{u}_{(s)}\Vert =\triangle }\sum _{i=1}^{n}\nonumber \\&\qquad \Bigg [\mathbb {E}\left\{ \left( 1-y_i\textbf{x}_{{(s)},i}^{\top }({\varvec{\beta }}^*_{(s)}+\sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})}\textbf{u}_{(s)}) \right) _+ \right\} \nonumber \\&\qquad -\mathbb {E}\left\{ (1-y_i\textbf{x}_{{(s)},i}^{\top }{\varvec{\beta }}^*_{(s)})_+\right\} \Bigg ]\nonumber \\&\quad = \min _{1\le s\le S_n}\inf _{\Vert \textbf{u}_{(s)}\Vert =\triangle } 2^{-1}{p_{\max }}\log ({p_{\max }}) \textbf{u}_{(s)}^{\top }{\textbf{H}}_{(s)}\nonumber \\&\qquad \left( {\varvec{\beta }}^*_{(s)}+t\sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})}\textbf{u}_{(s)}\right) \textbf{u}_{(s)}\nonumber \\&\quad \ge 2^{-1}\triangle ^2 c_0{p_{\max }}\log ({p_{\max }}), \end{aligned}$$
(B19)

for some \(0<t<1\), where the last inequality is due to (B18) and Condition 5. It can be readily shown by (B4), (B17), (B19) and Condition 6 that when \(\triangle \) is sufficiently large, \(2^{-1}\triangle ^2 c_0{p_{\max }}(>0)\) dominates other terms in \(B_{s,n}\). This completes the proof of Step 2.

Step 3: Combining (B3), (B6), (B17) and (B19), when n and \(\triangle \) are sufficiently large, we have

$$\begin{aligned}&\max _{1\le s\le S_n}\inf _{\Vert \textbf{u}_{(s)}\Vert =\triangle } \left\{ l_s\left( {\varvec{\beta }}^*_{(s)}+\sqrt{n^{-1}p_{\max }\log ({p_{\max }})}\textbf{u}_{(s)}\right) -l_s({\varvec{\beta }}^*_{(s)})\right\} \nonumber \\&\quad =\max _{1\le s\le S_n}\inf _{\Vert \textbf{u}_{(s)}\Vert =\triangle }\Big \{ n^{-1}(A_{s,n}+B_{s,n})+2^{-1}\lambda _n \left\| {\varvec{\beta }}^{*+}_{(s)}\right. \nonumber \\&\qquad \left. +\sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})}\textbf{u}_{(s)}^+\right\| ^2 -2^{-1}\lambda _n \Vert {\varvec{\beta }}^{*+}_{(s)}\Vert ^2\Big \}\nonumber \\&\quad \ge \max _{1\le s\le S_n}\inf _{\Vert \textbf{u}_{(s)}\Vert =\triangle }\Big \{ n^{-1}B_{s,n}-n^{-1}|A_{s,n}|-2^{-1}\lambda _n \Big |\left\| {\varvec{\beta }}^{*+}_{(s)}\right. \nonumber \\&\qquad \left. +\sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})}\textbf{u}_{(s)}^+\right\| ^2- \Vert {\varvec{\beta }}^{*+}_{(s)}\Vert ^2\Big |\Big \}\nonumber \\&\quad \ge \min _{1\le s\le S_n}\inf _{\Vert \textbf{u}_{(s)}\Vert =\triangle }n^{-1} B_{s,n2}-\max _{1\le s\le S_n}\sup _{\Vert \textbf{u}_{(s)}\Vert =\triangle }n^{-1}|B_{s,n1}|\nonumber \\&\qquad -\max _{1\le s\le S_n}\sup _{\Vert \textbf{u}_{(s)}\Vert =\triangle }n^{-1}|A_{s,n}|\nonumber \\&\qquad -2^{-1}\lambda _n \max _{1\le s\le S_n}\sup _{\Vert \textbf{u}_{(s)}\Vert =\triangle } \Big |\left\| {\varvec{\beta }}^{*+}_{(s)}+\sqrt{n^{-1}{p_{\max }}\log ({p_{\max }})}\textbf{u}_{(s)}^+\right\| ^2\nonumber \\&\qquad - \Vert {\varvec{\beta }}^{*+}_{(s)}\Vert ^2\Big |\nonumber \\&\quad =2^{-1}n^{-1}\triangle ^2 c_0{p_{\max }}\log ({p_{\max }})- O_{p}(\triangle n^{-1} {p_{\max }}\log ({p_{\max }}))\nonumber \\&\qquad -\triangle ^{3/2}n^{-1}{p_{\max }}o_{p}(1)\nonumber \\&\qquad -2^{-1}\triangle \lambda _n{p_{\max }}\sqrt{n^{-1}\log ({p_{\max }})}\nonumber \\&>0, \end{aligned}$$
(B20)

where the last inequality is obtained from Conditions 56 and \(\lambda _n=O(\sqrt{n^{-1}\log ({p_{\max }})})\). This completes the proof of (B1).

Equation (10) can be proved in a similar way. Note that \(n-\lfloor n/J \rfloor \sim n\) and each sample from \({\mathcal {D}}_n\) is drawn independently from an identical distribution. Hence \({\widetilde{{\varvec{\beta }}}}^{[-j]}_{(s)}\) converges to \({\varvec{\beta }}^*_{(s)}\) in the same order as \({\hat{{\varvec{\beta }}}}_{(s)}\) for each \(j=1,2,\ldots ,J\), i.e.,

$$\begin{aligned} \max _{1\le j\le J} \max _{1\le s\le S_n}\left\| {\widetilde{{\varvec{\beta }}}}_{(s)}^{[-j]}-{\varvec{\beta }}^*_{(s)}\right\| =O_p\left( \sqrt{\frac{{p_{\max }}\log ({p_{\max }})}{n}}\right) . \end{aligned}$$
(B21)

1.2 B.2 Proof of Theorem 1

Let us introduce Lemma 2 that facilitates the proof of Theorem 1.

Lemma 2

Assume that Condition 7 and

$$\begin{aligned} \sup _{\textbf{w}\in {{\mathcal {W}}}}\bigg |\frac{\textrm{CV}(\textbf{w})-R_n(\textbf{w})}{R_n(\textbf{w})}\bigg |=o_p(1) \end{aligned}$$
(B22)

hold. Then

$$\begin{aligned} \frac{R_n({\hat{\textbf{w}}})}{\inf _{\textbf{w}\in {\mathcal {W}}} R_n(\textbf{w})}\rightarrow 1 \end{aligned}$$
(B23)

in probability, where \({\hat{\textbf{w}}}\) is the optimal solution from (8).

Proof of Lemma 2

By the definition of infimum, there exist a sequence \(\vartheta _n\) and a vector sequence \(\textbf{w}_n\in {{\mathcal {W}}}\) such that as \(n\rightarrow \infty \), \(\vartheta _n\rightarrow 0\) and

$$\begin{aligned} \inf _{\textbf{w}\in {{\mathcal {W}}}}R_n(\textbf{w})=R_n(\textbf{w}_n)-\vartheta _n. \end{aligned}$$
(B24)

From Condition 7, we have

$$\begin{aligned} \frac{R_n(\textbf{w}_n)}{\inf _{\textbf{w}\in {{\mathcal {W}}}}R_n(\textbf{w})}&>\frac{\inf _{\textbf{w}\in {{\mathcal {W}}}}R_n(\textbf{w})}{\inf _{\textbf{w}\in {{\mathcal {W}}}}R_n(\textbf{w})}=1, \end{aligned}$$
(B25)

and

$$\begin{aligned} \frac{\vartheta _n}{\inf _{\textbf{w}\in {{\mathcal {W}}}}R_n(\textbf{w})}&=o_p(1). \end{aligned}$$
(B26)

Taking (B22), (B25) and (B26) together, for any \(\delta >0\),

$$\begin{aligned}&\Pr \left\{ \bigg |\frac{\inf _{\textbf{w}\in {\mathcal {W}}} R_n(\textbf{w})}{R_n({\hat{\textbf{w}}})}-1\bigg |>\delta \right\} \nonumber \\&\quad =\Pr \left\{ \frac{R_n({\hat{\textbf{w}}})-\inf _{\textbf{w}\in {\mathcal {W}}} R_n(\textbf{w})}{R_n({\hat{\textbf{w}}})}-1>\delta \right\} \nonumber \\&\quad =\Pr \left\{ \frac{R_n({\hat{\textbf{w}}})-{\text {CV}}({\hat{\textbf{w}}})+{\text {CV}}({\hat{\textbf{w}}})-R_n(\textbf{w}_n)+\vartheta _n}{R_n({\hat{\textbf{w}}})}>\delta \right\} \nonumber \\&\quad \le \Pr \left\{ \frac{R_n({\hat{\textbf{w}}})-{\text {CV}}({\hat{\textbf{w}}})+{\text {CV}}(\textbf{w}_n)-R_n(\textbf{w}_n)+\vartheta _n}{R_n({\hat{\textbf{w}}})}>\delta \right\} \nonumber \\&\quad \le \Pr \left\{ \frac{|R_n({\hat{\textbf{w}}})-{\text {CV}}({\hat{\textbf{w}}})|}{R_n({\hat{\textbf{w}}})} +\frac{|{\text {CV}}(\textbf{w}_n)-R_n(\textbf{w}_n)|}{\inf _{\textbf{w}\in {{\mathcal {W}}}}R_n(\textbf{w})}\right. \nonumber \\&\qquad \left. +\frac{\vartheta _n}{\inf _{\textbf{w}\in {{\mathcal {W}}}}R_n(\textbf{w})}>\delta \right\} \nonumber \\&\quad \le \Pr \left\{ \sup _{\textbf{w}\in {{\mathcal {W}}}}\bigg |\frac{R_n(\textbf{w})-{\text {CV}}(\textbf{w})}{R_n(\textbf{w})}\bigg |\right. \nonumber \\&\qquad \left. +\frac{|{\text {CV}}(\textbf{w}_n)-R_n(\textbf{w}_n)|/R_n(\textbf{w}_n)}{\inf _{\textbf{w}\in {{\mathcal {W}}}}R_n(\textbf{w})/R_n(\textbf{w}_n)}+\frac{\vartheta _n}{\inf _{\textbf{w}\in {{\mathcal {W}}}}R_n(\textbf{w})}>\delta \right\} \nonumber \\&\quad \le \Pr \Bigg \{\sup _{\textbf{w}\in {{\mathcal {W}}}}\bigg |\frac{R_n(\textbf{w})-{\text {CV}}(\textbf{w})}{R_n(\textbf{w})}\bigg |\nonumber \\&\qquad +\sup _{\textbf{w}\in {{\mathcal {W}}}}\bigg |\frac{{\text {CV}}(\textbf{w})-R_n(\textbf{w})}{R_n(\textbf{w})}\bigg |\frac{R_n(\textbf{w}_n)}{\inf _{\textbf{w}\in {{\mathcal {W}}}}R_n(\textbf{w})}\nonumber \\&\qquad +\frac{\vartheta _n}{\inf _{\textbf{w}\in {{\mathcal {W}}}}R_n(\textbf{w})}>\delta \Bigg \}\nonumber \\&\quad \rightarrow 0, \end{aligned}$$
(B27)

which implies that (B23) is valid.

Proof of Theorem 1

Let

$$\begin{aligned} T_n=\frac{1}{n}\sum _{i=1}^{n}\left( 1-y_i\textbf{x}_i^{\top }{\hat{{\varvec{\beta }}}}(\textbf{w})\right) _+ \end{aligned}$$
(B28)

By Lemma 2 and the triangle inequality, it suffices to verify that

$$\begin{aligned} \sup _{\textbf{w}\in {{\mathcal {W}}}}\frac{|{\text {CV}}(\textbf{w})-T_n(\textbf{w})|}{R_n(\textbf{w})}=o_p(1), \end{aligned}$$
(B29)

and

$$\begin{aligned} \sup _{\textbf{w}\in {{\mathcal {W}}}}\frac{|T_n(\textbf{w})-R_n(\textbf{w})|}{R_n(\textbf{w})}=o_p(1). \end{aligned}$$
(B30)

For (B29), we have

$$\begin{aligned}&|{\text {CV}}(\textbf{w})-T_n(\textbf{w})|=\bigg |\frac{1}{n}\sum _{j=1}^{J}\sum _{i\in \mathcal{A}(j)}\left\{ (1-y_i\textbf{x}_i^{\top }{\widetilde{{\varvec{\beta }}}}^{[-j]}(\textbf{w}))_+\right. \nonumber \\&\qquad \left. -(1-y_i\textbf{x}_i^{\top }{\hat{{\varvec{\beta }}}}(\textbf{w}))_+\right\} \bigg |\nonumber \\&\quad \le \frac{1}{n}\sum _{j=1}^{J}\sum _{i\in \mathcal{A}(j)}\bigg |\int _{y_i\textbf{x}_i^{\top }{\widetilde{{\varvec{\beta }}}}^{[-j]}(\textbf{w})}^{y_i\textbf{x}_i^{\top }{\hat{{\varvec{\beta }}}}(\textbf{w})}I(t\le 1)\textrm{d}t \bigg |\nonumber \\&\quad \le \frac{1}{n}\sum _{j=1}^{J}\sum _{i\in \mathcal{A}(j)}\bigg |y_i\textbf{x}_i^{\top }\left( {\widetilde{{\varvec{\beta }}}}^{[-j]}(\textbf{w})-{\hat{{\varvec{\beta }}}}(\textbf{w})\right) \bigg |\nonumber \\&\quad \le \frac{1}{n}\sum _{j=1}^{J}\sum _{i\in \mathcal{A}(j)}\sum _{s=1}^{S_n}w_s\Vert \textbf{x}_{{(s)},i}\Vert \left\| {\widetilde{{\varvec{\beta }}}}_{(s)}^{[-j]}-{\hat{{\varvec{\beta }}}}_{(s)}\right\| \nonumber \\&\quad \le \frac{1}{n}\sum _{i=1}^{n}\max _{1\le s\le S_n}\Vert \textbf{x}_{{(s)},i}\Vert \max _{1\le j\le J}\max _{1\le s\le S_n}\left\| {\widetilde{{\varvec{\beta }}}}_{(s)}^{[-j]}-{\hat{{\varvec{\beta }}}}_{(s)}\right\| \nonumber \\&\quad \le C_1\sqrt{{p_{\max }}}\max _{1\le j\le J}\max _{1\le s\le S_n}\left\| {\widetilde{{\varvec{\beta }}}}_{(s)}^{[-j]}-{\hat{{\varvec{\beta }}}}_{(s)}\right\| \nonumber \\&\quad =O_p\left( \frac{{p_{\max }}\sqrt{\log ({p_{\max }})}}{\sqrt{n}}\right) \nonumber \\&\quad =o_p(1), \end{aligned}$$
(B31)

where the second last equality is established based on Lemma 1, and the last equality is based on Conditions 6. Coupled with Condition 7 and (B31), we obtain (B29).

To prove (B30), note that

$$\begin{aligned}&|T_n(\textbf{w})-R_n(\textbf{w})|=\bigg |\frac{1}{n}\sum _{i=1}^{n}\left( 1-y_i\textbf{x}_i^{\top }{\hat{{\varvec{\beta }}}}(\textbf{w}) \right) _+\nonumber \\&\qquad -\mathbb {E}\left\{ (1-y\textbf{x}^{\top }{\hat{{\varvec{\beta }}}}(\textbf{w}) )_+\mid {\mathcal {D}}_n\right\} \bigg |\nonumber \\&\quad \le \bigg |\frac{1}{n}\sum _{i=1}^{n}\left( 1{-}y_i\textbf{x}_i^{\top }{\hat{{\varvec{\beta }}}}(\textbf{w}) \right) _+{-}\frac{1}{n}\sum _{i=1}^{n}\left( 1{-}y_i\textbf{x}_i^{\top }{\varvec{\beta }}^*(\textbf{w}) \right) _+\bigg |\nonumber \\&\qquad +\bigg |\frac{1}{n}\sum _{i=1}^{n}\left( 1-y_i\textbf{x}_i^{\top }{\varvec{\beta }}^*(\textbf{w}) \right) _+-\mathbb {E}\left( 1-y\textbf{x}^{\top }{\varvec{\beta }}^*(\textbf{w})\right) _+\bigg |\nonumber \\&\qquad {+}\bigg |\mathbb {E}\left( 1{-}y\textbf{x}^{\top }{\varvec{\beta }}^*(\textbf{w})\right) _+{-}\mathbb {E}\left\{ (1{-}y\textbf{x}^{\top }{\hat{{\varvec{\beta }}}}(\textbf{w}))_+ \mid {\mathcal {D}}_n\right\} \bigg |\nonumber \\&\quad \equiv |\Omega _1(\textbf{w})|+|\Omega _2(\textbf{w})|+|\Omega _3(\textbf{w})|. \end{aligned}$$
(B32)

Recognising the above, Lemma 1 and Conditions 3 and 6, it can be shown that

$$\begin{aligned}&\sup _{\textbf{w}\in {{\mathcal {W}}}}|\Omega _1(\textbf{w})|\le \sup _{\textbf{w}\in {{\mathcal {W}}}}\frac{1}{n}\sum _{i=1}^{n}\nonumber \\&\qquad \bigg |\left( 1-y_i\textbf{x}_i^{\top }{\hat{{\varvec{\beta }}}}(\textbf{w}) \right) _+- \left( 1-y_i\textbf{x}_i^{\top }{\varvec{\beta }}^*(\textbf{w}) \right) _+\bigg |\nonumber \\&\quad \le \sup _{\textbf{w}\in {{\mathcal {W}}}}\frac{1}{n}\sum _{i=1}^{n}\bigg |y_i\textbf{x}_i^{\top }\left( {\varvec{\beta }}^*(\textbf{w})-{\hat{{\varvec{\beta }}}}(\textbf{w})\right) \bigg |\nonumber \\&\quad \le \sup _{\textbf{w}\in {{\mathcal {W}}}}\frac{1}{n}\sum _{i=1}^{n}\sum _{s=1}^{S_n}w_s \Vert \textbf{x}_{{(s)},i}\Vert \left\| {\varvec{\beta }}^*_{(s)}-{\hat{{\varvec{\beta }}}}_{(s)}\right\| \nonumber \\&\quad \le \max _{1\le s\le S_n}\left\| {\varvec{\beta }}^*_{(s)}-{\hat{{\varvec{\beta }}}}_{(s)}\right\| \max _{1\le i \le n}\max _{1\le s\le S_n}\Vert \textbf{x}_{{(s)},i}\Vert \nonumber \\&\quad =O_p\left( \frac{{p_{\max }}\sqrt{\log ({p_{\max }})}}{\sqrt{n}}\right) \nonumber \\&\quad =o_p(1). \end{aligned}$$
(B33)

Define

$$\begin{aligned} |\textbf{w}-\textbf{w}'|_1=\sum _{s=1}^{S_n}|w_s-w'_s|, \end{aligned}$$
(B34)

for any \(\textbf{w}=(w_1,\ldots ,w_{S_n})\in {{\mathcal {W}}}\) and \(\textbf{w}'=(w'_1,\ldots ,w'_{S_n})\in {{\mathcal {W}}}\). Let \(h_n=1/( {p_{\max }}\log n)\) and create grids using regions of the form \({{\mathcal {W}}}^{(l)}=\{\textbf{w}:|\textbf{w}-\textbf{w}^{(l)}|_1\le h_n\}\). By the notion of the \(\epsilon -\)covering number introduced by van der Vaart and Wellner (1996), \({{\mathcal {W}}}\) can be covered with \(N=O(1/h_n^{S_n-1})\) regions \({{\mathcal {W}}}^{(l)}\), \(l=1,\ldots ,N.\)

Note that

$$\begin{aligned}&\sup _{\textbf{w}\in {{\mathcal {W}}}^{(l)}}|\Omega _2(\textbf{w})-\Omega _2(\textbf{w}^{(l)})|\nonumber \\&\quad \le \sup _{\textbf{w}\in {{\mathcal {W}}}^{(l)}}\bigg |\frac{1}{n}\sum _{i=1}^{n}\left( 1-y_i\textbf{x}_i^{\top }{\varvec{\beta }}^*(\textbf{w}) \right) _+\nonumber \\&\qquad -\frac{1}{n}\sum _{i=1}^{n}\left( 1-y_i\textbf{x}_i^{\top }{\varvec{\beta }}^*(\textbf{w}^{(l)}) \right) _+\bigg |\nonumber \\&\qquad +\sup _{\textbf{w}\in {{\mathcal {W}}}^{(l)}}\bigg |\mathbb {E}\left( 1-y\textbf{x}^{\top }{\varvec{\beta }}^*(\textbf{w})\right) _+\nonumber \\&\qquad -\mathbb {E}\left( 1-y\textbf{x}^{\top }{\varvec{\beta }}^*(\textbf{w}^{(l)})\right) _+\bigg |\nonumber \\&\quad \le \sup _{\textbf{w}\in {{\mathcal {W}}}^{(l)}}\frac{1}{n}\sum _{i=1}^{n}\bigg |y_i\textbf{x}_i^{\top }\{{\varvec{\beta }}^*(\textbf{w}^{(l)})-{\varvec{\beta }}^*(\textbf{w})\} \bigg |\nonumber \\&\qquad +\sup _{\textbf{w}\in {{\mathcal {W}}}^{(l)}}\mathbb {E}\bigg |y\textbf{x}^{\top }\{{\varvec{\beta }}^*(\textbf{w}^{(l)})-{\varvec{\beta }}^*(\textbf{w})\} \bigg |\nonumber \\&\quad \le \sup _{\textbf{w}\in {{\mathcal {W}}}^{(l)}}\frac{1}{n}\sum _{i=1}^{n}\sum _{s=1}^{S_n}|w_s-w^{(l)}_s|\bigg |\textbf{x}_{{(s)},i}^{\top }{\varvec{\beta }}^*_{(s)}\bigg |\nonumber \\&\qquad +\sup _{\textbf{w}\in {{\mathcal {W}}}^{(l)}}\sum _{s=1}^{S_n}|w_s-w^{(l)}_s|\mathbb {E}\bigg |\textbf{x}_{{(s)},i}^{\top }{\varvec{\beta }}^*_{(s)}\bigg |\nonumber \\&\quad =\sup _{\textbf{w}\in {{\mathcal {W}}}^{(l)}}|\textbf{w}-\textbf{w}^{(l)}|_1\max _{1\le s\le S_n}\Vert {\varvec{\beta }}^*_{(s)}\Vert \nonumber \\&\qquad \left( \max _{1\le i \le n}\max _{1\le s\le S_n}\Vert \textbf{x}_{{(s)},i}\Vert + \max _{1\le i \le n}\max _{1\le s\le S_n}\mathbb {E}\Vert \textbf{x}_{{(s)},i}\Vert \right) \nonumber \\&\quad \le \frac{C_2\sqrt{{p_{\max }}}}{{p_{\max }}\log (n)}2C_1\sqrt{{p_{\max }}}\nonumber \\&\quad =O_p( \log ^{-1}(n))\nonumber \\&\quad =o_p(1), \end{aligned}$$
(B35)

where the result holds uniformly for j. Hence we have

$$\begin{aligned} \sup _{\textbf{w}\in {{\mathcal {W}}}}|\Omega _2(\textbf{w})|&=\max _{1\le l\le N}\sup _{\textbf{w}\in {{\mathcal {W}}}^{(j)}}|\Omega _2(\textbf{w})|\nonumber \\&\le \max _{1\le l\le N}|\Omega _2(\textbf{w}^{(l)})|\nonumber \\&\quad +\max _{1\le l\le N}\sup _{\textbf{w}\in {{\mathcal {W}}}^{(j)}}|\Omega _2(\textbf{w})-\Omega _2(\textbf{w}^{(l)})|\nonumber \\&=\max _{1\le l\le N}|\Omega _2({\textbf{w}}^{(l)})|+o_p(1). \end{aligned}$$
(B36)

Furthermore, for any \(\epsilon >0\),

$$\begin{aligned}&\Pr \left\{ \max _{1\le l\le N}|\Omega _2({\textbf{w}}^{(l)})|> 3\epsilon \right\} \nonumber \\&\quad =\Pr \Bigg [\max _{1\le l\le N}\Big \vert \frac{1}{n}\sum _{i=1}^{n}\left( 1-y_i\textbf{x}_i^{\top }{\varvec{\beta }}^*({\textbf{w}}^{(l)})\right) _+{\textbf{1}}\nonumber \\&\qquad \left( |1-y_i\textbf{x}_i^{\top }{\varvec{\beta }}^*({\textbf{w}}^{(l)})|<{p_{\max }}n^{0.1}\right) \nonumber \\&\qquad +\frac{1}{n}\sum _{i=1}^{n}\left( 1-y_i\textbf{x}_i^{\top }{\varvec{\beta }}^*({\textbf{w}}^{(l)})\right) _+\nonumber \\&\qquad {\textbf{1}}\left( |1-y_i\textbf{x}_i^{\top }{\varvec{\beta }}^*({\textbf{w}}^{(l)})|\ge {p_{\max }}n^{0.1}\right) \nonumber \\&\qquad -\mathbb {E}\left\{ (1-y\textbf{x}^{\top }{\varvec{\beta }}^*({\textbf{w}}^{(l)}))_+{\textbf{1}}\right. \nonumber \\&\qquad \left. \left( |1-y\textbf{x}^{\top }{\varvec{\beta }}^*({\textbf{w}}^{(l)})|< {p_{\max }}n^{0.1}\right) \right\} \nonumber \\&\qquad -\mathbb {E}\left\{ (1-y\textbf{x}^{\top }{\varvec{\beta }}^*({\textbf{w}}^{(l)}))_+{\textbf{1}}\right. \nonumber \\&\qquad \left. \left( |1-y\textbf{x}^{\top }{\varvec{\beta }}^*({\textbf{w}}^{(l)})|\ge {p_{\max }}n^{0.1}\right) \right\} \Big \vert> 3\epsilon \Bigg ]\nonumber \\&\quad \le \Pr \Bigg [\max _{1\le l\le N}\Big \vert \frac{1}{n}\sum _{i=1}^{n}\left( 1-y_i\textbf{x}_i^{\top }{\varvec{\beta }}^*({\textbf{w}}^{(l)})\right) _+{\textbf{1}}\nonumber \\&\qquad \left( |1-y_i\textbf{x}_i^{\top }{\varvec{\beta }}^*({\textbf{w}}^{(l)})|<{p_{\max }}n^{0.1}\right) \nonumber \\&\qquad -\mathbb {E}\left\{ (1-y\textbf{x}^{\top }{\varvec{\beta }}^*({\textbf{w}}^{(l)}))_+{\textbf{1}}\right. \nonumber \\&\qquad \left. \left( |1-y\textbf{x}^{\top }{\varvec{\beta }}^*({\textbf{w}}^{(l)})|< {p_{\max }}n^{0.1}\right) \right\} \Big \vert>\epsilon \Bigg ]\nonumber \\&\qquad +\Pr \Bigg [\max _{1\le l\le N}\frac{1}{n}\sum _{i=1}^{n}\left( 1-y_i\textbf{x}_i^{\top }{\varvec{\beta }}^*({\textbf{w}}^{(l)})\right) _+{\textbf{1}}\nonumber \\&\qquad \left( |1-y_i\textbf{x}_i^{\top }{\varvec{\beta }}^*({\textbf{w}}^{(l)})|\ge {p_{\max }}n^{0.1}\right)>\epsilon \Bigg ]\nonumber \\&\qquad +\Pr \Bigg [\max _{1\le l\le N}\mathbb {E}\left\{ (1-y\textbf{x}^{\top }{\varvec{\beta }}^*({\textbf{w}}^{(l)}))_+{\textbf{1}}\right. \nonumber \\&\qquad \left. \left( |1-y\textbf{x}^{\top }{\varvec{\beta }}^*({\textbf{w}}^{(l)})|\ge {p_{\max }}n^{0.1}\right) \right\} > \epsilon \Bigg ]\nonumber \\&\quad \equiv \Xi _1+\Xi _2+\Xi _3. \end{aligned}$$
(B37)

Clearly,

$$\begin{aligned}&\sum _{i=1}^{n}\mathbb {E}\left\{ (1-y_i\textbf{x}_i^{\top }{\varvec{\beta }}^*({\textbf{w}}^{(l)}))_+\right\} ^2\nonumber \\&\quad \le \sum _{i=1}^{n}\mathbb {E}\bigg |1-y_i\textbf{x}_i^{\top }{\varvec{\beta }}^*({\textbf{w}}^{(l)})\bigg |^2\nonumber \\&\quad \le \sum _{i=1}^{n}\mathbb {E}\left( 1+2|\textbf{x}_i^{\top }{\varvec{\beta }}^*(\textbf{w}^{(l)})|+\textbf{x}_i^{\top }{\varvec{\beta }}^*(\textbf{w}^{(l)}) {\varvec{\beta }}^{*\text {T}}(\textbf{w}^{(l)})\textbf{x}_i^{\top }\right) \nonumber \\&\quad \le \sum _{i=1}^{n}\mathbb {E}\left( 1+2\max _{1\le s\le S_n}\Vert \textbf{x}_{{(s)},i}\Vert \Vert {\varvec{\beta }}^*_{(s)}\Vert \right. \nonumber \\&\quad \left. +\max _{1\le s\le S_n}\Vert {\varvec{\beta }}^*_{(s)}\Vert ^2\Vert \textbf{x}_{{(s)},i}\Vert ^2 \right) \nonumber \\&\quad \le 4C^2_1C^2_2 n p^2_{\max }. \end{aligned}$$
(B38)

Using Boole’s and Bernstein’s inequalities and by (B38),

$$\begin{aligned} \Xi _1&\le \sum _{j=1}^N\Pr \Bigg [\Big \vert \frac{1}{n}\sum _{i=1}^{n}\left( 1-y_i\textbf{x}_i^{\top }{\varvec{\beta }}^*({\textbf{w}}^{(l)})\right) _+{\textbf{1}}\nonumber \\&\quad \left( |1-y_i\textbf{x}_i^{\top }{\varvec{\beta }}^*({\textbf{w}}^{(l)})|<{p_{\max }}n^{0.1}\right) \nonumber \\&\quad -\mathbb {E}\left\{ (1-y\textbf{x}^{\top }{\varvec{\beta }}^*({\textbf{w}}^{(l)}))_+{\textbf{1}}\right. \nonumber \\&\quad \left. \left( |1-y\textbf{x}^{\top }{\varvec{\beta }}^*({\textbf{w}}^{(l)})|< {p_{\max }}n^{0.1}\right) \right\} \Big \vert >\epsilon \Bigg ]\nonumber \\&\le N\exp \left( - \frac{n^2\epsilon ^2/2}{4C_1^2C_2^2 np_{\max }^2+\epsilon {p_{\max }}n^{0.1}/3}\right) \nonumber \\&\le ({p_{\max }}\log n)^{S_n-1}\exp \left( {-} \frac{n^2\epsilon ^2/2}{4C_1^2C_2^2 np_{\max }^2{+}\epsilon {p_{\max }}n^{0.1}/3}\right) \nonumber \\&=O\left\{ \exp \left( -\epsilon ^2 n p^{-2}_{\max }{+} S_n \log ({p_{\max }})+S_n\log \log (n)\right) \right\} \nonumber \\&=o(1), \end{aligned}$$
(B39)

where the last equality is established from Condition 6 and the condition that \(S_n=O(n^\tau )\) for \(\tau \in (0,1-2\kappa )\). Additionally, we can write

$$\begin{aligned}&\Xi _2=\Pr \Bigg \{\max _{1\le l\le N}\frac{1}{n}\sum _{i=1}^{n}\left( 1-y_i\textbf{x}_i^{\top }{\varvec{\beta }}^*({\textbf{w}}^{(l)})\right) _+{\textbf{1}}\nonumber \\&\qquad \left( |1-y_i\textbf{x}_i^{\top }{\varvec{\beta }}^*({\textbf{w}}^{(l)})|\ge {p_{\max }}n^{0.1}\right) >\epsilon \Bigg \}\nonumber \\&\quad \le \Pr \left( \max _{1\le l\le N}\max _{1\le i \le n}|1-y_i\textbf{x}_i^{\top }{\varvec{\beta }}^*({\textbf{w}}^{(l)})|\ge {p_{\max }}n^{0.1}\right) \nonumber \\&\quad \le \Pr \left\{ \max _{1\le l\le N}\max _{1\le i \le n}\sum _{s=1}^{S_n}{w^{(l)}_s}\left( 1+\Vert \textbf{x}_{{(s)},i}\Vert \Vert {\varvec{\beta }}^*_{(s)}\Vert \right) \right. \nonumber \\&\quad \left. \ge {p_{\max }}n^{0.1}\right\} \nonumber \\&\quad \le \Pr \left\{ \left( 1+\max _{1\le i \le n}\max _{1\le s\le S_n}\Vert \textbf{x}_{{(s)},i}\Vert \max _{1\le s\le S_n}\Vert {\varvec{\beta }}^*_{(s)}\Vert \right) \right. \nonumber \\&\quad \left. \ge {p_{\max }}n^{0.1}\right\} \nonumber \\&\quad = o(1), \end{aligned}$$
(B40)

where the last inequality holds because of Conditions 2 and 3. Similarly,

$$\begin{aligned} \Xi _3&=\Pr \left[ \max _{1\le l\le N}\mathbb {E}\left\{ (1-y\textbf{x}^{\top }{\varvec{\beta }}^*({\textbf{w}}^{(l)}))_+{\textbf{1}}\right. \right. \nonumber \\&\quad \left. \left. \left( |1-y\textbf{x}^{\top }{\varvec{\beta }}^*({\textbf{w}}^{(l)})|\ge {p_{\max }}n^{0.1}\right) \right\} > \epsilon \right] \nonumber \\&\le \Pr \left( \max _{1\le l\le N}\mathbb {E}|1-y\textbf{x}^{\top }{\varvec{\beta }}^*({\textbf{w}}^{(l)})|\ge {p_{\max }}n^{0.1}\right) \nonumber \\&\le \Pr \left\{ \left( 1+\max _{1\le s\le S_n}\mathbb {E}\Vert \textbf{x}_{(s)}\Vert \max _{1\le s\le S_n}\Vert {\varvec{\beta }}^*_{(s)}\Vert \right) \ge {p_{\max }}n^{0.1}\right\} \nonumber \\&=o(1). \end{aligned}$$
(B41)

Together with (B37), (B39)–(B41), we obtain \(\max _{1\le l\le N}|\)\( \Omega _2({\textbf{w}}^{(l)})|=o_p(1)\). As well, by (B36), we have

$$\begin{aligned} \sup _{\textbf{w}\in {{\mathcal {W}}}}|\Omega _2(\textbf{w})|=o_P(1). \end{aligned}$$
(B42)

Finally, note that \((y,\textbf{x})\) and \(({\tilde{y}}, {\tilde{\textbf{x}}})\) are independently and identically distributed, and under Lemma 1, we have

$$\begin{aligned}&\sup _{\textbf{w}\in {{\mathcal {W}}}}|\Omega _3(\textbf{w})|=\sup _{\textbf{w}\in {{\mathcal {W}}}}\bigg |\mathbb {E}\left( 1-y\textbf{x}^{\top }{\varvec{\beta }}^*(\textbf{w})\right) _+\nonumber \\&\qquad -\mathbb {E}\left\{ (1-{\tilde{y}}{\tilde{\textbf{x}}}^{\top }{\hat{{\varvec{\beta }}}}(\textbf{w})) \mid {\mathcal {D}}_n\right\} _+\bigg |\nonumber \\&\quad =\sup _{\textbf{w}\in {{\mathcal {W}}}}\bigg |\mathbb {E}\left( 1-{\tilde{y}}{\tilde{\textbf{x}}}^{\top }{\varvec{\beta }}^*(\textbf{w})\right) _+\nonumber \\&\quad -\mathbb {E}\left\{ (1-{\tilde{y}}{\tilde{\textbf{x}}}^{\top }{\hat{{\varvec{\beta }}}}(\textbf{w})) \mid {\mathcal {D}}_n\right\} _+\bigg |\nonumber \\&\quad = \sup _{\textbf{w}\in {{\mathcal {W}}}}\bigg |\mathbb {E}\int _{{\tilde{y}}{\tilde{\textbf{x}}}^{\top }{\varvec{\beta }}^*(\textbf{w})}^{{\tilde{y}}{\tilde{\textbf{x}}}^{\top }{\hat{{\varvec{\beta }}}}(\textbf{w})}I(t\le 1)\textrm{d}t\bigg \vert {\mathcal {D}}_n\bigg |\nonumber \\&\quad \le \sup _{\textbf{w}\in {{\mathcal {W}}}}\mathbb {E}\left\{ \bigg |{\tilde{y}}{\tilde{\textbf{x}}}^{\top }\left( {\hat{{\varvec{\beta }}}}(\textbf{w})-{\varvec{\beta }}^*(\textbf{w})\right) \bigg |\big \vert {\mathcal {D}}_n\right\} \nonumber \\&\quad \le \sup _{\textbf{w}\in {{\mathcal {W}}}}\sum _{s=1}^{S_n}w_s\mathbb {E}\left\{ \bigg |{\tilde{\textbf{x}}}_{(s)}^{\top }\left( {\hat{{\varvec{\beta }}}}_{(s)}-{\varvec{\beta }}^*_{(s)}\right) \bigg |\big \vert {\mathcal {D}}_n\right\} \nonumber \\&\quad \le \max _{1\le s\le S_n}\left\| {\hat{{\varvec{\beta }}}}_{(s)}-{\varvec{\beta }}^*_{(s)}\right\| \max _{1\le s\le S_n}\mathbb {E}\Vert {\tilde{\textbf{x}}}_{{(s)},i}\Vert \nonumber \\&\quad =O_p\left( \frac{ {p_{\max }}\sqrt{\log ({p_{\max }})}}{\sqrt{n}}\right) \nonumber \\&\quad =o_p(1), \end{aligned}$$
(B43)

where the last inequality holds due to Condition 6. Putting (B32), (B33), (B42) and (B43) together, we complete the proof of (B30).

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zou, J., Yuan, C., Zhang, X. et al. Model averaging for support vector classifier by cross-validation. Stat Comput 33, 117 (2023). https://doi.org/10.1007/s11222-023-10284-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11222-023-10284-6

Keywords