Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: mdwlist

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: arXiv.org perpetual non-exclusive license
arXiv:2207.10199v2 [cs.LG] 15 Jan 2024

Provably tuning the ElasticNet across instances

Maria-Florina Balcan  Mikhail Khodak  Dravyansh Sharma  Ameet Talwalkar
Carnegie Mellon University111Correspondence: ninamf@cs.cmu.edu, khodak@cmu.edu, dravyans@cs.cmu.edu, talwalkar@cmu.edu
Abstract

An important unresolved challenge in the theory of regularization is to set the regularization coefficients of popular techniques like the ElasticNet with general provable guarantees. We consider the problem of tuning the regularization parameters of Ridge regression, LASSO, and the ElasticNet across multiple problem instances, a setting that encompasses both cross-validation and multi-task hyperparameter optimization. We obtain a novel structural result for the ElasticNet which characterizes the loss as a function of the tuning parameters as a piecewise-rational function with algebraic boundaries. We use this to bound the structural complexity of the regularized loss functions and show generalization guarantees for tuning the ElasticNet regression coefficients in the statistical setting. We also consider the more challenging online learning setting, where we show vanishing average expected regret relative to the optimal parameter pair. We further extend our results to tuning classification algorithms obtained by thresholding regression fits regularized by Ridge, LASSO, or ElasticNet. Our results are the first general learning-theoretic guarantees for this important class of problems that avoid strong assumptions on the data distribution. Furthermore, our guarantees hold for both validation and popular information criterion objectives.

1 Introduction

Ridge regression [HK70, TA77], LASSO [Tib96], and their generalization the ElasticNet [HTF09] are among the most popular algorithms in machine learning and statistics, with applications to linear classification, regression, data analysis, and feature selection [Cha92, ZY07, HTF09, DW18, FDSC+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT19]. Given a supervised dataset (X,y)m×p×m𝑋𝑦superscript𝑚𝑝superscript𝑚(X,y)\in\mathbb{R}^{m\times p}\times\mathbb{R}^{m}( italic_X , italic_y ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_p end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT with m𝑚mitalic_m datapoints and p𝑝pitalic_p features, these algorithms compute the linear predictor

β^λ1,λ2(X,y)=argminβpyXβ22+λ1β1+λ2β22.superscriptsubscript^𝛽subscript𝜆1subscript𝜆2𝑋𝑦subscriptargmin𝛽superscript𝑝superscriptsubscriptnorm𝑦𝑋𝛽22subscript𝜆1subscriptnorm𝛽1subscript𝜆2superscriptsubscriptnorm𝛽22\hat{\beta}_{\lambda_{1},\lambda_{2}}^{(X,y)}=\operatorname*{argmin}_{\beta\in% \mathbb{R}^{p}}\|y-X\beta\|_{2}^{2}+\lambda_{1}\|\beta\|_{1}+\lambda_{2}\|% \beta\|_{2}^{2}.over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_X , italic_y ) end_POSTSUPERSCRIPT = roman_argmin start_POSTSUBSCRIPT italic_β ∈ blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ italic_y - italic_X italic_β ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ italic_β ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ italic_β ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (1)

Here λ1,λ20subscript𝜆1subscript𝜆20\lambda_{1},\lambda_{2}\geq 0italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≥ 0 are regularization coefficients constraining the 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT norms, respectively, of the model β𝛽\betaitalic_β. For general λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT the above algorithm is the ElasticNet, while setting λ1=0subscript𝜆10\lambda_{1}=0italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0 recovers Ridge and setting λ2=0subscript𝜆20\lambda_{2}=0italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0 recovers LASSO.

These coefficients play a crucial role across fields: in machine learning controlling the norm of β𝛽\betaitalic_β implies provable generalization guarantees and prevent over-fitting in practice [MRT12], in data analysis their combined use yields parsimonious and interpretable models [HTF09], and in Bayesian statistics they correspond to imposing specific priors on β𝛽\betaitalic_β [Mur12, LL10]. In practice, λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT regularizes β𝛽\betaitalic_β by uniformly shrinking all coefficients, while λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT encourages the model vector to be sparse. This means that while they do yield learning-theoretic and statistical benefits, setting them to be too high will cause models to under-fit the data. The question of how to set the regularization coefficients becomes even more unclear in the case of the ElasticNet, as one must juggle trade-offs between sparsity, feature correlation, and bias when setting both λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT simultaneously. As a result, there has been intense empirical and theoretical effort devoted to automatically tuning these parameters. Yet the state-of-the-art is quite unsatisfactory: proposed work consists of either heuristics without formal guarantees [Gib81, KKM15], approaches that optimize over a finite grid or random set instead of the full continuous domain [CLW16], or analyses that involve very strong theoretical assumptions [Zha09].

In this work, we study a variant on the above well-established and intensely studied formulation. The key distinction is that instead of a single dataset (X,y)𝑋𝑦(X,y)( italic_X , italic_y ), we consider a collection of datasets or instances of the same underlying regression problem (X(i),y(i))superscript𝑋𝑖superscript𝑦𝑖(X^{(i)},y^{(i)})( italic_X start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) and would like to learn a pair (λ1,λ2)subscript𝜆1subscript𝜆2(\lambda_{1},\lambda_{2})( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) that selects a model in equation (1) that has low loss on a validation dataset. This can be useful to model practical settings, for example where new supervised data is obtained several times or where the set of features may change frequently [DP14]. We do not require all examples across datasets to be i.i.d. draws from the same data distribution, and can capture more general data generation scenarios like cross-validation and multi-task learning [ZY21]. Despite these advantages, we remark that our problem formulation is quite different from the standard single dataset setting, where all examples in the dataset are typically assumed to be drawn independently from the same distribution. Our formulation treats the selection of regularization coefficients as data-driven algorithm design, which is often used to study combinatorial problems [GR17, Bal20].

Our main contribution is a new structural result for the ElasticNet Regression problem, which implies generalization guarantees for selecting ElasticNet Regression coefficients in the multiple-instance setting. In particular, Ridge and LASSO regressions are special cases. We extend our results to obtain low regret in the online learning setting, and to tuning related linear classification algorithms. In summary, we make the following key contributions:

  • We formulate the problem of tuning the ElasticNet as a question of learning λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT simultaneously across multiple problem instances, either generated statistically or coming online. Our formulation captures relevant settings like cross-validation and multi-task learning.

  • We provide a novel structural result (Theorem 2.2) that characterizes the loss of the ElasticNet fit. We show that the hyperparameter space can be partitioned by polynomial curves of bounded degrees into pieces where the loss is a bivariate rational function. The result holds for both the usual ElasticNet validation objective and when it is augmented with information criteria like the AIC or BIC.

  • An important consequence of our structural result is a bound on the pseudo-dimension (Definition 5) for the loss function class, which yields strong generalization bounds for tuning λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT simultaneously in the statistical learning setting (Theorem 3.2). Informally, for ElasticNet regression problems with at most p𝑝pitalic_p parameters, for any problem distribution 𝒟𝒟\mathcal{D}caligraphic_D, we show that O(1ϵ2(p2log1ϵ+log1δ))𝑂1superscriptitalic-ϵ2superscript𝑝21italic-ϵ1𝛿O\left(\frac{1}{\epsilon^{2}}(p^{2}\log\frac{1}{\epsilon}+\log\frac{1}{\delta}% )\right)italic_O ( divide start_ARG 1 end_ARG start_ARG italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( italic_p start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log divide start_ARG 1 end_ARG start_ARG italic_ϵ end_ARG + roman_log divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG ) ) problems (datasets) are sufficient to learn an ϵitalic-ϵ\epsilonitalic_ϵ-approximation to the best (λ1,λ2)subscript𝜆1subscript𝜆2(\lambda_{1},\lambda_{2})( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ), with probability at least 1δ1𝛿1-\delta1 - italic_δ.

  • In the online setting, we show under very mild data assumptions—much weaker than prior work—that the problem satisfies a dispersion condition [BDV18, BDS20]. As a result we can tune all parameters across a sequence of instances appearing online and obtain vanishing regret relative to the optimal parameter in hindsight over the sequence (Theorem 3.3) at the rate O~(1/T)~𝑂1𝑇\tilde{O}(1/\sqrt{T})over~ start_ARG italic_O end_ARG ( 1 / square-root start_ARG italic_T end_ARG )222The soft-O notation is used to emphasize dependence on T𝑇Titalic_T, and suppresses other factors as well as logarithmic terms. wrt the length T𝑇Titalic_T of the sequence.

  • We show how to extend our results to regularized classifiers that perform thresholding on Ridge, LASSO or ElasticNet regression estimates, again providing strong generalization and online learning guarantees (Theorems 4.1, 4.2).

We include a couple of remarks to emphasize the generality and significance of our results. First, in our multiple-instance formulation the different problem instances need not have the same number of examples, or even the same set of features. This allows us to handle practical scenarios where the set of features changes across datasets, and we can learn parameters that work well on average across multiple different but related regression tasks. Second, by generating problem instances iid from a fixed (training + validation) dataset, we can obtain iterations (training/validation splits) of popular cross-validation techniques (including the popular leave-one-out and Monte Carlo CV) and our result implies that O~(p2/ϵ2)~𝑂superscript𝑝2superscriptitalic-ϵ2\tilde{O}(p^{2}/\epsilon^{2})over~ start_ARG italic_O end_ARG ( italic_p start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) iterations are enough to determine an ElasticNet parameter λ^^𝜆\hat{\lambda}over^ start_ARG italic_λ end_ARG with loss within ϵitalic-ϵ\epsilonitalic_ϵ (with high probability) of the optimal parameter λ*superscript𝜆\lambda^{*}italic_λ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT over the distribution induced by the cross-validation splits.

Key challenges and insights. A major challenge in learning the ElasticNet parameters is that the variation of the solution path as a function of λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is hard to characterize. Indeed the original ElasticNet paper [ZH05] suggests using the heuristic of grid search to learn a good λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, even though λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT may be exactly optimized by computing full solution paths (for each λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT). We approach this indirectly by utilizing a characterization of the LASSO solution by [Tib13], which is based on the KKT (Karush–Kuhn–Tucker) optimality conditions, to arrive at a precise piecewise structure for the problem. In more detail, we use these conditions to come up with a set of algebraic curves (polynomial equations in λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT) of bounded degrees, such that the set of possible discontinuities lie along these curves, and the loss function behaves well (a bounded-degree rational function) in each piece of the partition of the parameter domain induced by these curves. This characterization is crucial in establishing a bound on the structural complexity needed to provide strong generalization guarantees. We further show additional structure on these algebraic curves that (roughly speaking) imply that the curves do not concentrate in any region of the domain, allowing us to use the powerful recipe of [BDP20] for online learning.

1.1 Related work

Model selection for Ridge regression, LASSO or ElasticNet is typically done by selecting the regularization parameter λ𝜆\lambdaitalic_λ that works well for given data, although some parameter-free techniques for variable selection have been recently proposed [LM15]. Choosing ‘optimal’ parameters for tuning the regularization has been a subject of extensive theoretical and applied research. Much of this effort is heuristic [Gib81, KKM15] or focused on developing tuning objectives beyond validation accuracy like AIC or BIC [Aka74, Sch78] without providing procedures for provably optimizing them. The standard approach given a tuning objective is to optimize it over a grid or random set of parameters, for which there are guarantees [CLW16], but this does not ensure optimality over the entire continuous tuning domain, especially since objectives such as 0-1 validation error or information criteria can have many discontinuities. Selecting a grid that is too fine or too coarse can result in either very inefficient or highly inaccurate estimates (respectively) for good parameters. Other guarantees make strong assumptions on the data distribution such as sub-Gaussian noise [Zha09, CLC21] or depend on unknown parameters that are hard to quantify in practice [FL10]. Recent work has shown asymptotic consistency of cross-validation for ridge regression, even in the limiting case λ20subscript𝜆20\lambda_{2}\rightarrow 0italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT → 0 which is particularly interesting for the overparameterized regime [HMRT22, PWRT21]. A successful line of work has focused on efficiently obtaining models for different values of λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT using regularization paths [EHJT04], but the guarantees are computational rather than learning-theoretic or statistical. In contrast, we provide principled approaches that guarantee near-optimality of selected parameters with high confidence over the entire continuous domain of parameters.

Data-driven algorithm design has proved successful for tuning parameters for a variety of combinatorial problems like clustering, integer programming, auction design and graph-based learning [BDL19, BPSV21, BSV16, BS21]. We provide an application of these techniques to parameter tuning in a problem that is not inherently combinatorial by revealing a novel discrete structure. We identify the underlying piecewise structure of the ElasticNet loss function which is extremely effective in establishing learning-theoretic guarantees [BDD+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT21]. To exploit this piecewise structure, we analyze the learning-theoretic complexity of rational algebraic function classes and infer generalization guarantees. Follow-up work [BNS23] improves on our generalzation guarantees and extends the results to regularized logistic regression. We also employ and extend general tools and techniques for online data-driven learning from [BDP20, BS21] to rational functions in order to prove our online learning guarantees for regularization coefficient tuning.

2 Preliminaries and a Key Structural Result

Given data (X,y)𝑋𝑦(X,y)( italic_X , italic_y ) with Xm×p𝑋superscript𝑚𝑝X\in\mathbb{R}^{m\times p}italic_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_p end_POSTSUPERSCRIPT and ym𝑦superscript𝑚y\in\mathbb{R}^{m}italic_y ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, consisting of m𝑚mitalic_m labeled examples with p𝑝pitalic_p features, we seek estimators βp𝛽superscript𝑝\beta\in\mathbb{R}^{p}italic_β ∈ blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT which minimize the regularized loss. Popular regularization methods like LASSO and ElasticNet can be expressed as computing the solution of an optimization problem given by

β^λ,f(X,y)argminβpyXβ22+λ,f(β),\hat{\beta}_{\lambda,f}^{(X,y)}\in\operatorname*{argmin}_{\beta\in\mathbb{R}^{% p}}\left\lVert y-X\beta\right\rVert_{2}^{2}+\langle\lambda,f(\beta)\rangle,over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_λ , italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_X , italic_y ) end_POSTSUPERSCRIPT ∈ roman_argmin start_POSTSUBSCRIPT italic_β ∈ blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ italic_y - italic_X italic_β ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ⟨ italic_λ , italic_f ( italic_β ) ⟩ ,

where f:p0d:𝑓superscript𝑝superscriptsubscriptabsent0𝑑f:\mathbb{R}^{p}\rightarrow\mathbb{R}_{\geq 0}^{d}italic_f : blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT → blackboard_R start_POSTSUBSCRIPT ≥ 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT gives the regularization penalty for estimator β𝛽\betaitalic_β, λ0d𝜆superscriptsubscriptabsent0𝑑\lambda\in\mathbb{R}_{\geq 0}^{d}italic_λ ∈ blackboard_R start_POSTSUBSCRIPT ≥ 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT is the regularization parameter, and d𝑑ditalic_d is the number of regularization parameters. d=1𝑑1d=1italic_d = 1 for Ridge and LASSO, and d=2𝑑2d=2italic_d = 2 for the ElasticNet. Setting f=f2𝑓subscript𝑓2f=f_{2}italic_f = italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT with f2(β)=β22subscript𝑓2𝛽superscriptsubscriptdelimited-∥∥𝛽22f_{2}(\beta)=\left\lVert\beta\right\rVert_{2}^{2}italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_β ) = ∥ italic_β ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT yields Ridge regression, and setting f(β)=f1(β):=β1𝑓𝛽subscript𝑓1𝛽assignsubscriptdelimited-∥∥𝛽1f(\beta)=f_{1}(\beta):=\left\lVert\beta\right\rVert_{1}italic_f ( italic_β ) = italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_β ) := ∥ italic_β ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT corresponds to LASSO. Also using fEN(β):=(f1(β),f2(β))assignsubscript𝑓EN𝛽subscript𝑓1𝛽subscript𝑓2𝛽f_{\text{EN}}(\beta):=(f_{1}(\beta),f_{2}(\beta))italic_f start_POSTSUBSCRIPT EN end_POSTSUBSCRIPT ( italic_β ) := ( italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_β ) , italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_β ) ) gives the ElasticNet with regularization parameter λ=(λ1,λ2)𝜆subscript𝜆1subscript𝜆2\lambda=(\lambda_{1},\lambda_{2})italic_λ = ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ). Note that we use the same λ𝜆\lambdaitalic_λ (with some notational overloading) to denote the regularization parameters for ridge, LASSO, or ElasticNet. We write β^λ,f(X,y)superscriptsubscript^𝛽𝜆𝑓𝑋𝑦\hat{\beta}_{\lambda,f}^{(X,y)}over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_λ , italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_X , italic_y ) end_POSTSUPERSCRIPT as simply β^λ,fsubscript^𝛽𝜆𝑓\hat{\beta}_{\lambda,f}over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_λ , italic_f end_POSTSUBSCRIPT when the dataset (X,y)𝑋𝑦(X,y)( italic_X , italic_y ) is clear from context. On any instance xp𝑥superscript𝑝x\in\mathbb{R}^{p}italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT from the feature space, the prediction of the regularized estimator is given by the dot product x,β^λ,f𝑥subscript^𝛽𝜆𝑓\langle x,\hat{\beta}_{\lambda,f}\rangle⟨ italic_x , over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_λ , italic_f end_POSTSUBSCRIPT ⟩. The average squared loss over a dataset (X,y)superscript𝑋superscript𝑦(X^{\prime},y^{\prime})( italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) with Xm×psuperscript𝑋superscriptsuperscript𝑚𝑝X^{\prime}\in\mathbb{R}^{m^{\prime}\times p}italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × italic_p end_POSTSUPERSCRIPT and ymsuperscript𝑦superscriptsuperscript𝑚y^{\prime}\in\mathbb{R}^{m^{\prime}}italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT is given by

lr(β^λ,f,(X,y))=1myXβ^λ,f22.subscript𝑙𝑟subscript^𝛽𝜆𝑓superscript𝑋superscript𝑦1superscript𝑚superscriptsubscriptdelimited-∥∥superscript𝑦superscript𝑋subscript^𝛽𝜆𝑓22l_{r}(\hat{\beta}_{\lambda,f},(X^{\prime},y^{\prime}))=\frac{1}{m^{\prime}}% \left\lVert y^{\prime}-X^{\prime}\hat{\beta}_{\lambda,f}\right\rVert_{2}^{2}.italic_l start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_λ , italic_f end_POSTSUBSCRIPT , ( italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) = divide start_ARG 1 end_ARG start_ARG italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ∥ italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_λ , italic_f end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

By setting (X,y)superscript𝑋superscript𝑦(X^{\prime},y^{\prime})( italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) to be the training data (X,y)𝑋𝑦(X,y)( italic_X , italic_y ), we get the training loss lr(β^λ,f,(X,y))subscript𝑙𝑟subscript^𝛽𝜆𝑓𝑋𝑦l_{r}(\hat{\beta}_{\lambda,f},(X,y))italic_l start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_λ , italic_f end_POSTSUBSCRIPT , ( italic_X , italic_y ) ). We use (Xval,yval)subscript𝑋valsubscript𝑦val(X_{\text{val}},y_{\text{val}})( italic_X start_POSTSUBSCRIPT val end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT val end_POSTSUBSCRIPT ) to denote a validation split.

Distributional and Online Settings. In the distributional or statistical setting, we receive a collection of n𝑛nitalic_n instances of the regression problem

P(i)=(X(i),y(i),Xval(i),yval(i))mi,pi,mi:=mi×pi×mi×mi×pi×mi,superscript𝑃𝑖superscript𝑋𝑖superscript𝑦𝑖superscriptsubscript𝑋val𝑖superscriptsubscript𝑦val𝑖subscriptsubscript𝑚𝑖subscript𝑝𝑖superscriptsubscript𝑚𝑖assignsuperscriptsubscript𝑚𝑖subscript𝑝𝑖superscriptsubscript𝑚𝑖superscriptsubscriptsuperscript𝑚𝑖subscript𝑝𝑖superscriptsubscriptsuperscript𝑚𝑖P^{(i)}=(X^{(i)},y^{(i)},X_{\text{val}}^{(i)},y_{\text{val}}^{(i)})\in\mathcal% {R}_{m_{i},p_{i},m_{i}^{\prime}}:=\mathbb{R}^{m_{i}\times p_{i}}\times\mathbb{% R}^{m_{i}}\times\mathbb{R}^{m^{\prime}_{i}\times p_{i}}\times\mathbb{R}^{m^{% \prime}_{i}},italic_P start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT = ( italic_X start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_X start_POSTSUBSCRIPT val end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT val end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) ∈ caligraphic_R start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT := blackboard_R start_POSTSUPERSCRIPT italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ,

for i[n]𝑖delimited-[]𝑛i\in[n]italic_i ∈ [ italic_n ] generated i.i.d. from some problem distribution 𝒟𝒟\mathcal{D}caligraphic_D. The problems are in the problem space given by Πm,p=m10,m2m,p1pm1,p1,m2subscriptΠ𝑚𝑝subscriptformulae-sequencesubscript𝑚10formulae-sequencesubscript𝑚2𝑚subscript𝑝1𝑝subscriptsubscript𝑚1subscript𝑝1subscript𝑚2\Pi_{m,p}=\bigcup_{m_{1}\geq 0,m_{2}\leq m,p_{1}\leq p}\mathcal{R}_{m_{1},p_{1% },m_{2}}roman_Π start_POSTSUBSCRIPT italic_m , italic_p end_POSTSUBSCRIPT = ⋃ start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≥ 0 , italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_m , italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ italic_p end_POSTSUBSCRIPT caligraphic_R start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT (note that the problem distribution 𝒟𝒟\mathcal{D}caligraphic_D is over Πm,psubscriptΠ𝑚𝑝\Pi_{m,p}roman_Π start_POSTSUBSCRIPT italic_m , italic_p end_POSTSUBSCRIPT). On any given instance P(i)superscript𝑃𝑖P^{(i)}italic_P start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT the loss is given by the squared loss on the validation set, EN(λ,P(i))=lr(β^λ,fEN(X(i),y(i)),(Xval(i),yval(i)))subscriptEN𝜆superscript𝑃𝑖subscript𝑙𝑟superscriptsubscript^𝛽𝜆subscript𝑓ENsuperscript𝑋𝑖superscript𝑦𝑖superscriptsubscript𝑋val𝑖superscriptsubscript𝑦val𝑖\ell_{\text{EN}}(\lambda,P^{(i)})=l_{r}(\hat{\beta}_{\lambda,f_{\text{EN}}}^{(% X^{(i)},y^{(i)})},(X_{\text{val}}^{(i)},y_{\text{val}}^{(i)}))roman_ℓ start_POSTSUBSCRIPT EN end_POSTSUBSCRIPT ( italic_λ , italic_P start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) = italic_l start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_λ , italic_f start_POSTSUBSCRIPT EN end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_X start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT , ( italic_X start_POSTSUBSCRIPT val end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT val end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) ). On the other hand, in the online setting, we receive a sequence of T𝑇Titalic_T instances of the ElasticNet regression problem P(i)=(X(i),y(i),Xval(i),yval(i))Πm,psuperscript𝑃𝑖superscript𝑋𝑖superscript𝑦𝑖superscriptsubscript𝑋val𝑖superscriptsubscript𝑦val𝑖subscriptΠ𝑚𝑝P^{(i)}=(X^{(i)},y^{(i)},X_{\text{val}}^{(i)},y_{\text{val}}^{(i)})\in\Pi_{m,p}italic_P start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT = ( italic_X start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_X start_POSTSUBSCRIPT val end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT val end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) ∈ roman_Π start_POSTSUBSCRIPT italic_m , italic_p end_POSTSUBSCRIPT for i[T]𝑖delimited-[]𝑇i\in[T]italic_i ∈ [ italic_T ] online. On any given instance P(i)superscript𝑃𝑖P^{(i)}italic_P start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT, the online learner is required to select the regularization parameter λ(i)superscript𝜆𝑖\lambda^{(i)}italic_λ start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT without observing yval(i)superscriptsubscript𝑦val𝑖y_{\text{val}}^{(i)}italic_y start_POSTSUBSCRIPT val end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT, and experiences loss given by (λ(i),P(i))=lc(β^λ(i),fEN(X(i),y(i)),(Xval(i),yval(i)))superscript𝜆𝑖superscript𝑃𝑖subscript𝑙𝑐superscriptsubscript^𝛽superscript𝜆𝑖subscript𝑓𝐸𝑁superscript𝑋𝑖superscript𝑦𝑖superscriptsubscript𝑋val𝑖superscriptsubscript𝑦val𝑖\ell(\lambda^{(i)},P^{(i)})=l_{c}(\hat{\beta}_{\lambda^{(i)},f_{EN}}^{(X^{(i)}% ,y^{(i)})},(X_{\text{val}}^{(i)},y_{\text{val}}^{(i)}))roman_ℓ ( italic_λ start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_P start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) = italic_l start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_λ start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_f start_POSTSUBSCRIPT italic_E italic_N end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_X start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT , ( italic_X start_POSTSUBSCRIPT val end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT val end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) ). The goal is to minimize the regret w.r.t. choosing the best fixed parameter in hindsight for the same problem sequence, i.e.

RT=i=1T(λ(i),P(i))minλi=1T(λ,P(i)).subscript𝑅𝑇superscriptsubscript𝑖1𝑇superscript𝜆𝑖superscript𝑃𝑖subscript𝜆superscriptsubscript𝑖1𝑇𝜆superscript𝑃𝑖R_{T}=\sum_{i=1}^{T}\ell(\lambda^{(i)},P^{(i)})-\min_{\lambda}\sum_{i=1}^{T}% \ell(\lambda,P^{(i)}).italic_R start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_ℓ ( italic_λ start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_P start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) - roman_min start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_ℓ ( italic_λ , italic_P start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) .

We also define average regret as 1TRT1𝑇subscript𝑅𝑇\frac{1}{T}R_{T}divide start_ARG 1 end_ARG start_ARG italic_T end_ARG italic_R start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT and expected regret as 𝔼[RT]𝔼delimited-[]subscript𝑅𝑇\mathbb{E}[R_{T}]blackboard_E [ italic_R start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ] where the expectation is over both the randomness of the loss functions and any random coins used by the online algorithm.

Given a class of regularization algorithms 𝒜𝒜\mathcal{A}caligraphic_A parameterized by regularization parameter λ𝜆\lambdaitalic_λ over a set of problem instances 𝒳𝒳\mathcal{X}caligraphic_X, and given loss function :𝒜×𝒳:𝒜𝒳\ell:\mathcal{A}\times\mathcal{X}\rightarrow\mathbb{R}roman_ℓ : caligraphic_A × caligraphic_X → blackboard_R which measures the loss of any algorithm in 𝒜𝒜\mathcal{A}caligraphic_A on any fixed problem instance, consider the set of functions 𝒜={(A,)A𝒜}subscript𝒜conditional-set𝐴𝐴𝒜\mathcal{H}_{\mathcal{A}}=\{\ell(A,\cdot)\mid A\in\mathcal{A}\}caligraphic_H start_POSTSUBSCRIPT caligraphic_A end_POSTSUBSCRIPT = { roman_ℓ ( italic_A , ⋅ ) ∣ italic_A ∈ caligraphic_A }. For example, for the ElasticNet we have EN(λ,P)=lr(β^λ,fEN(XP,yP),(XP,yP))subscriptEN𝜆𝑃subscript𝑙𝑟subscriptsuperscript^𝛽subscript𝑋𝑃subscript𝑦𝑃𝜆subscript𝑓ENsubscriptsuperscript𝑋𝑃subscriptsuperscript𝑦𝑃\ell_{\text{EN}}(\lambda,P)=l_{r}(\hat{\beta}^{(X_{P},y_{P})}_{\lambda,f_{% \text{EN}}},(X^{\prime}_{P},y^{\prime}_{P}))roman_ℓ start_POSTSUBSCRIPT EN end_POSTSUBSCRIPT ( italic_λ , italic_P ) = italic_l start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( over^ start_ARG italic_β end_ARG start_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_λ , italic_f start_POSTSUBSCRIPT EN end_POSTSUBSCRIPT end_POSTSUBSCRIPT , ( italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT , italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ) ), where (XP,yP)subscript𝑋𝑃subscript𝑦𝑃(X_{P},y_{P})( italic_X start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ) and (XP,yP)subscriptsuperscript𝑋𝑃subscriptsuperscript𝑦𝑃(X^{\prime}_{P},y^{\prime}_{P})( italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT , italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ) are the training and validation sets associated with problem P𝒳𝑃𝒳P\in\mathcal{X}italic_P ∈ caligraphic_X respectively. Bounding the pseudo-dimension of 𝒜subscript𝒜\mathcal{H}_{\mathcal{A}}caligraphic_H start_POSTSUBSCRIPT caligraphic_A end_POSTSUBSCRIPT gives a bound on the sample complexity for uniform convergence guarantees, i.e. a bound on the sample size n𝑛nitalic_n for which the algorithm A^S𝒜subscript^𝐴𝑆𝒜\hat{A}_{S}\in\mathcal{A}over^ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ∈ caligraphic_A which minimizes the average loss on any sample S𝑆Sitalic_S of size n𝑛nitalic_n drawn i.i.d. from any problem distribution 𝒟𝒟\mathcal{D}caligraphic_D is guaranteed to be near-optimal with high probability [Dud67]. See Appendix A for the relevant classic definitions and results. Define the dual class *superscript\mathcal{H}^{*}caligraphic_H start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT of a set of real-valued functions 2𝒳superscript2𝒳\mathcal{H}\subseteq 2^{\mathcal{X}}caligraphic_H ⊆ 2 start_POSTSUPERSCRIPT caligraphic_X end_POSTSUPERSCRIPT as *={hx*:x𝒳}superscriptconditional-setsubscriptsuperscript𝑥conditional𝑥𝒳\mathcal{H}^{*}=\{h^{*}_{x}:\mathcal{H}\rightarrow\mathbb{R}\mid x\in\mathcal{% X}\}caligraphic_H start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = { italic_h start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT : caligraphic_H → blackboard_R ∣ italic_x ∈ caligraphic_X } where hx*(h)=h(x)subscriptsuperscript𝑥𝑥h^{*}_{x}(h)=h(x)italic_h start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_h ) = italic_h ( italic_x ). In the context of regression problems 𝒳𝒳\mathcal{X}caligraphic_X, for each fixed problem instance x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X there is a dual function hx*subscriptsuperscript𝑥h^{*}_{x}italic_h start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT that computes the loss (A,x)𝐴𝑥\ell(A,x)roman_ℓ ( italic_A , italic_x ) for any (primal) function hA=(A,)𝒜subscript𝐴𝐴subscript𝒜h_{A}=\ell(A,\cdot)\in\mathcal{H}_{\mathcal{A}}italic_h start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT = roman_ℓ ( italic_A , ⋅ ) ∈ caligraphic_H start_POSTSUBSCRIPT caligraphic_A end_POSTSUBSCRIPT. For a function class \mathcal{H}caligraphic_H, showing that dual class *superscript\mathcal{H}^{*}caligraphic_H start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT is piecewise-structured in the sense of Definition 1 and bounding the complexity of the duals of boundary and piece functions of *superscript\mathcal{H}^{*}caligraphic_H start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT are useful to understand the learnability of \mathcal{H}caligraphic_H [BDD+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT21].

Definition 1 (Piecewise structured functions, [BDD+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT21]).

A function class H𝒳𝐻superscript𝒳H\subseteq\mathbb{R}^{\mathcal{X}}italic_H ⊆ blackboard_R start_POSTSUPERSCRIPT caligraphic_X end_POSTSUPERSCRIPT that maps a domain 𝒳𝒳\mathcal{X}caligraphic_X to \mathbb{R}blackboard_R is (F,G,k)𝐹𝐺𝑘(F,G,k)( italic_F , italic_G , italic_k )-piecewise decomposable for a class G{0,1}𝒳𝐺superscript01𝒳G\subseteq\{0,1\}^{\mathcal{X}}italic_G ⊆ { 0 , 1 } start_POSTSUPERSCRIPT caligraphic_X end_POSTSUPERSCRIPT of boundary functions and a class F𝒳𝐹superscript𝒳F\subseteq\mathbb{R}^{\mathcal{X}}italic_F ⊆ blackboard_R start_POSTSUPERSCRIPT caligraphic_X end_POSTSUPERSCRIPT of piece functions if the following holds: for every hH𝐻h\in Hitalic_h ∈ italic_H, there are k𝑘kitalic_k boundary functions g1,,gkGsubscript𝑔1normal-…subscript𝑔𝑘𝐺g_{1},\dots,g_{k}\in Gitalic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ italic_G and a piece function f𝐛Fsubscript𝑓𝐛𝐹f_{\mathbf{b}}\in Fitalic_f start_POSTSUBSCRIPT bold_b end_POSTSUBSCRIPT ∈ italic_F for each bit vector 𝐛{0,1}k𝐛superscript01𝑘\mathbf{b}\in\{0,1\}^{k}bold_b ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT such that for all x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X, h(x)=f𝐛x(x)𝑥subscript𝑓subscript𝐛𝑥𝑥h(x)=f_{\mathbf{b}_{x}}(x)italic_h ( italic_x ) = italic_f start_POSTSUBSCRIPT bold_b start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) where 𝐛x=(g1(x),,gk(x)){0,1}ksubscript𝐛𝑥subscript𝑔1𝑥normal-…subscript𝑔𝑘𝑥superscript01𝑘\mathbf{b}_{x}=(g_{1}(x),\dots,g_{k}(x))\in\{0,1\}^{k}bold_b start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT = ( italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x ) , … , italic_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) ) ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT.

Intuitively, a real-valued function is piecewise-structured if the domain can be divided into pieces by a finite number of boundary functions (say linear or polynomial thresholds) and the function value over each piece is easy to characterize (e.g. constant, linear, polynomial). To state and understand our structural insights into the ElasticNet problem we will also need the definition of equicorrelation sets, the subset of features with maximum absolute correlation for any fixed λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, useful for characterizing LASSO/ElasticNet solutions. For any subset [p]delimited-[]𝑝\mathcal{E}\subseteq[p]caligraphic_E ⊆ [ italic_p ] of the features, we define X=(X*i)isubscript𝑋subscriptmatrixsubscript𝑋absent𝑖𝑖X_{\mathcal{E}}=\begin{pmatrix}\dots X_{*i}\dots\end{pmatrix}_{i\in\mathcal{E}}italic_X start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT = ( start_ARG start_ROW start_CELL … italic_X start_POSTSUBSCRIPT * italic_i end_POSTSUBSCRIPT … end_CELL end_ROW end_ARG ) start_POSTSUBSCRIPT italic_i ∈ caligraphic_E end_POSTSUBSCRIPT as the m×||𝑚m\times|\mathcal{E}|italic_m × | caligraphic_E | matrix of columns X*isubscript𝑋absent𝑖X_{*i}italic_X start_POSTSUBSCRIPT * italic_i end_POSTSUBSCRIPT of X𝑋Xitalic_X corresponding to indices i𝑖i\in\mathcal{E}italic_i ∈ caligraphic_E. Similarly β||subscript𝛽superscript\beta_{\mathcal{E}}\in\mathbb{R}^{|\mathcal{E}|}italic_β start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT | caligraphic_E | end_POSTSUPERSCRIPT is the subset of estimators in β𝛽\betaitalic_β corresponding to indices in \mathcal{E}caligraphic_E. We will assume all the feature matrixes X𝑋Xitalic_X (for training datasets) are in general position (Definition 6).

Definition 2 (Equicorrelation sets, [Tib13]).

Let β*argminβpyXβ22+λ1||β||1\beta^{*}\in\operatorname*{argmin}_{\beta\in\mathbb{R}^{p}}\left\lVert y-X% \beta\right\rVert_{2}^{2}+\lambda_{1}||\beta||_{1}italic_β start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ∈ roman_argmin start_POSTSUBSCRIPT italic_β ∈ blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ italic_y - italic_X italic_β ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | | italic_β | | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. The equicorrelation set corresponding to β*superscript𝛽\beta^{*}italic_β start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT, ={j[p]|𝐱jT(yXβ*)|=λ1}conditional-set𝑗delimited-[]𝑝superscriptsubscript𝐱𝑗𝑇𝑦𝑋superscript𝛽subscript𝜆1\mathcal{E}=\{j\in[p]\mid|\bm{x}_{j}^{T}(y-X\beta^{*})|=\lambda_{1}\}caligraphic_E = { italic_j ∈ [ italic_p ] ∣ | bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_y - italic_X italic_β start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) | = italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT }, is simply the set of covariates with maximum absolute correlation. We also define the equicorrelation sign vector for β*superscript𝛽\beta^{*}italic_β start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT as s=𝗌𝗂𝗀𝗇(XT(yXβ*)){±1}𝑠𝗌𝗂𝗀𝗇superscriptsubscript𝑋𝑇𝑦𝑋superscript𝛽plus-or-minus1s=\textup{{sign}}(X_{\mathcal{E}}^{T}(y-X\beta^{*}))\in\{\pm 1\}italic_s = sign ( italic_X start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_y - italic_X italic_β start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) ) ∈ { ± 1 }.

Here 𝗌𝗂𝗀𝗇(x)=1𝗌𝗂𝗀𝗇𝑥1\textup{{sign}}(x)=1sign ( italic_x ) = 1 if x0𝑥0x\geq 0italic_x ≥ 0, and 𝗌𝗂𝗀𝗇(x)=1𝗌𝗂𝗀𝗇𝑥1\textup{{sign}}(x)=-1sign ( italic_x ) = - 1 otherwise. Consider the class of algorithms consisting of ElasticNet regressors for different values of λ=(λ1,λ2)(0,)×(0,)𝜆subscript𝜆1subscript𝜆200\lambda=(\lambda_{1},\lambda_{2})\in(0,\infty)\times(0,\infty)italic_λ = ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ∈ ( 0 , ∞ ) × ( 0 , ∞ ). We assume λ1>0subscript𝜆10\lambda_{1}>0italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT > 0 for technical simplicity (cf. [Tib13]). We seek to solve problems of the form P=(X,y,Xval,yval)Πm,p𝑃𝑋𝑦subscript𝑋valsubscript𝑦valsubscriptΠ𝑚𝑝P=(X,y,X_{\text{val}},y_{\text{val}})\in\Pi_{m,p}italic_P = ( italic_X , italic_y , italic_X start_POSTSUBSCRIPT val end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT val end_POSTSUBSCRIPT ) ∈ roman_Π start_POSTSUBSCRIPT italic_m , italic_p end_POSTSUBSCRIPT, where (X,y)𝑋𝑦(X,y)( italic_X , italic_y ) is the training set, (Xval,yval)subscript𝑋valsubscript𝑦val(X_{\text{val}},y_{\text{val}})( italic_X start_POSTSUBSCRIPT val end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT val end_POSTSUBSCRIPT ) is the validation set with the same set of features, and m,p𝑚𝑝m,pitalic_m , italic_p are upper bounds on the number of examples and features respectively in any dataset. Let EN={EN(λ,)λ(0,)×(0,)}subscriptENconditional-setsubscriptEN𝜆𝜆00\mathcal{H}_{\text{EN}}=\{\ell_{\text{EN}}(\lambda,\cdot)\mid\lambda\in(0,% \infty)\times(0,\infty)\}caligraphic_H start_POSTSUBSCRIPT EN end_POSTSUBSCRIPT = { roman_ℓ start_POSTSUBSCRIPT EN end_POSTSUBSCRIPT ( italic_λ , ⋅ ) ∣ italic_λ ∈ ( 0 , ∞ ) × ( 0 , ∞ ) } denote the set of loss functions for the class of algorithms consisting of ElasticNet regressors for different values of λ+×+𝜆superscriptsuperscript\lambda\in\mathbb{R}^{+}\times\mathbb{R}^{+}italic_λ ∈ blackboard_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT. Additionally, we will consider information criterion based loss functions, EN𝖠𝖨𝖢(λ,P)=EN(λ,P)+2β^λ,fEN(X,y)0superscriptsubscriptEN𝖠𝖨𝖢𝜆𝑃subscriptEN𝜆𝑃2subscriptnormsuperscriptsubscript^𝛽𝜆subscript𝑓EN𝑋𝑦0\ell_{\text{EN}}^{\textsf{AIC}}(\lambda,P)=\ell_{\text{EN}}(\lambda,P)+2||\hat% {\beta}_{\lambda,f_{\text{EN}}}^{(X,y)}||_{0}roman_ℓ start_POSTSUBSCRIPT EN end_POSTSUBSCRIPT start_POSTSUPERSCRIPT AIC end_POSTSUPERSCRIPT ( italic_λ , italic_P ) = roman_ℓ start_POSTSUBSCRIPT EN end_POSTSUBSCRIPT ( italic_λ , italic_P ) + 2 | | over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_λ , italic_f start_POSTSUBSCRIPT EN end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_X , italic_y ) end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and EN𝖡𝖨𝖢(λ,P)=EN(λ,P)+2β^λ,fEN(X,y)0logmsuperscriptsubscriptEN𝖡𝖨𝖢𝜆𝑃subscriptEN𝜆𝑃2subscriptnormsuperscriptsubscript^𝛽𝜆subscript𝑓EN𝑋𝑦0𝑚\ell_{\text{EN}}^{\textsf{BIC}}(\lambda,P)=\ell_{\text{EN}}(\lambda,P)+2||\hat% {\beta}_{\lambda,f_{\text{EN}}}^{(X,y)}||_{0}\log mroman_ℓ start_POSTSUBSCRIPT EN end_POSTSUBSCRIPT start_POSTSUPERSCRIPT BIC end_POSTSUPERSCRIPT ( italic_λ , italic_P ) = roman_ℓ start_POSTSUBSCRIPT EN end_POSTSUBSCRIPT ( italic_λ , italic_P ) + 2 | | over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_λ , italic_f start_POSTSUBSCRIPT EN end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_X , italic_y ) end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT roman_log italic_m [Aka74, Sch78]. Let EN𝖠𝖨𝖢superscriptsubscriptEN𝖠𝖨𝖢\mathcal{H}_{\text{EN}}^{\textsf{AIC}}caligraphic_H start_POSTSUBSCRIPT EN end_POSTSUBSCRIPT start_POSTSUPERSCRIPT AIC end_POSTSUPERSCRIPT and EN𝖡𝖨𝖢superscriptsubscriptEN𝖡𝖨𝖢\mathcal{H}_{\text{EN}}^{\textsf{BIC}}caligraphic_H start_POSTSUBSCRIPT EN end_POSTSUBSCRIPT start_POSTSUPERSCRIPT BIC end_POSTSUPERSCRIPT denote the corresponding sets of loss functions. These criteria are popularly used to compute the squared loss on the training set, to give alternatives to cross-validation. We do not make any assumption on the relation between training and validation sets in our formulation, so our analysis can capture these settings as well.

2.1 Piecewise structure of the ElasticNet loss

We will now establish a piecewise structure of the dual class loss functions (Definition 1). A key observation is that if the signed equicorrelation set (,s)𝑠(\mathcal{E},s)( caligraphic_E , italic_s ) (i.e. a subset of features [p]delimited-[]𝑝\mathcal{E}\subseteq[p]caligraphic_E ⊆ [ italic_p ] with the same maximum absolute correlation, assigned a fixed sign pattern {1,+1}||superscript11\{-1,+1\}^{|\mathcal{E}|}{ - 1 , + 1 } start_POSTSUPERSCRIPT | caligraphic_E | end_POSTSUPERSCRIPT, see Definition 2) is fixed, then the ElasticNet coefficients may be characterized (Lemma C.1) and the loss is a fixed rational polynomial piece function of the parameters λ1,λ2subscript𝜆1subscript𝜆2\lambda_{1},\lambda_{2}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. We then show the existence of a set of boundary function curves 𝒢𝒢\mathcal{G}caligraphic_G, such that any region of the parameter space located on a fixed side of all the curves (more formally, for a fixed sign pattern in Definition 1) in 𝒢𝒢\mathcal{G}caligraphic_G has the same signed equicorrelation set. The boundary functions are a collection of possible curves at which a covariate may enter or leave the set \mathcal{E}caligraphic_E and correspond to polynomial thresholds. We make repeated use of the following lemma which provides useful properties of the piece functions as well the the boundary functions of the dual class loss functions.


Refer to caption
Figure 1: An illustration of the piecewise structure of the ElasticNet loss, as a function of the regularization parameters, for a fixed problem instance. Pieces are regions where some bounded degree polynomials (r1,r2subscript𝑟1subscript𝑟2r_{1},r_{2}italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT) have a fixed sign pattern (one of ±1,±1plus-or-minus1plus-or-minus1\pm 1,\pm 1± 1 , ± 1), and in each piece the loss is a fixed (rational) function.

Lemma 2.1.

Let A𝐴Aitalic_A be an r×s𝑟𝑠r\times sitalic_r × italic_s matrix. Consider the matrix B(λ)=(ATA+λIs)1𝐵𝜆superscriptsuperscript𝐴𝑇𝐴𝜆subscript𝐼𝑠1B(\lambda)=(A^{T}A+\lambda I_{s})^{-1}italic_B ( italic_λ ) = ( italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_A + italic_λ italic_I start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT and λ>0𝜆0\lambda>0italic_λ > 0.

  • 1.

    Each entry of B(λ)𝐵𝜆B(\lambda)italic_B ( italic_λ ) is a rational polynomial Pij(λ)/Q(λ)subscript𝑃𝑖𝑗𝜆𝑄𝜆P_{ij}(\lambda)/Q(\lambda)italic_P start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_λ ) / italic_Q ( italic_λ ) for i,j[s]𝑖𝑗delimited-[]𝑠i,j\in[s]italic_i , italic_j ∈ [ italic_s ] with each Pijsubscript𝑃𝑖𝑗P_{ij}italic_P start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT of degree at most s1𝑠1s-1italic_s - 1, and Q𝑄Qitalic_Q of degree s𝑠sitalic_s.

  • 2.

    Further, for i=j𝑖𝑗i=jitalic_i = italic_j, Pijsubscript𝑃𝑖𝑗P_{ij}italic_P start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT has degree s1𝑠1s-1italic_s - 1 and leading coefficient 1, and for ij𝑖𝑗i\neq jitalic_i ≠ italic_j Pijsubscript𝑃𝑖𝑗P_{ij}italic_P start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT has degree at most s2𝑠2s-2italic_s - 2. Also, Q(λ)𝑄𝜆Q(\lambda)italic_Q ( italic_λ ) has leading coefficient 1111.

The proof is straightforward and deferred to Appendix C. We will now formally state and prove our key structural result which is needed to establish our generalization and online regret guarantees in Section 3.

Theorem 2.2.

Let \mathcal{L}caligraphic_L be a set of functions {lλ:Πm,p0λ+×+\{l_{\lambda}:\Pi_{m,p}\rightarrow\mathbb{R}_{\geq 0}\mid\lambda\in\mathbb{R}^% {+}\times\mathbb{R}^{+}{ italic_l start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT : roman_Π start_POSTSUBSCRIPT italic_m , italic_p end_POSTSUBSCRIPT → blackboard_R start_POSTSUBSCRIPT ≥ 0 end_POSTSUBSCRIPT ∣ italic_λ ∈ blackboard_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT that map a regression problem instance PΠm,p𝑃subscriptnormal-Π𝑚𝑝P\in\Pi_{m,p}italic_P ∈ roman_Π start_POSTSUBSCRIPT italic_m , italic_p end_POSTSUBSCRIPT to the validation loss 𝐸𝑁(λ,P)subscriptnormal-ℓ𝐸𝑁𝜆𝑃\ell_{\text{EN}}(\lambda,P)roman_ℓ start_POSTSUBSCRIPT EN end_POSTSUBSCRIPT ( italic_λ , italic_P ) of ElasticNet trained with regularization parameter λ=(λ1,λ2)𝜆subscript𝜆1subscript𝜆2\lambda=(\lambda_{1},\lambda_{2})italic_λ = ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ). The dual class *superscript\mathcal{L}^{*}caligraphic_L start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT is (,𝒢,p3p)𝒢𝑝superscript3𝑝(\mathcal{F},\mathcal{G},p3^{p})( caligraphic_F , caligraphic_G , italic_p 3 start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT )-piecewise decomposable, with ={fq:}conditional-setsubscript𝑓𝑞normal-→\mathcal{F}=\{f_{q}:\mathcal{L}\rightarrow\mathbb{R}\}caligraphic_F = { italic_f start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT : caligraphic_L → blackboard_R } consisting of rational polynomial functions fq1,q2:lλq1(λ1,λ2)q2(λ2)normal-:subscript𝑓subscript𝑞1subscript𝑞2maps-tosubscript𝑙𝜆subscript𝑞1subscript𝜆1subscript𝜆2subscript𝑞2subscript𝜆2f_{q_{1},q_{2}}:l_{\lambda}\mapsto\frac{q_{1}(\lambda_{1},\lambda_{2})}{q_{2}(% \lambda_{2})}italic_f start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT : italic_l start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ↦ divide start_ARG italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG, where q1,q2subscript𝑞1subscript𝑞2q_{1},q_{2}italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT have degrees at most 2p2𝑝2p2 italic_p, and 𝒢={gr:{0,1}}𝒢conditional-setsubscript𝑔𝑟normal-→01\mathcal{G}=\{g_{r}:\mathcal{L}\rightarrow\{0,1\}\}caligraphic_G = { italic_g start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT : caligraphic_L → { 0 , 1 } } consists of polynomial threshold functions gr:uλ𝕀{r(λ1,λ2)<0}normal-:subscript𝑔𝑟maps-tosubscript𝑢𝜆𝕀𝑟subscript𝜆1subscript𝜆20g_{r}:u_{\lambda}\mapsto\mathbb{I}\{r(\lambda_{1},\lambda_{2})<0\}italic_g start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT : italic_u start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ↦ blackboard_I { italic_r ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) < 0 }, where r𝑟ritalic_r is a polynomial of degree 1 in λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and at most p𝑝pitalic_p in λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

Proof.

Let P=(X,y,Xval,yval)Πm,p𝑃𝑋𝑦subscript𝑋valsubscript𝑦valsubscriptΠ𝑚𝑝P=(X,y,X_{\text{val}},y_{\text{val}})\in\Pi_{m,p}italic_P = ( italic_X , italic_y , italic_X start_POSTSUBSCRIPT val end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT val end_POSTSUBSCRIPT ) ∈ roman_Π start_POSTSUBSCRIPT italic_m , italic_p end_POSTSUBSCRIPT be a regression problem instance. By using the standard reduction to LASSO [ZH05] and well-known characterization of the LASSO solution in terms of equicorrelation sets, we can characterize the solution β^λ,fENsubscript^𝛽𝜆subscript𝑓𝐸𝑁\hat{\beta}_{\lambda,f_{EN}}over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_λ , italic_f start_POSTSUBSCRIPT italic_E italic_N end_POSTSUBSCRIPT end_POSTSUBSCRIPT of the Elastic Net as follows (Lemma C.1):

β^λ,fEN=(XTX+λ2I||)1XTyλ1(XTX+λ2I||)1s,subscript^𝛽𝜆subscript𝑓𝐸𝑁superscriptsuperscriptsubscript𝑋𝑇subscript𝑋subscript𝜆2subscript𝐼1superscriptsubscript𝑋𝑇𝑦subscript𝜆1superscriptsuperscriptsubscript𝑋𝑇subscript𝑋subscript𝜆2subscript𝐼1𝑠\hat{\beta}_{\lambda,f_{EN}}=(X_{\mathcal{E}}^{T}X_{\mathcal{E}}+\lambda_{2}I_% {|\mathcal{E}|})^{-1}X_{\mathcal{E}}^{T}y-\lambda_{1}(X_{\mathcal{E}}^{T}X_{% \mathcal{E}}+\lambda_{2}I_{|\mathcal{E}|})^{-1}s,over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_λ , italic_f start_POSTSUBSCRIPT italic_E italic_N end_POSTSUBSCRIPT end_POSTSUBSCRIPT = ( italic_X start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT | caligraphic_E | end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_y - italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT | caligraphic_E | end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_s ,

for some [p]delimited-[]𝑝\mathcal{E}\in[p]caligraphic_E ∈ [ italic_p ] and s{1,1}p𝑠superscript11𝑝s\in\{-1,1\}^{p}italic_s ∈ { - 1 , 1 } start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT. Thus for any λ=(λ1,λ2)𝜆subscript𝜆1subscript𝜆2\lambda=(\lambda_{1},\lambda_{2})italic_λ = ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ), the prediction y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG on any validation example with features 𝒙p𝒙superscript𝑝\bm{x}\in\mathbb{R}^{p}bold_italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT satisfies (for some ,s2[p]×{1,1}p𝑠superscript2delimited-[]𝑝superscript11𝑝\mathcal{E},s\in 2^{[p]}\times\{-1,1\}^{p}caligraphic_E , italic_s ∈ 2 start_POSTSUPERSCRIPT [ italic_p ] end_POSTSUPERSCRIPT × { - 1 , 1 } start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT)

y^j=𝒙β^λ,fEN=𝒙(XTX+λ2I||)1XTyλ1𝒙(XTX+λ2I||)1s.subscript^𝑦𝑗𝒙subscript^𝛽𝜆subscript𝑓𝐸𝑁𝒙superscriptsuperscriptsubscript𝑋𝑇subscript𝑋subscript𝜆2subscript𝐼1superscriptsubscript𝑋𝑇𝑦subscript𝜆1𝒙superscriptsuperscriptsubscript𝑋𝑇subscript𝑋subscript𝜆2subscript𝐼1𝑠\hat{y}_{j}=\bm{x}\hat{\beta}_{\lambda,f_{EN}}=\bm{x}(X_{\mathcal{E}}^{T}X_{% \mathcal{E}}+\lambda_{2}I_{|\mathcal{E}|})^{-1}X_{\mathcal{E}}^{T}y-\lambda_{1% }\bm{x}(X_{\mathcal{E}}^{T}X_{\mathcal{E}}+\lambda_{2}I_{|\mathcal{E}|})^{-1}s.over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = bold_italic_x over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_λ , italic_f start_POSTSUBSCRIPT italic_E italic_N end_POSTSUBSCRIPT end_POSTSUBSCRIPT = bold_italic_x ( italic_X start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT | caligraphic_E | end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_y - italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_italic_x ( italic_X start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT | caligraphic_E | end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_s .

For any subset R2𝑅superscript2R\subseteq\mathbb{R}^{2}italic_R ⊆ blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, if the signed equicorrelation set (,s)𝑠(\mathcal{E},s)( caligraphic_E , italic_s ) is fixed over R𝑅Ritalic_R, then the above observation, together with Lemma 2.1 implies that the loss function EN(λ,P)subscriptEN𝜆𝑃\ell_{\text{EN}}(\lambda,P)roman_ℓ start_POSTSUBSCRIPT EN end_POSTSUBSCRIPT ( italic_λ , italic_P ) is a rational function of the form q1(λ1,λ2)q2(λ2)subscript𝑞1subscript𝜆1subscript𝜆2subscript𝑞2subscript𝜆2\frac{q_{1}(\lambda_{1},\lambda_{2})}{q_{2}(\lambda_{2})}divide start_ARG italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG, where q1subscript𝑞1q_{1}italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is a bivariate polynomial with degree at most 2||22|\mathcal{E}|2 | caligraphic_E | and q2subscript𝑞2q_{2}italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is univariate with degree 2||22|\mathcal{E}|2 | caligraphic_E |.

To show the piecewise structure, we need to demonstrate a set boundary functions 𝒢={g1,,gk}𝒢subscript𝑔1subscript𝑔𝑘\mathcal{G}=\{g_{1},\dots,g_{k}\}caligraphic_G = { italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } such that for any sign pattern 𝐛{0,1}k𝐛superscript01𝑘\mathbf{b}\in\{0,1\}^{k}bold_b ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, the signed equicorrelation set (,s)𝑠(\mathcal{E},s)( caligraphic_E , italic_s ) for the region with sign pattern 𝐛𝐛\mathbf{b}bold_b is fixed. To this end, based on the observation above, we will consider the conditions (on λ𝜆\lambdaitalic_λ) under which a covariate may enter or leave the equicorrelation set. We will show that this can happen only at one of a finite number of algebraic curves (with bounded degrees).

Condition for joining \mathcal{E}caligraphic_E. Fix ,s𝑠\mathcal{E},scaligraphic_E , italic_s. Also fix j𝑗j\notin\mathcal{E}italic_j ∉ caligraphic_E. If covariate j𝑗jitalic_j enters the equicorrelation set, the KKT conditions (Lemma B.1) applied to the LASSO problem corresponding to the ElasticNet (Lemma C.1) imply

(𝒙j*)T(y*X*(c1c2λ1*))=±λ1*,superscriptsuperscriptsubscript𝒙𝑗𝑇superscript𝑦superscriptsubscript𝑋subscript𝑐1subscript𝑐2superscriptsubscript𝜆1plus-or-minussuperscriptsubscript𝜆1(\bm{x}_{j}^{*})^{T}(y^{*}-X_{\mathcal{E}}^{*}(c_{1}-c_{2}\lambda_{1}^{*}))=% \pm\lambda_{1}^{*},( bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_y start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT - italic_X start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) ) = ± italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ,

where c1=(X*TX*)1X*Ty*subscript𝑐1superscriptsuperscriptsubscriptsuperscript𝑋𝑇subscriptsuperscript𝑋1superscriptsubscriptsuperscript𝑋𝑇superscript𝑦c_{1}=({X^{*}_{\mathcal{E}}}^{T}X^{*}_{\mathcal{E}})^{-1}{X^{*}_{\mathcal{E}}}% ^{T}y^{*}italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ( italic_X start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_X start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_X start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_y start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT, c2=(X*TX*)1ssubscript𝑐2superscriptsuperscriptsubscriptsuperscript𝑋𝑇subscriptsuperscript𝑋1𝑠c_{2}=({X^{*}_{\mathcal{E}}}^{T}X^{*}_{\mathcal{E}})^{-1}sitalic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ( italic_X start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_X start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_s, X*=11+λ2(Xλ2Ip)superscript𝑋11subscript𝜆2matrix𝑋subscript𝜆2subscript𝐼𝑝X^{*}=\frac{1}{\sqrt{1+\lambda_{2}}}\begin{pmatrix}X\\ \sqrt{\lambda_{2}}I_{p}\end{pmatrix}italic_X start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG square-root start_ARG 1 + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG end_ARG ( start_ARG start_ROW start_CELL italic_X end_CELL end_ROW start_ROW start_CELL square-root start_ARG italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG italic_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ), y*=(y0)superscript𝑦matrix𝑦0y^{*}=\begin{pmatrix}y\\ 0\end{pmatrix}italic_y start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = ( start_ARG start_ROW start_CELL italic_y end_CELL end_ROW start_ROW start_CELL 0 end_CELL end_ROW end_ARG ), and λ1*=λ11+λ2superscriptsubscript𝜆1subscript𝜆11subscript𝜆2\lambda_{1}^{*}=\frac{\lambda_{1}}{\sqrt{1+\lambda_{2}}}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = divide start_ARG italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG 1 + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG end_ARG. Rearranging, and simplifying, we get

λ1*superscriptsubscript𝜆1\displaystyle\lambda_{1}^{*}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT =(𝒙j*)TX*(X*TX*)1(X*)Ty*(𝒙j*)Ty*(𝒙j*)TX*(X*TX*)1s±1,orabsentsuperscriptsuperscriptsubscript𝒙𝑗𝑇subscriptsuperscript𝑋superscriptsuperscriptsubscriptsuperscript𝑋𝑇subscriptsuperscript𝑋1superscriptsubscriptsuperscript𝑋𝑇superscript𝑦superscriptsubscriptsuperscript𝒙𝑗𝑇superscript𝑦plus-or-minussuperscriptsuperscriptsubscript𝒙𝑗𝑇subscriptsuperscript𝑋superscriptsuperscriptsubscriptsuperscript𝑋𝑇subscriptsuperscript𝑋1𝑠1or\displaystyle=\frac{(\bm{x}_{j}^{*})^{T}X^{*}_{\mathcal{E}}({X^{*}_{\mathcal{E% }}}^{T}X^{*}_{\mathcal{E}})^{-1}(X^{*}_{\mathcal{E}})^{T}y^{*}-(\bm{x}^{*}_{j}% )^{T}y^{*}}{(\bm{x}_{j}^{*})^{T}X^{*}_{\mathcal{E}}({X^{*}_{\mathcal{E}}}^{T}X% ^{*}_{\mathcal{E}})^{-1}s\pm 1},\text{or}= divide start_ARG ( bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_X start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT ( italic_X start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_X start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_X start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_y start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT - ( bold_italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_y start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_ARG start_ARG ( bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_X start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT ( italic_X start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_X start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_s ± 1 end_ARG , or
λ1subscript𝜆1\displaystyle\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT =𝒙jTX(XTX+λ2I||)1XTy𝒙jTy𝒙jTX(XTX+λ2I||)1s±1.absentsuperscriptsubscript𝒙𝑗𝑇subscript𝑋superscriptsuperscriptsubscript𝑋𝑇subscript𝑋subscript𝜆2subscript𝐼1superscriptsubscript𝑋𝑇𝑦superscriptsubscript𝒙𝑗𝑇𝑦plus-or-minussuperscriptsubscript𝒙𝑗𝑇subscript𝑋superscriptsuperscriptsubscript𝑋𝑇subscript𝑋subscript𝜆2subscript𝐼1𝑠1\displaystyle=\frac{\bm{x}_{j}^{T}X_{\mathcal{E}}({X_{\mathcal{E}}}^{T}X_{% \mathcal{E}}+\lambda_{2}I_{|\mathcal{E}|})^{-1}{X_{\mathcal{E}}}^{T}y-\bm{x}_{% j}^{T}y}{\bm{x}_{j}^{T}X_{\mathcal{E}}({X_{\mathcal{E}}}^{T}X_{\mathcal{E}}+% \lambda_{2}I_{|\mathcal{E}|})^{-1}s\pm 1}.= divide start_ARG bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT | caligraphic_E | end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_y - bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_y end_ARG start_ARG bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT | caligraphic_E | end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_s ± 1 end_ARG .

Note that the terms (𝒙j*)TX*=𝒙jTXsuperscriptsuperscriptsubscript𝒙𝑗𝑇subscriptsuperscript𝑋superscriptsubscript𝒙𝑗𝑇subscript𝑋(\bm{x}_{j}^{*})^{T}X^{*}_{\mathcal{E}}=\bm{x}_{j}^{T}X_{\mathcal{E}}( bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_X start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT = bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT, (X*)Ty*=XTysuperscriptsubscriptsuperscript𝑋𝑇superscript𝑦superscriptsubscript𝑋𝑇𝑦(X^{*}_{\mathcal{E}})^{T}y^{*}=X_{\mathcal{E}}^{T}y( italic_X start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_y start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = italic_X start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_y, and (𝒙j*)Ty*=𝒙jTysuperscriptsubscriptsuperscript𝒙𝑗𝑇superscript𝑦superscriptsubscript𝒙𝑗𝑇𝑦(\bm{x}^{*}_{j})^{T}y^{*}=\bm{x}_{j}^{T}y( bold_italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_y start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_y do not depend on λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT or λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (the λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT terms are zeroed out since j𝑗j\notin\mathcal{E}italic_j ∉ caligraphic_E). Moreover, (X*TX*)1=(XTX+λ2I||)1superscriptsuperscriptsubscriptsuperscript𝑋𝑇subscriptsuperscript𝑋1superscriptsuperscriptsubscript𝑋𝑇subscript𝑋subscript𝜆2subscript𝐼1({X^{*}_{\mathcal{E}}}^{T}X^{*}_{\mathcal{E}})^{-1}=({X_{\mathcal{E}}}^{T}X_{% \mathcal{E}}+\lambda_{2}I_{|\mathcal{E}|})^{-1}( italic_X start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_X start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT = ( italic_X start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT | caligraphic_E | end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT. Using Lemma 2.1, we get an algebraic curve rj,,s(λ1,λ2)=0subscript𝑟𝑗𝑠subscript𝜆1subscript𝜆20r_{j,\mathcal{E},s}(\lambda_{1},\lambda_{2})=0italic_r start_POSTSUBSCRIPT italic_j , caligraphic_E , italic_s end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = 0 with degree 1 in λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and |||\mathcal{E}|| caligraphic_E | in λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT corresponding to addition of j𝑗j\notin\mathcal{E}italic_j ∉ caligraphic_E given ,s𝑠\mathcal{E},scaligraphic_E , italic_s.

Condition for leaving \mathcal{E}caligraphic_E. Now consider a fixed jsuperscript𝑗j^{\prime}\in\mathcal{E}italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_E, given fixed ,s𝑠\mathcal{E},scaligraphic_E , italic_s. The coefficient of jsuperscript𝑗j^{\prime}italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT will be zero for λ1*=(c1)j(c2)jsuperscriptsubscript𝜆1subscriptsubscript𝑐1superscript𝑗subscriptsubscript𝑐2superscript𝑗\lambda_{1}^{*}=\frac{(c_{1})_{j^{\prime}}}{(c_{2})_{j^{\prime}}}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = divide start_ARG ( italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG start_ARG ( italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG, which simplifies to λ1((XTX+λ2I||)1s)j=((XTX+λ2I||)1XTy)jsubscript𝜆1subscriptsuperscriptsuperscriptsubscript𝑋𝑇subscript𝑋subscript𝜆2subscript𝐼1𝑠superscript𝑗subscriptsuperscriptsuperscriptsubscript𝑋𝑇subscript𝑋subscript𝜆2subscript𝐼1superscriptsubscript𝑋𝑇𝑦superscript𝑗\lambda_{1}(({X_{\mathcal{E}}}^{T}X_{\mathcal{E}}+\lambda_{2}I_{|\mathcal{E}|}% )^{-1}s)_{j^{\prime}}=(({X_{\mathcal{E}}}^{T}X_{\mathcal{E}}+\lambda_{2}I_{|% \mathcal{E}|})^{-1}{X_{\mathcal{E}}}^{T}y)_{j^{\prime}}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( ( italic_X start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT | caligraphic_E | end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_s ) start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = ( ( italic_X start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT | caligraphic_E | end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_y ) start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT. Again by Lemma 2.1, we get an algebraic curve rj,,s(λ1,λ2)=0subscript𝑟superscript𝑗𝑠subscript𝜆1subscript𝜆20r_{j^{\prime},\mathcal{E},s}(\lambda_{1},\lambda_{2})=0italic_r start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , caligraphic_E , italic_s end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = 0 with degree 1 in λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and at most |||\mathcal{E}|| caligraphic_E | in λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT corresponding to removal of jsuperscript𝑗j^{\prime}\in\mathcal{E}italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_E given ,s𝑠\mathcal{E},scaligraphic_E , italic_s.

Putting the two together, we get i=0p2i(pi)((pi)+i)=p3psuperscriptsubscript𝑖0𝑝superscript2𝑖binomial𝑝𝑖𝑝𝑖𝑖𝑝superscript3𝑝\sum_{i=0}^{p}2^{i}{p\choose i}\left((p-i)+i\right)=p3^{p}∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT 2 start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( binomial start_ARG italic_p end_ARG start_ARG italic_i end_ARG ) ( ( italic_p - italic_i ) + italic_i ) = italic_p 3 start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT algebraic curves of degree 1 in λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and at most p𝑝pitalic_p in λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, across which the signed equicorrelation set may change. These curves characterize the complete set of points (λ1,λ2)subscript𝜆1subscript𝜆2(\lambda_{1},\lambda_{2})( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) at which (,s)𝑠(\mathcal{E},s)( caligraphic_E , italic_s ) may possibly change. Thus by setting these p3p𝑝superscript3𝑝p3^{p}italic_p 3 start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT curves as the set of boundary functions 𝒢𝒢\mathcal{G}caligraphic_G, ,s𝑠\mathcal{E},scaligraphic_E , italic_s is guaranteed to be fixed for each sign pattern, and the corresponding loss takes the rational function form shown above. ∎

The exact same piecewise structure can be established for the dual function classes for loss functions EN𝖠𝖨𝖢(λ,)superscriptsubscriptEN𝖠𝖨𝖢𝜆\ell_{\text{EN}}^{\textsf{AIC}}(\lambda,\cdot)roman_ℓ start_POSTSUBSCRIPT EN end_POSTSUBSCRIPT start_POSTSUPERSCRIPT AIC end_POSTSUPERSCRIPT ( italic_λ , ⋅ ) and EN𝖡𝖨𝖢(λ,)superscriptsubscriptEN𝖡𝖨𝖢𝜆\ell_{\text{EN}}^{\textsf{BIC}}(\lambda,\cdot)roman_ℓ start_POSTSUBSCRIPT EN end_POSTSUBSCRIPT start_POSTSUPERSCRIPT BIC end_POSTSUPERSCRIPT ( italic_λ , ⋅ ). This is evident from the proof of Theorem 2.2, since any dual piece has a fixed equicorrelation set, and therefore β0subscriptnorm𝛽0||\beta||_{0}| | italic_β | | start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is fixed. Given this piecewise structure, a challenge to learning values of λ𝜆\lambdaitalic_λ that minimize the loss function is that the function may not be differentiable (or may even be discontinuous, for the information criteria based losses) at the piece boundaries, making well-known gradient-based (local) optimization techniques inapplicable here. In the following (specifically Algorithm 1) we will show that techniques from data-driven design may be used to overcome this optimization challenge.

3 Learning to Regularize the ElasticNet

We will consider the problem of learning provably good ElasticNet parameters for a given problem domain, from multiple datasets (problem instances) either available as a collection (Section 3.1), or arriving online (Section 3.2). Our parameter tuning techniques also apply to simpler regression techniques typically used for variable selection, like LARS and LASSO, which are reasonable choices if the features are not multicollinear. Additional proof details for the results in this section are located in Appendix C.

3.1 Distributional Setting

Our main result in this section is the following upper bound on the pseudo-dimension of the classes of loss functions for the ElasticNet, which implies that in our distributional setting it is possible to learn near-optimal values of λ𝜆\lambdaitalic_λ with polynomially many problem instances.

Theorem 3.1.

Pdim(𝐸𝑁)=O(p2)Pdimsubscript𝐸𝑁𝑂superscript𝑝2\textsc{Pdim}(\mathcal{H}_{\text{EN}})=O(p^{2})Pdim ( caligraphic_H start_POSTSUBSCRIPT EN end_POSTSUBSCRIPT ) = italic_O ( italic_p start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). Further, Pdim(𝐸𝑁𝖠𝖨𝖢)=O(p2)Pdimsuperscriptsubscript𝐸𝑁𝖠𝖨𝖢𝑂superscript𝑝2\textsc{Pdim}(\mathcal{H}_{\text{EN}}^{\textsf{AIC}})=O(p^{2})Pdim ( caligraphic_H start_POSTSUBSCRIPT EN end_POSTSUBSCRIPT start_POSTSUPERSCRIPT AIC end_POSTSUPERSCRIPT ) = italic_O ( italic_p start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) and Pdim(𝐸𝑁𝖡𝖨𝖢)=O(p2)Pdimsuperscriptsubscript𝐸𝑁𝖡𝖨𝖢𝑂superscript𝑝2\textsc{Pdim}(\mathcal{H}_{\text{EN}}^{\textsf{BIC}})=O(p^{2})Pdim ( caligraphic_H start_POSTSUBSCRIPT EN end_POSTSUBSCRIPT start_POSTSUPERSCRIPT BIC end_POSTSUPERSCRIPT ) = italic_O ( italic_p start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ).

Proof Sketch. The crucial ingredient is the (,𝒢,p3p)𝒢𝑝superscript3𝑝(\mathcal{F},\mathcal{G},p3^{p})( caligraphic_F , caligraphic_G , italic_p 3 start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT )-piecewise decomposable structure for the dual class function EN*superscriptsubscriptEN\mathcal{H}_{\text{EN}}^{*}caligraphic_H start_POSTSUBSCRIPT EN end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT established in Theorem 2.2, where \mathcal{F}caligraphic_F is a class of bivariate rational functions and 𝒢𝒢\mathcal{G}caligraphic_G consists of polynomial thresholds, both with bounded degrees. We then bound the complexity of the corresponding dual class functions *superscript\mathcal{F}^{*}caligraphic_F start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT and 𝒢*superscript𝒢\mathcal{G}^{*}caligraphic_G start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT, in order to use the following powerful general result due to [BDD+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT21] (Theorem C.2 in the appendix)

Pdim()=O((Pdim(*)+d𝒢*)log(Pdim(*)+d𝒢*)+d𝒢*logk).Pdim𝑂Pdimsuperscriptsubscript𝑑superscript𝒢Pdimsuperscriptsubscript𝑑superscript𝒢subscript𝑑superscript𝒢𝑘\textsc{Pdim}(\mathcal{H})=O((\textsc{Pdim}(\mathcal{F}^{*})+d_{\mathcal{G}^{*% }})\log(\textsc{Pdim}(\mathcal{F}^{*})+d_{\mathcal{G}^{*}})+d_{\mathcal{G}^{*}% }\log k).Pdim ( caligraphic_H ) = italic_O ( ( Pdim ( caligraphic_F start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) + italic_d start_POSTSUBSCRIPT caligraphic_G start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) roman_log ( Pdim ( caligraphic_F start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) + italic_d start_POSTSUBSCRIPT caligraphic_G start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) + italic_d start_POSTSUBSCRIPT caligraphic_G start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_log italic_k ) .

In more detail, we can bound the pseudo-dimension of the dual class of piece functions *superscript\mathcal{F}^{*}caligraphic_F start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT (a class of bivariate rational functions) by O(logp)𝑂𝑝O(\log p)italic_O ( roman_log italic_p ) (Lemma C.4 in the appendix), by giving an upper bound of O(k3d3)𝑂superscript𝑘3superscript𝑑3O(k^{3}d^{3})italic_O ( italic_k start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) on the number of sign patterns over 2superscript2\mathbb{R}^{2}blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT induced by k𝑘kitalic_k algebraic curves of degree at most d𝑑ditalic_d. We can also bound the VC dimension of the dual class of boundary functions 𝒢*superscript𝒢\mathcal{G}^{*}caligraphic_G start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT (polynomial thresholds in two variates) by O(p)𝑂𝑝O(p)italic_O ( italic_p ) using a standard linearization argument (Lemma C.5). Finally, the above result from [BDD+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT21] allows us to bound the pseudodimension of \mathcal{H}caligraphic_H by combining the above bounds.

Pdim()=O(plogp+plog(p3p))=O(p2).Pdim𝑂𝑝𝑝𝑝𝑝superscript3𝑝𝑂superscript𝑝2\textsc{Pdim}(\mathcal{H})=O(p\log p+p\log(p3^{p}))=O(p^{2}).Pdim ( caligraphic_H ) = italic_O ( italic_p roman_log italic_p + italic_p roman_log ( italic_p 3 start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ) ) = italic_O ( italic_p start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) .

The dual classes (EN𝖠𝖨𝖢)*superscriptsuperscriptsubscriptEN𝖠𝖨𝖢{(\mathcal{H}_{\text{EN}}^{\textsf{AIC}})}^{*}( caligraphic_H start_POSTSUBSCRIPT EN end_POSTSUBSCRIPT start_POSTSUPERSCRIPT AIC end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT and (EN𝖡𝖨𝖢)*superscriptsuperscriptsubscriptEN𝖡𝖨𝖢{(\mathcal{H}_{\text{EN}}^{\textsf{BIC}})}^{*}( caligraphic_H start_POSTSUBSCRIPT EN end_POSTSUBSCRIPT start_POSTSUPERSCRIPT BIC end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT also follow the same piecewise decomposable structure given by Theorem 2.2. This is because in each piece the equicorrelation set \mathcal{E}caligraphic_E, and therefore β0=||subscriptnorm𝛽0||\beta||_{0}=|\mathcal{E}|| | italic_β | | start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = | caligraphic_E | is fixed (Lemma B.2). The above argument implies an identical upper bound on the pseudo-dimensions of EN𝖠𝖨𝖢superscriptsubscriptEN𝖠𝖨𝖢\mathcal{H}_{\text{EN}}^{\textsf{AIC}}caligraphic_H start_POSTSUBSCRIPT EN end_POSTSUBSCRIPT start_POSTSUPERSCRIPT AIC end_POSTSUPERSCRIPT and EN𝖡𝖨𝖢superscriptsubscriptEN𝖡𝖨𝖢\mathcal{H}_{\text{EN}}^{\textsf{BIC}}caligraphic_H start_POSTSUBSCRIPT EN end_POSTSUBSCRIPT start_POSTSUPERSCRIPT BIC end_POSTSUPERSCRIPT. See Appendix C for further proof details, including the technical lemmas. \square

The upper bound above implies a guarantee on the sample complexity of learning the ElasticNet tuning parameter, using standard learning-theoretic results [AB99], under mild boundedness assumptions on the data and hyperparameter search space.

Assumption 1 (Boundedness).

The predicted variable and all feature values are bounded by an absolute constant R𝑅Ritalic_R, i.e. max{X(i),,y(i),X𝑣𝑎𝑙(i),,y𝑣𝑎𝑙(i)}Rsubscriptnormsuperscript𝑋𝑖subscriptnormsuperscript𝑦𝑖subscriptnormsuperscriptsubscript𝑋𝑣𝑎𝑙𝑖subscriptnormsuperscriptsubscript𝑦𝑣𝑎𝑙𝑖𝑅\max\{||X^{(i)}||_{\infty,\infty},||y^{(i)}||_{\infty},||X_{\text{val}}^{(i)}|% |_{\infty,\infty},||y_{\text{val}}^{(i)}||_{\infty}\}\leq Rroman_max { | | italic_X start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT ∞ , ∞ end_POSTSUBSCRIPT , | | italic_y start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT , | | italic_X start_POSTSUBSCRIPT val end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT ∞ , ∞ end_POSTSUBSCRIPT , | | italic_y start_POSTSUBSCRIPT val end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT } ≤ italic_R. Furthermore, the regularization coefficients are bounded, (λ1,λ2)[λmin,λmax]2subscript𝜆1subscript𝜆2superscriptsubscript𝜆subscript𝜆2(\lambda_{1},\lambda_{2})\in[\lambda_{\min},\lambda_{\max}]^{2}( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ∈ [ italic_λ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT for 0<λmin<λmax<0subscript𝜆subscript𝜆0<\lambda_{\min}<\lambda_{\max}<\infty0 < italic_λ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT < italic_λ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT < ∞.

In our setting of learning from multiple problem instances, each sample is a dataset instance, so the sample complexity is simply the number of regression problem instances needed to learn the tuning parameters to any given approximation and confidence level.

Theorem 3.2 (Sample complexity of tuning the ElasticNet).

Suppose Assumption 1 holds. Let 𝒟𝒟\mathcal{D}caligraphic_D be an arbitary distribution over the problem space Πm,psubscriptnormal-Π𝑚𝑝\Pi_{m,p}roman_Π start_POSTSUBSCRIPT italic_m , italic_p end_POSTSUBSCRIPT. There is an algorithm which given n=O(H2ϵ2(p2+log1δ))𝑛𝑂superscript𝐻2superscriptitalic-ϵ2superscript𝑝21𝛿n=O\left(\frac{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}H^{2}}}{\epsilon^{% 2}}(p^{2}+\log\frac{1}{\delta})\right)italic_n = italic_O ( divide start_ARG italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( italic_p start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + roman_log divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG ) ) problem samples drawn from 𝒟𝒟\mathcal{D}caligraphic_D, for any ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0 and δ(0,1)𝛿01\delta\in(0,1)italic_δ ∈ ( 0 , 1 ) and some constant H𝐻Hitalic_H, outputs a regularization parameter λ^normal-^𝜆\hat{\lambda}over^ start_ARG italic_λ end_ARG for the ElasticNet such that with probability at least 1δ1𝛿1-\delta1 - italic_δ over the draw of the problem samples, we have that

|𝔼P𝒟[EN(λ^,P)]minλ𝔼P𝒟[EN(λ,P)]|ϵ.subscript𝔼similar-to𝑃𝒟delimited-[]subscript𝐸𝑁^𝜆𝑃subscript𝜆subscript𝔼similar-to𝑃𝒟delimited-[]subscript𝐸𝑁𝜆𝑃italic-ϵ\Big{\lvert}\mathbb{E}_{P\sim\mathcal{D}}[\ell_{EN}(\hat{\lambda},P)]-\min_{% \lambda}\mathbb{E}_{P\sim\mathcal{D}}[\ell_{EN}(\lambda,P)]\Big{\rvert}\leq\epsilon.| blackboard_E start_POSTSUBSCRIPT italic_P ∼ caligraphic_D end_POSTSUBSCRIPT [ roman_ℓ start_POSTSUBSCRIPT italic_E italic_N end_POSTSUBSCRIPT ( over^ start_ARG italic_λ end_ARG , italic_P ) ] - roman_min start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_P ∼ caligraphic_D end_POSTSUBSCRIPT [ roman_ℓ start_POSTSUBSCRIPT italic_E italic_N end_POSTSUBSCRIPT ( italic_λ , italic_P ) ] | ≤ italic_ϵ .
Proof.

We use Lemma C.6 to conclude that the validation loss is uniformly bounded by some constant H𝐻Hitalic_H under Assumption 1. The result then follows from substituting our result in Theorem 3.1 into well-known generalization guarantee for function classes with bounded pseudo-dimensions (Theorem A.1). ∎

Discussion and applications. Computing the parameters which minimize the loss on the problem samples (aka Empirical Risk Minimization, or ERM) achieves the sample complexity bound in Theorem 3.2. Even though we only need polynomially many samples to guarantee the selection of nearly-optimal parameters, it is not clear how to implement the ERM efficiently. Note that we do not assume the set of features is the same across problem instances, so our approach can handle feature reset i.e. different problem instances can differ in not only the number of examples but also the number of features. Moreover, as a special case application, we consider the commonly used techniques of leave-one-out cross validation (LOOCV) and Monte Carlo cross validation (repeated random test-validation splits, typically independent and in a fixed proportion). Given a dataset of size mtrsubscript𝑚𝑡𝑟m_{tr}italic_m start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT, LOOCV would require mtrsubscript𝑚𝑡𝑟m_{tr}italic_m start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT regression fits which can be inefficient for large dataset size. Alternately, we can consider draws from a distribution 𝒟LOOsubscript𝒟𝐿𝑂𝑂\mathcal{D}_{LOO}caligraphic_D start_POSTSUBSCRIPT italic_L italic_O italic_O end_POSTSUBSCRIPT which generates problem instances P𝑃Pitalic_P from a fixed dataset (X,y)m+1×p×m+1𝑋𝑦superscript𝑚1𝑝superscript𝑚1(X,y)\in\mathbb{R}^{m+1\times p}\times\mathbb{R}^{m+1}( italic_X , italic_y ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_m + 1 × italic_p end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT italic_m + 1 end_POSTSUPERSCRIPT by uniformly selecting j[m+1]𝑗delimited-[]𝑚1j\in[m+1]italic_j ∈ [ italic_m + 1 ] and setting P=(Xj*,yj,Xj*,yj)𝑃subscript𝑋𝑗subscript𝑦𝑗subscript𝑋𝑗subscript𝑦𝑗P=(X_{-j*},y_{-j},X_{j*},y_{j})italic_P = ( italic_X start_POSTSUBSCRIPT - italic_j * end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT - italic_j end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_j * end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ). Theorem 3.2 now implies that O~(p2/ϵ2)~𝑂superscript𝑝2superscriptitalic-ϵ2\tilde{O}(p^{2}/\epsilon^{2})over~ start_ARG italic_O end_ARG ( italic_p start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) iterations are enough to determine an ElasticNet parameter λ^^𝜆\hat{\lambda}over^ start_ARG italic_λ end_ARG with loss within ϵitalic-ϵ\epsilonitalic_ϵ (with high probability) of the parameter λ*superscript𝜆\lambda^{*}italic_λ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT obtained from running the full LOOCV. Similarly, we can define a distribution 𝒟MCsubscript𝒟𝑀𝐶\mathcal{D}_{MC}caligraphic_D start_POSTSUBSCRIPT italic_M italic_C end_POSTSUBSCRIPT to capture the Monte Carlo cross validation procedure and determine the number of iterations sufficient to get an ϵitalic-ϵ\epsilonitalic_ϵ-approximation of the loss corresponding parameter selection with arbitrarily large number of runs of the procedure. Thus, in a very precise sense, our results answer the question of how much cross-validation is enough to effectively implement the above techniques.

Remark 1.

While our result implies polynomial sample complexity, the question of learning the provably near-optimal parameter efficiently (even in output polynomial time) is left open. For the special cases of LASSO (λ2=0)subscript𝜆20(\lambda_{2}=0)( italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0 ) and Ridge (λ1=0)subscript𝜆10(\lambda_{1}=0)( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0 ), the piece boundaries of the piecewise polynomial dual class (loss) function may be computed efficiently (using the LARS-LASSO algorithm of [EHJT04] for LASSO, and solving linear systems and locating roots of polynomials for Ridge). This applies to online and classification settings in the following sections as well.

3.2 Online Learning

We will now extend our results to learning the regularization coefficients given an online sequence of regression problems, such as when one needs to solve a new regression problem each day. Unlike the distributional setting above, we will not assume any problem distribution and our results will hold for an adversarial sequence of problem instances. We will need very mild assumptions on the data, namely boundedness of feature and prediction values and ‘smoothness’ of predictions (formally stated as Assumptions 1 and 2), while our distributional results above hold for worst-case problem datasets.

We will need two mild assumptions on the datasets in our problem instances for our results to hold. Our first assumption is that all feature values and predictions are bounded, for training as well as validation examples (Assumption 1 above). We will need the following definition to state our second assumption. Roughly speaking the definition below captures smoothness of a distribution.

Definition 3.

A continuous probability distribution is said to be κ𝜅\kappaitalic_κ-bounded if the probability density function p(x)𝑝𝑥p(x)italic_p ( italic_x ) satisfies p(x)κ𝑝𝑥𝜅p(x)\leq\kappaitalic_p ( italic_x ) ≤ italic_κ for any x𝑥xitalic_x in the sample space.

For example, the normal distribution 𝒩(μ,σ2)𝒩𝜇superscript𝜎2\mathcal{N}(\mu,\sigma^{2})caligraphic_N ( italic_μ , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) with mean μ𝜇\muitalic_μ and standard deviation σ𝜎\sigmaitalic_σ is 1σ2π1𝜎2𝜋\frac{1}{\sigma\sqrt{2\pi}}divide start_ARG 1 end_ARG start_ARG italic_σ square-root start_ARG 2 italic_π end_ARG end_ARG-bounded. We assume that the predicted variable y𝑦yitalic_y in the training set comes from a κ𝜅\kappaitalic_κ-bounded (i.e. smooth) distribution, which does not require the strong tail decay of sub-Gaussian distributions [Zha09, CP09]. Moreover, the online adversary is allowed to change the distribution as long as it is κ𝜅\kappaitalic_κ-bounded. Note that our assumption also captures common data preprocessing steps, for example the jitter parameter in the popular Python library scikit-learn [P+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT11] adds a uniform noise to the y𝑦yitalic_y values to help model stability. The assumption is formally stated as follows:

Assumption 2 (Smooth predictions).

The predicted variables y(i)superscript𝑦𝑖y^{(i)}italic_y start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT in the training set are drawn from a joint κ𝜅\kappaitalic_κ-bounded distribution, i.e. for each i𝑖iitalic_i, the variables y(i)superscript𝑦𝑖y^{(i)}italic_y start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT have a joint distribution with probability density bounded by κ𝜅\kappaitalic_κ.

Under these assumptions, we can show that it is possible to learn the ElasticNet parameters with sublinear expected regret when the problem instances arrive online. The learning algorithm (Algorithm 1) that achieves this regret is a continuous variant of the classic Exponential Weights algorithm [CBL06, BDV18]. It samples points in the domain with probability inversely propotional to the exponentiated loss. To formally state our result, we will need the following definition of dispersed loss functions. Informally speaking, it captures how amenable a set of non-Lipschitz functions is to online learning by measuring the worst rate of occurrence of non-Lipschitzness (or discontinuities) between any pair of points in the domain. [BDV18, BDS20, BDP20] show that dispersion is necessary and sufficient for learning piecewise Lipschitz functions.

Definition 4.

Dispersion [BDP20]. The sequence of random loss functions l1,,lTsubscript𝑙1normal-…subscript𝑙𝑇l_{1},\dots,l_{T}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_l start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT is β𝛽\betaitalic_β-dispersed for the Lipschitz constant L𝐿Litalic_L if, for all T𝑇Titalic_T and for all ϵTβitalic-ϵsuperscript𝑇𝛽\epsilon\geq T^{-\beta}italic_ϵ ≥ italic_T start_POSTSUPERSCRIPT - italic_β end_POSTSUPERSCRIPT, we have that, in expectation, at most O~(ϵT)normal-~𝑂italic-ϵ𝑇\tilde{O}(\epsilon T)over~ start_ARG italic_O end_ARG ( italic_ϵ italic_T ) functions (the soft-O notation suppresses dependence on quantities beside ϵ,Titalic-ϵ𝑇\epsilon,Titalic_ϵ , italic_T and β𝛽\betaitalic_β, as well as logarithmic terms) are not L𝐿Litalic_L-Lipschitz for any pair of points at distance ϵitalic-ϵ\epsilonitalic_ϵ in the domain 𝒞𝒞\mathcal{C}caligraphic_C. That is, for all T𝑇Titalic_T and for all ϵTβitalic-ϵsuperscript𝑇𝛽\epsilon\geq T^{-\beta}italic_ϵ ≥ italic_T start_POSTSUPERSCRIPT - italic_β end_POSTSUPERSCRIPT, 𝔼[maxρ,ρ𝒞ρρ2ϵ|{t[T]lt(ρ)lt(ρ)>Lρρ2}|]O~(ϵT)𝔼delimited-[]subscript𝜌superscript𝜌normal-′𝒞subscriptdelimited-∥∥𝜌superscript𝜌normal-′2italic-ϵconditional-set𝑡delimited-[]𝑇subscript𝑙𝑡𝜌subscript𝑙𝑡superscript𝜌normal-′𝐿subscriptdelimited-∥∥𝜌superscript𝜌normal-′2normal-~𝑂italic-ϵ𝑇\mathbb{E}\Big{[}\max_{\begin{subarray}{c}\rho,\rho^{\prime}\in\mathcal{C}\\ \left\lVert\rho-\rho^{\prime}\right\rVert_{2}\leq\epsilon\end{subarray}}\big{% \lvert}\{t\in[T]\mid l_{t}(\rho)-l_{t}(\rho^{\prime})>L\left\lVert\rho-\rho^{% \prime}\right\rVert_{2}\}\big{\rvert}\Big{]}\leq\tilde{O}(\epsilon T)blackboard_E [ roman_max start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_ρ , italic_ρ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_C end_CELL end_ROW start_ROW start_CELL ∥ italic_ρ - italic_ρ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_ϵ end_CELL end_ROW end_ARG end_POSTSUBSCRIPT | { italic_t ∈ [ italic_T ] ∣ italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_ρ ) - italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_ρ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) > italic_L ∥ italic_ρ - italic_ρ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } | ] ≤ over~ start_ARG italic_O end_ARG ( italic_ϵ italic_T ).

Our key contribution is to show that the loss sequence is dispersed (Definition 4) under the above assumptions. This involves establishing additional structure for the problem, specifically about the location of boundary functions in the piecewise structure from Theorem 2.2. This stronger characterization coupled with results from [BDP20] on dispersion of algebraic discontinuities completes the proof.

Theorem 3.3.

Suppose Assumptions 1 and 2 hold. Let l1,,lT:(0,λmax)20normal-:subscript𝑙1normal-…subscript𝑙𝑇normal-→superscript0subscript𝜆2subscriptabsent0l_{1},\dots,l_{T}:(0,\lambda_{\max})^{2}\rightarrow\mathbb{R}_{\geq 0}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_l start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT : ( 0 , italic_λ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT → blackboard_R start_POSTSUBSCRIPT ≥ 0 end_POSTSUBSCRIPT denote an independent sequence of losses (e.g. fresh randomness is used to generate the validation set features in each round) as a function of the ElasticNet regularization parameter λ=(λ1,λ2)𝜆subscript𝜆1subscript𝜆2\lambda=(\lambda_{1},\lambda_{2})italic_λ = ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ), li(λ)=lr(β^λ,fEN(X(i),y(i)),(X𝑣𝑎𝑙(i),y𝑣𝑎𝑙(i)))subscript𝑙𝑖𝜆subscript𝑙𝑟subscriptsuperscriptnormal-^𝛽superscript𝑋𝑖superscript𝑦𝑖𝜆subscript𝑓𝐸𝑁superscriptsubscript𝑋𝑣𝑎𝑙𝑖superscriptsubscript𝑦𝑣𝑎𝑙𝑖l_{i}(\lambda)=l_{r}(\hat{\beta}^{(X^{(i)},y^{(i)})}_{\lambda,f_{EN}},(X_{% \text{val}}^{(i)},y_{\text{val}}^{(i)}))italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_λ ) = italic_l start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( over^ start_ARG italic_β end_ARG start_POSTSUPERSCRIPT ( italic_X start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_λ , italic_f start_POSTSUBSCRIPT italic_E italic_N end_POSTSUBSCRIPT end_POSTSUBSCRIPT , ( italic_X start_POSTSUBSCRIPT val end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT val end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) ). The sequence of functions is 1212\frac{1}{2}divide start_ARG 1 end_ARG start_ARG 2 end_ARG-dispersed, and there is an online algorithm with O~(T)normal-~𝑂𝑇\tilde{O}(\sqrt{T})over~ start_ARG italic_O end_ARG ( square-root start_ARG italic_T end_ARG )333The O~()normal-~𝑂normal-⋅\tilde{O}(\cdot)over~ start_ARG italic_O end_ARG ( ⋅ ) notation hides dependence on logarithmic terms, as well as on quantities other than T𝑇Titalic_T. expected regret. The result also holds for loss functions adjusted by information criteria AIC and BIC.

Proof Sketch. We start with the (,𝒢,p3p)𝒢𝑝superscript3𝑝(\mathcal{F},\mathcal{G},p3^{p})( caligraphic_F , caligraphic_G , italic_p 3 start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT )-piecewise decomposable structure for the dual class function EN*superscriptsubscriptEN\mathcal{H}_{\text{EN}}^{*}caligraphic_H start_POSTSUBSCRIPT EN end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT from Theorem 2.2. Observe that the rational piece functions in \mathcal{F}caligraphic_F do not introduce any new discontinuities since the denominator polynomials do not have positive roots. For each of two types of boundary functions in 𝒢𝒢\mathcal{G}caligraphic_G (corresponding to leaving/entering the equicorrelation set) we show that the discontinuities between any pair of points λ,λ𝜆superscript𝜆\lambda,\lambda^{\prime}italic_λ , italic_λ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT lie along the roots of polynomials with non-leading coefficients bounded and smoothly distributed (bounded joint density). This allows us to use results from [BDP20] to establish dispersion, and therefore online learnability. \square

1:  Input: Problems (X(i),y(i))superscript𝑋𝑖superscript𝑦𝑖(X^{(i)},y^{(i)})( italic_X start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) and regularization penalty function f𝑓fitalic_f.
2:  Hyperparameter: step size parameter ζ(0,1]𝜁01\zeta\in(0,1]italic_ζ ∈ ( 0 , 1 ].
3:  Output: Regularization parameter (λi)i[T]Csubscriptsubscript𝜆𝑖𝑖delimited-[]𝑇𝐶(\lambda_{i})_{i\in[T]}\in C( italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i ∈ [ italic_T ] end_POSTSUBSCRIPT ∈ italic_C, C+𝐶superscriptC\subset\mathbb{R}^{+}italic_C ⊂ blackboard_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT (LASSO/Ridge) or C+2𝐶superscriptsuperscript2C\subset{\mathbb{R}^{+}}^{2}italic_C ⊂ blackboard_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (ElasticNet).
4:  Set w1(λ)=1subscript𝑤1𝜆1w_{1}(\lambda)=1italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_λ ) = 1 for all λC𝜆𝐶\lambda\in Citalic_λ ∈ italic_C.
5:  for i=1,2,,T𝑖12𝑇i=1,2,\dots,Titalic_i = 1 , 2 , … , italic_T do
6:     Wi:=Cwi(λ)𝑑λassignsubscript𝑊𝑖subscript𝐶subscript𝑤𝑖𝜆differential-d𝜆W_{i}:=\int_{C}w_{i}(\lambda)d\lambdaitalic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT := ∫ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_λ ) italic_d italic_λ.
7:     Sample λ𝜆\lambdaitalic_λ with probability pt(λ)=wi(λ)Wisubscript𝑝𝑡𝜆subscript𝑤𝑖𝜆subscript𝑊𝑖p_{t}(\lambda)=\frac{w_{i}(\lambda)}{W_{i}}italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_λ ) = divide start_ARG italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_λ ) end_ARG start_ARG italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG, output as λisubscript𝜆𝑖\lambda_{i}italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.
8:     Compute average loss function li(λ)=1|y(i)|l(β^λ,f,(X(i),y(i)))subscript𝑙𝑖𝜆1superscript𝑦𝑖𝑙subscript^𝛽𝜆𝑓superscript𝑋𝑖superscript𝑦𝑖l_{i}(\lambda)=\frac{1}{|y^{(i)}|}l(\hat{\beta}_{\lambda,f},(X^{(i)},y^{(i)}))italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_λ ) = divide start_ARG 1 end_ARG start_ARG | italic_y start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT | end_ARG italic_l ( over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_λ , italic_f end_POSTSUBSCRIPT , ( italic_X start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) ).
9:     For each λC, update weights wi+1(λ)=eζ(1li(λ))wi(λ)formulae-sequence𝜆𝐶 update weights subscript𝑤𝑖1𝜆superscript𝑒𝜁1subscript𝑙𝑖𝜆subscript𝑤𝑖𝜆\lambda\in C,\text{ update weights }w_{i+1}(\lambda)=e^{\zeta(1-l_{i}(\lambda)% )}w_{i}(\lambda)italic_λ ∈ italic_C , update weights italic_w start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT ( italic_λ ) = italic_e start_POSTSUPERSCRIPT italic_ζ ( 1 - italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_λ ) ) end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_λ ).
10:  end for
Algorithm 1 Data-driven Regularization (ζ𝜁\zetaitalic_ζ)

We remark that the above result holds for arbitrary training features and validation sets in the problem sequence that satisfy our assumptions, in particular the losses are only assumed to be independent but not identically distributed. In contrast, the results in the previous section needed them to be drawn from the same distribution. Also the parameters need to be selected online, and cannot be changed for already seen instances. This setting captures interesting practical settings where the set of features (including feature dimensions) and the relevant training set (including training set size) may change over the online sequence. It is not clear how usual model selection techniques like cross-validation may be adapted to these challenging settings.

4 Extension to Regularized Least Squares Classification

Regression techniques can also be used to train binary classifiers by using an appropriate threshold on top of the regression estimate. Intuitively, regression learns a linear mapping which projects the datapoints onto a one-dimensional space, i.e. a real number, after which a threshold may be applied to classify the points. The use of thresholds to make discrete classifications adds discontinuities to the empirical loss function. Thus, in general, the classification setting is more challenging as it already includes the piecewise structure in the regression loss. We provide statistical and online learning guarantees for Ridge and LASSO. For the ElasticNet we present the extensions needed to the arguments from the previous sections to obtain results in the classification setting.

More formally, we will restrict y𝑦yitalic_y to {0,1}msuperscript01𝑚\{0,1\}^{m}{ 0 , 1 } start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT. The estimator β^λ,fsubscript^𝛽𝜆𝑓\hat{\beta}_{\lambda,f}over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_λ , italic_f end_POSTSUBSCRIPT is obtained as before, and the prediction on a test instance x𝑥xitalic_x may be obtained by taking the sign of a thresholded regression estimate, 𝗌𝗀𝗇(x,β^λ,fτ)𝗌𝗀𝗇𝑥subscript^𝛽𝜆𝑓𝜏\textup{{sgn}}(\langle x,\hat{\beta}_{\lambda,f}\rangle-\tau)sgn ( ⟨ italic_x , over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_λ , italic_f end_POSTSUBSCRIPT ⟩ - italic_τ ), where 𝗌𝗀𝗇:{0,1}:𝗌𝗀𝗇01\textup{{sgn}}:\mathbb{R}\rightarrow\{0,1\}sgn : blackboard_R → { 0 , 1 } maps x𝑥x\in\mathbb{R}italic_x ∈ blackboard_R to 𝕀{x0}𝕀𝑥0\mathbb{I}\{x\geq 0\}blackboard_I { italic_x ≥ 0 } and τ𝜏\tau\in\mathbb{R}italic_τ ∈ blackboard_R is the threshold. The threshold τ𝜏\tauitalic_τ corresponds to the intercept or bias of the learned linear classifier, here we will treat it as a tunable hyperparameter (in addition to λ1,λ2subscript𝜆1subscript𝜆2\lambda_{1},\lambda_{2}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT)444We can still have a problem instance specific bias in β𝛽\betaitalic_β using the standard trick of adding a unit feature to X𝑋Xitalic_X, thus we generalize the common practice of using a fixed threshold. For example, the RidgeClassifier implementation in Python library scikit-learn 1.1.1 [P+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT11] assumes y{1,+1}m𝑦superscript11𝑚y\in\{-1,+1\}^{m}italic_y ∈ { - 1 , + 1 } start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT and sets τ=0𝜏0\tau=0italic_τ = 0.. The average 0-1 loss over the dataset (X,y)𝑋𝑦(X,y)( italic_X , italic_y ) is given by lc(β^λ,f,(X,y),τ)=1mi=1m|yi𝗌𝗀𝗇(Xi,β^λ,fτ)|subscript𝑙𝑐subscript^𝛽𝜆𝑓𝑋𝑦𝜏1𝑚superscriptsubscript𝑖1𝑚subscript𝑦𝑖𝗌𝗀𝗇subscript𝑋𝑖subscript^𝛽𝜆𝑓𝜏l_{c}(\hat{\beta}_{\lambda,f},(X,y),\tau)=\frac{1}{m}\sum_{i=1}^{m}|y_{i}-% \textup{{sgn}}(\langle X_{i},\hat{\beta}_{\lambda,f}\rangle-\tau)|italic_l start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_λ , italic_f end_POSTSUBSCRIPT , ( italic_X , italic_y ) , italic_τ ) = divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT | italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - sgn ( ⟨ italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_λ , italic_f end_POSTSUBSCRIPT ⟩ - italic_τ ) |555Squared loss and 0-1 loss are identical in this setting.. Proofs from this section are in Appendix D.

4.1 Distributional setting

The problem setting is the same as in Section 3.1, except that the labels y𝑦yitalic_y are binary and we use threshold for prediction. We bound the pseudo-dimension for classification loss on these problem instances, which as before (c.f. Theorems 3.1 and 3.2) imply that polynomially many problem samples are sufficient to generalize well over the problem distribution 𝒟𝒟\mathcal{D}caligraphic_D. For Ridge and LASSO we upper bound the number of discontinuities of the piecewise constant classification loss by determining the values of λ𝜆\lambdaitalic_λ where any prediction changes.

Theorem 4.1.

Let 𝑅𝑖𝑑𝑔𝑒csuperscriptsubscript𝑅𝑖𝑑𝑔𝑒𝑐\mathcal{H}_{\text{Ridge}}^{c}caligraphic_H start_POSTSUBSCRIPT Ridge end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT, 𝐿𝐴𝑆𝑆𝑂csuperscriptsubscript𝐿𝐴𝑆𝑆𝑂𝑐\mathcal{H}_{\text{LASSO}}^{c}caligraphic_H start_POSTSUBSCRIPT LASSO end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT and 𝐸𝑁csuperscriptsubscript𝐸𝑁𝑐\mathcal{H}_{\text{EN}}^{c}caligraphic_H start_POSTSUBSCRIPT EN end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT denote the set of loss functions for classification problems with at most m𝑚mitalic_m examples and p𝑝pitalic_p features, for linear classifiers regularized using Ridge, LASSO and ElasticNet regression respectively.

  • (i)

    Pdim(𝑅𝑖𝑑𝑔𝑒c)=O(logmp)Pdimsuperscriptsubscript𝑅𝑖𝑑𝑔𝑒𝑐𝑂𝑚𝑝\textsc{Pdim}(\mathcal{H}_{\text{Ridge}}^{c})=O(\log mp)Pdim ( caligraphic_H start_POSTSUBSCRIPT Ridge end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) = italic_O ( roman_log italic_m italic_p )

  • (ii)

    Pdim(𝐿𝐴𝑆𝑆𝑂c)=O(plogm)Pdimsuperscriptsubscript𝐿𝐴𝑆𝑆𝑂𝑐𝑂𝑝𝑚\textsc{Pdim}(\mathcal{H}_{\text{LASSO}}^{c})=O(p\log m)Pdim ( caligraphic_H start_POSTSUBSCRIPT LASSO end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) = italic_O ( italic_p roman_log italic_m ). Further, in the overparameterized regime (pmmuch-greater-than𝑝𝑚p\gg mitalic_p ≫ italic_m), we have that Pdim(𝐿𝐴𝑆𝑆𝑂c)=O(mlogpm)Pdimsuperscriptsubscript𝐿𝐴𝑆𝑆𝑂𝑐𝑂𝑚𝑝𝑚\textsc{Pdim}(\mathcal{H}_{\text{LASSO}}^{c})=O(m\log\frac{p}{m})Pdim ( caligraphic_H start_POSTSUBSCRIPT LASSO end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) = italic_O ( italic_m roman_log divide start_ARG italic_p end_ARG start_ARG italic_m end_ARG ).

  • (iii)

    Pdim(𝐸𝑁c)=O(p2+plogm)Pdimsuperscriptsubscript𝐸𝑁𝑐𝑂superscript𝑝2𝑝𝑚\textsc{Pdim}(\mathcal{H}_{\text{EN}}^{c})=O(p^{2}+p\log m)Pdim ( caligraphic_H start_POSTSUBSCRIPT EN end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) = italic_O ( italic_p start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_p roman_log italic_m ).

The key difference with the bound for the regression loss in Theorem 3.1 is the additional O(plogm)𝑂𝑝𝑚O(p\log m)italic_O ( italic_p roman_log italic_m ) term which corresponds to discontinuities induced by the thresholding in the regression based classifiers. We can establish a structure similar to Theorem 2.2 in this case (Lemma D.1).

4.2 Online setting

As in Section 3.2, we can define an online learning setting for classification. Note that the smoothness of the predicted variable is not meaningful here, since y𝑦yitalic_y is a binary vector. Instead we will assume that the validation examples have smooth feature values. Intuitively this means that small perturbations to the feature values does not meaningfully change the problem.

Assumption 3 (Smooth validation features).

The feature values (X𝑣𝑎𝑙(i))jksubscriptsuperscriptsubscript𝑋𝑣𝑎𝑙𝑖𝑗𝑘(X_{\text{val}}^{(i)})_{jk}( italic_X start_POSTSUBSCRIPT val end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT in the validation examples are drawn from a joint κ𝜅\kappaitalic_κ-bounded distribution.

Under the assumption, we show that we can learn the regularization parameters online, for each of Ridge, LASSO and ElasticNet estimators. The proofs are straightforward extensions of the structural results developed in the previous sections, with minor technical changes to use the above validation set feature smoothness instead of Assumption 2, and are deferred to the appendix.

Theorem 4.2.

Suppose Assumptions 1 and 3 hold. Let l1,,lT:(0,H]d×[H,H]normal-:subscript𝑙1normal-…subscript𝑙𝑇normal-→superscript0𝐻𝑑𝐻𝐻l_{1},\dots,l_{T}:(0,H]^{d}\times[-H,H]\rightarrow\mathbb{R}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_l start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT : ( 0 , italic_H ] start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT × [ - italic_H , italic_H ] → blackboard_R denote an independent sequence of losses as a function of the regularization parameter λ𝜆\lambdaitalic_λ, li(λ,τ)=lc(β^λ,f,(X(i),y(i)),τ)subscript𝑙𝑖𝜆𝜏subscript𝑙𝑐subscriptnormal-^𝛽𝜆𝑓superscript𝑋𝑖superscript𝑦𝑖𝜏l_{i}(\lambda,\tau)=l_{c}(\hat{\beta}_{\lambda,f},(X^{(i)},y^{(i)}),\tau)italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_λ , italic_τ ) = italic_l start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_λ , italic_f end_POSTSUBSCRIPT , ( italic_X start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) , italic_τ ). If f𝑓fitalic_f is given by f1subscript𝑓1f_{1}italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (LASSO), f2subscript𝑓2f_{2}italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (Ridge), or fENsubscript𝑓𝐸𝑁f_{EN}italic_f start_POSTSUBSCRIPT italic_E italic_N end_POSTSUBSCRIPT (ElasticNet) then the sequence of functions is 1212\frac{1}{2}divide start_ARG 1 end_ARG start_ARG 2 end_ARG-dispersed and there is an online algorithm with O~(T)normal-~𝑂𝑇\tilde{O}(\sqrt{T})over~ start_ARG italic_O end_ARG ( square-root start_ARG italic_T end_ARG ) expected regret.

5 Conclusions and Future Work

We obtain a novel structural result for the ElasticNet loss as a function of the tuning parameters. Our characterization is useful in giving upper bounds for the sample complexity of learning the parameters from multiple regression problem instances (i.e. different datasets, possibly corresponding to different tasks) from the same problem domain. Efficient algorithms are immediate from our results for Ridge and LASSO. For the ElasticNet we show generalization and online regret guarantees, but efficient implementation of the algorithms is an interesting question for further work. Also we show general learning-theoretic guarantees, i.e. without any significant restrictions on the data-generating distribution, in learning from multiple problems. The problems may be drawn i.i.d. from an arbitrary problem distribution, or even arrive in an online sequence but with some smoothness properties. It is unclear if such general guarantees may be given for tuning parameters for the more standard setting of tuning over a single training set generated by i.i.d. draws from an example distribution, or how such guarantees can be combined with our results.

Acknowledgments

This material is based on work supported by the National Science Foundation under grants CCF-1910321, IIS-1705121, IIS-1838017, IIS-1901403, IIS-2046613, IIS-2112471, and SES-1919453; the Defense Advanced Research Projects Agency under cooperative agreement HR00112020003; a Simons Investigator Award; an AWS Machine Learning Research Award; an Amazon Research Award; a Bloomberg Research Grant; a Microsoft Research Faculty Fellowship; funding from Meta, Morgan Stanley, and Amazon; and a Facebook PhD Fellowship. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of any of these funding agencies.

References

  • [AB99] Martin Anthony and Peter Bartlett. Neural network learning: Theoretical foundations, volume 9. Cambridge University Press, 1999.
  • [Aka74] Hirotugu Akaike. A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19(6):716–723, 1974.
  • [Bal20] Maria-Florina Balcan. Book chapter Data-Driven Algorithm Design. In Beyond Worst Case Analysis of Algorithms, Tim Roughgarden (Ed). Cambridge University Press, 2020.
  • [BDD+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT21] Maria-Florina Balcan, Dan DeBlasio, Travis Dick, Carl Kingsford, Tuomas Sandholm, and Ellen Vitercik. How much data is sufficient to learn high-performing algorithms? Generalization guarantees for data-driven algorithm design. In Symposium on Theory of Computing (STOC), pages 919–932, 2021.
  • [BDL19] Maria-Florina Balcan, Travis Dick, and Manuel Lang. Learning to link. In International Conference on Learning Representations (ICLR), 2019.
  • [BDP20] Maria-Florina Balcan, Travis Dick, and Wesley Pegden. Semi-bandit optimization in the dispersed setting. In Conference on Uncertainty in Artificial Intelligence (UAI), pages 909–918. PMLR, 2020.
  • [BDS20] Maria-Florina Balcan, Travis Dick, and Dravyansh Sharma. Learning piecewise Lipschitz functions in changing environments. In International Conference on Artificial Intelligence and Statistics (AISTATS), pages 3567–3577. PMLR, 2020.
  • [BDV18] Maria-Florina Balcan, Travis Dick, and Ellen Vitercik. Dispersion for data-driven algorithm design, online learning, and private optimization. In Foundations of Computer Science (FOCS), pages 603–614. IEEE, 2018.
  • [BNS23] Nina Balcan, Anh Tuan Nguyen, and Dravyansh Sharma. New bounds for hyperparameter tuning of regression problems across instances. In Advances in Neural Information Processing Systems (NeurIPS), 2023.
  • [BPSV21] Maria-Florina Balcan, Siddharth Prasad, Tuomas Sandholm, and Ellen Vitercik. Sample complexity of tree search configuration: Cutting planes and beyond. Advances in Neural Information Processing Systems (NeurIPS), 34, 2021.
  • [BS21] Maria-Florina Balcan and Dravyansh Sharma. Data driven semi-supervised learning. Advances in Neural Information Processing Systems (NeurIPS), 34, 2021.
  • [BSV16] Maria-Florina Balcan, Tuomas Sandholm, and Ellen Vitercik. Sample complexity of automated mechanism design. Advances in Neural Information Processing Systems (NeurIPS), 29, 2016.
  • [CBL06] Nicolo Cesa-Bianchi and Gábor Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006.
  • [Cha92] John Chambers. Linear models (book chapter). In Statistical models in S, Trevor Hastie (Ed). Wadsworth & Brooks, 1992.
  • [CLC21] Denis Chetverikov, Zhipeng Liao, and Victor Chernozhukov. On cross-validated Lasso in high dimensions. The Annals of Statistics, 49(3):1300–1317, 2021.
  • [CLW16] Michael Chichignoud, Johannes Lederer, and Martin Wainwright. A practical scheme and fast algorithm to tune the lasso with optimality guarantees. The Journal of Machine Learning Research (JMLR), 17:8162–8181, 2016.
  • [Cov65] Thomas Cover. Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE Transactions on Electronic Computers, 3:326–334, 1965.
  • [CP09] Emmanuel Candès and Yaniv Plan. Near-ideal model selection by L1 minimization. The Annals of Statistics, 37(5A):2145–2177, 2009.
  • [DP14] Amit Dhurandhar and Marek Petrik. Efficient and accurate methods for updating generalized linear models with multiple feature additions. The Journal of Machine Learning Research (JMLR), 15(1):2607–2627, 2014.
  • [Dud67] Richard Dudley. The sizes of compact subsets of Hilbert space and continuity of Gaussian processes. Journal of Functional Analysis, 1(3):290–330, 1967.
  • [DW18] Edgar Dobriban and Stefan Wager. High-dimensional asymptotics of prediction: Ridge regression and classification. The Annals of Statistics, 46(1):247–279, 2018.
  • [EHJT04] Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tibshirani. Least angle regression. The Annals of Statistics, 32(2):407–499, 2004.
  • [FDSC+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT19] Manuel Fernández-Delgado, Manisha Sanjay Sirsat, Eva Cernadas, Sadi Alawadi, Senén Barro, and Manuel Febrero-Bande. An extensive experimental survey of regression methods. Neural Networks, 111:11–34, 2019.
  • [FL10] Jianqing Fan and Jinchi Lv. A selective overview of variable selection in high dimensional feature space. Statistica Sinica, 20(1):101, 2010.
  • [Fuc05] Jean-Jacques Fuchs. Recovery of exact sparse representations in the presence of bounded noise. IEEE Transactions on Information Theory, 51(10):3601–3608, 2005.
  • [Gib81] Diane Galarneau Gibbons. A simulation study of some ridge estimators. Journal of the American Statistical Association (JASA), 76(373):131–139, 1981.
  • [GR17] Rishi Gupta and Tim Roughgarden. A PAC approach to application-specific algorithm selection. SIAM Journal on Computing (SICOMP), 46(3):992–1017, 2017.
  • [HK70] Arthur Hoerl and Robert Kennard. Ridge regression: applications to nonorthogonal problems. Technometrics, 12(1):69–82, 1970.
  • [HMRT22] Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan Tibshirani. Surprises in high-dimensional ridgeless least squares interpolation. The Annals of Statistics, 50(2):949–986, 2022.
  • [HTF09] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, volume 2. Springer, 2009.
  • [KKM15] Lisa-Ann Kirkland, Frans Kanfer, and Sollie Millard. LASSO tuning parameter selection. In South African Statistical Association Conference (SASA), volume 2015, pages 49–56. South African Statistical Association (SASA), 2015.
  • [LL10] Qing Li and Nan Lin. The Bayesian elastic net. Bayesian analysis, 5(1):151–170, 2010.
  • [LM15] Johannes Lederer and Christian Müller. Don’t fall for tuning parameters: tuning-free variable selection in high dimensions with the TREX. In AAAI Conference on Artificial Intelligence, volume 29, 2015.
  • [MRT12] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learning. MIT Press, 2012.
  • [Mur12] Kevin Murphy. Machine learning: a Probabilistic Perspective. MIT press, 2012.
  • [P+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT11] Fabian Pedregosa et al. Scikit-learn: Machine Learning in Python. The Journal of Machine Learning Research (JMLR), 12:2825–2830, 2011.
  • [Pol12] David Pollard. Convergence of stochastic processes. Springer Science & Business Media, 2012.
  • [PWRT21] Pratik Patil, Yuting Wei, Alessandro Rinaldo, and Ryan Tibshirani. Uniform consistency of cross-validation estimators for high-dimensional ridge regression. In International Conference on Artificial Intelligence and Statistics (AISTATS), pages 3178–3186. PMLR, 2021.
  • [Sch78] Gideon Schwarz. Estimating the dimension of a model. The Annals of Statistics, pages 461–464, 1978.
  • [TA77] Andrey Tikonov and Vasily Arsenin. Solutions of ill-posed problems. Winston, 1977.
  • [Tib96] Robert Tibshirani. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996.
  • [Tib13] Ryan Tibshirani. The Lasso problem and uniqueness. Electronic Journal of Statistics, 7:1456–1490, 2013.
  • [ZH05] Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2):301–320, 2005.
  • [Zha09] Tong Zhang. Some sharp performance bounds for least squares regression with L1 regularization. The Annals of Statistics, 37(5A):2109–2144, 2009.
  • [ZY07] Peng Zhao and Bin Yu. Stagewise Lasso. The Journal of Machine Learning Research (JMLR), 8:2701–2726, 2007.
  • [ZY21] Yu Zhang and Qiang Yang. A survey on multi-task learning. IEEE Transactions on Knowledge and Data Engineering, 2021.

Appendix

Appendix A A classic Generalization Bound

The pseudo-dimension (also known as the Pollard dimension) is a generalization of the VC-dimension to real-valued functions, and may be defined as follows.

Definition 5 (Pseudo-dimension [Pol12]).

Let \mathcal{H}caligraphic_H be a set of real valued functions from input space 𝒳𝒳\mathcal{X}caligraphic_X. We say that C=(x1,,xn)𝒳n𝐶subscript𝑥1normal-…subscript𝑥𝑛superscript𝒳𝑛C=(x_{1},\dots,x_{n})\in\mathcal{X}^{n}italic_C = ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ∈ caligraphic_X start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT is pseudo-shattered by \mathcal{H}caligraphic_H if there exists a vector r=(r1,,rn)n𝑟subscript𝑟1normal-…subscript𝑟𝑛superscript𝑛r=(r_{1},\dots,r_{n})\in\mathbb{R}^{n}italic_r = ( italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT (called “witness”) such that for all b=(b1,,bn){±1}n𝑏subscript𝑏1normal-…subscript𝑏𝑛superscriptplus-or-minus1𝑛b=(b_{1},\dots,b_{n})\in\{\pm 1\}^{n}italic_b = ( italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_b start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ∈ { ± 1 } start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT there exists hbsubscript𝑏h_{b}\in\mathcal{H}italic_h start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ∈ caligraphic_H such that 𝑠𝑖𝑔𝑛(hb(xi)ri)=bi𝑠𝑖𝑔𝑛subscript𝑏subscript𝑥𝑖subscript𝑟𝑖subscript𝑏𝑖\text{sign}(h_{b}(x_{i})-r_{i})=b_{i}sign ( italic_h start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Pseudo-dimension of \mathcal{H}caligraphic_H, denoted by Pdim()Pdim\textsc{Pdim}(\mathcal{H})Pdim ( caligraphic_H ), is the cardinality of the largest set pseudo-shattered by \mathcal{H}caligraphic_H.

The following theorem connects the sample complexity of uniform learning over a class of real-valued functions to the pseudo-dimension of the class. Let h*:𝒳{0,1}:superscript𝒳01h^{*}:\mathcal{X}\rightarrow\{0,1\}italic_h start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT : caligraphic_X → { 0 , 1 } denote the target concept. We say \mathcal{H}caligraphic_H is (ϵ,δ)italic-ϵ𝛿(\epsilon,\delta)( italic_ϵ , italic_δ )-uniformly learnable666(ϵ,δ)italic-ϵ𝛿(\epsilon,\delta)( italic_ϵ , italic_δ )-uniform learnability with n𝑛nitalic_n samples implies (ϵ,δ)italic-ϵ𝛿(\epsilon,\delta)( italic_ϵ , italic_δ )-PAC learnability with n𝑛nitalic_n samples. with sample complexity n𝑛nitalic_n if, for every distribution 𝒟𝒟\mathcal{D}caligraphic_D, given a sample S𝑆Sitalic_S of size n𝑛nitalic_n, with probability 1δ1𝛿1-\delta1 - italic_δ, |1nsS|h(s)h*(s)|𝔼s𝒟[|h(s)h*(s)|]|<ϵ1𝑛subscript𝑠𝑆𝑠superscript𝑠subscript𝔼similar-to𝑠𝒟delimited-[]𝑠superscript𝑠italic-ϵ\big{\lvert}\frac{1}{n}\sum_{s\in S}|h(s)-h^{*}(s)|-\mathbb{E}_{s\sim\mathcal{% D}}[|h(s)-h^{*}(s)|]\big{\rvert}<\epsilon| divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_s ∈ italic_S end_POSTSUBSCRIPT | italic_h ( italic_s ) - italic_h start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_s ) | - blackboard_E start_POSTSUBSCRIPT italic_s ∼ caligraphic_D end_POSTSUBSCRIPT [ | italic_h ( italic_s ) - italic_h start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_s ) | ] | < italic_ϵ for every hh\in\mathcal{H}italic_h ∈ caligraphic_H.

Theorem A.1 ([AB99]).

Suppose \mathcal{H}caligraphic_H is a class of real-valued functions with range in [0,H]0𝐻[0,H][ 0 , italic_H ] and pseudo-dimension Pdim()Pdim\textsc{Pdim}(\mathcal{H})Pdim ( caligraphic_H ). For every ϵ>0,δ(0,1)formulae-sequenceitalic-ϵ0𝛿01\epsilon>0,\delta\in(0,1)italic_ϵ > 0 , italic_δ ∈ ( 0 , 1 ), the sample complexity of (ϵ,δ)italic-ϵ𝛿(\epsilon,\delta)( italic_ϵ , italic_δ )-uniformly learning the class \mathcal{H}caligraphic_H is O((Hϵ)2(Pdim()lnHϵ+ln1δ))𝑂superscript𝐻italic-ϵ2Pdim𝐻italic-ϵ1𝛿O\left(\left(\frac{H}{\epsilon}\right)^{2}\left(\textsc{Pdim}(\mathcal{H})\ln% \frac{H}{\epsilon}+\ln\frac{1}{\delta}\right)\right)italic_O ( ( divide start_ARG italic_H end_ARG start_ARG italic_ϵ end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( Pdim ( caligraphic_H ) roman_ln divide start_ARG italic_H end_ARG start_ARG italic_ϵ end_ARG + roman_ln divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG ) ).

Appendix B Known characterization of LASSO solutions

We will review some properties of LASSO solutions from prior work that are useful in proving our results. Let (X,y)𝑋𝑦(X,y)( italic_X , italic_y ) with X=[𝒙1,,𝒙p]m×p𝑋subscript𝒙1subscript𝒙𝑝superscript𝑚𝑝X=[\bm{x}_{1},\dots,\bm{x}_{p}]\in\mathbb{R}^{m\times p}italic_X = [ bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_x start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_p end_POSTSUPERSCRIPT and ym𝑦superscript𝑚y\in\mathbb{R}^{m}italic_y ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT denote a (training) dataset consisting of m𝑚mitalic_m labeled examples with p𝑝pitalic_p features. As noted in Section 2, LASSO regularization may be formulated as the following optimization problem.

minβpyXβ22+λ1||β||1,\min_{\beta\in\mathbb{R}^{p}}\left\lVert y-X\beta\right\rVert_{2}^{2}+\lambda_% {1}||\beta||_{1},roman_min start_POSTSUBSCRIPT italic_β ∈ blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ italic_y - italic_X italic_β ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | | italic_β | | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ,

where λ1+subscript𝜆1superscript\lambda_{1}\in\mathbb{R}^{+}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT is the regularization parameter. Dealing with the case λ1=0subscript𝜆10\lambda_{1}=0italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0 (i.e. Ordinary Least Squares) is not difficult, but is omitted here to keep the statements of the definitions and results simple. We will use the following well-known facts about the solution of the LASSO optimization problem [Fuc05, Tib13]. Applying the Karush-Kuhn-Tucker (KKT) optimality conditions to the problem gives,

Lemma B.1 (KKT Optimality Conditions for LASSO).

β*argminβpyXβ22+λ1||β||1\beta^{*}\in\operatorname*{argmin}_{\beta\in\mathbb{R}^{p}}\left\lVert y-X% \beta\right\rVert_{2}^{2}+\lambda_{1}||\beta||_{1}italic_β start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ∈ roman_argmin start_POSTSUBSCRIPT italic_β ∈ blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ italic_y - italic_X italic_β ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | | italic_β | | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT iff for all j[p]𝑗delimited-[]𝑝j\in[p]italic_j ∈ [ italic_p ],

𝒙jT(yXβ*)superscriptsubscript𝒙𝑗𝑇𝑦𝑋superscript𝛽\displaystyle\bm{x}_{j}^{T}(y-X\beta^{*})bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_y - italic_X italic_β start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) =λ1𝗌𝗂𝗀𝗇(β*), if βj*0,formulae-sequenceabsentsubscript𝜆1𝗌𝗂𝗀𝗇superscript𝛽 if subscriptsuperscript𝛽𝑗0\displaystyle=\lambda_{1}\textup{{sign}}(\beta^{*}),\text{ if }\beta^{*}_{j}% \neq 0,= italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT sign ( italic_β start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) , if italic_β start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ≠ 0 ,
|𝒙jT(yXβ*)|superscriptsubscript𝒙𝑗𝑇𝑦𝑋superscript𝛽\displaystyle|\bm{x}_{j}^{T}(y-X\beta^{*})|| bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_y - italic_X italic_β start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) | λ1, otherwise.absentsubscript𝜆1 otherwise.\displaystyle\leq\lambda_{1},\text{ otherwise. }≤ italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , otherwise.

Here 𝒙jT(yXβ*)superscriptsubscript𝒙𝑗𝑇𝑦𝑋superscript𝛽\bm{x}_{j}^{T}(y-X\beta^{*})bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_y - italic_X italic_β start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) is simply the correlation of the the j𝑗jitalic_j-th covariate with the residual yXβ*𝑦𝑋superscript𝛽y-X\beta^{*}italic_y - italic_X italic_β start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT (when y,X𝑦𝑋y,Xitalic_y , italic_X have been standardized). This motivates the definition of equicorrelation sets of covariates (Definition 2).

In terms of the equicorrelation set and the equicorrelation sign vector, the characterization of the LASSO solution in Lemma B.1 implies

XT(yXβ*)=λ1s.superscriptsubscript𝑋𝑇𝑦subscript𝑋subscriptsuperscript𝛽subscript𝜆1𝑠X_{\mathcal{E}}^{T}(y-X_{\mathcal{E}}\beta^{*}_{\mathcal{E}})=\lambda_{1}s.italic_X start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_y - italic_X start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT italic_β start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT ) = italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_s .

This implies a necessary and sufficient condition for the uniqueness of the LASSO solution, namely that Xsubscript𝑋X_{\mathcal{E}}italic_X start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT is full rank for all equicorrelation sets \mathcal{E}caligraphic_E [Tib13]. Our results will hold if the dataset X𝑋Xitalic_X satisfies this condition, but for simplicity we will use the a simpler (and possibly more natural) sufficient condition involving the general position.

Definition 6.

A matrix Xm×p𝑋superscript𝑚𝑝X\in\mathbb{R}^{m\times p}italic_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_p end_POSTSUPERSCRIPT is said to have its columns in the general position if the affine span of any km𝑘𝑚k\leq mitalic_k ≤ italic_m points (σi𝐱ji)i[k],{ji}i=J[p]subscriptsubscript𝜎𝑖subscript𝐱subscript𝑗𝑖formulae-sequence𝑖delimited-[]𝑘subscriptsubscript𝑗𝑖𝑖𝐽delimited-[]𝑝(\sigma_{i}\bm{x}_{j_{i}})_{i\in[k],\{j_{i}\}_{i}=J\subseteq[p]}( italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i ∈ [ italic_k ] , { italic_j start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_J ⊆ [ italic_p ] end_POSTSUBSCRIPT for arbitrary signs σ[k]{1,1}ksubscript𝜎delimited-[]𝑘superscript11𝑘\sigma_{[k]}\in\{-1,1\}^{k}italic_σ start_POSTSUBSCRIPT [ italic_k ] end_POSTSUBSCRIPT ∈ { - 1 , 1 } start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT and subset J𝐽Jitalic_J of the columns of size k𝑘kitalic_k, does not contain any element of {𝐱iiJ}conditional-setsubscript𝐱𝑖𝑖𝐽\{\bm{x}_{i}\mid i\notin J\}{ bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_i ∉ italic_J }.

Finally, we state the following useful characterization of the LASSO solutions in terms of the equicorrelation sets and sign vectors.

Lemma B.2 ([Tib13], Lemma 3).

If the columns of X𝑋Xitalic_X are in general position, then for any y𝑦yitalic_y and λ1>0subscript𝜆10\lambda_{1}>0italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT > 0, the LASSO solution is unique and is given by

β*=(XTX)1(XTyλ1s),β[p]*=0.formulae-sequencesubscriptsuperscript𝛽superscriptsuperscriptsubscript𝑋𝑇subscript𝑋1superscriptsubscript𝑋𝑇𝑦subscript𝜆1𝑠subscriptsuperscript𝛽delimited-[]𝑝0\beta^{*}_{\mathcal{E}}=(X_{\mathcal{E}}^{T}X_{\mathcal{E}})^{-1}(X_{\mathcal{% E}}^{T}y-\lambda_{1}s),\beta^{*}_{[p]\setminus\mathcal{E}}=0.italic_β start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT = ( italic_X start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_y - italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_s ) , italic_β start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT [ italic_p ] ∖ caligraphic_E end_POSTSUBSCRIPT = 0 .

We remark that Lemma B.2 does not give a way to compute β*superscript𝛽\beta^{*}italic_β start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT for a given value of λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, since \mathcal{E}caligraphic_E and s𝑠sitalic_s depend on β*superscript𝛽\beta^{*}italic_β start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT, but still gives a property of β*superscript𝛽\beta^{*}italic_β start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT that is convenient to use. In particular, since we have at most 3psuperscript3𝑝3^{p}3 start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT possible choices for (,s)𝑠(\mathcal{E},s)( caligraphic_E , italic_s ), this implies that the LASSO solution β*(λ1)superscript𝛽subscript𝜆1\beta^{*}(\lambda_{1})italic_β start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) is a piecewise linear function of λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, with at most 3psuperscript3𝑝3^{p}3 start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT pieces (for λ1>0subscript𝜆10\lambda_{1}>0italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT > 0). Following popular terminology, we will refer to this function as a solution path of LASSO for the given dataset (X,y)𝑋𝑦(X,y)( italic_X , italic_y ). LARS-LASSO of [ZH05] is an efficient algorithm for computing the solution path of LASSO.

Corollary B.3.

Let X𝑋Xitalic_X be a matrix with columns in the general position. If the unique LASSO solution for the dataset (X,y)𝑋𝑦(X,y)( italic_X , italic_y ) is given by the function β*:+pnormal-:superscript𝛽normal-→superscriptsuperscript𝑝\beta^{*}:\mathbb{R}^{+}\rightarrow\mathbb{R}^{p}italic_β start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT : blackboard_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT, then β*superscript𝛽\beta^{*}italic_β start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT is piecewise linear with at most 3psuperscript3𝑝3^{p}3 start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT pieces given by Lemma B.2.

Appendix C Lemmas and proof details for Section 3

We start with a helper lemma that characterizes the solution of the ElasticNet in terms of equicorrelation sets and sign vectors.

Lemma C.1.

Let X𝑋Xitalic_X be a matrix with columns in the general position, and λ=(λ1,λ2)(0,)×(0,)𝜆subscript𝜆1subscript𝜆200\lambda=(\lambda_{1},\lambda_{2})\in(0,\infty)\times(0,\infty)italic_λ = ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ∈ ( 0 , ∞ ) × ( 0 , ∞ ). Then the ElasticNet solution β^λ,fENargminβpyXβ22+λ,f𝐸𝑁(β)\hat{\beta}_{\lambda,f_{EN}}\in\operatorname*{argmin}_{\beta\in\mathbb{R}^{p}}% \left\lVert y-X\beta\right\rVert_{2}^{2}+\langle\lambda,f_{\text{EN}}(\beta)\rangleover^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_λ , italic_f start_POSTSUBSCRIPT italic_E italic_N end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∈ roman_argmin start_POSTSUBSCRIPT italic_β ∈ blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ italic_y - italic_X italic_β ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ⟨ italic_λ , italic_f start_POSTSUBSCRIPT EN end_POSTSUBSCRIPT ( italic_β ) ⟩ is unique for any dataset (X,y)𝑋𝑦(X,y)( italic_X , italic_y ) and satisfies

β^λ,fEN=(XTX+λ2I||)1XTyλ1(XTX+λ2I||)1ssubscript^𝛽𝜆subscript𝑓𝐸𝑁superscriptsuperscriptsubscript𝑋𝑇subscript𝑋subscript𝜆2subscript𝐼1superscriptsubscript𝑋𝑇𝑦subscript𝜆1superscriptsuperscriptsubscript𝑋𝑇subscript𝑋subscript𝜆2subscript𝐼1𝑠\hat{\beta}_{\lambda,f_{EN}}=(X_{\mathcal{E}}^{T}X_{\mathcal{E}}+\lambda_{2}I_% {|\mathcal{E}|})^{-1}X_{\mathcal{E}}^{T}y-\lambda_{1}(X_{\mathcal{E}}^{T}X_{% \mathcal{E}}+\lambda_{2}I_{|\mathcal{E}|})^{-1}sover^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_λ , italic_f start_POSTSUBSCRIPT italic_E italic_N end_POSTSUBSCRIPT end_POSTSUBSCRIPT = ( italic_X start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT | caligraphic_E | end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_y - italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT | caligraphic_E | end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_s

for some [p]delimited-[]𝑝\mathcal{E}\subseteq[p]caligraphic_E ⊆ [ italic_p ] and s{1,1}||𝑠superscript11s\in\{-1,1\}^{|\mathcal{E}|}italic_s ∈ { - 1 , 1 } start_POSTSUPERSCRIPT | caligraphic_E | end_POSTSUPERSCRIPT.

Proof.

We start with the well-known characterization of the ElasticNet solution as the solution of a LASSO problem on a transformed dataset, obtained using simple algebra [ZH05]. Given any dataset (X,y)𝑋𝑦(X,y)( italic_X , italic_y ), the ElasticNet coefficients β^λ,fENsubscript^𝛽𝜆subscript𝑓𝐸𝑁\hat{\beta}_{\lambda,f_{EN}}over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_λ , italic_f start_POSTSUBSCRIPT italic_E italic_N end_POSTSUBSCRIPT end_POSTSUBSCRIPT are given by β^λ,fEN=11+λ2β^λ*subscript^𝛽𝜆subscript𝑓𝐸𝑁11subscript𝜆2subscriptsuperscript^𝛽𝜆\hat{\beta}_{\lambda,f_{EN}}=\frac{1}{\sqrt{1+\lambda_{2}}}\hat{\beta}^{*}_{\lambda}over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_λ , italic_f start_POSTSUBSCRIPT italic_E italic_N end_POSTSUBSCRIPT end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG square-root start_ARG 1 + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG end_ARG over^ start_ARG italic_β end_ARG start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT777This corresponds to the “naive” ElasticNet solution in the terminology of [ZH05]. They also define an ElasticNet ‘estimate’ given by 1+λ2β^λ*1subscript𝜆2subscriptsuperscript^𝛽𝜆\sqrt{1+\lambda_{2}}\hat{\beta}^{*}_{\lambda}square-root start_ARG 1 + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG over^ start_ARG italic_β end_ARG start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT with nice properties, to which our analysis is easily adapted. where β^λ*subscriptsuperscript^𝛽𝜆\hat{\beta}^{*}_{\lambda}over^ start_ARG italic_β end_ARG start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT is the solution for a LASSO problem on a modified dataset (X*,y*)superscript𝑋superscript𝑦(X^{*},y^{*})( italic_X start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT )

β^λ*=argminβy*X*β22+λ1*f1(β),\hat{\beta}^{*}_{\lambda}=\operatorname*{argmin}_{\beta}\left\lVert y^{*}-X^{*% }\beta\right\rVert_{2}^{2}+\lambda_{1}^{*}f_{1}(\beta),over^ start_ARG italic_β end_ARG start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT = roman_argmin start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT ∥ italic_y start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT - italic_X start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT italic_β ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_β ) ,

with X*=11+λ2(Xλ2Ip)superscript𝑋11subscript𝜆2matrix𝑋subscript𝜆2subscript𝐼𝑝X^{*}=\frac{1}{\sqrt{1+\lambda_{2}}}\begin{pmatrix}X\\ \sqrt{\lambda_{2}}I_{p}\end{pmatrix}italic_X start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG square-root start_ARG 1 + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG end_ARG ( start_ARG start_ROW start_CELL italic_X end_CELL end_ROW start_ROW start_CELL square-root start_ARG italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG italic_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ), y*=(y0)superscript𝑦matrix𝑦0y^{*}=\begin{pmatrix}y\\ 0\end{pmatrix}italic_y start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = ( start_ARG start_ROW start_CELL italic_y end_CELL end_ROW start_ROW start_CELL 0 end_CELL end_ROW end_ARG ), and λ1*=λ11+λ2superscriptsubscript𝜆1subscript𝜆11subscript𝜆2\lambda_{1}^{*}=\frac{\lambda_{1}}{\sqrt{1+\lambda_{2}}}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = divide start_ARG italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG 1 + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG end_ARG.

If the columns of X𝑋Xitalic_X are in general position (Definition 6), then the same is true of X*superscript𝑋X^{*}italic_X start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT. For [p]delimited-[]𝑝\mathcal{E}\subseteq[p]caligraphic_E ⊆ [ italic_p ], note that X*TX*=11+λ2(XTX+λ2I||)superscriptsubscriptsuperscript𝑋𝑇subscriptsuperscript𝑋11subscript𝜆2superscriptsubscript𝑋𝑇subscript𝑋subscript𝜆2subscript𝐼{X^{*}_{\mathcal{E}}}^{T}X^{*}_{\mathcal{E}}=\frac{1}{1+\lambda_{2}}(X_{% \mathcal{E}}^{T}X_{\mathcal{E}}+\lambda_{2}I_{|\mathcal{E}|})italic_X start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_X start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 1 + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG ( italic_X start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT | caligraphic_E | end_POSTSUBSCRIPT ) and X*Ty*=11+λ2XTysuperscriptsubscriptsuperscript𝑋𝑇superscript𝑦11subscript𝜆2superscriptsubscript𝑋𝑇𝑦{X^{*}_{\mathcal{E}}}^{T}y^{*}=\frac{1}{\sqrt{1+\lambda_{2}}}X_{\mathcal{E}}^{% T}yitalic_X start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_y start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG square-root start_ARG 1 + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG end_ARG italic_X start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_y. By Lemma B.2, if \mathcal{E}caligraphic_E denotes the equicorrelation set of covariates and s{1,1}||𝑠superscript11s\in\{-1,1\}^{|\mathcal{E}|}italic_s ∈ { - 1 , 1 } start_POSTSUPERSCRIPT | caligraphic_E | end_POSTSUPERSCRIPT the equicorrelation sign vector for the LASSO problem, then the ElasticNet solution is given by

β^λ,fEN=c1c2λ1,subscript^𝛽𝜆subscript𝑓𝐸𝑁subscript𝑐1subscript𝑐2subscript𝜆1\hat{\beta}_{\lambda,f_{EN}}=c_{1}-c_{2}{\lambda_{1}},over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_λ , italic_f start_POSTSUBSCRIPT italic_E italic_N end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ,

where

c1=11+λ2(X*TX*)1X*Ty*=(XTX+λ2I||)1XTy,subscript𝑐111subscript𝜆2superscriptsuperscriptsubscriptsuperscript𝑋𝑇subscriptsuperscript𝑋1superscriptsubscriptsuperscript𝑋𝑇superscript𝑦superscriptsuperscriptsubscript𝑋𝑇subscript𝑋subscript𝜆2subscript𝐼1superscriptsubscript𝑋𝑇𝑦c_{1}=\frac{1}{\sqrt{1+\lambda_{2}}}({X^{*}_{\mathcal{E}}}^{T}X^{*}_{\mathcal{% E}})^{-1}{X^{*}_{\mathcal{E}}}^{T}y^{*}=(X_{\mathcal{E}}^{T}X_{\mathcal{E}}+% \lambda_{2}I_{|\mathcal{E}|})^{-1}X_{\mathcal{E}}^{T}y,italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG square-root start_ARG 1 + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG end_ARG ( italic_X start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_X start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_X start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_y start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = ( italic_X start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT | caligraphic_E | end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_y ,

and

c2=11+λ2(X*TX*)1s=(XTX+λ2I||)1s.subscript𝑐211subscript𝜆2superscriptsuperscriptsubscriptsuperscript𝑋𝑇subscriptsuperscript𝑋1𝑠superscriptsuperscriptsubscript𝑋𝑇subscript𝑋subscript𝜆2subscript𝐼1𝑠c_{2}=\frac{1}{1+\lambda_{2}}({X^{*}_{\mathcal{E}}}^{T}X^{*}_{\mathcal{E}})^{-% 1}s=(X_{\mathcal{E}}^{T}X_{\mathcal{E}}+\lambda_{2}I_{|\mathcal{E}|})^{-1}s.italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 1 + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG ( italic_X start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_X start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_s = ( italic_X start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT | caligraphic_E | end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_s .

The following lemma helps determine the dependence of ElasticNet solutions on λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

Lemma 2.1 (restated). Let A𝐴Aitalic_A be an r×s𝑟𝑠r\times sitalic_r × italic_s matrix. Consider the matrix B(λ)=(ATA+λIs)1𝐵𝜆superscriptsuperscript𝐴𝑇𝐴𝜆subscript𝐼𝑠1B(\lambda)=(A^{T}A+\lambda I_{s})^{-1}italic_B ( italic_λ ) = ( italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_A + italic_λ italic_I start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT for λ>0𝜆0\lambda>0italic_λ > 0.

  • 1.

    Each entry of B(λ)𝐵𝜆B(\lambda)italic_B ( italic_λ ) is a rational polynomial Pij(λ)/Q(λ)subscript𝑃𝑖𝑗𝜆𝑄𝜆P_{ij}(\lambda)/Q(\lambda)italic_P start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_λ ) / italic_Q ( italic_λ ) for i,j[s]𝑖𝑗delimited-[]𝑠i,j\in[s]italic_i , italic_j ∈ [ italic_s ] with each Pijsubscript𝑃𝑖𝑗P_{ij}italic_P start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT of degree at most s1𝑠1s-1italic_s - 1, and Q𝑄Qitalic_Q of degree s𝑠sitalic_s.

  • 2.

    Further, for i=j𝑖𝑗i=jitalic_i = italic_j, Pijsubscript𝑃𝑖𝑗P_{ij}italic_P start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT has degree s1𝑠1s-1italic_s - 1 and leading coefficient 1, and for ij𝑖𝑗i\neq jitalic_i ≠ italic_j Pijsubscript𝑃𝑖𝑗P_{ij}italic_P start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT has degree at most s2𝑠2s-2italic_s - 2. Also, Q(λ)𝑄𝜆Q(\lambda)italic_Q ( italic_λ ) has leading coefficient 1111.

Proof.

Let G=ATA𝐺superscript𝐴𝑇𝐴G=A^{T}Aitalic_G = italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_A be the Gramian matrix. G𝐺Gitalic_G is symmetric and therefore diagonalizable, and the diagonalization gives the eigendecomposition G=EΛE1𝐺𝐸Λsuperscript𝐸1G=E\Lambda E^{-1}italic_G = italic_E roman_Λ italic_E start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT. Thus we have

(ATA+λIs)1=(EΛE1+λEE1)1=E(Λ+λIs)1E1superscriptsuperscript𝐴𝑇𝐴𝜆subscript𝐼𝑠1superscript𝐸Λsuperscript𝐸1𝜆𝐸superscript𝐸11𝐸superscriptΛ𝜆subscript𝐼𝑠1superscript𝐸1(A^{T}A+\lambda I_{s})^{-1}=(E\Lambda E^{-1}+\lambda EE^{-1})^{-1}=E(\Lambda+% \lambda I_{s})^{-1}E^{-1}( italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_A + italic_λ italic_I start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT = ( italic_E roman_Λ italic_E start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT + italic_λ italic_E italic_E start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT = italic_E ( roman_Λ + italic_λ italic_I start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_E start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT

But ΛΛ\Lambdaroman_Λ is the diagonal matrix 𝖣𝗂𝖺𝗀(Λ11,,Λss)𝖣𝗂𝖺𝗀subscriptΛ11subscriptΛ𝑠𝑠\textsf{Diag}(\Lambda_{11},\dots,\Lambda_{ss})Diag ( roman_Λ start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT , … , roman_Λ start_POSTSUBSCRIPT italic_s italic_s end_POSTSUBSCRIPT ), and therefore (Λ+λIs)1=𝖣𝗂𝖺𝗀((Λ11+λ)1,,(Λss+λ)1)superscriptΛ𝜆subscript𝐼𝑠1𝖣𝗂𝖺𝗀superscriptsubscriptΛ11𝜆1superscriptsubscriptΛ𝑠𝑠𝜆1(\Lambda+\lambda I_{s})^{-1}=\textsf{Diag}((\Lambda_{11}+\lambda)^{-1},\dots,(% \Lambda_{ss}+\lambda)^{-1})( roman_Λ + italic_λ italic_I start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT = Diag ( ( roman_Λ start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT + italic_λ ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT , … , ( roman_Λ start_POSTSUBSCRIPT italic_s italic_s end_POSTSUBSCRIPT + italic_λ ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ). This implies the desired characterization, with Q(λ)=Πi[s](Λii+λ)𝑄𝜆subscriptΠ𝑖delimited-[]𝑠subscriptΛ𝑖𝑖𝜆Q(\lambda)=\Pi_{i\in[s]}(\Lambda_{ii}+\lambda)italic_Q ( italic_λ ) = roman_Π start_POSTSUBSCRIPT italic_i ∈ [ italic_s ] end_POSTSUBSCRIPT ( roman_Λ start_POSTSUBSCRIPT italic_i italic_i end_POSTSUBSCRIPT + italic_λ ) and

Pij(λ)=Q(λ)k=1sEik(E1)kjΛkk+λ=k=1s(Eik(E1)kjΠi[s]k(Λii+λ)),subscript𝑃𝑖𝑗𝜆𝑄𝜆superscriptsubscript𝑘1𝑠subscript𝐸𝑖𝑘subscriptsuperscript𝐸1𝑘𝑗subscriptΛ𝑘𝑘𝜆superscriptsubscript𝑘1𝑠subscript𝐸𝑖𝑘subscriptsuperscript𝐸1𝑘𝑗subscriptΠ𝑖delimited-[]𝑠𝑘subscriptΛ𝑖𝑖𝜆P_{ij}(\lambda)=Q(\lambda)\sum_{k=1}^{s}\frac{E_{ik}(E^{-1})_{kj}}{\Lambda_{kk% }+\lambda}=\sum_{k=1}^{s}\left(E_{ik}(E^{-1})_{kj}\Pi_{i\in[s]\setminus k}(% \Lambda_{ii}+\lambda)\right),italic_P start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_λ ) = italic_Q ( italic_λ ) ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT divide start_ARG italic_E start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT ( italic_E start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_k italic_j end_POSTSUBSCRIPT end_ARG start_ARG roman_Λ start_POSTSUBSCRIPT italic_k italic_k end_POSTSUBSCRIPT + italic_λ end_ARG = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ( italic_E start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT ( italic_E start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_k italic_j end_POSTSUBSCRIPT roman_Π start_POSTSUBSCRIPT italic_i ∈ [ italic_s ] ∖ italic_k end_POSTSUBSCRIPT ( roman_Λ start_POSTSUBSCRIPT italic_i italic_i end_POSTSUBSCRIPT + italic_λ ) ) ,

with coefficient of λs1superscript𝜆𝑠1\lambda^{s-1}italic_λ start_POSTSUPERSCRIPT italic_s - 1 end_POSTSUPERSCRIPT in Pij(λ)subscript𝑃𝑖𝑗𝜆P_{ij}(\lambda)italic_P start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_λ ) equal to k=1sEik(E1)kj=𝕀{i=j}superscriptsubscript𝑘1𝑠subscript𝐸𝑖𝑘subscriptsuperscript𝐸1𝑘𝑗𝕀𝑖𝑗\sum_{k=1}^{s}E_{ik}(E^{-1})_{kj}=\mathbb{I}\{i=j\}∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT italic_E start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT ( italic_E start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_k italic_j end_POSTSUBSCRIPT = blackboard_I { italic_i = italic_j }. ∎

C.1 Tuning the ElasticNet – Distributional setting

We first present some terminology from algebraic geometry which will be useful in our proofs.

Definition 7 (Semialgebraic sets, Algebraic curves.).

A semialgebraic subset of nsuperscript𝑛\mathbb{R}^{n}blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT is a finite union of sets of the form {xnpi(x)0 for each i[m]}conditional-set𝑥superscript𝑛subscript𝑝𝑖𝑥0 for each 𝑖delimited-[]𝑚\{x\in\mathbb{R}^{n}\mid p_{i}(x)\geq 0\text{ for each }i\in[m]\}{ italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∣ italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) ≥ 0 for each italic_i ∈ [ italic_m ] }, where p1,,pmsubscript𝑝1normal-…subscript𝑝𝑚p_{1},\dots,p_{m}italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_p start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT are polynomials. An algebraic curve is the zero set of a polynomial in two dimensions.

The result of Theorem 2.2 motivates the following results for bounding the complexity of dual piece functions and dual boundary functions, which can be used to bound the pseudo-dimension of ENsubscriptEN\mathcal{H}_{\text{EN}}caligraphic_H start_POSTSUBSCRIPT EN end_POSTSUBSCRIPT (Theorem 3.1) using the following remarkable result from [BDD+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT21].

Theorem C.2 ([BDD+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT21]).

If the dual function class *superscript\mathcal{H}^{*}caligraphic_H start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT is (,𝒢,k)𝒢𝑘(\mathcal{F},\mathcal{G},k)( caligraphic_F , caligraphic_G , italic_k )-piecewise decomposable, then the pseudo-dimension of \mathcal{H}caligraphic_H may be bounded as

Pdim()=O((Pdim(*)+d𝒢*)log(Pdim(*)+d𝒢*)+d𝒢*logk),Pdim𝑂Pdimsuperscriptsubscript𝑑superscript𝒢Pdimsuperscriptsubscript𝑑superscript𝒢subscript𝑑superscript𝒢𝑘\textsc{Pdim}(\mathcal{H})=O((\textsc{Pdim}(\mathcal{F}^{*})+d_{\mathcal{G}^{*% }})\log(\textsc{Pdim}(\mathcal{F}^{*})+d_{\mathcal{G}^{*}})+d_{\mathcal{G}^{*}% }\log k),Pdim ( caligraphic_H ) = italic_O ( ( Pdim ( caligraphic_F start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) + italic_d start_POSTSUBSCRIPT caligraphic_G start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) roman_log ( Pdim ( caligraphic_F start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) + italic_d start_POSTSUBSCRIPT caligraphic_G start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) + italic_d start_POSTSUBSCRIPT caligraphic_G start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_log italic_k ) ,

where d𝒢*subscript𝑑superscript𝒢d_{\mathcal{G}^{*}}italic_d start_POSTSUBSCRIPT caligraphic_G start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT denotes the VC dimension of dual class boundary function 𝒢*superscript𝒢\mathcal{G}^{*}caligraphic_G start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT.

We will first prove a useful lemma that bounds the number of pieces into which a finite set of algebraic curves with bounded degrees may partition 2superscript2\mathbb{R}^{2}blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

Lemma C.3.

Let \mathcal{H}caligraphic_H be a collection of k𝑘kitalic_k functions hi:2normal-:subscript𝑖normal-→superscript2h_{i}:\mathbb{R}^{2}\rightarrow\mathbb{R}italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT → blackboard_R that map (x,y)qi(x,y)maps-to𝑥𝑦subscript𝑞𝑖𝑥𝑦(x,y)\mapsto q_{i}(x,y)( italic_x , italic_y ) ↦ italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x , italic_y ) where qisubscript𝑞𝑖q_{i}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is a bivariate polynomial of degree at most d𝑑ditalic_d, for i[k]𝑖delimited-[]𝑘i\in[k]italic_i ∈ [ italic_k ], then 2{(x,y)qi(x,y)=0 for some i[k]}superscript2conditional-set𝑥𝑦subscript𝑞𝑖𝑥𝑦0 for some 𝑖delimited-[]𝑘\mathbb{R}^{2}\setminus\{(x,y)\mid q_{i}(x,y)=0\text{ for some }i\in[k]\}blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∖ { ( italic_x , italic_y ) ∣ italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x , italic_y ) = 0 for some italic_i ∈ [ italic_k ] } may be partitioned into at most (kd+1)(d2(k2)+2kd(d1)+1)=O(d3k3)𝑘𝑑1superscript𝑑2binomial𝑘22𝑘𝑑𝑑11𝑂superscript𝑑3superscript𝑘3(kd+1)\left(d^{2}{k\choose 2}+2kd(d-1)+1\right)=O(d^{3}k^{3})( italic_k italic_d + 1 ) ( italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( binomial start_ARG italic_k end_ARG start_ARG 2 end_ARG ) + 2 italic_k italic_d ( italic_d - 1 ) + 1 ) = italic_O ( italic_d start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_k start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) disjoint sets such that the sign pattern (𝕀{qi(x,y)>0})i[k]subscript𝕀subscript𝑞𝑖𝑥𝑦0𝑖delimited-[]𝑘(\mathbb{I}\{q_{i}(x,y)>0\})_{i\in[k]}( blackboard_I { italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x , italic_y ) > 0 } ) start_POSTSUBSCRIPT italic_i ∈ [ italic_k ] end_POSTSUBSCRIPT is fixed over any set in the partition.

Proof.

Assume WLOG that the curves are in the general position. Simple applications of Bezout’s theorem (which states that, in general, two algebraic curves of degrees m𝑚mitalic_m and n𝑛nitalic_n intersect in at most mn𝑚𝑛mnitalic_m italic_n points) imply that there are at most d2(k2)superscript𝑑2binomial𝑘2d^{2}{k\choose 2}italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( binomial start_ARG italic_k end_ARG start_ARG 2 end_ARG ) points where any pair of curves from the set {qi(x,y)}i[k]subscriptsubscript𝑞𝑖𝑥𝑦𝑖delimited-[]𝑘\{q_{i}(x,y)\}_{i\in[k]}{ italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x , italic_y ) } start_POSTSUBSCRIPT italic_i ∈ [ italic_k ] end_POSTSUBSCRIPT may intersect, and at most 2kd(d1)2𝑘𝑑𝑑12kd(d-1)2 italic_k italic_d ( italic_d - 1 ) points of extrema (i.e. points p0=(x0,y0)subscript𝑝0subscript𝑥0subscript𝑦0p_{0}=(x_{0},y_{0})italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) on the curve f𝑓fitalic_f such that there is an open neighborhood N𝑁Nitalic_N around p0subscript𝑝0p_{0}italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT in which x0argmin(x,y)Nfxsubscript𝑥0subscriptargmin𝑥𝑦𝑁𝑓𝑥x_{0}\in\operatorname*{argmin}_{(x,y)\in N\cap f}xitalic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ roman_argmin start_POSTSUBSCRIPT ( italic_x , italic_y ) ∈ italic_N ∩ italic_f end_POSTSUBSCRIPT italic_x, or x0argmax(x,y)Nfxsubscript𝑥0subscriptargmax𝑥𝑦𝑁𝑓𝑥x_{0}\in\operatorname*{argmax}_{(x,y)\in N\cap f}xitalic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ roman_argmax start_POSTSUBSCRIPT ( italic_x , italic_y ) ∈ italic_N ∩ italic_f end_POSTSUBSCRIPT italic_x, or y0argmin(x,y)Nfysubscript𝑦0subscriptargmin𝑥𝑦𝑁𝑓𝑦y_{0}\in\operatorname*{argmin}_{(x,y)\in N\cap f}yitalic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ roman_argmin start_POSTSUBSCRIPT ( italic_x , italic_y ) ∈ italic_N ∩ italic_f end_POSTSUBSCRIPT italic_y, or y0argmax(x,y)Nfysubscript𝑦0subscriptargmax𝑥𝑦𝑁𝑓𝑦y_{0}\in\operatorname*{argmax}_{(x,y)\in N\cap f}yitalic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ roman_argmax start_POSTSUBSCRIPT ( italic_x , italic_y ) ∈ italic_N ∩ italic_f end_POSTSUBSCRIPT italic_y) for the k𝑘kitalic_k algebraic curves. Let 𝒫𝒫\mathcal{P}caligraphic_P denote the set of these d2(k2)+2kd(d1)absentsuperscript𝑑2binomial𝑘22𝑘𝑑𝑑1\leq d^{2}{k\choose 2}+2kd(d-1)≤ italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( binomial start_ARG italic_k end_ARG start_ARG 2 end_ARG ) + 2 italic_k italic_d ( italic_d - 1 ) points.

Now a horizontal line y=c𝑦𝑐y=citalic_y = italic_c will have the exact same set of intersections with all the curves in \mathcal{H}caligraphic_H as a line y=c𝑦superscript𝑐y=c^{\prime}italic_y = italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, and in the same order (including multiplicities), if none of the points in 𝒫𝒫\mathcal{P}caligraphic_P lie between these lines. There are thus at most |𝒫|+1𝒫1|\mathcal{P}|+1| caligraphic_P | + 1 distinct sequences of the k𝑘kitalic_k curves that may correspond to the intersection sequence of any horizontal line. Moreover, any such horizontal line may intersect any curve in the set at most d𝑑ditalic_d times (since a polynomial in degree d𝑑ditalic_d has at most d𝑑ditalic_d zeros), or at most kd𝑘𝑑kditalic_k italic_d intersections with all the curves. Summing up over the distinct intersection sequences, we have at most (kd+1)(|𝒫|+1)𝑘𝑑1𝒫1(kd+1)(|\mathcal{P}|+1)( italic_k italic_d + 1 ) ( | caligraphic_P | + 1 ) distinct sign patterns induced by the set of curves. ∎

We will now use Lemma C.3 to bound the pseudo-dimension of the relevant function classes (Theorem 2.2).

Lemma C.4.

Let *={fq1,q2*:2}superscriptconditional-setsubscriptsuperscript𝑓subscript𝑞1subscript𝑞2normal-→superscript2\mathcal{F}^{*}=\{f^{*}_{q_{1},q_{2}}:\mathbb{R}^{2}\rightarrow\mathbb{R}\}caligraphic_F start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = { italic_f start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT → blackboard_R } be a function class consisting of rational polynomial functions fq1,q2*:(λ1,λ2)q1(λ1,λ2)q2(λ1,λ2)normal-:subscriptsuperscript𝑓subscript𝑞1subscript𝑞2maps-tosubscript𝜆1subscript𝜆2subscript𝑞1subscript𝜆1subscript𝜆2subscript𝑞2subscript𝜆1subscript𝜆2f^{*}_{q_{1},q_{2}}:(\lambda_{1},\lambda_{2})\mapsto\frac{q_{1}(\lambda_{1},% \lambda_{2})}{q_{2}(\lambda_{1},\lambda_{2})}italic_f start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT : ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ↦ divide start_ARG italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG, where q1,q2subscript𝑞1subscript𝑞2q_{1},q_{2}italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT have degrees at most d𝑑ditalic_d. Then Pdim(*)=O(logd)Pdimsuperscript𝑂𝑑\textsc{Pdim}(\mathcal{F}^{*})=O(\log d)Pdim ( caligraphic_F start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) = italic_O ( roman_log italic_d ).

Proof.

Suppose that Pdim(*)=NPdimsuperscript𝑁\textsc{Pdim}(\mathcal{F}^{*})=NPdim ( caligraphic_F start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) = italic_N. Then there exist functions f1*,,fN**subscriptsuperscript𝑓1subscriptsuperscript𝑓𝑁superscriptf^{*}_{1},\dots,f^{*}_{N}\in\mathcal{F}^{*}italic_f start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_f start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ∈ caligraphic_F start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT and real-valued witnesses (r1,,rN)Nsubscript𝑟1subscript𝑟𝑁superscript𝑁(r_{1},\dots,r_{N})\in\mathbb{R}^{N}( italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_r start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT such that for every subset T[N]𝑇delimited-[]𝑁T\subseteq[N]italic_T ⊆ [ italic_N ], there exists a parameter setting λT=(λ1,λ2)2subscript𝜆𝑇subscript𝜆1subscript𝜆2superscript2\lambda_{T}=(\lambda_{1},\lambda_{2})\in\mathbb{R}^{2}italic_λ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT such that fi*(λT)risubscriptsuperscript𝑓𝑖subscript𝜆𝑇subscript𝑟𝑖f^{*}_{i}(\lambda_{T})\geq r_{i}italic_f start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ≥ italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT if and only if iT𝑖𝑇i\in Titalic_i ∈ italic_T. In other words, we have a set of 2Nsuperscript2𝑁2^{N}2 start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT parameters (indexed by T𝑇Titalic_T) that induce all possible labelings of the binary vector (𝕀{fi*(λT)ri})i[N]subscript𝕀subscriptsuperscript𝑓𝑖subscript𝜆𝑇subscript𝑟𝑖𝑖delimited-[]𝑁(\mathbb{I}\{f^{*}_{i}(\lambda_{T})\geq r_{i}\})_{i\in[N]}( blackboard_I { italic_f start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ≥ italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } ) start_POSTSUBSCRIPT italic_i ∈ [ italic_N ] end_POSTSUBSCRIPT.

But fi*(λ)risubscriptsuperscript𝑓𝑖𝜆subscript𝑟𝑖f^{*}_{i}(\lambda)\geq r_{i}italic_f start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_λ ) ≥ italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are semi-algebraic sets bounded by N𝑁Nitalic_N algebraic curves of degree at most d𝑑ditalic_d. By Lemma C.3, there are at most O(d3N3)𝑂superscript𝑑3superscript𝑁3O(d^{3}N^{3})italic_O ( italic_d start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_N start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) different sign-patterns induced by N𝑁Nitalic_N algebraic curves over all possible values of λ2𝜆superscript2\lambda\in\mathbb{R}^{2}italic_λ ∈ blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. In particular, the number of distinct sign patterns over λ{λT}T[N]𝜆subscriptsubscript𝜆𝑇𝑇delimited-[]𝑁\lambda\in\{\lambda_{T}\}_{T\subseteq[N]}italic_λ ∈ { italic_λ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_T ⊆ [ italic_N ] end_POSTSUBSCRIPT is also O(d3N3)𝑂superscript𝑑3superscript𝑁3O(d^{3}N^{3})italic_O ( italic_d start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_N start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ). Thus, we conclude 2N=O(d3N3)superscript2𝑁𝑂superscript𝑑3superscript𝑁32^{N}=O(d^{3}N^{3})2 start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT = italic_O ( italic_d start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_N start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ), or N=O(logd)𝑁𝑂𝑑N=O(\log d)italic_N = italic_O ( roman_log italic_d ). ∎

Lemma C.5.

Let R[x1,x2,,xd]D𝑅subscriptsubscript𝑥1subscript𝑥2normal-…subscript𝑥𝑑𝐷R[x_{1},x_{2},\dots,x_{d}]_{D}italic_R [ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT denote the set of all real polynomials in d𝑑ditalic_d variables of degree at most D𝐷Ditalic_D in x1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, and degree at most 1 in x2,,xdsubscript𝑥2normal-…subscript𝑥𝑑x_{2},\dots,x_{d}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT. Further, let Pd,D={{xRd:p(x)0}pR[x1,x2,,xd]D}subscript𝑃𝑑𝐷conditional-setconditional-set𝑥superscript𝑅𝑑𝑝𝑥0𝑝𝑅subscriptsubscript𝑥1subscript𝑥2normal-…subscript𝑥𝑑𝐷P_{d,D}=\{\{x\in R^{d}:p(x)\geq 0\}\mid p\in R[x_{1},x_{2},\dots,x_{d}]_{D}\}italic_P start_POSTSUBSCRIPT italic_d , italic_D end_POSTSUBSCRIPT = { { italic_x ∈ italic_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT : italic_p ( italic_x ) ≥ 0 } ∣ italic_p ∈ italic_R [ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT }. The VC-dimension of the set system (d,Pd,D)superscript𝑑subscript𝑃𝑑𝐷(\mathbb{R}^{d},P_{d,D})( blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT , italic_P start_POSTSUBSCRIPT italic_d , italic_D end_POSTSUBSCRIPT ) is O(dD)𝑂𝑑𝐷O(dD)italic_O ( italic_d italic_D ).

Proof.

We will employ a standard linearization argument [Cov65] that reduces the problem to bounding the VC dimension of halfspaces in higher dimensions. Let M𝑀Mitalic_M be the set of all possible non-constant monomials of degree at most D𝐷Ditalic_D in x1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, and at most one in x2,,xdsubscript𝑥2subscript𝑥𝑑x_{2},\dots,x_{d}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT. For example, when d=3𝑑3d=3italic_d = 3 and D=2𝐷2D=2italic_D = 2, we have M={x1,x2,x3,x1x2,x1x3,x12,x12x2,x12x3}𝑀subscript𝑥1subscript𝑥2subscript𝑥3subscript𝑥1subscript𝑥2subscript𝑥1subscript𝑥3superscriptsubscript𝑥12superscriptsubscript𝑥12subscript𝑥2superscriptsubscript𝑥12subscript𝑥3M=\{x_{1},x_{2},x_{3},x_{1}x_{2},x_{1}x_{3},x_{1}^{2},x_{1}^{2}x_{2},x_{1}^{2}% x_{3}\}italic_M = { italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT }. Note that |M|=(D+1)d1𝑀𝐷1𝑑1|M|=(D+1)d-1| italic_M | = ( italic_D + 1 ) italic_d - 1. Indeed for each x1isuperscriptsubscript𝑥1𝑖x_{1}^{i}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT for i=0,,D𝑖0𝐷i=0,\dots,Ditalic_i = 0 , … , italic_D we obtain a monomial by multiplying with each of {1,x2,,xd}1subscript𝑥2subscript𝑥𝑑\{1,x_{2},\dots,x_{d}\}{ 1 , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT }. Excluding the constant monomial gives the result. The linearization we use is a map ϕ:d|M|:italic-ϕsuperscript𝑑superscript𝑀\phi:\mathbb{R}^{d}\rightarrow\mathbb{R}^{|M|}italic_ϕ : blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT | italic_M | end_POSTSUPERSCRIPT which indexes the coordinates by monomials in M𝑀Mitalic_M. For example when d=3𝑑3d=3italic_d = 3 and D=2𝐷2D=2italic_D = 2, ϕ(x1,x2,x3)=(x1,x2,x3,x1x2,x1x3,x12,x12x2,x12x3)italic-ϕsubscript𝑥1subscript𝑥2subscript𝑥3subscript𝑥1subscript𝑥2subscript𝑥3subscript𝑥1subscript𝑥2subscript𝑥1subscript𝑥3superscriptsubscript𝑥12superscriptsubscript𝑥12subscript𝑥2superscriptsubscript𝑥12subscript𝑥3\phi(x_{1},x_{2},x_{3})=(x_{1},x_{2},x_{3},x_{1}x_{2},x_{1}x_{3},x_{1}^{2},x_{% 1}^{2}x_{2},x_{1}^{2}x_{3})italic_ϕ ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) = ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ).

Now, if Sd𝑆superscript𝑑S\in\mathbb{R}^{d}italic_S ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT is shattered by Pd,Dsubscript𝑃𝑑𝐷P_{d,D}italic_P start_POSTSUBSCRIPT italic_d , italic_D end_POSTSUBSCRIPT, then ϕ(S)italic-ϕ𝑆\phi(S)italic_ϕ ( italic_S ) is shattered by half-spaces in |M|superscript𝑀\mathbb{R}^{|M|}blackboard_R start_POSTSUPERSCRIPT | italic_M | end_POSTSUPERSCRIPT. Indeed, suppose p=p0+𝐩,ϕ(x1,,xd)Pd,D𝑝subscript𝑝0𝐩italic-ϕsubscript𝑥1subscript𝑥𝑑subscript𝑃𝑑𝐷p=p_{0}+\langle\mathbf{p},\phi(x_{1},\dots,x_{d})\rangle\in P_{d,D}italic_p = italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + ⟨ bold_p , italic_ϕ ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) ⟩ ∈ italic_P start_POSTSUBSCRIPT italic_d , italic_D end_POSTSUBSCRIPT (for 𝐩|M|𝐩superscript𝑀\mathbf{p}\in\mathbb{R}^{|M|}bold_p ∈ blackboard_R start_POSTSUPERSCRIPT | italic_M | end_POSTSUPERSCRIPT) is a polynomial that is positive over some TS𝑇𝑆T\subseteq Sitalic_T ⊆ italic_S and negative over ST𝑆𝑇S\setminus Titalic_S ∖ italic_T. Define halfspace hp|M|subscript𝑝superscript𝑀h_{p}\in\mathbb{R}^{|M|}italic_h start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT | italic_M | end_POSTSUPERSCRIPT as {y|M|p0+𝐩,y0}conditional-set𝑦superscript𝑀subscript𝑝0𝐩𝑦0\{y\in\mathbb{R}^{|M|}\mid p_{0}+\langle\mathbf{p},y\rangle\geq 0\}{ italic_y ∈ blackboard_R start_POSTSUPERSCRIPT | italic_M | end_POSTSUPERSCRIPT ∣ italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + ⟨ bold_p , italic_y ⟩ ≥ 0 }. Clearly hpϕ(S)=ϕ(T)subscript𝑝italic-ϕ𝑆italic-ϕ𝑇h_{p}\cap\phi(S)=\phi(T)italic_h start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∩ italic_ϕ ( italic_S ) = italic_ϕ ( italic_T ), and in general S𝑆Sitalic_S is shattered by halfspaces in |M|superscript𝑀\mathbb{R}^{|M|}blackboard_R start_POSTSUPERSCRIPT | italic_M | end_POSTSUPERSCRIPT. Using the well-known result for the VC-dimension of halfspaces we have that the VC-dimension of Pd,Dsubscript𝑃𝑑𝐷P_{d,D}italic_P start_POSTSUBSCRIPT italic_d , italic_D end_POSTSUBSCRIPT over dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT is (D+1)d𝐷1𝑑(D+1)d( italic_D + 1 ) italic_d. ∎

Theorem 3.1 (restated). Pdim(𝐸𝑁)=O(p2)Pdimsubscript𝐸𝑁𝑂superscript𝑝2\textsc{Pdim}(\mathcal{H}_{\text{EN}})=O(p^{2})Pdim ( caligraphic_H start_POSTSUBSCRIPT EN end_POSTSUBSCRIPT ) = italic_O ( italic_p start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). Further, Pdim(𝐸𝑁𝖠𝖨𝖢)=O(p2)Pdimsuperscriptsubscript𝐸𝑁𝖠𝖨𝖢𝑂superscript𝑝2\textsc{Pdim}(\mathcal{H}_{\text{EN}}^{\textsf{AIC}})=O(p^{2})Pdim ( caligraphic_H start_POSTSUBSCRIPT EN end_POSTSUBSCRIPT start_POSTSUPERSCRIPT AIC end_POSTSUPERSCRIPT ) = italic_O ( italic_p start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) and Pdim(𝐸𝑁𝖡𝖨𝖢)=O(p2)Pdimsuperscriptsubscript𝐸𝑁𝖡𝖨𝖢𝑂superscript𝑝2\textsc{Pdim}(\mathcal{H}_{\text{EN}}^{\textsf{BIC}})=O(p^{2})Pdim ( caligraphic_H start_POSTSUBSCRIPT EN end_POSTSUBSCRIPT start_POSTSUPERSCRIPT BIC end_POSTSUPERSCRIPT ) = italic_O ( italic_p start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ).

Proof.

By Theorem 2.2, the dual class EN*superscriptsubscriptEN\mathcal{H}_{\text{EN}}^{*}caligraphic_H start_POSTSUBSCRIPT EN end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT of ENsubscriptEN\mathcal{H}_{\text{EN}}caligraphic_H start_POSTSUBSCRIPT EN end_POSTSUBSCRIPT is (,𝒢,p3p)𝒢𝑝superscript3𝑝(\mathcal{F},\mathcal{G},p3^{p})( caligraphic_F , caligraphic_G , italic_p 3 start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT )-piecewise decomposable, with ={fq1,q2:}conditional-setsubscript𝑓subscript𝑞1subscript𝑞2\mathcal{F}=\{f_{q_{1},q_{2}}:\mathcal{L}\rightarrow\mathbb{R}\}caligraphic_F = { italic_f start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT : caligraphic_L → blackboard_R } consisting of rational polynomial functions fq1,q2:lλq1(λ1,λ2)q2(λ2):subscript𝑓subscript𝑞1subscript𝑞2maps-tosubscript𝑙𝜆subscript𝑞1subscript𝜆1subscript𝜆2subscript𝑞2subscript𝜆2f_{q_{1},q_{2}}:l_{\lambda}\mapsto\frac{q_{1}(\lambda_{1},\lambda_{2})}{q_{2}(% \lambda_{2})}italic_f start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT : italic_l start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ↦ divide start_ARG italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG, where q1,q2subscript𝑞1subscript𝑞2q_{1},q_{2}italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT have degrees at most 2p2𝑝2p2 italic_p, and 𝒢={gr:{0,1}}𝒢conditional-setsubscript𝑔𝑟01\mathcal{G}=\{g_{r}:\mathcal{L}\rightarrow\{0,1\}\}caligraphic_G = { italic_g start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT : caligraphic_L → { 0 , 1 } } consisting of semi-algebraic sets bounded by algebraic curves gr:uλ𝕀{r(λ1,λ2)<0}:subscript𝑔𝑟maps-tosubscript𝑢𝜆𝕀𝑟subscript𝜆1subscript𝜆20g_{r}:u_{\lambda}\mapsto\mathbb{I}\{r(\lambda_{1},\lambda_{2})<0\}italic_g start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT : italic_u start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ↦ blackboard_I { italic_r ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) < 0 }, where r𝑟ritalic_r is a polynomial of degree 1 in λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and at most p𝑝pitalic_p in λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

Now by Lemma C.4, we have Pdim(*)=O(logp)Pdimsuperscript𝑂𝑝\textsc{Pdim}(\mathcal{F}^{*})=O(\log p)Pdim ( caligraphic_F start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) = italic_O ( roman_log italic_p ), and by Lemma C.5 the VC dimension of the dual boundary class is d𝒢*=O(p)subscript𝑑superscript𝒢𝑂𝑝d_{\mathcal{G}^{*}}=O(p)italic_d start_POSTSUBSCRIPT caligraphic_G start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = italic_O ( italic_p ). A straightforward application of Theorem C.2 yields

Pdim()=O(plogp+plog(p3p))=O(p2).Pdim𝑂𝑝𝑝𝑝𝑝superscript3𝑝𝑂superscript𝑝2\textsc{Pdim}(\mathcal{H})=O(p\log p+p\log(p3^{p}))=O(p^{2}).Pdim ( caligraphic_H ) = italic_O ( italic_p roman_log italic_p + italic_p roman_log ( italic_p 3 start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ) ) = italic_O ( italic_p start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) .

The dual classes (EN𝖠𝖨𝖢)*superscriptsuperscriptsubscriptEN𝖠𝖨𝖢{(\mathcal{H}_{\text{EN}}^{\textsf{AIC}})}^{*}( caligraphic_H start_POSTSUBSCRIPT EN end_POSTSUBSCRIPT start_POSTSUPERSCRIPT AIC end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT and (EN𝖡𝖨𝖢)*superscriptsuperscriptsubscriptEN𝖡𝖨𝖢{(\mathcal{H}_{\text{EN}}^{\textsf{BIC}})}^{*}( caligraphic_H start_POSTSUBSCRIPT EN end_POSTSUBSCRIPT start_POSTSUPERSCRIPT BIC end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT also follow the same piecewise decomposable structure given by Theorem 2.2. This is because in each piece the equicorrelation set \mathcal{E}caligraphic_E, and therefore β0=||subscriptnorm𝛽0||\beta||_{0}=|\mathcal{E}|| | italic_β | | start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = | caligraphic_E | (by Lemma B.2) is fixed. Thus we can keep the same boundary functions, and the function value in each piece only changes by a constant (in λ𝜆\lambdaitalic_λ) and is therefore also a rational function with the same degrees. The above argument then implies an identical upper bound on the pseudo-dimensions. ∎

The following lemma shows that under mild boundedness assumptions on the data and the search space of hyperparameters, the validation loss function class ENsubscriptEN\mathcal{H}_{\text{EN}}caligraphic_H start_POSTSUBSCRIPT EN end_POSTSUBSCRIPT is uniformly bounded by some constant H>0𝐻0H>0italic_H > 0.

Lemma C.6.

Under Assumption 1, there exists a constant H>0𝐻0H>0italic_H > 0 so that for all h(λ,)𝐸𝑁={h(λ,):Πm,p0λ[λmin,λmax]}𝜆normal-⋅subscript𝐸𝑁conditional-set𝜆normal-⋅normal-→subscriptnormal-Π𝑚𝑝conditionalsubscriptabsent0𝜆subscript𝜆subscript𝜆h(\lambda,\cdot)\in\mathcal{H}_{\text{EN}}=\{h(\lambda,\cdot):\Pi_{m,p}% \rightarrow\mathbb{R}_{\geq 0}\mid\lambda\in[\lambda_{\min},\lambda_{\max}]\}italic_h ( italic_λ , ⋅ ) ∈ caligraphic_H start_POSTSUBSCRIPT EN end_POSTSUBSCRIPT = { italic_h ( italic_λ , ⋅ ) : roman_Π start_POSTSUBSCRIPT italic_m , italic_p end_POSTSUBSCRIPT → blackboard_R start_POSTSUBSCRIPT ≥ 0 end_POSTSUBSCRIPT ∣ italic_λ ∈ [ italic_λ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ] }, we have h(λ,)=supPΠm,ph(λ,P)Hsubscriptnorm𝜆normal-⋅subscriptsupremum𝑃subscriptnormal-Π𝑚𝑝𝜆𝑃𝐻\|h(\lambda,\cdot)\|_{\infty}=\sup_{\begin{subarray}{c}P\in\Pi_{m,p}\end{% subarray}}{h(\lambda,P)}\leq H∥ italic_h ( italic_λ , ⋅ ) ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT = roman_sup start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_P ∈ roman_Π start_POSTSUBSCRIPT italic_m , italic_p end_POSTSUBSCRIPT end_CELL end_ROW end_ARG end_POSTSUBSCRIPT italic_h ( italic_λ , italic_P ) ≤ italic_H.

Proof.

For any problem instance P=(X,y,Xval,yval)Πm,p𝑃𝑋𝑦subscript𝑋valsubscript𝑦valsubscriptΠ𝑚𝑝P=(X,y,X_{\text{val}},y_{\text{val}})\in\Pi_{m,p}italic_P = ( italic_X , italic_y , italic_X start_POSTSUBSCRIPT val end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT val end_POSTSUBSCRIPT ) ∈ roman_Π start_POSTSUBSCRIPT italic_m , italic_p end_POSTSUBSCRIPT, and for any λ=(λ1,λ2)[λmin,λmax]2𝜆subscript𝜆1subscript𝜆2superscriptsubscript𝜆subscript𝜆2\lambda=(\lambda_{1},\lambda_{2})\in[\lambda_{\min},\lambda_{\max}]^{2}italic_λ = ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ∈ [ italic_λ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, consider the optimization problem for training set (X,y)𝑋𝑦(X,y)( italic_X , italic_y )

argminβF(β),subscriptargmin𝛽𝐹𝛽\operatorname*{argmin}_{\beta}F(\beta),roman_argmin start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT italic_F ( italic_β ) , (2)

where F(β)=12myXβ22+λ1β1+λ2β22𝐹𝛽12𝑚superscriptsubscriptnorm𝑦𝑋𝛽22subscript𝜆1subscriptnorm𝛽1subscript𝜆2superscriptsubscriptnorm𝛽22F(\beta)=\frac{1}{2m}||{y-X\beta}||_{2}^{2}+\lambda_{1}||\beta||_{1}+\lambda_{% 2}||\beta||_{2}^{2}italic_F ( italic_β ) = divide start_ARG 1 end_ARG start_ARG 2 italic_m end_ARG | | italic_y - italic_X italic_β | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | | italic_β | | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | | italic_β | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. If we set β=0𝛽0\beta=\vec{0}italic_β = over→ start_ARG 0 end_ARG, we have

F(0)=12my2212R2,𝐹012𝑚superscriptsubscriptnorm𝑦2212superscript𝑅2F(\vec{0})=\frac{1}{2m}||{y}||_{2}^{2}\leq\frac{1}{2}R^{2},italic_F ( over→ start_ARG 0 end_ARG ) = divide start_ARG 1 end_ARG start_ARG 2 italic_m end_ARG | | italic_y | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

for some absolute constant R𝑅Ritalic_R, due to Assumption 1. Let β^(X,y)(λ)subscript^𝛽𝑋𝑦𝜆\hat{\beta}_{(X,y)}(\lambda)over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT ( italic_X , italic_y ) end_POSTSUBSCRIPT ( italic_λ ) be the optimal solution of 2, we have

12R2F(β^(X,y)(λ))λ1β^(X,y)(λ)1+λ2β^(X,y)(λ)22.12superscript𝑅2𝐹subscript^𝛽𝑋𝑦𝜆subscript𝜆1subscriptnormsubscript^𝛽𝑋𝑦𝜆1subscript𝜆2superscriptsubscriptnormsubscript^𝛽𝑋𝑦𝜆22\frac{1}{2}R^{2}\geq F(\hat{\beta}_{(X,y)}(\lambda))\geq\lambda_{1}||{\hat{% \beta}_{(X,y)}(\lambda)}||_{1}+\lambda_{2}||{\hat{\beta}_{(X,y)}(\lambda)}||_{% 2}^{2}.divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≥ italic_F ( over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT ( italic_X , italic_y ) end_POSTSUBSCRIPT ( italic_λ ) ) ≥ italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | | over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT ( italic_X , italic_y ) end_POSTSUBSCRIPT ( italic_λ ) | | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | | over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT ( italic_X , italic_y ) end_POSTSUBSCRIPT ( italic_λ ) | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

Therefore, for any problem instance P𝑃Pitalic_P, the solution of the training optimization problem β^(X,y)(λ)subscript^𝛽𝑋𝑦𝜆\hat{\beta}_{(X,y)}(\lambda)over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT ( italic_X , italic_y ) end_POSTSUBSCRIPT ( italic_λ ) has bounded norm, i.e. β^(X,y)(λ)1,β^(X,y)(λ)22R22λminsubscriptnormsubscript^𝛽𝑋𝑦𝜆1superscriptsubscriptnormsubscript^𝛽𝑋𝑦𝜆22superscript𝑅22subscript𝜆||{\hat{\beta}_{(X,y)}(\lambda)}||_{1},||{\hat{\beta}_{(X,y)}(\lambda)}||_{2}^% {2}\leq\frac{R^{2}}{2\lambda_{\min}}| | over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT ( italic_X , italic_y ) end_POSTSUBSCRIPT ( italic_λ ) | | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , | | over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT ( italic_X , italic_y ) end_POSTSUBSCRIPT ( italic_λ ) | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ divide start_ARG italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_λ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT end_ARG, which implies

h(λ,P)=12myvalβ^(X,y)(λ)Xval2212myval22+12mβ^(X,y)(λ)Xval22H,𝜆𝑃12𝑚superscriptsubscriptnormsubscript𝑦valsubscript^𝛽𝑋𝑦𝜆subscript𝑋val2212𝑚superscriptsubscriptnormsubscript𝑦val2212𝑚superscriptsubscriptnormsubscript^𝛽𝑋𝑦𝜆subscript𝑋val22𝐻h(\lambda,P)=\frac{1}{2m}||{y_{\text{val}}-\hat{\beta}_{(X,y)}(\lambda)X_{% \text{val}}}||_{2}^{2}\leq\frac{1}{2m}||{y_{\text{val}}}||_{2}^{2}+\frac{1}{2m% }||{\hat{\beta}_{(X,y)}(\lambda)}{X_{\text{val}}}||_{2}^{2}\leq H,italic_h ( italic_λ , italic_P ) = divide start_ARG 1 end_ARG start_ARG 2 italic_m end_ARG | | italic_y start_POSTSUBSCRIPT val end_POSTSUBSCRIPT - over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT ( italic_X , italic_y ) end_POSTSUBSCRIPT ( italic_λ ) italic_X start_POSTSUBSCRIPT val end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ divide start_ARG 1 end_ARG start_ARG 2 italic_m end_ARG | | italic_y start_POSTSUBSCRIPT val end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 italic_m end_ARG | | over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT ( italic_X , italic_y ) end_POSTSUBSCRIPT ( italic_λ ) italic_X start_POSTSUBSCRIPT val end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_H ,

for some constant H𝐻Hitalic_H (depends only on R𝑅Ritalic_R and λminsubscript𝜆\lambda_{\min}italic_λ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT). ∎

C.2 Tuning the ElasticNet – Online learning

At a high level, the plan is to show dispersion (Definition 4) using the general recipe developed in [BDP20]. The recipe may be summarized at a high level as follows.

  • S1.

    Bound the probability density of the random set of discontinuities of the loss functions. Intuitively this corresponds to computing the average number of loss functions that may be discontinuous along a path connecting any two points within distance ϵitalic-ϵ\epsilonitalic_ϵ in the domain.

  • S2.

    Use a VC-dimension based uniform convergence argument to transform this into a bound on the dispersion of the loss functions.

Formally, we have the following theorems from [BDP20], which show how to use this technique when the discontinuities are roots of a random polynomial with bounded coefficients. The theorems implement steps S1 and S2 of the above recipe respectively.

Theorem C.7 ([BDP20]).

Consider a random degree d𝑑ditalic_d polynomial ϕitalic-ϕ\phiitalic_ϕ with leading coefficient 1 and subsequent coefficients which are real of absolute value at most R𝑅Ritalic_R, whose joint density is at most κ𝜅\kappaitalic_κ. There is an absolute constant K0subscript𝐾0K_{0}italic_K start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT depending only on d𝑑ditalic_d and R𝑅Ritalic_R such that every interval I𝐼Iitalic_I of length ϵabsentitalic-ϵ\leq\epsilon≤ italic_ϵ satisfies Pr(ϕitalic-ϕ\phiitalic_ϕ has a root in I𝐼Iitalic_I) κϵ/K0absent𝜅italic-ϵsubscript𝐾0\leq\kappa\epsilon/K_{0}≤ italic_κ italic_ϵ / italic_K start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT.

Theorem C.8 ([BDP20]).

Let l1,,lT:normal-:subscript𝑙1normal-…subscript𝑙𝑇normal-→l_{1},\dots,l_{T}:\mathbb{R}\rightarrow\mathbb{R}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_l start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT : blackboard_R → blackboard_R be independent piecewise L𝐿Litalic_L-Lipschitz functions, each having at most K𝐾Kitalic_K discontinuities. Let D(T,ϵ,ρ)=|{1tTlt is not L-Lipschitz on [ρϵ,ρ+ϵ]}|𝐷𝑇italic-ϵ𝜌conditional-set1𝑡𝑇subscript𝑙𝑡 is not 𝐿-Lipschitz on 𝜌italic-ϵ𝜌italic-ϵD(T,\epsilon,\rho)=|\{1\leq t\leq T\mid l_{t}\text{ is not }L\text{-Lipschitz % on }[\rho-\epsilon,\rho+\epsilon]\}|italic_D ( italic_T , italic_ϵ , italic_ρ ) = | { 1 ≤ italic_t ≤ italic_T ∣ italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is not italic_L -Lipschitz on [ italic_ρ - italic_ϵ , italic_ρ + italic_ϵ ] } | be the number of functions that are not L𝐿Litalic_L-Lipschitz on the ball [ρϵ,ρ+ϵ]𝜌italic-ϵ𝜌italic-ϵ[\rho-\epsilon,\rho+\epsilon][ italic_ρ - italic_ϵ , italic_ρ + italic_ϵ ]. Then we have E[maxρD(T,ϵ,ρ)]maxρE[D(T,ϵ,ρ)]+O(Tlog(TK))𝐸delimited-[]subscript𝜌𝐷𝑇italic-ϵ𝜌subscript𝜌𝐸delimited-[]𝐷𝑇italic-ϵ𝜌𝑂𝑇𝑇𝐾E[\max_{\rho\in\mathbb{R}}D(T,\epsilon,\rho)]\leq\max_{\rho\in\mathbb{R}}E[D(T% ,\epsilon,\rho)]+O(\sqrt{T\log(TK)})italic_E [ roman_max start_POSTSUBSCRIPT italic_ρ ∈ blackboard_R end_POSTSUBSCRIPT italic_D ( italic_T , italic_ϵ , italic_ρ ) ] ≤ roman_max start_POSTSUBSCRIPT italic_ρ ∈ blackboard_R end_POSTSUBSCRIPT italic_E [ italic_D ( italic_T , italic_ϵ , italic_ρ ) ] + italic_O ( square-root start_ARG italic_T roman_log ( italic_T italic_K ) end_ARG ).

The following lemma provides useful extension to Lemma 2.1 for our online learning results.

Lemma C.9.

Let A𝐴Aitalic_A be an r×s𝑟𝑠r\times sitalic_r × italic_s matrix with R𝑅Ritalic_R-bounded max-norm, i.e. A,=maxi,j|Aij|Rsubscriptnorm𝐴subscript𝑖𝑗subscript𝐴𝑖𝑗𝑅||A||_{\infty,\infty}=\max_{i,j}|A_{ij}|\leq R| | italic_A | | start_POSTSUBSCRIPT ∞ , ∞ end_POSTSUBSCRIPT = roman_max start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT | italic_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT | ≤ italic_R. Then each entry of the matrix (ATA+λIs)1superscriptsuperscript𝐴𝑇𝐴𝜆subscript𝐼𝑠1(A^{T}A+\lambda I_{s})^{-1}( italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_A + italic_λ italic_I start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT is a rational polynomial Pij(λ)/Q(λ)subscript𝑃𝑖𝑗𝜆𝑄𝜆P_{ij}(\lambda)/Q(\lambda)italic_P start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_λ ) / italic_Q ( italic_λ ) for i,j[s]𝑖𝑗delimited-[]𝑠i,j\in[s]italic_i , italic_j ∈ [ italic_s ] with each Pijsubscript𝑃𝑖𝑗P_{ij}italic_P start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT of degree at most s1𝑠1s-1italic_s - 1, Q𝑄Qitalic_Q of degree s𝑠sitalic_s, and all the coefficients have absolute value at most rs(Rs)2ssuperscript𝑟𝑠superscript𝑅𝑠2𝑠r^{s}(Rs)^{2s}italic_r start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ( italic_R italic_s ) start_POSTSUPERSCRIPT 2 italic_s end_POSTSUPERSCRIPT.

Proof.

Let G=ATA𝐺superscript𝐴𝑇𝐴G=A^{T}Aitalic_G = italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_A be the Gram matrix.

|Gij|=|kAkiAkj|k|AkiAkj|rR2,subscript𝐺𝑖𝑗subscript𝑘subscript𝐴𝑘𝑖subscript𝐴𝑘𝑗subscript𝑘subscript𝐴𝑘𝑖subscript𝐴𝑘𝑗𝑟superscript𝑅2|G_{ij}|=|\sum_{k}{A_{ki}A_{kj}}|\leq\sum_{k}{|A_{ki}A_{kj}|}\leq rR^{2},| italic_G start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT | = | ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_k italic_i end_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_k italic_j end_POSTSUBSCRIPT | ≤ ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | italic_A start_POSTSUBSCRIPT italic_k italic_i end_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_k italic_j end_POSTSUBSCRIPT | ≤ italic_r italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

by the triangle inequality and using maxi,j|Aij|Rsubscript𝑖𝑗subscript𝐴𝑖𝑗𝑅\max_{i,j}|A_{ij}|\leq Rroman_max start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT | italic_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT | ≤ italic_R. The determinant det(ATA+λIs)detsuperscript𝐴𝑇𝐴𝜆subscript𝐼𝑠\textsc{det}(A^{T}A+\lambda I_{s})det ( italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_A + italic_λ italic_I start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) is a sum of s!ss𝑠superscript𝑠𝑠s!\leq s^{s}italic_s ! ≤ italic_s start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT signed terms, each a product of s𝑠sitalic_s elements of the form Gijsubscript𝐺𝑖𝑗G_{ij}italic_G start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT or Gii+λsubscript𝐺𝑖𝑖𝜆G_{ii}+\lambdaitalic_G start_POSTSUBSCRIPT italic_i italic_i end_POSTSUBSCRIPT + italic_λ. Thus, in each of the s!𝑠s!italic_s ! terms, the coefficient of λksuperscript𝜆𝑘\lambda^{k}italic_λ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT is a sum of at most (ssk)skssbinomial𝑠𝑠𝑘superscript𝑠𝑘superscript𝑠𝑠{s\choose s-k}\leq s^{k}\leq s^{s}( binomial start_ARG italic_s end_ARG start_ARG italic_s - italic_k end_ARG ) ≤ italic_s start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ≤ italic_s start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT expressions of the form Π(i,j)SGijsubscriptΠ𝑖𝑗𝑆subscript𝐺𝑖𝑗\Pi_{(i,j)\in S}G_{ij}roman_Π start_POSTSUBSCRIPT ( italic_i , italic_j ) ∈ italic_S end_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT with |S|sk𝑆𝑠𝑘|S|\leq s-k| italic_S | ≤ italic_s - italic_k. Now |Π(i,j)SGij|(rR2)|S|(rR2)ssubscriptΠ𝑖𝑗𝑆subscript𝐺𝑖𝑗superscript𝑟superscript𝑅2𝑆superscript𝑟superscript𝑅2𝑠|\Pi_{(i,j)\in S}G_{ij}|\leq(rR^{2})^{|S|}\leq(rR^{2})^{s}| roman_Π start_POSTSUBSCRIPT ( italic_i , italic_j ) ∈ italic_S end_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT | ≤ ( italic_r italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT | italic_S | end_POSTSUPERSCRIPT ≤ ( italic_r italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT, and by triangle inequality the coefficient of λksuperscript𝜆𝑘\lambda^{k}italic_λ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT is upper bounded by (rR2)ssksssuperscript𝑟superscript𝑅2𝑠superscript𝑠𝑘superscript𝑠𝑠(rR^{2})^{s}\cdot s^{k}\cdot s^{s}( italic_r italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ⋅ italic_s start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ⋅ italic_s start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT for any k𝑘kitalic_k. This establishes the bound on the coefficients of Q(λ)𝑄𝜆Q(\lambda)italic_Q ( italic_λ ). A similar argument implies the upper bound for each Pij(λ)subscript𝑃𝑖𝑗𝜆P_{ij}(\lambda)italic_P start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_λ ). ∎

We will also need the following result, which is a simple extension of Lemma 24 from [BS21].

Lemma C.10.

Suppose X𝑋Xitalic_X and Y𝑌Yitalic_Y are real-valued random variables taking values in [m,m+M]𝑚𝑚𝑀[m,m+M][ italic_m , italic_m + italic_M ] for some m,M+𝑚𝑀superscriptm,M\in\mathbb{R}^{+}italic_m , italic_M ∈ blackboard_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT and suppose that their joint distribution is κ𝜅\kappaitalic_κ-bounded. Let c𝑐citalic_c be an absolute constant. Then,

  • (i)

    Z=X+Y𝑍𝑋𝑌Z=X+Yitalic_Z = italic_X + italic_Y is drawn from a K1κsubscript𝐾1𝜅K_{1}\kappaitalic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_κ-bounded distribution, where K1Msubscript𝐾1𝑀K_{1}\leq Mitalic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ italic_M.

  • (ii)

    Z=XY𝑍𝑋𝑌Z=XYitalic_Z = italic_X italic_Y is drawn from a K2κsubscript𝐾2𝜅K_{2}\kappaitalic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_κ-bounded distribution, where K2M/msubscript𝐾2𝑀𝑚K_{2}\leq M/mitalic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_M / italic_m.

  • (iii)

    Z=XY𝑍𝑋𝑌Z=X-Yitalic_Z = italic_X - italic_Y is drawn from a K1κsubscript𝐾1𝜅K_{1}\kappaitalic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_κ-bounded distribution, where K1Msubscript𝐾1𝑀K_{1}\leq Mitalic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ italic_M.

  • (iv)

    Z=X+c𝑍𝑋𝑐Z=X+citalic_Z = italic_X + italic_c has a κ𝜅\kappaitalic_κ-bounded distribution, and Z=cX𝑍𝑐𝑋Z=cXitalic_Z = italic_c italic_X has a κ|c|𝜅𝑐\frac{\kappa}{|c|}divide start_ARG italic_κ end_ARG start_ARG | italic_c | end_ARG-bounded distribution.

Proof.

Let fX,Y(x,y)subscript𝑓𝑋𝑌𝑥𝑦f_{X,Y}(x,y)italic_f start_POSTSUBSCRIPT italic_X , italic_Y end_POSTSUBSCRIPT ( italic_x , italic_y ) denote the joint density of X,Y𝑋𝑌X,Yitalic_X , italic_Y. (i) and (ii) are immediate from Lemma 24 from [BS21], (iii) is a simple extension. Indeed, the cumulative density function for Z𝑍Zitalic_Z is given by

FZ(z)subscript𝐹𝑍𝑧\displaystyle F_{Z}(z)italic_F start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT ( italic_z ) =Pr(Zz)=Pr(XYz)=Pr(Xz+Y)absentPr𝑍𝑧Pr𝑋𝑌𝑧Pr𝑋𝑧𝑌\displaystyle=\text{Pr}(Z\leq z)=\text{Pr}(X-Y\leq z)=\text{Pr}(X\leq z+Y)= Pr ( italic_Z ≤ italic_z ) = Pr ( italic_X - italic_Y ≤ italic_z ) = Pr ( italic_X ≤ italic_z + italic_Y )
=mm+Mmz+yfX,Y(x,y)𝑑x𝑑y.absentsuperscriptsubscript𝑚𝑚𝑀superscriptsubscript𝑚𝑧𝑦subscript𝑓𝑋𝑌𝑥𝑦differential-d𝑥differential-d𝑦\displaystyle=\int_{m}^{m+M}\int_{m}^{z+y}f_{X,Y}(x,y)dxdy.= ∫ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m + italic_M end_POSTSUPERSCRIPT ∫ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z + italic_y end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_X , italic_Y end_POSTSUBSCRIPT ( italic_x , italic_y ) italic_d italic_x italic_d italic_y .

The density function for Z𝑍Zitalic_Z can be obtained using Leibniz’s rule as

fZ(z)=ddzFZ(z)subscript𝑓𝑍𝑧𝑑𝑑𝑧subscript𝐹𝑍𝑧\displaystyle f_{Z}(z)=\frac{d}{dz}F_{Z}(z)italic_f start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT ( italic_z ) = divide start_ARG italic_d end_ARG start_ARG italic_d italic_z end_ARG italic_F start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT ( italic_z ) =ddzmm+Mmz+yfX,Y(x,y)𝑑x𝑑yabsent𝑑𝑑𝑧superscriptsubscript𝑚𝑚𝑀superscriptsubscript𝑚𝑧𝑦subscript𝑓𝑋𝑌𝑥𝑦differential-d𝑥differential-d𝑦\displaystyle=\frac{d}{dz}\int_{m}^{m+M}\int_{m}^{z+y}f_{X,Y}(x,y)dxdy= divide start_ARG italic_d end_ARG start_ARG italic_d italic_z end_ARG ∫ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m + italic_M end_POSTSUPERSCRIPT ∫ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z + italic_y end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_X , italic_Y end_POSTSUBSCRIPT ( italic_x , italic_y ) italic_d italic_x italic_d italic_y
=mm+M(ddzmyfX,Y(x,y)𝑑x+ddz0zfX,Y(t+y,y)𝑑t)𝑑yabsentsuperscriptsubscript𝑚𝑚𝑀𝑑𝑑𝑧superscriptsubscript𝑚𝑦subscript𝑓𝑋𝑌𝑥𝑦differential-d𝑥𝑑𝑑𝑧superscriptsubscript0𝑧subscript𝑓𝑋𝑌𝑡𝑦𝑦differential-d𝑡differential-d𝑦\displaystyle=\int_{m}^{m+M}\left(\frac{d}{dz}\int_{m}^{y}f_{X,Y}(x,y)dx+\frac% {d}{dz}\int_{0}^{z}f_{X,Y}(t+y,y)dt\right)dy= ∫ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m + italic_M end_POSTSUPERSCRIPT ( divide start_ARG italic_d end_ARG start_ARG italic_d italic_z end_ARG ∫ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_X , italic_Y end_POSTSUBSCRIPT ( italic_x , italic_y ) italic_d italic_x + divide start_ARG italic_d end_ARG start_ARG italic_d italic_z end_ARG ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_X , italic_Y end_POSTSUBSCRIPT ( italic_t + italic_y , italic_y ) italic_d italic_t ) italic_d italic_y
=mm+MfX,Y(z+y,y)𝑑yabsentsuperscriptsubscript𝑚𝑚𝑀subscript𝑓𝑋𝑌𝑧𝑦𝑦differential-d𝑦\displaystyle=\int_{m}^{m+M}f_{X,Y}(z+y,y)dy= ∫ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m + italic_M end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_X , italic_Y end_POSTSUBSCRIPT ( italic_z + italic_y , italic_y ) italic_d italic_y
mm+Mκ𝑑yabsentsuperscriptsubscript𝑚𝑚𝑀𝜅differential-d𝑦\displaystyle\leq\int_{m}^{m+M}\kappa dy≤ ∫ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m + italic_M end_POSTSUPERSCRIPT italic_κ italic_d italic_y
=Mκ.absent𝑀𝜅\displaystyle=M\kappa.= italic_M italic_κ .

Finally, (iv) follows from simple change of variable manipulations (e.g. Theorem 22 of [BDP20]). ∎

Theorem 3.3 (restated). Assume that the predicted variable and all feature values are bounded by an absolute constant R𝑅Ritalic_R, i.e. max{X(i),,y(i),X𝑣𝑎𝑙(i),,y𝑣𝑎𝑙(i)}Rsubscriptnormsuperscript𝑋𝑖subscriptnormsuperscript𝑦𝑖subscriptnormsuperscriptsubscript𝑋𝑣𝑎𝑙𝑖subscriptnormsuperscriptsubscript𝑦𝑣𝑎𝑙𝑖𝑅\max\{||X^{(i)}||_{\infty,\infty},||y^{(i)}||_{\infty},||X_{\text{val}}^{(i)}|% |_{\infty,\infty},||y_{\text{val}}^{(i)}||_{\infty}\}\leq Rroman_max { | | italic_X start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT ∞ , ∞ end_POSTSUBSCRIPT , | | italic_y start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT , | | italic_X start_POSTSUBSCRIPT val end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT ∞ , ∞ end_POSTSUBSCRIPT , | | italic_y start_POSTSUBSCRIPT val end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT } ≤ italic_R. Suppose the predicted variables y(i)superscript𝑦𝑖y^{(i)}italic_y start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT in the training set are drawn from a joint κ𝜅\kappaitalic_κ-bounded distribution. Let l1,,lT:(0,λmax)20normal-:subscript𝑙1normal-…subscript𝑙𝑇normal-→superscript0subscript𝜆2subscriptabsent0l_{1},\dots,l_{T}:(0,\lambda_{\max})^{2}\rightarrow\mathbb{R}_{\geq 0}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_l start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT : ( 0 , italic_λ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT → blackboard_R start_POSTSUBSCRIPT ≥ 0 end_POSTSUBSCRIPT denote an independent sequence of losses (e.g. fresh randomness is used to generate the validation set features in each round) as a function of the ElasticNet regularization parameter λ=(λ1,λ2)𝜆subscript𝜆1subscript𝜆2\lambda=(\lambda_{1},\lambda_{2})italic_λ = ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ), li(λ)=lr(β^λ,fEN(X(i),y(i)),(X𝑣𝑎𝑙(i),y𝑣𝑎𝑙(i)))subscript𝑙𝑖𝜆subscript𝑙𝑟subscriptsuperscriptnormal-^𝛽superscript𝑋𝑖superscript𝑦𝑖𝜆subscript𝑓𝐸𝑁superscriptsubscript𝑋𝑣𝑎𝑙𝑖superscriptsubscript𝑦𝑣𝑎𝑙𝑖l_{i}(\lambda)=l_{r}(\hat{\beta}^{(X^{(i)},y^{(i)})}_{\lambda,f_{EN}},(X_{% \text{val}}^{(i)},y_{\text{val}}^{(i)}))italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_λ ) = italic_l start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( over^ start_ARG italic_β end_ARG start_POSTSUPERSCRIPT ( italic_X start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_λ , italic_f start_POSTSUBSCRIPT italic_E italic_N end_POSTSUBSCRIPT end_POSTSUBSCRIPT , ( italic_X start_POSTSUBSCRIPT val end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT val end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) ). The sequence of functions is 1212\frac{1}{2}divide start_ARG 1 end_ARG start_ARG 2 end_ARG-dispersed, and there is an online algorithm with O~(T)normal-~𝑂𝑇\tilde{O}(\sqrt{T})over~ start_ARG italic_O end_ARG ( square-root start_ARG italic_T end_ARG ) expected regret. The result also holds for loss functions adjusted by information criteria AIC and BIC.

Proof.

We start with the piecewise-decomposable characterization of the dual class function in Theorem 2.2. On any fixed problem instance PΠm,n𝑃subscriptΠ𝑚𝑛P\in\Pi_{m,n}italic_P ∈ roman_Π start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT, as the parameter λ𝜆\lambdaitalic_λ is varied in the loss function EN(,P)subscriptEN𝑃\ell_{\text{EN}}(\cdot,P)roman_ℓ start_POSTSUBSCRIPT EN end_POSTSUBSCRIPT ( ⋅ , italic_P ) of ElasticNet trained with regularization parameter λ=(λ1,λ2)𝜆subscript𝜆1subscript𝜆2\lambda=(\lambda_{1},\lambda_{2})italic_λ = ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ), we have the following piecewise structure. There are k=p3p𝑘𝑝superscript3𝑝k=p3^{p}italic_k = italic_p 3 start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT boundary functions g1,,gksubscript𝑔1subscript𝑔𝑘g_{1},\dots,g_{k}italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT for which the transition boundaries are algebraic curves ri(λ1,λ2)subscript𝑟𝑖subscript𝜆1subscript𝜆2r_{i}(\lambda_{1},\lambda_{2})italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ), where risubscript𝑟𝑖r_{i}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is a polynomial with degree 1 in λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and at most p𝑝pitalic_p in λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Also the piece function f𝐛subscript𝑓𝐛f_{\mathbf{b}}italic_f start_POSTSUBSCRIPT bold_b end_POSTSUBSCRIPT for each sign pattern 𝐛{0,1}k𝐛superscript01𝑘\mathbf{b}\in\{0,1\}^{k}bold_b ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT is a rational polynomial function q1b(λ1,λ2)q2b(λ2)superscriptsubscript𝑞1𝑏subscript𝜆1subscript𝜆2superscriptsubscript𝑞2𝑏subscript𝜆2\frac{q_{1}^{b}(\lambda_{1},\lambda_{2})}{q_{2}^{b}(\lambda_{2})}divide start_ARG italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG, where q1𝐛,q2𝐛superscriptsubscript𝑞1𝐛superscriptsubscript𝑞2𝐛q_{1}^{\mathbf{b}},q_{2}^{\mathbf{b}}italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_b end_POSTSUPERSCRIPT , italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_b end_POSTSUPERSCRIPT have degrees at most 2p2𝑝2p2 italic_p, and corresponds to a fixed signed equicorrelation set (,s)𝑠(\mathcal{E},s)( caligraphic_E , italic_s ). To show online learnability, we will examine this piecewise structure more closely – in particular analyse how the structure varies when the predicted variable is drawn from a smooth distribution.

In order to show dispersion for the loss functions {li(λ)}subscript𝑙𝑖𝜆\{l_{i}(\lambda)\}{ italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_λ ) }, we will use the recipe of [BDP20] and bound the worst rate of discontinuities between any pair of points λ=(λ1,λ2)𝜆subscript𝜆1subscript𝜆2\lambda=(\lambda_{1},\lambda_{2})italic_λ = ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) and λ=(λ1,λ2)superscript𝜆subscriptsuperscript𝜆1subscriptsuperscript𝜆2\lambda^{\prime}=(\lambda^{\prime}_{1},\lambda^{\prime}_{2})italic_λ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = ( italic_λ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_λ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) with λλ2ϵsubscriptnorm𝜆superscript𝜆2italic-ϵ||\lambda-\lambda^{\prime}||_{2}\leq\epsilon| | italic_λ - italic_λ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_ϵ along the axis-aligned path λ(λ1,λ2)λ𝜆subscriptsuperscript𝜆1subscript𝜆2superscript𝜆\lambda\rightarrow(\lambda^{\prime}_{1},\lambda_{2})\rightarrow\lambda^{\prime}italic_λ → ( italic_λ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) → italic_λ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. First observe that the only possible points at which li(λ)subscript𝑙𝑖𝜆l_{i}(\lambda)italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_λ ) may be discontinuous are

  • (a)

    (λ1,λ2)subscript𝜆1subscript𝜆2(\lambda_{1},\lambda_{2})( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) such that ri(λ1,λ2)=0subscript𝑟𝑖subscript𝜆1subscript𝜆20r_{i}(\lambda_{1},\lambda_{2})=0italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = 0 corresponding to some boundary function gisubscript𝑔𝑖g_{i}italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

  • (b)

    (λ1,λ2)subscript𝜆1subscript𝜆2(\lambda_{1},\lambda_{2})( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) such that q2𝐛(λ2)=0superscriptsubscript𝑞2𝐛subscript𝜆20q_{2}^{\mathbf{b}}(\lambda_{2})=0italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_b end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = 0 corresponding to some piece function f𝐛subscript𝑓𝐛f_{\mathbf{b}}italic_f start_POSTSUBSCRIPT bold_b end_POSTSUBSCRIPT.

Fortunately the discontinuity of type (b) does not occur for λ2>0subscript𝜆20\lambda_{2}>0italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > 0. From the ElasticNet characterization in Lemma C.1, and using Lemma 2.1, we know that q2(λ2)=Πj[||](Λj+λ2)subscript𝑞2subscript𝜆2subscriptΠ𝑗delimited-[]subscriptΛ𝑗subscript𝜆2q_{2}(\lambda_{2})=\Pi_{j\in[|\mathcal{E}|]}(\Lambda_{j}+\lambda_{2})italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = roman_Π start_POSTSUBSCRIPT italic_j ∈ [ | caligraphic_E | ] end_POSTSUBSCRIPT ( roman_Λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ), where (Λj)j[||]subscriptsubscriptΛ𝑗𝑗delimited-[](\Lambda_{j})_{j\in[|\mathcal{E}|]}( roman_Λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_j ∈ [ | caligraphic_E | ] end_POSTSUBSCRIPT are non-negative eigenvalues of the positive semi-definite matrix X(i)TX(i)superscriptsubscriptsuperscript𝑋𝑖𝑇subscriptsuperscript𝑋𝑖{X^{(i)}_{\mathcal{E}}}^{T}X^{(i)}_{\mathcal{E}}italic_X start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_X start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT. It follows that q2𝐛superscriptsubscript𝑞2𝐛q_{2}^{\mathbf{b}}italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_b end_POSTSUPERSCRIPT does not have positive zeros (for any sign vector 𝐛𝐛\mathbf{b}bold_b).

Therefore it suffices to locate boundaries of type (a). To this end, we have two subtypes corresponding to a variable entering or leaving the equicorrelation set.

Addition of j𝑗j\notin\mathcal{E}italic_j ∉ caligraphic_E. As observed in the proof of Theorem 2.2, a variate j𝑗j\notin\mathcal{E}italic_j ∉ caligraphic_E can enter the equicorrelation set \mathcal{E}caligraphic_E only for (λ1,λ2)subscript𝜆1subscript𝜆2(\lambda_{1},\lambda_{2})( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) satisfying

λ1=𝒙jTX(XTX+λ2I||)1XTy𝒙jTy𝒙jTX(XTX+λ2I||)1s±1.subscript𝜆1superscriptsubscript𝒙𝑗𝑇subscript𝑋superscriptsuperscriptsubscript𝑋𝑇subscript𝑋subscript𝜆2subscript𝐼1superscriptsubscript𝑋𝑇𝑦superscriptsubscript𝒙𝑗𝑇𝑦plus-or-minussuperscriptsubscript𝒙𝑗𝑇subscript𝑋superscriptsuperscriptsubscript𝑋𝑇subscript𝑋subscript𝜆2subscript𝐼1𝑠1\lambda_{1}=\frac{\bm{x}_{j}^{T}X_{\mathcal{E}}({X_{\mathcal{E}}}^{T}X_{% \mathcal{E}}+\lambda_{2}I_{|\mathcal{E}|})^{-1}{X_{\mathcal{E}}}^{T}y-\bm{x}_{% j}^{T}y}{\bm{x}_{j}^{T}X_{\mathcal{E}}({X_{\mathcal{E}}}^{T}X_{\mathcal{E}}+% \lambda_{2}I_{|\mathcal{E}|})^{-1}s\pm 1}.italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = divide start_ARG bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT | caligraphic_E | end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_y - bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_y end_ARG start_ARG bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT | caligraphic_E | end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_s ± 1 end_ARG .

For fixed λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, the distribution of λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT at which the discontinuity occurs for insertion of j𝑗jitalic_j is K1κsubscript𝐾1𝜅K_{1}\kappaitalic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_κ-bounded (by Lemma C.10) for some constant K1subscript𝐾1K_{1}italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT that only depends on R,m,p𝑅𝑚𝑝R,m,pitalic_R , italic_m , italic_p and λmaxsubscript𝜆\lambda_{\max}italic_λ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT. This implies an upper bound of K1κϵsubscript𝐾1𝜅italic-ϵK_{1}\kappa\epsilonitalic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_κ italic_ϵ on the expected number of discontinuities corresponding to j𝑗jitalic_j along the segment λ(λ1,λ2)𝜆subscriptsuperscript𝜆1subscript𝜆2\lambda\rightarrow(\lambda^{\prime}_{1},\lambda_{2})italic_λ → ( italic_λ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) for any j,𝑗j,\mathcal{E}italic_j , caligraphic_E.

For constant λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, we can use Lemma 2.1 and a standard change of variable argument (e.g. Theorem 22 of [BDP20]) to conclude that the discontinuties lie at the roots of a random polynomial in λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT of degree |||\mathcal{E}|| caligraphic_E |, leading coefficient 1111, and bounded random coefficients with K2κsubscript𝐾2𝜅K_{2}\kappaitalic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_κ-bounded density for some constant K2subscript𝐾2K_{2}italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (that only depends on R,m,p𝑅𝑚𝑝R,m,pitalic_R , italic_m , italic_p and λmaxsubscript𝜆\lambda_{\max}italic_λ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT). By Theorem C.7, the expected number of discontinuities along the segment (λ1,λ2)λsubscriptsuperscript𝜆1subscript𝜆2superscript𝜆(\lambda^{\prime}_{1},\lambda_{2})\rightarrow\lambda^{\prime}( italic_λ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) → italic_λ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is upper bounded by K2Kpκϵsubscript𝐾2subscript𝐾𝑝𝜅italic-ϵK_{2}K_{p}\kappa\epsilonitalic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT italic_κ italic_ϵ (Kpsubscript𝐾𝑝K_{p}italic_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT only depends on p𝑝pitalic_p). This implies that the expected number of Lipschitz violations between λ𝜆\lambdaitalic_λ and λsuperscript𝜆\lambda^{\prime}italic_λ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT along the axis aligned path is O~(κϵ)~𝑂𝜅italic-ϵ\tilde{O}(\kappa\epsilon)over~ start_ARG italic_O end_ARG ( italic_κ italic_ϵ ) and completes the first step of the recipe in this case (O~~𝑂\tilde{O}over~ start_ARG italic_O end_ARG notation suppresses terms in R,m,p𝑅𝑚𝑝R,m,pitalic_R , italic_m , italic_p and λmaxsubscript𝜆\lambda_{\max}italic_λ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT as constants).

Removal of jsuperscript𝑗normal-′j^{\prime}\in\mathcal{E}italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_E. The second case, when a variate jsuperscript𝑗j^{\prime}\in\mathcal{E}italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_E leaves the equicorrelation set \mathcal{E}caligraphic_E for (λ1,λ2)subscript𝜆1subscript𝜆2(\lambda_{1},\lambda_{2})( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) satisfying

λ1((XTX+λ2I||)1s)j=((XTX+λ2I||)1XTy)j,subscript𝜆1subscriptsuperscriptsuperscriptsubscript𝑋𝑇subscript𝑋subscript𝜆2subscript𝐼1𝑠superscript𝑗subscriptsuperscriptsuperscriptsubscript𝑋𝑇subscript𝑋subscript𝜆2subscript𝐼1superscriptsubscript𝑋𝑇𝑦superscript𝑗\lambda_{1}(({X_{\mathcal{E}}}^{T}X_{\mathcal{E}}+\lambda_{2}I_{|\mathcal{E}|}% )^{-1}s)_{j^{\prime}}=(({X_{\mathcal{E}}}^{T}X_{\mathcal{E}}+\lambda_{2}I_{|% \mathcal{E}|})^{-1}{X_{\mathcal{E}}}^{T}y)_{j^{\prime}},italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( ( italic_X start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT | caligraphic_E | end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_s ) start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = ( ( italic_X start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT | caligraphic_E | end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_y ) start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ,

also yields the same bound using the above arguments. Putting together, and noting that we have at most p3p𝑝superscript3𝑝p3^{p}italic_p 3 start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT distinct curves each with O~(κϵ)~𝑂𝜅italic-ϵ\tilde{O}(\kappa\epsilon)over~ start_ARG italic_O end_ARG ( italic_κ italic_ϵ ) expected number of intersections with the axis aligned path λλ𝜆superscript𝜆\lambda\rightarrow\lambda^{\prime}italic_λ → italic_λ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, the total expected number of discontinuities is also O~(κϵ)~𝑂𝜅italic-ϵ\tilde{O}(\kappa\epsilon)over~ start_ARG italic_O end_ARG ( italic_κ italic_ϵ ). This completes the first step (S1) of the above recipe.

We use Theorem 9 of [BDP20] to complete the second step of the recipe, which employs a VC-dimension argument for Ksuperscript𝐾K^{\prime}italic_K start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT algebraic curves of bounded degrees (here degree is at most p+1𝑝1p+1italic_p + 1) to conclude that the expected worst number of discontinuties along any axis-aligned path between any pair of points ϵabsentitalic-ϵ\leq\epsilon≤ italic_ϵ apart is at most O~(ϵT)+O(TlogKT)~𝑂italic-ϵ𝑇𝑂𝑇superscript𝐾𝑇\tilde{O}(\epsilon T)+O(\sqrt{T\log K^{\prime}T})over~ start_ARG italic_O end_ARG ( italic_ϵ italic_T ) + italic_O ( square-root start_ARG italic_T roman_log italic_K start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_T end_ARG ). Kp3psuperscript𝐾𝑝superscript3𝑝K^{\prime}\leq p3^{p}italic_K start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≤ italic_p 3 start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT as shown above. This implies that the sequence of loss functions is 1212\frac{1}{2}divide start_ARG 1 end_ARG start_ARG 2 end_ARG-dispersed, and further there is an algorithm (Algorithm 4 of [BDV18]) that achieves O~(T)~𝑂𝑇\tilde{O}(\sqrt{T})over~ start_ARG italic_O end_ARG ( square-root start_ARG italic_T end_ARG ) expected regret.

Finally note that loss functions with AIC and BIC have the same dual class piecewise structure, and therefore the above analysis applies. The only difference is that the value of the piece functions f𝐛subscript𝑓𝐛f_{\mathbf{b}}italic_f start_POSTSUBSCRIPT bold_b end_POSTSUBSCRIPT are changed by a constant (in λ𝜆\lambdaitalic_λ), Km,pplogmsubscript𝐾𝑚𝑝𝑝𝑚K_{m,p}\leq p\log mitalic_K start_POSTSUBSCRIPT italic_m , italic_p end_POSTSUBSCRIPT ≤ italic_p roman_log italic_m. The piece boundaries are the same, and are therefore 1212\frac{1}{2}divide start_ARG 1 end_ARG start_ARG 2 end_ARG-dispersed as above. The range of the loss functions is now [0,Km,p+1]0subscript𝐾𝑚𝑝1[0,K_{m,p}+1][ 0 , italic_K start_POSTSUBSCRIPT italic_m , italic_p end_POSTSUBSCRIPT + 1 ], so the same algorithm (Algorithm 4 of [BDV18]) again achieves O~(T)~𝑂𝑇\tilde{O}(\sqrt{T})over~ start_ARG italic_O end_ARG ( square-root start_ARG italic_T end_ARG ) expected regret. ∎

Appendix D Lemmas and proof details for Section 4

We will first extend the structure for the ElasticNet regression loss functions shown in Theorem 2.2 to the classification setting. The main new challenge is that there are additional discontinuties due to thresholding the loss function needed for binary classification, which intuitively makes the loss more jumpy and discontinuous as a function of the regularization parameters.

Lemma D.1.

Let \mathcal{L}caligraphic_L be a set of functions {lλ,τ:Πm,p0λ+×0,τ}conditional-setsubscript𝑙𝜆𝜏formulae-sequencenormal-→subscriptnormal-Π𝑚𝑝conditionalsubscriptabsent0𝜆superscriptsubscriptabsent0𝜏\{l_{\lambda,\tau}:\Pi_{m,p}\rightarrow\mathbb{R}_{\geq 0}\mid\lambda\in% \mathbb{R}^{+}\times\mathbb{R}_{\geq 0},\tau\in\mathbb{R}\}{ italic_l start_POSTSUBSCRIPT italic_λ , italic_τ end_POSTSUBSCRIPT : roman_Π start_POSTSUBSCRIPT italic_m , italic_p end_POSTSUBSCRIPT → blackboard_R start_POSTSUBSCRIPT ≥ 0 end_POSTSUBSCRIPT ∣ italic_λ ∈ blackboard_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT × blackboard_R start_POSTSUBSCRIPT ≥ 0 end_POSTSUBSCRIPT , italic_τ ∈ blackboard_R } that map a regression problem instance PΠm,p𝑃subscriptnormal-Π𝑚𝑝P\in\Pi_{m,p}italic_P ∈ roman_Π start_POSTSUBSCRIPT italic_m , italic_p end_POSTSUBSCRIPT to the validation classification loss 𝐸𝑁c(λ,P,τ)superscriptsubscriptnormal-ℓ𝐸𝑁𝑐𝜆𝑃𝜏\ell_{\text{EN}}^{c}(\lambda,P,\tau)roman_ℓ start_POSTSUBSCRIPT EN end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ( italic_λ , italic_P , italic_τ ) of ElasticNet trained with regularization parameter λ=(λ1,λ2)𝜆subscript𝜆1subscript𝜆2\lambda=(\lambda_{1},\lambda_{2})italic_λ = ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) and threshold parameter τ𝜏\tauitalic_τ. The dual class *superscript\mathcal{L}^{*}caligraphic_L start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT is (,𝒢,(m+p)3p)𝒢𝑚𝑝superscript3𝑝(\mathcal{F},\mathcal{G},(m+p)3^{p})( caligraphic_F , caligraphic_G , ( italic_m + italic_p ) 3 start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT )-piecewise decomposable, with ={fc:}conditional-setsubscript𝑓𝑐normal-→\mathcal{F}=\{f_{c}:\mathcal{L}\rightarrow\mathbb{R}\}caligraphic_F = { italic_f start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT : caligraphic_L → blackboard_R } consisting of constant functions fc:lλ,τcnormal-:subscript𝑓𝑐maps-tosubscript𝑙𝜆𝜏𝑐f_{c}:l_{\lambda,\tau}\mapsto citalic_f start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT : italic_l start_POSTSUBSCRIPT italic_λ , italic_τ end_POSTSUBSCRIPT ↦ italic_c, where c0𝑐subscriptabsent0c\in\mathbb{R}_{\geq 0}italic_c ∈ blackboard_R start_POSTSUBSCRIPT ≥ 0 end_POSTSUBSCRIPT, and 𝒢={gr:{0,1}}𝒢conditional-setsubscript𝑔𝑟normal-→01\mathcal{G}=\{g_{r}:\mathcal{L}\rightarrow\{0,1\}\}caligraphic_G = { italic_g start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT : caligraphic_L → { 0 , 1 } } consisting of semi-algebraic sets bounded by algebraic varieties gr:lλ,τ𝕀{r(λ1,λ2,τ)<0}normal-:subscript𝑔𝑟maps-tosubscript𝑙𝜆𝜏𝕀𝑟subscript𝜆1subscript𝜆2𝜏0g_{r}:l_{\lambda,\tau}\mapsto\mathbb{I}\{r(\lambda_{1},\lambda_{2},\tau)<0\}italic_g start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT : italic_l start_POSTSUBSCRIPT italic_λ , italic_τ end_POSTSUBSCRIPT ↦ blackboard_I { italic_r ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_τ ) < 0 }, where r𝑟ritalic_r is a polynomial of degree 1 in λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and τ𝜏\tauitalic_τ, and at most p𝑝pitalic_p in λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

Proof.

By Lemma C.1, the EN coefficients β^ENsubscript^𝛽𝐸𝑁\hat{\beta}_{EN}over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_E italic_N end_POSTSUBSCRIPT are fixed given the signed equicorrelation set ,s𝑠\mathcal{E},scaligraphic_E , italic_s. As in Theorem 2.2, we have p3pabsent𝑝superscript3𝑝\leq p3^{p}≤ italic_p 3 start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT boundaries 𝒢1subscript𝒢1\mathcal{G}_{1}caligraphic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT corresponding to a change in the equicorrelation set, but the value of the loss also changes when a prediction vector coefficient μj=(Xval)jβ^ENsubscript𝜇𝑗subscriptsubscript𝑋val𝑗subscript^𝛽𝐸𝑁\mu_{j}=(X_{\text{val}})_{j}\hat{\beta}_{EN}italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = ( italic_X start_POSTSUBSCRIPT val end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_E italic_N end_POSTSUBSCRIPT cross the threshold τ𝜏\tauitalic_τ. This is given by (c1c2λ1)j=τsubscriptsubscript𝑐1subscript𝑐2subscript𝜆1𝑗𝜏(c_{1}-c_{2}{\lambda_{1}})_{j}=\tau( italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_τ where

c1=(Xval)(XTX+λ2I||)1XTy,subscript𝑐1subscriptsubscript𝑋valsuperscriptsuperscriptsubscript𝑋𝑇subscript𝑋subscript𝜆2subscript𝐼1superscriptsubscript𝑋𝑇𝑦c_{1}=(X_{\text{val}})_{\mathcal{E}}(X_{\mathcal{E}}^{T}X_{\mathcal{E}}+% \lambda_{2}I_{|\mathcal{E}|})^{-1}X_{\mathcal{E}}^{T}y,italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ( italic_X start_POSTSUBSCRIPT val end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT | caligraphic_E | end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_y ,

and

c2=(Xval)(XTX+λ2I||)1s.subscript𝑐2subscriptsubscript𝑋valsuperscriptsuperscriptsubscript𝑋𝑇subscript𝑋subscript𝜆2subscript𝐼1𝑠c_{2}=(X_{\text{val}})_{\mathcal{E}}(X_{\mathcal{E}}^{T}X_{\mathcal{E}}+% \lambda_{2}I_{|\mathcal{E}|})^{-1}s.italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ( italic_X start_POSTSUBSCRIPT val end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT | caligraphic_E | end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_s .

Therefore, μ^j=0subscript^𝜇𝑗0\hat{\mu}_{j}=0over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = 0 corresponds to the 0-set of (c1c2λ1)jτsubscriptsubscript𝑐1subscript𝑐2subscript𝜆1𝑗𝜏(c_{1}-c_{2}{\lambda_{1}})_{j}-\tau( italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_τ. By an application of Lemma 2.1, this is an algebraic variety with degree at most |||\mathcal{E}|| caligraphic_E | in λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and degree 1111 in λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and τ𝜏\tauitalic_τ. There are at most m3p𝑚superscript3𝑝m3^{p}italic_m 3 start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT such boundary functions 𝒢2subscript𝒢2\mathcal{G}_{2}caligraphic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, corresponding to all possibilities of j,,s𝑗𝑠j,\mathcal{E},sitalic_j , caligraphic_E , italic_s. For a point (λ1,λ2,τ)subscript𝜆1subscript𝜆2𝜏(\lambda_{1},\lambda_{2},\tau)( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_τ ) with a fixed sign pattern of boundary functions in 𝒢1𝒢2subscript𝒢1subscript𝒢2\mathcal{G}_{1}\cup\mathcal{G}_{2}caligraphic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∪ caligraphic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, the EN coefficients are fixed and also all the predictions on the validation set are fixed. The classification loss

lc(β^λ,f,(X,y),τ)=1mi=1m(yi𝗌𝗀𝗇(Xi,β^λ,fτ))2subscript𝑙𝑐subscript^𝛽𝜆𝑓𝑋𝑦𝜏1𝑚superscriptsubscript𝑖1𝑚superscriptsubscript𝑦𝑖𝗌𝗀𝗇subscript𝑋𝑖subscript^𝛽𝜆𝑓𝜏2l_{c}(\hat{\beta}_{\lambda,f},(X,y),\tau)=\frac{1}{m}\sum_{i=1}^{m}(y_{i}-% \textup{{sgn}}(\langle X_{i},\hat{\beta}_{\lambda,f}\rangle-\tau))^{2}italic_l start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_λ , italic_f end_POSTSUBSCRIPT , ( italic_X , italic_y ) , italic_τ ) = divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - sgn ( ⟨ italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_λ , italic_f end_POSTSUBSCRIPT ⟩ - italic_τ ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

is therefore constant in each piece. Applying Theorem 2.2 shows the claimed stucture for the ElasticNet classification (dual class) loss function. ∎

The above piecewise decomposable structure is helpful in bounding the pseudodimension for the ElasticNet based classifier. For the special cases of Ridge and LASSO we obtain the pseudodimension bounds from first principles.

Theorem 4.1 (restated). Let 𝑅𝑖𝑑𝑔𝑒csuperscriptsubscript𝑅𝑖𝑑𝑔𝑒𝑐\mathcal{H}_{\text{Ridge}}^{c}caligraphic_H start_POSTSUBSCRIPT Ridge end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT, 𝐿𝐴𝑆𝑆𝑂csuperscriptsubscript𝐿𝐴𝑆𝑆𝑂𝑐\mathcal{H}_{\text{LASSO}}^{c}caligraphic_H start_POSTSUBSCRIPT LASSO end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT and 𝐸𝑁csuperscriptsubscript𝐸𝑁𝑐\mathcal{H}_{\text{EN}}^{c}caligraphic_H start_POSTSUBSCRIPT EN end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT denote the set of loss functions for classification problems with at most m𝑚mitalic_m examples and p𝑝pitalic_p features, with Ridge, LASSO and ElasticNet regularization respectively.

  • (i)

    Pdim(𝑅𝑖𝑑𝑔𝑒c)=O(logmp)Pdimsuperscriptsubscript𝑅𝑖𝑑𝑔𝑒𝑐𝑂𝑚𝑝\textsc{Pdim}(\mathcal{H}_{\text{Ridge}}^{c})=O(\log mp)Pdim ( caligraphic_H start_POSTSUBSCRIPT Ridge end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) = italic_O ( roman_log italic_m italic_p )

  • (ii)

    Pdim(𝐿𝐴𝑆𝑆𝑂c)=O(plogm)Pdimsuperscriptsubscript𝐿𝐴𝑆𝑆𝑂𝑐𝑂𝑝𝑚\textsc{Pdim}(\mathcal{H}_{\text{LASSO}}^{c})=O(p\log m)Pdim ( caligraphic_H start_POSTSUBSCRIPT LASSO end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) = italic_O ( italic_p roman_log italic_m ). Further, in the overparameterized regime (pmmuch-greater-than𝑝𝑚p\gg mitalic_p ≫ italic_m), we have that Pdim(𝐿𝐴𝑆𝑆𝑂c)=O(mlogpm)Pdimsuperscriptsubscript𝐿𝐴𝑆𝑆𝑂𝑐𝑂𝑚𝑝𝑚\textsc{Pdim}(\mathcal{H}_{\text{LASSO}}^{c})=O(m\log\frac{p}{m})Pdim ( caligraphic_H start_POSTSUBSCRIPT LASSO end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) = italic_O ( italic_m roman_log divide start_ARG italic_p end_ARG start_ARG italic_m end_ARG ).

  • (iii)

    Pdim(𝐸𝑁c)=O(p2+plogm)Pdimsuperscriptsubscript𝐸𝑁𝑐𝑂superscript𝑝2𝑝𝑚\textsc{Pdim}(\mathcal{H}_{\text{EN}}^{c})=O(p^{2}+p\log m)Pdim ( caligraphic_H start_POSTSUBSCRIPT EN end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) = italic_O ( italic_p start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_p roman_log italic_m ).

Proof.

For Ridge regression, the estimator β^λ,f2subscript^𝛽𝜆subscript𝑓2\hat{\beta}_{\lambda,f_{2}}over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_λ , italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT on the dataset (X(i),y(i))superscript𝑋𝑖superscript𝑦𝑖(X^{(i)},y^{(i)})( italic_X start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) is given by the following closed form

β^λ,f2=(X(i)TX(i)+λIpi)1X(i)Ty(i),subscript^𝛽𝜆subscript𝑓2superscriptsuperscriptsuperscript𝑋𝑖𝑇superscript𝑋𝑖𝜆subscript𝐼subscript𝑝𝑖1superscriptsuperscript𝑋𝑖𝑇superscript𝑦𝑖\displaystyle\hat{\beta}_{\lambda,f_{2}}=({X^{(i)}}^{T}X^{(i)}+\lambda I_{p_{i% }})^{-1}{X^{(i)}}^{T}y^{(i)},over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_λ , italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = ( italic_X start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_X start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT + italic_λ italic_I start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_X start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_y start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ,

where Ipisubscript𝐼subscript𝑝𝑖I_{p_{i}}italic_I start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT is the pi×pisubscript𝑝𝑖subscript𝑝𝑖{p_{i}}\times{p_{i}}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT identity matrix. By Lemma 2.1 each coefficient (β^λ,f2)ksubscriptsubscript^𝛽𝜆subscript𝑓2𝑘(\hat{\beta}_{\lambda,f_{2}})_{k}( over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_λ , italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT of the estimator β^λ,f2subscript^𝛽𝜆subscript𝑓2\hat{\beta}_{\lambda,f_{2}}over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_λ , italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT is a rational polynomial in λ𝜆\lambdaitalic_λ of the form Pk(λ)/Q(λ)subscript𝑃𝑘𝜆𝑄𝜆P_{k}(\lambda)/Q(\lambda)italic_P start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_λ ) / italic_Q ( italic_λ ), where Pk,Qsubscript𝑃𝑘𝑄P_{k},Qitalic_P start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_Q are polynomials of degrees at most pi1subscript𝑝𝑖1p_{i}-1italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - 1 and pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT respectively. Thus the prediction on any example (Xval(i))jsubscriptsuperscriptsubscript𝑋val𝑖𝑗(X_{\text{val}}^{(i)})_{j}( italic_X start_POSTSUBSCRIPT val end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT in the validation set (Xval(i),yval(i))superscriptsubscript𝑋val𝑖superscriptsubscript𝑦val𝑖(X_{\text{val}}^{(i)},y_{\text{val}}^{(i)})( italic_X start_POSTSUBSCRIPT val end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT val end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) of any problem instance P(i)superscript𝑃𝑖P^{(i)}italic_P start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT can change at most pipsubscript𝑝𝑖𝑝p_{i}\leq pitalic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ italic_p times as λ𝜆\lambdaitalic_λ is varied. Recall there are mimsuperscriptsubscript𝑚𝑖𝑚m_{i}^{\prime}\leq mitalic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≤ italic_m examples in any validation set. This implies we have at most mnp𝑚𝑛𝑝mnpitalic_m italic_n italic_p distinct values of the loss function over the n𝑛nitalic_n problem instances. The pseudo-dimension n𝑛nitalic_n therefore satisfies 2nmnpsuperscript2𝑛𝑚𝑛𝑝2^{n}\leq mnp2 start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ≤ italic_m italic_n italic_p, or n=O(logmp)𝑛𝑂𝑚𝑝n=O(\log mp)italic_n = italic_O ( roman_log italic_m italic_p ).

Prior work [EHJT04] shows that the optimal vector β^p^𝛽superscript𝑝\hat{\beta}\in\mathbb{R}^{p}over^ start_ARG italic_β end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT evolves piecewise linearly with λ𝜆\lambdaitalic_λ, i.e. λ(0)=0<λ(1)<<λ(q)=superscript𝜆00superscript𝜆1superscript𝜆𝑞\exists\lambda^{(0)}=0<\lambda^{(1)}<\dots<\lambda^{(q)}=\infty∃ italic_λ start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = 0 < italic_λ start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT < ⋯ < italic_λ start_POSTSUPERSCRIPT ( italic_q ) end_POSTSUPERSCRIPT = ∞ and γ0,γ1,,γq1psubscript𝛾0subscript𝛾1subscript𝛾𝑞1superscript𝑝\gamma_{0},\gamma_{1},\dots,\gamma_{q-1}\in\mathbb{R}^{p}italic_γ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_γ start_POSTSUBSCRIPT italic_q - 1 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT such that

β^λ,f1=β^λ(k),f1+(λλ(k))γksubscript^𝛽𝜆subscript𝑓1subscript^𝛽superscript𝜆𝑘subscript𝑓1𝜆superscript𝜆𝑘subscript𝛾𝑘\hat{\beta}_{\lambda,f_{1}}=\hat{\beta}_{\lambda^{(k)},f_{1}}+(\lambda-\lambda% ^{(k)})\gamma_{k}over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_λ , italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_λ start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT , italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + ( italic_λ - italic_λ start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ) italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT

for λ(k)λλ(k+1)superscript𝜆𝑘𝜆superscript𝜆𝑘1\lambda^{(k)}\leq\lambda\leq\lambda^{(k+1)}italic_λ start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ≤ italic_λ ≤ italic_λ start_POSTSUPERSCRIPT ( italic_k + 1 ) end_POSTSUPERSCRIPT. Each piece corresponds to the addition or removal of at least one of p𝑝pitalic_p coordinates to the active set of covariates with maximum correlation. For any data point xjsubscript𝑥𝑗x_{j}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, 1jm1𝑗𝑚1\leq j\leq m1 ≤ italic_j ≤ italic_m, and any piece [λ(k),λ(k+1))superscript𝜆𝑘superscript𝜆𝑘1[\lambda^{(k)},\lambda^{(k+1)})[ italic_λ start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT , italic_λ start_POSTSUPERSCRIPT ( italic_k + 1 ) end_POSTSUPERSCRIPT ) we have that xjβ^subscript𝑥𝑗^𝛽x_{j}\hat{\beta}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT over^ start_ARG italic_β end_ARG is monotonic since β^^𝛽\hat{\beta}over^ start_ARG italic_β end_ARG varies along a fixed vector γksubscript𝛾𝑘\gamma_{k}italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, and therefore can have at most one value of λ𝜆\lambdaitalic_λ where the predicted label y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG changes. This gives an upper bound of mq𝑚𝑞mqitalic_m italic_q on the total number of discontinuities on any single problem instance (X(i),y(i),Xval(i),yval(i))superscript𝑋𝑖superscript𝑦𝑖superscriptsubscript𝑋val𝑖superscriptsubscript𝑦val𝑖(X^{(i)},y^{(i)},X_{\text{val}}^{(i)},y_{\text{val}}^{(i)})( italic_X start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_X start_POSTSUBSCRIPT val end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT val end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ), where q𝑞qitalic_q is the number of pieces in the solution path. By Lemma 6 of [Tib13], we have the number pieces in the solution path q3p𝑞superscript3𝑝q\leq 3^{p}italic_q ≤ 3 start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT. Also for the overparameterized regime pmmuch-greater-than𝑝𝑚p\gg mitalic_p ≫ italic_m, we have the property that there are at most m1𝑚1m-1italic_m - 1 variables in the active set for the entire sequence of solution paths (Section 7, [EHJT04]). Thus, we have that qm(pm1)(epm)m𝑞𝑚binomial𝑝𝑚1superscript𝑒𝑝𝑚𝑚q\leq m{p\choose m-1}\leq(\frac{ep}{m})^{m}italic_q ≤ italic_m ( binomial start_ARG italic_p end_ARG start_ARG italic_m - 1 end_ARG ) ≤ ( divide start_ARG italic_e italic_p end_ARG start_ARG italic_m end_ARG ) start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT in this case.

Over n𝑛nitalic_n problem instances, the pseudo-dimension satisfies 2nmqnsuperscript2𝑛𝑚𝑞𝑛2^{n}\leq mqn2 start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ≤ italic_m italic_q italic_n, or n=O(logmq)𝑛𝑂𝑚𝑞n=O(\log mq)italic_n = italic_O ( roman_log italic_m italic_q ). Substituting the above inequalities for q𝑞qitalic_q completes the proof.

The proof of Theorem follows the same arguments as Theorem 3.1, using Lemma D.1 instead of Theorem 2.2. By Lemma D.1, the dual class EN*superscriptsubscriptEN\mathcal{H}_{\text{EN}}^{*}caligraphic_H start_POSTSUBSCRIPT EN end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT of ENsubscriptEN\mathcal{H}_{\text{EN}}caligraphic_H start_POSTSUBSCRIPT EN end_POSTSUBSCRIPT is (,𝒢,(p+m)3p)𝒢𝑝𝑚superscript3𝑝(\mathcal{F},\mathcal{G},(p+m)3^{p})( caligraphic_F , caligraphic_G , ( italic_p + italic_m ) 3 start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT )-piecewise decomposable, with ={fc:}conditional-setsubscript𝑓𝑐\mathcal{F}=\{f_{c}:\mathcal{L}\rightarrow\mathbb{R}\}caligraphic_F = { italic_f start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT : caligraphic_L → blackboard_R } consisting of constant functions fc:lλc:subscript𝑓𝑐maps-tosubscript𝑙𝜆𝑐f_{c}:l_{\lambda}\mapsto citalic_f start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT : italic_l start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ↦ italic_c, where c0𝑐subscriptabsent0c\in\mathbb{R}_{\geq 0}italic_c ∈ blackboard_R start_POSTSUBSCRIPT ≥ 0 end_POSTSUBSCRIPT, and 𝒢={gr:{0,1}}𝒢conditional-setsubscript𝑔𝑟01\mathcal{G}=\{g_{r}:\mathcal{L}\rightarrow\{0,1\}\}caligraphic_G = { italic_g start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT : caligraphic_L → { 0 , 1 } } consisting of polynomial thresholds gr:uλ𝕀{r(λ1,λ2,τ)<0}:subscript𝑔𝑟maps-tosubscript𝑢𝜆𝕀𝑟subscript𝜆1subscript𝜆2𝜏0g_{r}:u_{\lambda}\mapsto\mathbb{I}\{r(\lambda_{1},\lambda_{2},\tau)<0\}italic_g start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT : italic_u start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ↦ blackboard_I { italic_r ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_τ ) < 0 }, where r𝑟ritalic_r is a polynomial of degree 1 in λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and τ𝜏\tauitalic_τ, and at most p𝑝pitalic_p in λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

Now it is easy to see that Pdim(*)=O(1)Pdimsuperscript𝑂1\textsc{Pdim}(\mathcal{F}^{*})=O(1)Pdim ( caligraphic_F start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) = italic_O ( 1 ) (in particular, a consequence of Lemma C.4), and by Lemma C.5 the VC dimension of the dual boundary class is d𝒢*=O(p)subscript𝑑superscript𝒢𝑂𝑝d_{\mathcal{G}^{*}}=O(p)italic_d start_POSTSUBSCRIPT caligraphic_G start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = italic_O ( italic_p ). A straightforward application of Theorem C.2 yields

Pdim()=O(plogp+plog((p+m)3p))=O(p2+plogm).Pdim𝑂𝑝𝑝𝑝𝑝𝑚superscript3𝑝𝑂superscript𝑝2𝑝𝑚\textsc{Pdim}(\mathcal{H})=O(p\log p+p\log((p+m)3^{p}))=O(p^{2}+p\log m).Pdim ( caligraphic_H ) = italic_O ( italic_p roman_log italic_p + italic_p roman_log ( ( italic_p + italic_m ) 3 start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ) ) = italic_O ( italic_p start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_p roman_log italic_m ) .

We will now restate and prove Theorem 4.2. This implies that under smoothness assumptions on the data distribution we can learn the data-dependent optimal regularization parameter in the online setting.

Theorem 4.2 (restated). Suppose Assumptions 1 and 3 hold. Let l1,,lT:(0,H]d×[H,H]normal-:subscript𝑙1normal-…subscript𝑙𝑇normal-→superscript0𝐻𝑑𝐻𝐻l_{1},\dots,l_{T}:(0,H]^{d}\times[-H,H]\rightarrow\mathbb{R}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_l start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT : ( 0 , italic_H ] start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT × [ - italic_H , italic_H ] → blackboard_R denote an independent sequence of losses as a function of the regularization parameter λ𝜆\lambdaitalic_λ, li(λ,τ)=lc(β^λ,f,(X(i),y(i)),τ)subscript𝑙𝑖𝜆𝜏subscript𝑙𝑐subscriptnormal-^𝛽𝜆𝑓superscript𝑋𝑖superscript𝑦𝑖𝜏l_{i}(\lambda,\tau)=l_{c}(\hat{\beta}_{\lambda,f},(X^{(i)},y^{(i)}),\tau)italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_λ , italic_τ ) = italic_l start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_λ , italic_f end_POSTSUBSCRIPT , ( italic_X start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) , italic_τ ). The sequence of functions is 1212\frac{1}{2}divide start_ARG 1 end_ARG start_ARG 2 end_ARG-dispersed, and there is an online algorithm with O~(T)normal-~𝑂𝑇\tilde{O}(\sqrt{T})over~ start_ARG italic_O end_ARG ( square-root start_ARG italic_T end_ARG ) expected regret, if f𝑓fitalic_f is given by

  • (i)

    f=f1𝑓subscript𝑓1f=f_{1}italic_f = italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (LASSO),

  • (ii)

    f=f2𝑓subscript𝑓2f=f_{2}italic_f = italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (Ridge), or

  • (iii)

    f=fEN𝑓subscript𝑓𝐸𝑁f=f_{EN}italic_f = italic_f start_POSTSUBSCRIPT italic_E italic_N end_POSTSUBSCRIPT (ElasticNet).

Proof.

On any dataset (X,y)𝑋𝑦(X,y)( italic_X , italic_y ), the predictions are given by the coefficients of the prediction vector μ^=X(XTX+λIp)1XTy^𝜇𝑋superscriptsuperscript𝑋𝑇𝑋𝜆subscript𝐼𝑝1superscript𝑋𝑇𝑦\hat{\mu}=X(X^{T}X+\lambda I_{p})^{-1}X^{T}yover^ start_ARG italic_μ end_ARG = italic_X ( italic_X start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_X + italic_λ italic_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_X start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_y. Note that by Lemma 2.1 (XTX+λIp)1superscriptsuperscript𝑋𝑇𝑋𝜆subscript𝐼𝑝1(X^{T}X+\lambda I_{p})^{-1}( italic_X start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_X + italic_λ italic_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT, and therefore Xval(XTX+λIp)1XTysubscript𝑋valsuperscriptsuperscript𝑋𝑇𝑋𝜆subscript𝐼𝑝1superscript𝑋𝑇𝑦X_{\text{val}}(X^{T}X+\lambda I_{p})^{-1}X^{T}yitalic_X start_POSTSUBSCRIPT val end_POSTSUBSCRIPT ( italic_X start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_X + italic_λ italic_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_X start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_y, has each element of the form Pj(λ)/Q(λ)subscript𝑃𝑗𝜆𝑄𝜆P_{j}(\lambda)/Q(\lambda)italic_P start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_λ ) / italic_Q ( italic_λ ) with degree of each Pjsubscript𝑃𝑗P_{j}italic_P start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT at most p1𝑝1p-1italic_p - 1 and degree of Q𝑄Qitalic_Q at most p𝑝pitalic_p. Further, for a fixed τ𝜏\tauitalic_τ, by using Lemma C.9 and a change of variables, we have that μ^j=τsubscript^𝜇𝑗𝜏\hat{\mu}_{j}=\tauover^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_τ is polynomial equation in λ𝜆\lambdaitalic_λ with degree p𝑝pitalic_p with bounded coefficients that have Kκ𝐾𝜅K\kappaitalic_K italic_κ-bounded density for some constant K𝐾Kitalic_K (that depends on R𝑅Ritalic_R, H𝐻Hitalic_H, m𝑚mitalic_m and p𝑝pitalic_p, but not on κ𝜅\kappaitalic_κ) and leading coefficient 1. Further, for fixed λ(0,H]𝜆0𝐻\lambda\in(0,H]italic_λ ∈ ( 0 , italic_H ], by Lemma C.10, the discontinuities over τ𝜏\tauitalic_τ again have O~(κ)~𝑂𝜅\tilde{O}(\kappa)over~ start_ARG italic_O end_ARG ( italic_κ )-bounded density. This completes step S1 of the recipe from [BDP20], and Theorem C.7 gives a bound on the expected number of discontinuities in any interval I𝐼Iitalic_I (over λ𝜆\lambdaitalic_λ).

To complete step S2, note that the loss function lisubscript𝑙𝑖l_{i}italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT on any instance has at most pm𝑝𝑚pmitalic_p italic_m discontinuities, since each coefficient μ^jsubscript^𝜇𝑗\hat{\mu}_{j}over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT of the prediction vector can change sign at most p𝑝pitalic_p times as λ𝜆\lambdaitalic_λ is varied. This implies the VC-dimension argument (Theorem C.8) applies and the expected maximum number of discontinuities in any interval of width ϵitalic-ϵ\epsilonitalic_ϵ is O(ϵT)+O(Tlog(mpT))𝑂italic-ϵ𝑇𝑂𝑇𝑚𝑝𝑇O(\epsilon T)+O(\sqrt{T\log(mpT)})italic_O ( italic_ϵ italic_T ) + italic_O ( square-root start_ARG italic_T roman_log ( italic_m italic_p italic_T ) end_ARG ), which is O~(ϵT)~𝑂italic-ϵ𝑇\tilde{O}(\epsilon T)over~ start_ARG italic_O end_ARG ( italic_ϵ italic_T ) for ϵ1/Titalic-ϵ1𝑇\epsilon\geq 1/\sqrt{T}italic_ϵ ≥ 1 / square-root start_ARG italic_T end_ARG. Thus, using the recipe from [BDP20], we have shown that the sequence of loss functions is 1212\frac{1}{2}divide start_ARG 1 end_ARG start_ARG 2 end_ARG-dispersed. This further implies that Algorithm 1, which implements the Continuous Exp-Weights algorithm of [BDV18] for setting the regularization parameter, achieves O~(T)~𝑂𝑇\tilde{O}(\sqrt{T})over~ start_ARG italic_O end_ARG ( square-root start_ARG italic_T end_ARG ) expected regret ([BDV18], Theorem 1).

Since the data-distribution is in particular assumed to be continuous, by Lemma 4 of [Tib13] we know that the LASSO solutions are unique with probability 1. Moreover if [p]delimited-[]𝑝\mathcal{E}\subseteq[p]caligraphic_E ⊆ [ italic_p ] denotes the equicorrelation set of variables (i.e. covariates with the maximum absolute value of correlation), and s{1,1}||𝑠superscript11s\in\{-1,1\}^{|\mathcal{E}|}italic_s ∈ { - 1 , 1 } start_POSTSUPERSCRIPT | caligraphic_E | end_POSTSUPERSCRIPT the sign vector (i.e. the sign of the correlations of the covariates in \mathcal{E}caligraphic_E), then the LASSO prediction vector μ^=Xvalβ^^𝜇subscript𝑋val^𝛽\hat{\mu}=X_{\text{val}}\hat{\beta}over^ start_ARG italic_μ end_ARG = italic_X start_POSTSUBSCRIPT val end_POSTSUBSCRIPT over^ start_ARG italic_β end_ARG is a linear function of regularization parameter λ𝜆\lambdaitalic_λ given by

μ^=c1c2λ,^𝜇subscript𝑐1subscript𝑐2𝜆\hat{\mu}=c_{1}-c_{2}\lambda,over^ start_ARG italic_μ end_ARG = italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_λ ,

where c1=(Xval)(XTX)1XTysubscript𝑐1subscriptsubscript𝑋valsuperscriptsuperscriptsubscript𝑋𝑇subscript𝑋1superscriptsubscript𝑋𝑇𝑦c_{1}=(X_{\text{val}})_{\mathcal{E}}(X_{\mathcal{E}}^{T}X_{\mathcal{E}})^{-1}X% _{\mathcal{E}}^{T}yitalic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ( italic_X start_POSTSUBSCRIPT val end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_y and c2=(Xval)(XTX)1ssubscript𝑐2subscriptsubscript𝑋valsuperscriptsuperscriptsubscript𝑋𝑇subscript𝑋1𝑠c_{2}=(X_{\text{val}})_{\mathcal{E}}(X_{\mathcal{E}}^{T}X_{\mathcal{E}})^{-1}sitalic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ( italic_X start_POSTSUBSCRIPT val end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_s. Thus for any fixed ,s𝑠\mathcal{E},scaligraphic_E , italic_s (corresponding to a unique piece in the solution path for LARS-LASSO), we have at most one discontinuity corresponding to μ^j=τsubscript^𝜇𝑗𝜏\hat{\mu}_{j}=\tauover^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_τ, and the location of this discontinuity has a Kκ𝐾𝜅K\kappaitalic_K italic_κ-bounded distribution (for constant K𝐾Kitalic_K independent of κ𝜅\kappaitalic_κ) by an application of Lemma C.10. Thus, the probability that this discontinuity is located along some axis aligned path I𝐼Iitalic_I of length ϵitalic-ϵ\epsilonitalic_ϵ is at most Kκϵ𝐾𝜅italic-ϵK\kappa\epsilonitalic_K italic_κ italic_ϵ. A union bound over j[m]𝑗delimited-[]𝑚j\in[m]italic_j ∈ [ italic_m ], and over 3psuperscript3𝑝3^{p}3 start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT choices of ,s𝑠\mathcal{E},scaligraphic_E , italic_s (for example, Lemma 6 in [Tib13]) gives the probability of a discontinuity along I𝐼Iitalic_I is at most m3pKκϵ𝑚superscript3𝑝𝐾𝜅italic-ϵm3^{p}K\kappa\epsilonitalic_m 3 start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT italic_K italic_κ italic_ϵ. This completes step S1 of the recipe above.

Now each loss function lisubscript𝑙𝑖l_{i}italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT has at most m3p𝑚superscript3𝑝m3^{p}italic_m 3 start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT discontinuities, and therefore by a VC-dimension argument (Theorem 9 of [BDP20]), the expected maximum number of discontinuities along any axis-aligned path of total length ϵitalic-ϵ\epsilonitalic_ϵ is O~(ϵT)+O(T(p+log(mT)))~𝑂italic-ϵ𝑇𝑂𝑇𝑝𝑚𝑇\tilde{O}(\epsilon T)+O(\sqrt{T(p+\log(mT))})over~ start_ARG italic_O end_ARG ( italic_ϵ italic_T ) + italic_O ( square-root start_ARG italic_T ( italic_p + roman_log ( italic_m italic_T ) ) end_ARG ), which is O~(ϵT)~𝑂italic-ϵ𝑇\tilde{O}(\epsilon T)over~ start_ARG italic_O end_ARG ( italic_ϵ italic_T ) for ϵ1/Titalic-ϵ1𝑇\epsilon\geq 1/\sqrt{T}italic_ϵ ≥ 1 / square-root start_ARG italic_T end_ARG. This completes step S2 of the recipe from [BDP20], and we have shown that the sequence of loss functions is 1212\frac{1}{2}divide start_ARG 1 end_ARG start_ARG 2 end_ARG-dispersed. As in Theorem 4.2, this implies that Algorithm 1 achieves O~(T)~𝑂𝑇\tilde{O}(\sqrt{T})over~ start_ARG italic_O end_ARG ( square-root start_ARG italic_T end_ARG ) expected regret ([BDV18], Theorem 1).

While we use the worst case bound on the number of solution paths here, algorithmically we can use LARS-LASSO on the given dataset, which is much faster in practice and the running time scales linearly with the actual number of solution paths q𝑞qitalic_q (typically q3pmuch-less-than𝑞superscript3𝑝q\ll 3^{p}italic_q ≪ 3 start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT).

The proof uses the piecewise decomposable structure proved in Lemma D.1, and establishes dispersion using joint smoothness of Xval(i)superscriptsubscript𝑋val𝑖X_{\text{val}}^{(i)}italic_X start_POSTSUBSCRIPT val end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT instead of y(i)superscript𝑦𝑖y^{(i)}italic_y start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT (as in the proof of Theorem 4.2 (i)). The recipe from [BDP20] can be used along a 3D axis-aligned path from (λ1,λ2,τ)(λ1,λ2,τ)subscript𝜆1subscript𝜆2𝜏superscriptsubscript𝜆1superscriptsubscript𝜆2superscript𝜏(\lambda_{1},\lambda_{2},\tau)\rightarrow(\lambda_{1}^{\prime},\lambda_{2}^{% \prime},\tau^{\prime})( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_τ ) → ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_τ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ). Lemma C.10 may be used to show bounded density of discontinuities along τ𝜏\tauitalic_τ (keeping λ1,λ2subscript𝜆1subscript𝜆2\lambda_{1},\lambda_{2}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT fixed). To complete step S2 of the recipe we can use Theorem 7 from [BS21]. The arguments are otherwise very similar to the proof of Theorem 3.3, and are omitted for brevity. ∎