Zero-norm regularized problems: equivalent surrogates, proximal MM method and statistical error bound

Zhang, Dongdong; Pan, Shaohua; Bi, Shujun; Sun, Defeng

doi:10.1007/s10589-023-00496-x

Zero-norm regularized problems: equivalent surrogates, proximal MM method and statistical error bound

Published: 06 June 2023

Volume 86, pages 627–667, (2023)
Cite this article

Computational Optimization and Applications Aims and scope Submit manuscript

Dongdong Zhang¹,
Shaohua Pan¹,
Shujun Bi¹ &
…
Defeng Sun²

611 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

For the zero-norm regularized problem, we verify that the penalty problem of its equivalent MPEC reformulation is a global exact penalty, which implies a family of equivalent surrogates. For a subfamily of these surrogates, the critical point set is demonstrated to coincide with the d-directional stationary point set and when a critical point has no too small nonzero component, it is a strongly local optimal solution of the surrogate problem and the zero-norm regularized problem. We also develop a proximal majorization-minimization (MM) method for solving the DC (difference of convex functions) surrogates, and provide its global and linear convergence analysis. For the limit of the generated sequence, the statistical error bound is established under a mild condition, which implies its good quality from a statistical respective. Numerical comparisons with ADMM for solving the DC surrogate and APG for solving its partially smoothed form indicate that our proximal MM method armed with an inexact dual PPA plus the semismooth Newton method (PMMSN for short) is remarkably superior to ADMM and APG in terms of the quality of solutions and the CPU time.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Proximal alternating penalty algorithms for nonsmooth constrained convex optimization

Article 27 September 2018

An accelerated first-order method with complexity analysis for solving cubic regularization subproblems

Article 27 March 2021

High-order methods beyond the classical complexity bounds: inexact high-order proximal-point methods

Article Open access 04 January 2024

Data availability

The data used to form the test problems in Subsection 5.4 are freely available in https://www.csie.ntu.edu.tw.

References

Attouch, H., Bolte, J.: On the convergence of the proximal algorithm for nonsmooth functions involving analytic features. Math. Program. 116, 5–16 (2009)
MathSciNet MATH Google Scholar
Attouch, H., Bolte, J., Redont, P., Soubeyran, A.: Proximal alternating minimization and projection methods for nonconvex problems: an approach based on the Kurdyka-Łojasiewicz inequality. Math. Oper. Res. 35, 438–457 (2010)
MathSciNet MATH Google Scholar
Belloni, A., Chernozhukov, V.: Square-root lasso: pivotal recovery of sparse signals via conic programming. Biometrika 4, 791–806 (2010)
MathSciNet MATH Google Scholar
Bi, S.J., Liu, X.L., Pan, S.H.: Exact penalty decomposition method for zero-norm minimization based on MPEC formulation. SIAM J. Sci. Comput. 36, A1451–A1477 (2014)
MathSciNet MATH Google Scholar
Bian, W., Chen, X.J.: A smoothing proximal gradient algorithm for nonsmooth convex regression with cardinality penalty. SIAM J. Numer. Anal. 58, 858–883 (2020)
MathSciNet Google Scholar
Bolte, J., Sabach, S., Teboulle, M.: Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Math. Program. 146, 459–494 (2014)
MathSciNet MATH Google Scholar
Bot, R.I., Nguyen, D.K.: The proximal alternating direction method of multipliers in the nonconvex setting: convergence analysis and rates. Math. Oper. Res. 45, 1–31 (2018)
MathSciNet Google Scholar
Bradley, P.S., Mangasarian, O.L.: Feature selection via concave minimization and support vector machines. In: Proceeding of ICML (1998)
Bruckstein, A.M., Donoho, D.L., Elad, M.: From sparse solutions of systems of equations to sparse modeling of signals and images. SIAM Rev. 51, 34–81 (2009)
MathSciNet MATH Google Scholar
Cao, S.S., Huo, X.M., Pang, J.S.: A unifying framework of high-dimensional sparse estimation with difference-of-convex (DC) regularizations, arXiv:1812.07130 (2018)
Chartrand, R.: Exact reconstruction of sparse signals via nonconvex minimization. IEEE Signal Process. Lett. 14, 707–710 (2007)
Google Scholar
Chen, X.J., Xu, F.M., Ye, Y.Y.: Lower bound theory of nonzero entries in solutions of $\ell _2$-$\ell _p$ minimization. SIAM J. Sci. Comput. 32, 2832–2852 (2010)
MathSciNet Google Scholar
Clarke, F.H.: Optimization and Nonsmooth Analysis. John Wiley and Sons, New York (1983)
MATH Google Scholar
Cui, Y., Chang, T.H., Hong, M., Pang, J.S.: A study of piecewise linear-quadratic programs. J. Optim. Theory Appl. 186, 523–553 (2020)
MathSciNet MATH Google Scholar
Cui, Y., Pang, J.S.: Modern Nonconvex Nondifferentiable Optimization. Society for Industrial and Applied Mathematics, Philadelphia (2022)
MATH Google Scholar
Cui, Y., Sun, D.F., Toh, K.C.: On the R-superlinear convergence of the KKT residuals generated by the augmented Lagrangian method for convex composite conic programming. Math. Program. 178, 381–415 (2019)
MathSciNet MATH Google Scholar
Donoho, D.L., Stark, B.F.: Uncertainty principles and signal recovery. SIAM J. Appl. Math. 49, 906–931 (1989)
MathSciNet MATH Google Scholar
Donoho, D.L.: Compressed sensing. IEEE Trans. Inf. Theory 52, 1289–1306 (2006)
MathSciNet MATH Google Scholar
Dontchev, A.L., Rockafellar, R.T.: Implicit Functions and Solution Mappings-a View from Variational Analysis. Springer Monographs in Mathematics, LLC, New York (2009)
MATH Google Scholar
Facchinei, F., Pang, J.S.: Finite-Dimensional Variational Inequalities and Complementarity Problems. Springer, New York (2003)
MATH Google Scholar
Fan, J.Q., Li, R.Z.: Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 96, 1348–1360 (2001)
MathSciNet MATH Google Scholar
Fan, J.Q., Xue, L.Z., Zou, H.: Strong oracle optimality of folded concave penalized estimation. Ann. Stat. 42, 819–849 (2014)
MathSciNet MATH Google Scholar
Feng, M.B., Mitchell, J.E., Pang, J.S., Shen, X., Wächter, A.: Complementarity formulations of $\ell _0$-norm optimization problems. Pac. J. Optim. 14, 273–305 (2018)
MathSciNet MATH Google Scholar
Gabay, D., Mercier, B.: A dual algorithm for the solution of nonlinear variational problems via finite element approximation. Comput. Math. Appl. 2, 17–40 (1976)
MATH Google Scholar
Gotoh, J.Y., Takeda, A., Tono, K.: DC formulations and algorithms for sparse optimization problems. Math. Program. 169, 141–176 (2018)
MathSciNet MATH Google Scholar
Gu, Y.W., Fan, J., Kong, L.C., Ma, S.Q., Zou, H.: ADMM for high-dimensional sparse penalized quantile regression. Technometrics 60, 319–331 (2018)
MathSciNet Google Scholar
Hiriart-Urruty, J.B., Strodiot, J.J., Nguyen, V.H.: Generalized Hessian matrix and second-order optimality conditions for problems with $C^{1,1}$ data. Appl. Math. Optim. 11, 43–56 (1984)
MathSciNet MATH Google Scholar
Huang, J., Hom, H., Jiao, Y., Liu, Y., Lu, X.: A constructive approach to $L_0$ penalized regression. J. Mach. Learn. Res. 19, 1–37 (2018)
MathSciNet Google Scholar
Ioffe, A.D., Outrata, J.V.: On metric and calmness qualification conditions in subdifferential calculus. Set-Valued Anal. 16, 199–227 (2008)
MathSciNet MATH Google Scholar
Le, H.Y.: Generalized subdifferentials of the rank function. Optim. Lett. 7, 731–743 (2013)
MathSciNet MATH Google Scholar
Le Thi, H.A., Le, H.M., Pham Dinh, T.: Feature selection in machine learning: an exact penalty approach using a difference of convex function algorithm. Mach. Learn. 101, 163–186 (2015)
MathSciNet MATH Google Scholar
Le Thi, H.A., Pham Dinh, T.: DC programming and DCA: thirty years of developments. Mathematical Programming B, Special Issue dedicated to: DC Programming-Theory, Algorithms and Applications 169, 5-68 (2018)
Li, G.Y., Pong, T.K.: Calculus of the exponent of Kurdyka-Łojasiewicz inequality and its applications to linear convergence of first-order methods. Found. Comput. Math. 18, 1199–1232 (2018)
MathSciNet MATH Google Scholar
Liu, Y.L., Bi, S.J., Pan, S.H.: Equivalent Lipschitz surrogates for zero-norm and rank optimization problems. J. Global Optim. 72, 679–704 (2018)
MathSciNet MATH Google Scholar
Lu, Z.: Iterative hard thresholding methods for $\ell _0$ regularized convex cone programming. Math. Program. 147, 125–154 (2014)
MathSciNet Google Scholar
Loh, P.L., Wainwright, M.J.: Regularized M-estimators with nonconvexity: statistical and algorithmic theory for local optima. J. Mach. Learn. Res. 16, 559–616 (2015)
MathSciNet MATH Google Scholar
Mangasarian, O.L.: Machine learning via polyhedral concave minimization. In: Fischer, H., Riedmueller, B., Schaeffler, S. (eds.) Applied Mathematics and Parallel Computing-Festschrift for Klaus Ritter, pp. 175–188. Physica-Verlag, Heidelberg (1996)
Google Scholar
Mifflin, R.: Semismooth and semiconvex functions in constrained optimization. SIAM J. Control. Optim. 15, 959–972 (1977)
MathSciNet MATH Google Scholar
Nesterov, Y.: A method of solving a convex programming problem with convergence rate $O(1/k^2)$. Soviet Math. Dokl. 27, 372–376 (1983)
MATH Google Scholar
Ortega, J.M., Rheinboldt, W.C.: Iterative Solution of Nonlinear Equations in Several Variables. Academic Press, New York (1970)
MATH Google Scholar
Pan, S.H., Liu, Y.L.: Subregularity of subdifferential mappings relative to the critical set and KL property of exponent 1/2, arXiv:1812.00558v3
Pang, J.S., Razaviyayn, M., Alvarado, A.: Computing B-stationary points of nonsmooth DC programs. Math. Oper. Res. 42, 95–118 (2017)
MathSciNet MATH Google Scholar
Pham Dinh, T., Le Thi, H.A.: Convex analysis approach to DC programming: theory, algorithms and applications. Acta Math. Vietnamica 22, 289–355 (1997)
MathSciNet MATH Google Scholar
Qi, L.Q., Sun, J.: A nonsmooth version of Newton’s method. Math. Program. 58, 353–367 (1993)
MathSciNet MATH Google Scholar
Qian, Y.T., Pan, S.H., Liu, Y.L.: Calmness of partial perturbation to composite rank constraint systems and its applications, arXiv:2102.10373v2, October 8 (2021)
Raskutti, G., Wainwright, M.J.: Restricted eigenvalue properties for correlated Gaussian designs. J. Mach. Learn. Res. 11, 2241–2259 (2010)
MathSciNet MATH Google Scholar
Rinaldi, F., Schoen, F., Sciandrone, M.: Concave programming for minimizing the zero-norm over polyhedral sets. Comput. Optim. Appl. 46, 467–486 (2010)
MathSciNet MATH Google Scholar
Rockafellar, R.T.: Augmented Lagrangians and applications of the proximal point algorithm in convex programming. Math. Oper. Res. 1, 97–116 (1976)
MathSciNet MATH Google Scholar
Rockafellar, R.T.: Convex Analysis. Princeton University Press, Princeton (1970)
MATH Google Scholar
Rockafellar, R.T., Wets, R.J.-B.: Variational Analysis. Springer, Cham (1998)
MATH Google Scholar
Robinson, S.M.: Some continuity properties of polyhedral multifunctions. Math. Program. Study 14, 206–214 (1981)
MathSciNet MATH Google Scholar
Soubies, E., Blang-Fraud, L., Aubert, G.: A unified view of exact continuous penalities for $\ell _2$-$\ell _0$ minimization. SIAM J. Optim. 8, 1067–1639 (2017)
Google Scholar
Tang, P.P., Wang, C.J., Sun, D.F., Toh, K.C.: A sparse semismooth Newton based proximal majorization-minimization algorithm for nonconvex square-root-loss regression problems. J. Mach. Learn. Res. 21, 1–38 (2020)
MathSciNet MATH Google Scholar
Tibshirani, R.: Regression shrinkage and selection via the Lasso. J. Roy. Stat. Soc. B 58, 267–288 (1996)
MathSciNet MATH Google Scholar
Wang, L., Wu, Y.C., Li, R.Z.: Quantile regression for analyzing heterogeneity in ultra high dimension. J. Am. Stat. Assoc. 107, 214–222 (2012)
MathSciNet MATH Google Scholar
Wang, Y., Yin, W.T., Zeng, J.S.: Global convergence of ADMM in nonconvex nonsmooth optimization. J. Sci. Comput. 78, 29–63 (2019)
MathSciNet MATH Google Scholar
Weston, J., Elisseef, A., Schölkopf, B., Tipping, M.: Use of the zero norm with linear models and kernel methods. J. Mach. Learn. Res. 3, 1439–1461 (2003)
MathSciNet MATH Google Scholar
Wen, B., Chen, X.J., Pong, T.K.: Linear convergence of proximal gradient algorithm with extrapolation for a class of nonconvex nonsmooth minimization problems. SIAM J. Optim. 27, 124–145 (2017)
MathSciNet MATH Google Scholar
Wright, J., Ma, Y.: Dense error correction via $\ell _1$-mininization. IEEE Trans. Inf. Theory 56, 3540–3560 (2010)
MATH Google Scholar
Wu, F., Bian, W.: Accelerated iterative hard thresholding algorithm for $\ell _0$ regularized regression problem. J. Global Optim. 76, 819–840 (2020)
MathSciNet MATH Google Scholar
Wu, F., Bian, W., Xue, X.P.: Smoothing fast iterative hard thresholding algorithm for $\ell _0$ regularized nonsmooth convex regression problem, arXiv:2104.13107v1
Ye, J.J., Zhu, D.L.: Optimality conditions for bilevel programming problems. Optimization 33, 9–27 (1995)
MathSciNet MATH Google Scholar
Ye, J.J., Zhu, D.L., Zhu, Q.J.: Exact penalization and necessary optimality conditions for generalized bilevel programming problems. SIAM J. Optim. 7, 481–507 (1997)
MathSciNet MATH Google Scholar
Zhang, C.H.: Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 38, 894–942 (2010)
MathSciNet MATH Google Scholar
Zhang, X., Zhang, X.Q.: A new proximal iterative hard thresholding method with extrapolation for $\ell _0$ minimization. J. Sci. Comput. 79, 809–826 (2019)
MathSciNet MATH Google Scholar
Zhao, X.Y., Sun, D.F., Toh, K.C.: A Newton-CG augmented Lagrangian method for semidefinite programming. SIAM J. Optim. 20, 1737–1765 (2010)
MathSciNet MATH Google Scholar

Download references

Acknowledgements

The first two authors would like to express their sincere thanks to Prof. Kim-Chuan Toh from National University of Singapore for helpful suggestions on the implementation of Algorithm A.1 when visiting SCUT, and give thanks to Prof. Liping Zhu from RenMin University of China for helpful discussion on Theorem 5.

Funding

The funding was provided by the National Natural Science Foundation of China under projects No. 11971177 and the Hong Kong Research Grant Council under grant No. 15304019

Author information

Authors and Affiliations

School of Mathematics, South China University of Technology, Guangzhou, China
Dongdong Zhang, Shaohua Pan & Shujun Bi
Department of Applied Mathematics, The Hong Kong Polytechnic University, Hong Kong, China
Defeng Sun

Authors

Dongdong Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Shaohua Pan
View author publications
You can also search for this author in PubMed Google Scholar
Shujun Bi
View author publications
You can also search for this author in PubMed Google Scholar
Defeng Sun
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shujun Bi.

Ethics declarations

Conflict of interest

The authors declare that they have no conflicts of interest to this work.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Proof of Proposition 3

The following two technical lemmas are need for the proof of Proposition 3.

Lemma 6

Fix any $\nu >0$ and $\mu >0$. Let $h_{\nu }(x){:}{=}\nu \Vert x\Vert _0$ for $x\in {\mathbb {R}}^p$. If $\vartheta $ is regular and strictly continuous relative to $\textrm{dom}\vartheta $, then for any $x\in \textrm{dom}f$ and $\zeta \in {\mathbb {R}}^p$,

$$\begin{aligned} {\widehat{\partial }}\Theta _{\nu ,\mu }(x)&=\partial \Theta _{\nu ,\mu }(x)=\partial \!f_{\!\mu }(x)+\partial h_{\nu }(x), \end{aligned}$$

(A1)

$$\begin{aligned} {\widehat{d}}\Theta _{\nu ,\mu }(x)(\zeta )&=d\Theta _{\nu ,\mu }(x)(\zeta )=df_{\!\mu }(x)(\zeta )+dh_{\nu }(x)(\zeta ). \end{aligned}$$

(A2)

Proof

Fix any $x\in \textrm{dom}f$ and $\zeta \in {\mathbb {R}}^p$. Since $\vartheta $ is strictly continuous relative to $\textrm{dom}\vartheta $, by the expression of $f_{\!\mu }$ in (4), the function $f_{\!\mu }$ can be rewritten as ${\widetilde{f}}_{\!\mu }+\delta _{\textrm{dom}f}$ where ${\widetilde{f}}_{\!\mu }$ is a finite strictly continuous function on ${\mathbb {R}}^p$. Clearly, ${\widetilde{f}}_{\!\mu }$ is regular by the regularity of $\vartheta $, and $\delta _{\textrm{dom}f}$ is also regular by the polyhedrality of $\textrm{dom}\vartheta $. By invoking [50, Exercise 10.10] and the first inclusion of [50, Corollary 10.9], it is not hard to obtain that

$$\begin{aligned} \partial \!{\widetilde{f}}_{\!\mu }(x) +{\widehat{\partial }}(\delta _{\textrm{dom}f}\!+\!h_{\nu })(x) \subseteq {\widehat{\partial }}\Theta _{\nu ,\mu }(x) \subseteq \partial \Theta _{\nu ,\mu }(x)\subseteq \partial \!{\widetilde{f}}_{\!\mu }(x) +\partial (\delta _{\textrm{dom}f}\!+\!h_{\nu })(x). \end{aligned}$$

Since $\textrm{epi}\,h_{\nu }$ is a union of finitely many polyhedral sets and $\textrm{dom}f$ is polyhedral, from [29, Page 213] it follows that $\partial (\delta _{\textrm{dom}f}+h_{\nu })(x)\subseteq {\mathcal {N}}_{\textrm{dom}f}(x)+\partial h_{\nu }(x)$ and $\partial ^{\infty }(\delta _{\textrm{dom}f}+h_{\nu })(x)\subseteq {\mathcal {N}}_{\textrm{dom}f}(x)+\partial ^{\infty }h_{\nu }(x)$. The first inclusion, along with the first inclusion of [50, Corollary 10.9] and the regularity of $\delta _{\textrm{dom}f}$ and $h_{\nu }$, implies that

$$\begin{aligned} {\mathcal {N}}_{\textrm{dom}f}(x)+\partial h_{\nu }(x) \subseteq {\widehat{\partial }}(\delta _{\textrm{dom}f}+h_{\nu })(x) \subseteq \partial (\delta _{\textrm{dom}f}+h_{\nu })(x)\subseteq {\mathcal {N}}_{\textrm{dom}f}(x)+\partial h_{\nu }(x). \end{aligned}$$

The regularity of $h_{\nu }$ is implied by [30, Theorem 1]. The last two equations imply the first equality in (A1). By the strict continuity of ${\widetilde{f}}_{\!\mu }$ and [50, Corollary 10.9],

$$\begin{aligned} d\Theta _{\nu ,\mu }(x)(\zeta )&\ge d{\widetilde{f}}_{\!\mu }(x)(\zeta )\!+\!d\delta _{\textrm{dom}f}(x)(\zeta )\!+\!dh_{\nu }(x)(\zeta ) =df_{\!\mu }(x)(\zeta )+dh_{\nu }(x)(\zeta ),\nonumber \\ {\widehat{d}}\Theta _{\nu ,\mu }(x)(\zeta )&\le {\widehat{d}}{\widetilde{f}}_{\!\mu }(x)(\zeta )\!+\!{\widehat{d}}(\delta _{\textrm{dom}f}\!+\!h_{\nu })(x)(\zeta )\nonumber \\&\le {\widehat{d}}{\widetilde{f}}_{\!\mu }(x)(\zeta ) +{\widehat{d}}\delta _{\textrm{dom}f}(x)(\zeta )+{\widehat{d}}h_{\nu }(x)(\zeta )\nonumber \\&=d{\widetilde{f}}_{\!\mu }(x)(\zeta ) +d\delta _{\textrm{dom}f}(x)(\zeta )+dh_{\nu }(x)(\zeta )\nonumber \\&=df_{\!\mu }(x)(\zeta )+dh_{\nu }(x)(\zeta ) \end{aligned}$$

(A3)

where the second inequality in (A3) is due to $\partial ^{\infty }(\delta _{\textrm{dom}f}\!+\!h_{\nu })({\overline{x}})\subseteq {\mathcal {N}}_{\textrm{dom}f}({\overline{x}})+\partial ^{\infty }h_{\nu }({\overline{x}})$ and [50, Exercise 8.23], and the first equality in (A3) is due to the regularity of ${\widetilde{f}}_{\!\mu },h_{\nu }$ and $\textrm{dom}f$. Note that ${\widehat{d}}\Theta _{\nu ,\mu }({\overline{x}})(\zeta )\ge d\Theta _{\nu ,\mu }({\overline{x}})(\zeta )$. From the last two inequality, we obtain the second equality in (A1). The proof is completed. $\square $

Lemma 7

Pick any $\phi \in \!{\mathscr {L}}_{\sigma ,\gamma }$. The associated function $g_{\rho }$ for any $\rho >0$ is continuously differentiable on ${\mathbb {R}}^p$.

Proof

Recall that $\psi ^*$ is a finite nondecreasing convex function on ${\mathbb {R}}$. If in addition $\phi $ is strongly convex on [0, 1] with modulus $\sigma $, then by [49, Theorem 26.3] and [50, Proposition 12.60], $\psi ^*$ is smooth on ${\mathbb {R}}$ and $(\psi ^*)'$ is Lipschitz continuous with constant $1/\sigma $. Thus, by the expression of $g_{\rho }$, it suffices to argue that $h(t){:}{=}\rho ^{-1}\psi ^*(\rho \vert t\vert )$ for $t\in {\mathbb {R}}$ is continuously differentiable at $t=0$. Indeed, by the assumption on $\phi $, it is easy to verify that $\psi ^*(s)=0$ for all $s\in [0,\gamma ]$. Then, for all $\vert t\vert \le \gamma $, $h(t)=0$. Together with $h(0)=0$, h is differentiable at $t=0$ with $h'(0)=0$. $\square $

Proof

(i) Since the range of $\partial \psi ^*$ is contained in $\textrm{dom}\partial \psi =[0,1]$, for any $x\in {\mathbb {R}}^p$ it holds that $\Vert x\Vert _1-g_{\rho }(x)\ge 0$. Together with the nonnegativity and coerciveness of $f_{\!\mu }$, it follows that $\Theta _{\!\rho ,\nu ,\mu }$ is nonnegative and coercive. Fix any $x\in \textrm{dom}f$. From Lemma 7 and [50, Exericse 8.8], ${\widehat{\partial }}\Theta _{\!\rho ,\nu ,\mu }(x)=\partial \Theta _{\!\rho ,\nu ,\mu }(x) =\partial (f_{\!\mu }\!+\!\rho \nu \Vert \cdot \Vert _1)(x)\!-\!\rho \nu \nabla g_{\rho }(x)$. Recall that $[\textrm{Im}(A)-b]\cap \textrm{dom}\vartheta \ne \emptyset $. By the convexity of $\vartheta $ and [49, Theorem 23.9],

$$\begin{aligned} \partial (f_{\!\mu }\!+\!\rho \nu \Vert \cdot \Vert _1)(x) =A^{{\mathbb {T}}}\partial \vartheta (Ax-b)+\mu x+\rho \nu \partial \Vert x\Vert _1. \end{aligned}$$

The characterization on the regular and limiting subdifferentials of $\Theta _{\!\rho ,\nu ,\mu }$ then holds.

(ii) By the definition of d-stationary point for DC program (see [42, Section 3]), a point $x\in \textrm{dom}f$ is a d-stationary point of (14) iff $\rho \nu \nabla \!g_{\rho }(x)\in \partial (f_{\!\mu }\!+\!\rho \nu \Vert \cdot \Vert _1)(x)$, which by part (i) is equivalent to saying that $x\in \textrm{dom}f$ is a limiting critical point of $\Theta _{\!\rho ,\nu ,\mu }$.

(iii) By [50, Theorem 13.24 (c)], it suffices to argue that $d^2\Theta _{\!\rho ,\nu ,\mu }({\overline{x}}\vert 0)(\zeta )>0$ for all $\zeta \ne 0$. Fix any $\zeta \in {\mathbb {R}}^p\backslash \{0\}$. Let $\varphi _{\!\rho ,\lambda }(x)\!{:}{=}\lambda [\Vert x\Vert _1\!-\!g_{\rho }(x)]$ with $\lambda =\rho \nu $ for $x\in {\mathbb {R}}^p$. Clearly, $\varphi _{\!\rho ,\lambda }$ is Lipschitz continuous and regular by the smoothness of $g_{\rho }$. Note that $\Theta _{\!\rho ,\nu ,\mu }\!=\!f_{\!\mu }+\varphi _{\!\rho ,\lambda }$. By invoking [50, Proposition 13.19], it follows that

$$\begin{aligned} d^2\Theta _{\!\rho ,\nu ,\mu }({\overline{x}}\vert 0)(\zeta ) \!\ge \!\sup _{u\in \partial \!f_{\!\mu }({\overline{x}}),v\in \partial \varphi _{\!\rho ,\lambda }({\overline{x}})} \!\Big \{d^2\!f_{\!\mu }({\overline{x}}\,\vert \,u)(\zeta )+ d^2\!\varphi _{\!\rho ,\lambda }({\overline{x}}\vert v)(\zeta ) \ \ \mathrm{s.t.}\ \ u+v=0\Big \}. \end{aligned}$$

(A4)

Recall that $f_{\!\mu }$ is strongly convex with modulus $\mu $. By Definition 3, we have that

$$\begin{aligned} d^2\!f_{\!\mu }({\overline{x}}\,\vert \,u)(\zeta )\ge \mu \Vert \zeta \Vert ^2>0 \quad \forall u\in \partial \!f_{\!\mu }({\overline{x}}). \end{aligned}$$

(A5)

Fix any $v\in \partial \varphi _{\!\rho ,\lambda }({\overline{x}})$. Since $\varphi _{\!\rho ,\lambda }$ is Lipschitz and directionally differentiable,

$$\begin{aligned} \langle v,\zeta \rangle \le d\varphi _{\!\rho ,\lambda }({\overline{x}})(\zeta ) =\varphi _{\!\rho ,\lambda }'({\overline{x}},\zeta ) =\lambda (\Vert \cdot \Vert _1)'({\overline{x}},\zeta )-\lambda \langle \nabla g_{\rho }({\overline{x}}),\zeta \rangle . \end{aligned}$$

By [50, Proposition 13.5], $d^2\varphi _{\!\rho ,\lambda }({\overline{x}}\vert v)(\zeta )=+\infty $ when $d\varphi _{\!\rho ,\lambda }({\overline{x}})(\zeta )>\langle v,\zeta \rangle $. This, together with (A4)-(A5), implies that $d^2\Theta _{\!\rho ,\nu ,\mu }({\overline{x}}\vert 0)(\zeta )>0$, so it suffices to consider that $\varphi _{\!\rho ,\lambda }'({\overline{x}};\zeta )=\langle v,\zeta \rangle $. In this case, from Definition 3 it follows that

$$\begin{aligned} d^2\varphi _{\!\rho ,\lambda }({\overline{x}}\vert v)(\zeta )&=\liminf _{\begin{array}{c} {\tau \downarrow 0}\\ {\zeta '\rightarrow \zeta } \end{array}} \frac{\varphi _{\!\rho ,\lambda }({\overline{x}}+\tau \zeta ')\!-\!\varphi _{\!\rho ,\lambda }({\overline{x}}) \!-\!\tau \varphi _{\!\rho ,\lambda }'({\overline{x}},\zeta ')}{\frac{1}{2}\tau ^2}\nonumber \\&=\lambda \liminf _{\begin{array}{c} {\tau \downarrow 0}\\ {\zeta '\rightarrow \zeta } \end{array}} \frac{-g_{\rho }({\overline{x}}\!+\!\tau \zeta ')+g_{\rho }({\overline{x}}) +\tau \langle \nabla g_{\rho }({\overline{x}}),\zeta '\rangle }{\frac{1}{2}\tau ^2}, \end{aligned}$$

(A6)

where the second equality is because $\Vert {\overline{x}}+\tau \zeta '\Vert _1-\Vert {\overline{x}}\Vert _1-\tau (\Vert \cdot \Vert _1)'({\overline{x}},\zeta ')=0$ for any $\tau >0$ small enough. Let $h(t){:}{=}\rho ^{-1}\psi ^*(\rho \vert t\vert )$ for $t\in {\mathbb {R}}$. Clearly, $g_{\rho }(z)=\sum _{i=1}^ph(z_i)$ for $z\in {\mathbb {R}}^p$. When $i\notin \textrm{supp}({\overline{x}})$, from the proof of Lemma 7, for all $\tau >0$ small enough, we have $h({\overline{x}}_i\!+\!\tau \zeta _i')-h({\overline{x}}_i)-\tau h'({\overline{x}}_i)\zeta _i'=0$. When $i\in \textrm{supp}({\overline{x}})$, by noting that $\psi ^*(s)=s-\phi (1)$ for all $s\ge \phi _{+}'(1)$ and using the assumption $\vert {\overline{x}}\vert _\textrm{nz}\ge {\phi _{+}'(1)}/{\rho }$,

$$\begin{aligned} h({\overline{x}}_i\!+\!\tau \zeta _i')-h({\overline{x}}_i)-\tau h'({\overline{x}}_i)\zeta _i'=0 =\vert {\overline{x}}_i\!+\!\tau \zeta _i'\vert -{\overline{x}}_i -\tau \textrm{sign}({\overline{x}}_i)\zeta _i'=0 \end{aligned}$$

for all sufficiently $\tau >0$. This means that, for all $\tau >0$ small enough,

$$\begin{aligned} -g_{\rho }({\overline{x}}+\tau \zeta ')+g_{\rho }({\overline{x}}) +\tau \langle \nabla g_{\rho }({\overline{x}}),\zeta '\rangle =0. \end{aligned}$$

By combining this with (A6), we obtain from (A4)-(A5) that $d^2\Theta _{\!\rho ,\nu ,\mu }({\overline{x}}\vert 0)(\zeta )>0$.

(iv) By Lemma 6, $\widehat{\textrm{crit}}\,\Theta _{\nu ,\mu }=\textrm{crit}\,\Theta _{\nu ,\mu }$. We next argue that ${\overline{x}}\in \widehat{\textrm{crit}}\,\Theta _{\nu ,\mu }$. Since ${\overline{x}}\in \textrm{crit}\,\Theta _{\!\rho ,\nu ,\mu }$, from part (i) it follows that

$$\begin{aligned} 0\in A^{{\mathbb {T}}}\partial \vartheta (A{\overline{x}}\!-b)+\mu {\overline{x}} +\rho \nu \big [(1\!-\!(w_\rho ({\overline{x}}))_1)\partial \vert {\overline{x}}_1\vert \times \cdots \times (1\!-\!(w_\rho ({\overline{x}}))_p)\partial \vert {\overline{x}}_p\vert \big ] \end{aligned}$$

where $[w_{\rho }({\overline{x}})]_i=(\psi ^*)'(\rho \vert {\overline{x}}_i\vert )$ for $i=1,2,\ldots ,p$. By the definition of $\psi ^*$, it is easy to deduce that $\psi ^*(s)=s-\phi (1)$ for all $s\ge \phi _{+}'(1)$. Together with $\vert {\overline{x}}\vert _{\textrm{nz}}\ge \phi _{+}'(1)/\rho $, it holds that $[w_{\rho }({\overline{x}})]_i=(\psi ^*)'(\rho \vert {\overline{x}}_i\vert )=1$ for all $i\in \textrm{supp}({\overline{x}})$. From [30, Theorem 1], we know that ${\widehat{\partial }}\Vert {\overline{x}}\Vert _0=\{v\in {\mathbb {R}}^p\,\vert \,v_i=0\ \textrm{for}\ i\in \textrm{supp}({\overline{x}})\}$. This means that

$$\begin{aligned} \rho \nu \big [(1\!-\!(w_\rho ({\overline{x}}))_1)\partial \vert {\overline{x}}_1\vert \times \cdots \times (1\!-\!(w_\rho ({\overline{x}}))_p)\partial \vert {\overline{x}}_p\vert \big ] \subseteq \nu {\widehat{\partial }}\Vert {\overline{x}}\Vert _0. \end{aligned}$$

From the last two equations, $0\in A^{{\mathbb {T}}}\partial \vartheta (A{\overline{x}}\!-b)+\mu {\overline{x}} +{\widehat{\partial }}\Vert {\overline{x}}\Vert _0={\widehat{\partial }}\Theta _{\nu ,\mu }({\overline{x}})$, where the equality is by Lemma 6. This means that ${\overline{x}}\in \widehat{\textrm{crit}}\,\Theta _{\nu ,\mu }$. For the rest, it suffices to argue that every point in $\textrm{crit}\,\Theta _{\nu ,\mu }$ is a strongly local optimal solution of (1). Pick any ${\overline{x}}\in \textrm{crit}\,\Theta _{\nu ,\mu }$. We only need to argue that $d^2\Theta _{\nu ,\mu }({\overline{x}}\vert 0)(\zeta )>0$ for all $\zeta \ne 0$. Fix any $0\ne \zeta \in {\mathbb {R}}^p$. By combining Lemma 6 with [50, Proposition 13.19], it holds that

$$\begin{aligned} d^2\Theta _{\nu ,\mu }({\overline{x}}\vert 0)(\zeta ) \!\ge \!\sup _{u\in \partial \!f_{\mu }({\overline{x}}),v\in \partial h_{\nu }({\overline{x}})} \!\Big \{ d^2\!f_{\mu }({\overline{x}}\vert u)(\zeta )\!+\! d^2h_{\nu }({\overline{x}}\vert v)(\zeta ) \ \ \mathrm{s.t.}\ u+v=0\Big \} \end{aligned}$$

(A7)

where $h_{\nu }$ is same as in Lemma 6. Fix any $v\in \partial h_{\nu }({\overline{x}})$. Let ${\overline{J}}\!{:}{=}\{1,\ldots ,p\}\backslash \textrm{supp}({\overline{x}})$. Then, $\langle v,\zeta \rangle =\langle v_{{\overline{J}}},\zeta _{{\overline{J}}}\rangle $. A simple calculation yields $ dh_{\nu }({\overline{x}})(\zeta )={\textstyle \sum _{i\in {\overline{J}}}}\,\delta _{\{0\}}(\zeta _i). $ This means that $dh_{\nu }({\overline{x}})(\zeta )\ge \langle v,\zeta \rangle $. When $dh_{\nu }({\overline{x}})(\zeta )>\langle v,\zeta \rangle $, by [50, Proposition 13.5] we have $d^2h_{\nu }({\overline{x}}\vert \xi )(\zeta )=+\infty $. This along with (A5) and (A7) means that $d^2\Theta _{\nu ,\mu }({\overline{x}}\vert 0)(\zeta )>0$, so it suffices to consider the case $dh_{\nu }({\overline{x}})(\zeta )=\langle v,\zeta \rangle $. For this case, from $dh_{\nu }({\overline{x}})(\zeta )={\textstyle \sum _{i\in {\overline{J}}}}\,\delta _{\{0\}}(\zeta _i)$, we have $\zeta _{{\overline{J}}}=0$. Consequently,

$$\begin{aligned} d^2h_{\nu }({\overline{x}}\vert v)(\zeta )&=\liminf _{\tau \downarrow 0,\zeta '\rightarrow \zeta } \frac{h_{\nu }({\overline{x}}+\!\tau \zeta ')\!-\!h_{\nu }({\overline{x}}) -\tau \langle v_{{\overline{J}}},\zeta _{{\overline{J}}}'\rangle }{\frac{1}{2}\tau ^2}\\&=\liminf _{\tau \downarrow 0,\zeta _{{\overline{J}}}'\rightarrow \zeta _{{\overline{J}}}} \frac{\sum _{i\in {\overline{J}}}[\text {sign}(\tau \vert \zeta _i'\vert )\!-\!\tau v_i\zeta _i']}{\frac{1}{2}\tau ^2}\ge 0. \end{aligned}$$

This along with (A5) and (A7) implies that $d^2\Theta _{\nu ,\mu }({\overline{x}}\vert 0)(\zeta )>0$. $\square $

Appendix B: Proof of results in Sect. 4.2

In this section, let $x^*$ be the true vector in model (21), and for each $k\in {\mathbb {N}}$ write

$$\begin{aligned} y^k\!{:}{=}Ax^k\!-b,\ \Delta x^k\!{:}{=}x^k-x^*\ \textrm{and}\ \xi ^k\!{:}{=}B_{k-1}(x^{k-1}\!- x^{k})+\delta ^{k-1}-\mu x^*. \end{aligned}$$

(B8)

By Assumption 2 and [50, Theorem 10.49], for any ${\overline{t}}\in {\mathbb {R}}$ we have $\partial (\theta ^2)({\overline{t}})=2D^*\theta ({\overline{t}})(\theta ({\overline{t}}))$ where $D^*\theta ({\overline{t}})\!:{\mathbb {R}}\rightrightarrows {\mathbb {R}}$ is the coderivative of $\theta $ at ${\overline{t}}$. Together with [50, Proposition 9.24(b)], $D^*\theta ({\overline{t}})(\theta ({\overline{t}}))=\partial (\theta ({\overline{t}})\theta )({\overline{t}})$. Thus,

$$\begin{aligned} \partial (\theta ^2)({\overline{t}}) =\left\{ \begin{array}{cl} \{0\} &{} \textrm{if}\ \theta ({\overline{t}})=0;\\ 2\theta ({\overline{t}})\partial \theta ({\overline{t}})&{}\textrm{otherwise} \end{array}\right. \quad \mathrm{for\ any}\ {\overline{t}}\in {\mathbb {R}}. \end{aligned}$$

(B9)

By using (B9) and the above notation, we can establish the following lemma.

Lemma 8

Suppose that for a certain $k\ge 1$ there exists an index set $S^{k-1}\supseteq S^*$ satisfying $\min _{i\in (S^{k-1})^c}v_i^{k-1}\ge {1}/{2}$. Let ${\mathcal {I}}{:}{=}\{i\in \{1,\ldots ,n\}\ \vert \ \varpi _i\ne 0\}$. Then, when $\lambda \ge 16{\widetilde{\tau }}n^{-1}\!|\!\Vert \!A_{{\mathcal {I}}\cdot }\!|\!\Vert _1+8\Vert \xi ^k\Vert _\infty $, it holds that $\big \Vert \Delta x^{k}_{\!(S^{k-1})^c}\Vert _1\le 3\Vert \Delta x^{k}_{\!S^{k-1}}\Vert _1$.

Proof

From $x^*\in \textrm{dom}f$ and the definition of $x^k$ in Step 2, it is not difficult to obtain

$$\begin{aligned}&f(x^*)+\frac{\mu }{2}\Vert x^*\Vert ^2+\lambda \langle v^{k-1},\vert x^*\vert \rangle +\frac{1}{2}\Vert x^*\!-\!x^{k-1}\Vert _{B_{k-1}}^2\\&\ge f(x^k)+\frac{\mu }{2}\Vert x^{k}\Vert ^2+\lambda \langle v^{k-1},\vert x^k\vert \rangle +\frac{1}{2}\Vert x^k\!-\!x^{k-1}\Vert _{B_{k-1}}^2\\&\quad +\frac{1}{2}\langle x^*\!-\!x^k,(\mu I\!+\!B_{k-1})(x^*\!-\!x^k)\rangle +\langle \delta ^{k-1},x^*\!-\!x^{k}\rangle , \end{aligned}$$

where the strong convexity of the objective function of (19) is used. After a suitable rearrangement for the last inequality, we obtain

$$\begin{aligned} f(x^{k})-f(x^*)+\mu \Vert \Delta x^k\Vert ^2 \le \lambda \langle v^{k-1},\vert x^*\vert -\vert x^{k}\vert \rangle +\langle \xi ^k, x^k\!-x^*\rangle . \end{aligned}$$

(B10)

For each $k\in {\mathbb {N}}$, let ${\mathcal {J}}_k\!{:}{=}\big \{i\notin {\mathcal {I}}\,\vert \, y_i^{k}\ne 0\big \}$. By the expression of $\vartheta $ and $\varpi =b-Ax^*$,

$$\begin{aligned}&\vartheta (A x^{k}\!-b)-\vartheta (A x^*\!-b) \nonumber \\&=\frac{1}{n}\bigg [\sum _{i\in {\mathcal {J}}_k}\frac{\theta ^2(y^{k}_i)-\theta ^2(\varpi _i)}{\theta (y^{k}_i)+\theta (\varpi _i)} +\sum _{i\in {\mathcal {I}}}\frac{\theta ^2(y^{k}_i)-\theta ^2(\varpi _i)}{\theta (y^{k}_i)+\theta (\varpi _i)}\bigg ]\nonumber \\&\ge \frac{1}{n}\bigg [\sum _{i\in {\mathcal {J}}_k}\frac{\theta ^2(y^{k}_i)-\theta ^2(\varpi _i)}{{\widetilde{\tau }}\Vert y^{k}\Vert _{\infty }} +\sum _{i\in {\mathcal {I}}}\frac{\theta ^2(y^{k}_i)-\theta ^2(\varpi _i)}{\theta (y^{k}_i)+\theta (\varpi _i)}\bigg ]. \end{aligned}$$

(B11)

where the inequality is since $\theta (y_i)\le {\widetilde{\tau }}\Vert y\Vert _\infty $ for $i=1,\ldots ,n$, implied by $\theta (0)=0$ and (22). Fix any $\eta _i\in \partial (\theta ^2)(\varpi _i)$. Since $\theta ^2$ is strongly convex with modulus $\tau $, we have

$$\begin{aligned} \theta ^2(y^{k}_i)-\theta ^2(\varpi _i) \ge \eta _i(y_i^k-\varpi _i) +0.5\tau (y^{k}_i-\varpi _i)^2 \ \ \textrm{for}\ \ i=1,\ldots ,n. \end{aligned}$$

(B12)

Along with (B9), for each $i\in {\mathcal {J}}_k$, $\eta _i=0$ and $\theta ^2(y^{k}_i)-\theta ^2(\varpi _i)\ge \frac{\tau }{2}(y^{k}_i-\varpi _i)^2$, and consequently,

$$\begin{aligned} \sum _{i\in {\mathcal {J}}_k}\frac{\theta ^2(y^{k}_i)-\theta ^2(\varpi _i)}{{\widetilde{\tau }}\Vert y^{k}\Vert _{\infty }} \ge \frac{\tau }{2{\widetilde{\tau }}} \sum _{i\in {\mathcal {J}}_k}\frac{(y^{k}_i-\varpi _i)^2}{\Vert y^{k}\Vert _{\infty }}. \end{aligned}$$

For each $i\in {\mathcal {I}}$, write ${\widetilde{y}}_i^{k}\!{:}{=}\frac{\eta _i}{\theta (y_i^{k})+\theta (\varpi _i)}$. From (B9) and (22), it is not hard to obtain $\vert {\widetilde{y}}_i^{k}\vert \le 2{\widetilde{\tau }}$ for all $i\in {\mathcal {I}}$. Together with (B12), $\varpi =b-Ax^*$ and $\theta (y^{k}_i)\le {\widetilde{\tau }}\Vert y^k\Vert _\infty $,

$$\begin{aligned} \sum _{i\in {\mathcal {I}}}\frac{\theta ^2(y^{k}_i)-\theta ^2(\varpi _i)}{\theta (y^{k}_i)+\theta (\varpi _i)}&\ge \sum _{i\in {\mathcal {I}}}{\widetilde{y}}_i^k(y_i^k-\varpi _i)+\frac{\tau }{2} \sum _{i\in {\mathcal {I}}}\frac{(y^{k}_i-\varpi _i)^2}{\theta (y^{k}_i)+\theta (\varpi _i)}\\&\ge -2{\widetilde{\tau }}\Vert [A(x^k\!- x^*)]_{{\mathcal {I}}}\Vert _1+\frac{\tau }{2} \sum _{i\in {\mathcal {I}}}\frac{(y^{k}_i-\varpi _i)^2}{{\widetilde{\tau }}(\Vert y^{k}\Vert _{\infty }\!+\!\Vert \varpi \Vert _\infty )}. \end{aligned}$$

Substituting the last two inequalities into (B11) and using the definition of f yields

$$\begin{aligned}{} & {} f(x^k)-f(x^*)=\vartheta (A x^{k}\!-b)-\vartheta (A x^*\!-b)\\{} & {} \ge -\frac{2{\widetilde{\tau }}}{n}\Vert [A(x^k\!- x^*)]_{{\mathcal {I}}}\Vert _1 +\frac{\tau \Vert A( x^k\!- x^*)\Vert ^2}{2n{\widetilde{\tau }}(\Vert y^{k}\Vert _{\infty }\!+\!\Vert \varpi \Vert _\infty )}. \end{aligned}$$

Write $\Upsilon ^k{:}{=}\frac{\tau \Vert A(x^k\!- x^*)\Vert ^2}{2n{\widetilde{\tau }}(\Vert y^{k}\Vert _{\infty }+\Vert \varpi \Vert _\infty )}$. By combining this inequality and (B10), we get

$$\begin{aligned} \mu \Vert \Delta x^k\Vert ^2+\Upsilon ^k&\le \lambda \langle v^{k-1},\vert x^*\vert -\vert x^{k}\vert \rangle +2{\widetilde{\tau }}n^{-1}\big \Vert [A(x^k\!- x^*)]_{{\mathcal {I}}}\big \Vert _{1}+\langle \xi ^k, x^k\!- x^*\rangle \nonumber \\&\le \lambda \Big (\textstyle {\sum _{i\in S^*}}v_i^{k-1}\vert \Delta x_i^k\vert -\textstyle {\sum _{i\in (S^{k-1})^c}}v_i^{k-1}\vert \Delta x_i^k\vert \Big )\nonumber \\&+\big (2{\widetilde{\tau }}n^{-1}\!|\!\Vert \!A_{{\mathcal {I}}\cdot }\!|\!\Vert _1+\Vert \xi ^k\Vert _{\infty }\big ) \big (\Vert \Delta x_{S^{k-1}}^k\Vert _{1}+\Vert \Delta x_{(S^{k-1})^{c}}^k\Vert _{1}\big ). \end{aligned}$$

(B13)

Since $S^{k-1}\supseteq S^*$ and $v_i^{k-1}\in [0.5,1]$ for $i\in (S^{k-1})^{c}$, from the last inequality we have

$$\begin{aligned} \mu \Vert \Delta x^k\Vert ^2+\Upsilon ^k&\le \textstyle {\sum _{i\in S^{k-1}}}\big (\lambda v_i^{k-1} +2{\widetilde{\tau }}n^{-1}\!|\!\Vert \!A_{{\mathcal {I}}\cdot }\!|\!\Vert _1+\Vert \xi ^k\Vert _\infty \big ) \vert \Delta x_i^k\vert \\&\quad +\textstyle {\sum _{i\in (S^{k-1})^c}} \big (n^{-1}\!|\!\Vert \!A_{{\mathcal {I}}\cdot }\!|\!\Vert _1+\Vert \xi ^k\Vert _\infty -\lambda /2\big )\vert \Delta x_i^k\vert \\&=\big (\lambda +2{\widetilde{\tau }}n^{-1}\!|\!\Vert \!A_{{\mathcal {I}}\cdot }\!|\!\Vert _1+\Vert \xi ^k\Vert _\infty \big ) \big \Vert \Delta x_{\!S^{k-1}}^k\big \Vert _1\\&\quad +\big (2{\widetilde{\tau }}n^{-1}\!|\!\Vert \!A_{{\mathcal {I}}\cdot }\!|\!\Vert _1+\Vert \xi ^k\Vert _\infty \!-0.5\lambda \big ) \big \Vert \Delta x_{\!(S^{k-1})^{c}}^k\big \Vert _1. \end{aligned}$$

From the nonnegativity of the left hand side and the given assumption on $\lambda $, we have

$$\begin{aligned} \big \Vert \Delta x_{(S^{k-1})^{c}}^k\big \Vert _1 \le \frac{\lambda +2{\widetilde{\tau }}n^{-1}\!|\!\Vert \!A_{{\mathcal {I}}\cdot }\!|\!\Vert _1+\Vert \xi ^k\Vert _\infty }{0.5\lambda -2{\widetilde{\tau }}n^{-1}\!|\!\Vert \!A_{{\mathcal {I}}\cdot }\!|\!\Vert _1-\Vert \xi ^k\Vert _\infty } \big \Vert \Delta x_{S^{k-1}}^k\big \Vert _1 \le 3\big \Vert \Delta x_{S^{k-1}}^k\big \Vert _1. \end{aligned}$$

This implies that the desired result holds. The proof is completed. $\square $

By invoking (B13) and Lemma 8, we can obtain the following conclusion.

Lemma 9

Suppose that $A^{{\mathbb {T}}}A/n$ satisfies the RE condition of parameter $\kappa >0$ on ${\mathcal {C}}(S^*)$, and that for some $k\ge 1$ there is an index set $S^{k-1}\supseteq S^*$ and $\vert S^{k-1}\vert \le 1.5s^*$ such that $\min _{i\in (S^{k-1})^c}v_i^{k-1}\ge \frac{1}{2}$. Let ${\mathcal {I}}{:}{=}\{i\ \vert \ \varpi _i\ne 0\}$. If $\lambda $ is chosen such that $16{\widetilde{\tau }}n^{-1}\!|\!\Vert \!A_{{\mathcal {I}}\cdot }\!|\!\Vert _1+8\Vert \xi ^k\Vert _\infty \le \lambda <\frac{2\mu {\widetilde{\tau }}\Vert \varpi \Vert _\infty +\tau \kappa -4{\widetilde{\tau }}\Vert A\Vert _{\infty } (2{\widetilde{\tau }}n^{-1}|\!\Vert A_{{\mathcal {I}}\cdot }|\!\Vert _1+\Vert \xi ^k\Vert _\infty )\vert S^{k-1}\vert }{4{\widetilde{\tau }}\Vert A\Vert _{\infty }\Vert v_{\!S^*}^{k-1}\Vert _{\infty }\vert S^{k-1}\vert }$,

$$\begin{aligned} \big \Vert \Delta x^{k}\big \Vert \le \frac{2{\widetilde{\tau }}\Vert \varpi \Vert _{\infty }\big (\lambda \Vert v_{\!S^*}^{k-1}\Vert _{\infty } +2{\widetilde{\tau }}n^{-1}\!|\!\Vert \!A_{{\mathcal {I}}\cdot }\!|\!\Vert _1+\Vert \xi ^k\Vert _{\infty }\big ) \sqrt{\vert S^{k-1}\vert }}{2\mu {\widetilde{\tau }}\Vert \varpi \Vert _{\infty }+\tau \kappa -4{\widetilde{\tau }}\Vert A\Vert _{\infty }\big (\lambda \Vert v_{\!S^*}^{k-1}\Vert _{\infty } +2{\widetilde{\tau }}n^{-1}\!|\!\Vert \!A_{{\mathcal {I}}\cdot }\!|\!\Vert _1+\Vert \xi ^k\Vert _{\infty }\big )\vert S^{k-1}\vert }. \end{aligned}$$

Proof

Note that $\Vert y^k\Vert _\infty +\Vert \varpi \Vert _\infty =\Vert \varpi -\!A\Delta x^k\Vert _{\infty }+\Vert \varpi \Vert _\infty \le \Vert A\Delta x^k\Vert _{\infty }+2\Vert \varpi \Vert _\infty $. Then

$$\begin{aligned} \frac{\tau \Vert A(x^k-x^*)\Vert ^2}{2n{\widetilde{\tau }}(\Vert z^{k}\Vert _{\infty }+\Vert \varpi \Vert _\infty )} \ge \frac{\tau \Vert A\Delta x^k\Vert ^2}{2n{\widetilde{\tau }}(\Vert A\Delta x^k\Vert _{\infty }+2\Vert \varpi \Vert _\infty )} {:}{=}{\widetilde{\Upsilon }}^k. \end{aligned}$$

Together with inequality (B13) and $v_i^{k-1}\in [0.5,1]$ for $i\in (S^{k-1})^{c}$, it follows that

$$\begin{aligned} \mu \Vert \Delta x^k\Vert ^2+{\widetilde{\Upsilon }}^k&\le \lambda {\textstyle \sum _{i\in S^*}}v_i^{k-1}\vert \Delta x_i^k\vert -({\lambda }/{2}){\textstyle \sum _{i\in (S^{k-1})^c}}\vert \Delta x_i^k\vert \\&\quad +\big (2{\widetilde{\tau }}n^{-1}\!|\!\Vert \!A_{{\mathcal {I}}\cdot }\!|\!\Vert _1+\Vert \xi ^k\Vert _{\infty }\big ) \big (\Vert \Delta x_{\!S^{k-1}}^k\Vert _{1}+\Vert \Delta x_{\!(S^{k-1})^{c}}^k\Vert _{1}\big ) \\&\le \big (\lambda \Vert v_{S^*}^{k-1}\Vert _{\infty }+2{\widetilde{\tau }}n^{-1}\!|\!\Vert \!A_{{\mathcal {I}}\cdot }\!|\!\Vert _1 +\Vert \xi ^k\Vert _{\infty }\big )\Vert \Delta x_{\!S^{k-1}}^k\Vert _{1} \end{aligned}$$

where the second inequality is due to $\lambda \ge 16{\widetilde{\tau }}n^{-1}\!|\!\Vert \!A_{{\mathcal {I}}\cdot }\!|\!\Vert _1+8\Vert \xi ^k\Vert _\infty $. By Lemma 8, $\Vert \Delta x^{k}_{(S^{k-1})^c}\Vert _1\le 3\Vert \Delta x^{k}_{\!S^{k-1}}\Vert _1$, which means that $\Delta x^{k}\in {\mathcal {C}}(S^*)$. From the assumption on $\frac{1}{n}A^{{\mathbb {T}}}A$, we have $\Vert A\Delta x^k\Vert ^2\ge 2n\kappa \Vert \Delta x^k\Vert ^2$. Then, it holds that

$$\begin{aligned}{} & {} \mu \Vert \Delta x^k\Vert ^2+\frac{\tau \kappa \Vert \Delta x^k\Vert ^2}{{\widetilde{\tau }}(\Vert A\Delta x^k\Vert _{\infty }\!+2\Vert \varpi \Vert _\infty )}\\{} & {} \le \Big (\lambda \Vert v_{S^*}^{k-1}\Vert _{\infty }+\frac{2{\widetilde{\tau }}}{n}\!|\!\Vert \!A_{{\mathcal {I}}\cdot }\!|\!\Vert _1 +\Vert \xi ^k\Vert _{\infty }\Big )\big \Vert \Delta x_{S^{k-1}}^k\big \Vert _1. \end{aligned}$$

Multiplying the both sides of this inequality with ${\widetilde{\tau }}(\Vert A\Delta x^k\Vert _{\infty }+2\Vert \varpi \Vert _\infty )$ yields that

$$\begin{aligned}&\big [\mu {\widetilde{\tau }}(\Vert A\Delta x^k\Vert _{\infty }+2\Vert \varpi \Vert _\infty )+\tau \kappa \big ]\Vert \Delta x^k\Vert ^2\\&\le {\widetilde{\tau }}\Vert A\Delta x^k\Vert _{\infty }\big (\lambda \Vert v_{S^*}^{k-1}\Vert _{\infty } +2{\widetilde{\tau }}n^{-1}\!|\!\Vert \!A_{{\mathcal {I}}\cdot }\!|\!\Vert _1+\Vert \xi ^k\Vert _{\infty }\big ) \big \Vert \Delta x_{S^{k-1}}^k\big \Vert _1\\&\quad + 2{\widetilde{\tau }}\Vert \varpi \Vert _\infty \big (\lambda \Vert v_{S^*}^{k-1}\Vert _{\infty } +2{\widetilde{\tau }}n^{-1}\!|\!\Vert \!A_{{\mathcal {I}}\cdot }\!|\!\Vert _1+\Vert \xi ^k\Vert _{\infty }\big ) \big \Vert \Delta x_{S^{k-1}}^k\big \Vert _1. \end{aligned}$$

Note that $\Vert A\Delta x^k\Vert _{\infty }\le \Vert A\Vert _{\infty }\Vert \Delta x^k\Vert _1$. Together with $\Vert \Delta x^{k}_{(S^{k-1})^c}\Vert _1\le 3\Vert \Delta x^{k}_{S^{k-1}}\Vert _1$, $\Vert A\Delta x^k\Vert _{\infty }\le 4\Vert A\Vert _{\infty }\Vert \Delta x_{S^{k-1}}^k\Vert _1$, so the right hand side of the last inequality satisfies

$$\begin{aligned} \textrm{RHS}&\le 4{\widetilde{\tau }}\Vert A\Vert _{\infty }\big (\lambda \Vert v_{S^*}^{k-1}\Vert _{\infty } +2{\widetilde{\tau }}n^{-1}\!|\!\Vert \!A_{{\mathcal {I}}\cdot }\!|\!\Vert _1+\Vert \xi ^k\Vert _{\infty }\big ) \vert S^{k-1}\vert \big \Vert \Delta x_{S^{k-1}}^k\big \Vert ^2\\&\quad +2{\widetilde{\tau }}\Vert \varpi \Vert _{\infty }\big (\lambda \Vert v_{S^*}^{k-1}\Vert _{\infty } +2{\widetilde{\tau }}n^{-1}\!|\!\Vert \!A_{{\mathcal {I}}\cdot }\!|\!\Vert _1+\Vert \xi ^k\Vert _{\infty }\big ) \sqrt{\vert S^{k-1}\vert }\big \Vert \Delta x_{S^{k-1}}^k\big \Vert . \end{aligned}$$

From the last two equations, a suitable rearrangement yields that

$$\begin{aligned}&\Big [2\mu {\widetilde{\tau }}\Vert \varpi \Vert _{\infty }+\tau \kappa -4{\widetilde{\tau }}\Vert A\Vert _{\infty }\big (\lambda \Vert v_{S^*}^{k-1}\Vert _{\infty } +2{\widetilde{\tau }}n^{-1}\!|\!\Vert \!A_{{\mathcal {I}}\cdot }\!|\!\Vert _1+\Vert \xi ^k\Vert _{\infty }\big )\vert S^{k-1}\vert \Big ] \Vert \Delta x^k\Vert ^2\\&\le 2{\widetilde{\tau }}\Vert \varpi \Vert _{\infty }\big (\lambda \Vert v_{S^*}^{k-1}\Vert _{\infty } +2{\widetilde{\tau }}n^{-1}\!|\!\Vert \!A_{{\mathcal {I}}\cdot }\!|\!\Vert _1+\Vert \xi ^k\Vert _{\infty }\big ) \sqrt{\vert S^{k-1}\vert }\big \Vert \Delta x_{S^{k-1}}^k\big \Vert , \end{aligned}$$

which along with $\lambda <\frac{2\mu {\widetilde{\tau }}\Vert \varpi \Vert _\infty +\tau \kappa -4{\widetilde{\tau }}\Vert A\Vert _{\infty } (2{\widetilde{\tau }}n^{-1}|\!\Vert A_{{\mathcal {I}}\cdot }|\!\Vert _1+\Vert \xi ^k\Vert _\infty )\vert S^{k-1}\vert }{4{\widetilde{\tau }}\Vert A\Vert _{\infty }\Vert v_{S^*}^{k-1}\Vert _{\infty }\vert S^{k-1}\vert }$ implies the desired result. The proof is then completed. $\square $

1.1 B.1 Proof of Proposition 6:

Let $\Delta x^{0}{:}{=}x^0-x^*$. From $x^*\in \textrm{dom}f$ and the strong convexity of (20),

$$\begin{aligned}&f(x^*)+{\widetilde{\lambda }}\Vert x^*\Vert _1+\frac{{\widetilde{\gamma }}_{1,0}}{2}\Vert x^*\Vert ^2 +\frac{{\widetilde{\gamma }}_{2,0}}{2}\Vert Ax^*\Vert ^2\\&\ge f(x^0)+{\widetilde{\lambda }}\Vert x^0\Vert _1+\frac{{\widetilde{\gamma }}_{1,0}}{2}\Vert x^0\Vert ^2 +\frac{{\widetilde{\gamma }}_{2,0}}{2}\Vert Bx^0\Vert ^2 \\&\quad +\langle {\widetilde{\delta }}^0, x^*\!-\! x^0\rangle +\frac{1}{2}\langle (x^*\!-\!x^0), ({\widetilde{\gamma }}_{1,0}I\!+\!{\widetilde{\gamma }}_{20}A^{{\mathbb {T}}}A)(x^*\!-\!x^0)\rangle . \end{aligned}$$

From $\vartheta (z)=\frac{1}{n}\sum _{i=1}^n\theta (z_i)$ and Assumption 2, $f(x^*)-f(x^0)\le \frac{{\widetilde{\tau }}}{n}\Vert A(x^*\!-\!x^0)\Vert _1$. Notice that $ \Vert x^0\Vert ^2-\Vert x^*\Vert ^2 =\Vert x^0\!-\!x^*\Vert ^2+2\langle x^0-x^*,x^*\rangle $. Together with the last inequality and $\Vert {\widetilde{\delta }}^0\Vert _\infty \le {\widetilde{\epsilon }}_0$, it follows that

$$\begin{aligned}&\Vert \Delta x^0\Vert _{{\widetilde{\gamma }}_{1,0}I+{\widetilde{\gamma }}_{2,0}A^{{\mathbb {T}}}A}^2 \le {\widetilde{\lambda }}(\Vert x^*\Vert _1\!-\Vert x^{0}\Vert _1)+n^{-1}{{\widetilde{\tau }}}\Vert A(x^*\!-\!x^0)\Vert _1\\&\qquad \qquad \qquad \qquad \quad +\langle x^0\!-x^*, {\widetilde{\delta }}^0\!+{\widetilde{\gamma }}_{2,0}A^{{\mathbb {T}}}Ax^*\!-{\widetilde{\gamma }}_{1,0}x^*\rangle \\&\le \big ({\widetilde{\lambda }} +{\widetilde{\tau }}n^{-1}\!|\!\Vert \!A\!|\!\Vert _1+{\widetilde{\gamma }}_{1,0} \Vert x^*\Vert _{\infty }+{\widetilde{\gamma }}_{2,0}\Vert A^{{\mathbb {T}}}Ax^*\Vert _{\infty }+{\widetilde{\epsilon }}_0\big ) \Vert \Delta x_{S^{*}}^0\Vert _1 \\&\quad +\big ({\widetilde{\tau }}n^{-1}\!|\!\Vert \!A\!|\!\Vert _1+{\widetilde{\gamma }}_{1,0} \Vert x^*\Vert _{\infty }+{\widetilde{\gamma }}_{2,0}\Vert A^{{\mathbb {T}}}Ax^*\Vert _{\infty }+{\widetilde{\epsilon }}_0-{\widetilde{\lambda }}\big ) \Vert \Delta x_{(S^{*})^{c}}^0\Vert _1. \end{aligned}$$

By the assumption on ${\widetilde{\lambda }}$ and the nonnegativity of $\Vert \Delta x^0\Vert _{{\widetilde{\gamma }}_{1,0}I+{\widetilde{\gamma }}_{2,0}A^{{\mathbb {T}}}A}^2$, we get $\Vert \Delta x_{(S^{*})^{c}}^0\Vert _1\le 3\Vert \Delta x_{S^{*}}^0\Vert _1$. Substituting this into the last inequality yields

$$\begin{aligned}&\Vert \Delta x^0\Vert _{{\widetilde{\gamma }}_{1,0}I+{\widetilde{\gamma }}_{2,0}A^{{\mathbb {T}}}A}^2\\&\le \big ({\widetilde{\lambda }} +{\widetilde{\tau }}n^{-1}\!|\!\Vert \!A\!|\!\Vert _1+{\widetilde{\gamma }}_{1,0} \Vert x^*\Vert _{\infty }+{\widetilde{\gamma }}_{2,0}\Vert A^{{\mathbb {T}}}Ax^*\Vert _{\infty }+{\widetilde{\epsilon }}_0\big ) \big \Vert \Delta x_{S^{*}}^0\big \Vert _1\\&\le \frac{3{\widetilde{\lambda }}\sqrt{s^*}}{2}\big \Vert \Delta x^0\big \Vert \end{aligned}$$

which implies that the desired conclusion holds. The proof is completed. $\square $

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhang, D., Pan, S., Bi, S. et al. Zero-norm regularized problems: equivalent surrogates, proximal MM method and statistical error bound. Comput Optim Appl 86, 627–667 (2023). https://doi.org/10.1007/s10589-023-00496-x

Download citation

Received: 23 April 2022
Accepted: 20 May 2023
Published: 06 June 2023
Issue Date: November 2023
DOI: https://doi.org/10.1007/s10589-023-00496-x

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Zero-norm regularized problems: equivalent surrogates, proximal MM method and statistical error bound

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Proximal alternating penalty algorithms for nonsmooth constrained convex optimization

An accelerated first-order method with complexity analysis for solving cubic regularization subproblems

High-order methods beyond the classical complexity bounds: inexact high-order proximal-point methods

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

Appendix A: Proof of Proposition 3

Lemma 6

Proof

Lemma 7

Proof

Proof

Appendix B: Proof of results in Sect. 4.2

Lemma 8

Proof

Lemma 9

Proof

1.1 B.1 Proof of Proposition 6:

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Subscribe and save

Buy Now

Search

Navigation