A penalty barrier framework for nonconvex constrained optimization

Alberto De Marchi Department of Aerospace Engineering, Institute of Applied Mathematics and Scientific Computing, University of the Bundeswehr Munich, Werner-Heisenberg-Weg 39, 85577 Neubiberg, Germany
email: alberto.demarchi@unibw.de, orcid: 0000-0002-3545-6898 Andreas Themelis Faculty of Information Science and Electrical Engineering (ISEE), Kyushu University, 744 Motooka, Nishi-ku 819-0395, Fukuoka, Japan
email: andreas.themelis@ees.kyushu-u.ac.jp, orcid: 0000-0002-6044-0169

Abstract

Focusing on minimization problems with structured objective function and smooth constraints, we present a flexible technique that combines the beneficial regularization effects of (exact) penalty and interior-point methods. Working in the fully nonconvex setting, a pure barrier approach requires careful steps when approaching the infeasible set, thus hindering convergence. We show how a tight integration with a penalty scheme overcomes such conservatism, does not require a strictly feasible starting point, and thus accommodates equality constraints. The crucial advancement that allows us to invoke generic (possibly accelerated) subsolvers is a marginalization step: amounting to a conjugacy operation, this step effectively merges (exact) penalty and barrier into a smooth, full domain functional object. When the penalty exactness takes effect, the generated subproblems do not suffer the ill-conditioning typical of penalty methods, nor do they exhibit the nonsmoothness of exact penalty terms. We provide a theoretical characterization of the algorithm and its asymptotic properties, deriving convergence results for fully nonconvex problems. Illustrative examples and numerical simulations demonstrate the wide range of problems our theory and algorithm are able to cover.

Keywords. Nonsmooth nonconvex optimization $\cdot$ exact penalty methods $\cdot$ interior point methods $\cdot$ proximal algorithms

AMS subject classifications. 49J52 $\cdot$ 49J53 $\cdot$ 65K05 $\cdot$ 90C06 $\cdot$ 90C30

1 Introduction

We are interested in developing numerical methods for constrained optimization problems of the form

\operatorname*{minimize}_{{\bm{x}}\in{}^{n}}{}\leavevmode\nobreak\ q({\bm{x}})% \quad\operatorname{subject\ to}{}\leavevmode\nobreak\ {\bm{c}}({\bm{x}})\leq{% \bm{0}}

where ${\bm{x}}$ is the decision variable and functions $q$ and ${\bm{c}}$ are problem data. (Throughout, we stick to the convention of bold-facing vector variables and vector-valued functions, so that ${\bm{0}}$ indicates the zero vector of suitable size and similarly ${\bm{1}}$ is the vector with all entries equal to one.) The proposed algorithm for (1) can also be applied for tackling problems with equality constraints ${{\bm{c}}}_{\text{eq}}({\bm{x}})={\bm{0}}$ . For simplicity of presentation, we focus on the more general inequality constrained case and illustrate in Section 4.3 an efficient way of handling equality constraints in our framework other than describing them as two-sided inequalities. Henceforth we consider (1) under the following standing assumptions.

Notice that no differentiability requirements are imposed on the cost $q$ , nor convexity on any term in the formulation. The primary objective of this paper is to devise an abstract algorithmic framework in the generality of this setting. The methodology requires an oracle for solving, up to approximate local optimality, minimization instances of the sum of $q$ with a differentiable term. In our numerical experiments we will invoke off-the-shelf routines based on proximal gradient iterations, thereby restricting our attention to problem instances in which $q$ is structured as $q=f+g$ for a differentiable function $f$ and a function $g$ that enjoys an easily computable proximal map. Most nonsmooth functions widely used in practice comply with all these requirements. For instance, $g$ can include indicators of any nonempty and closed set, and thus enforce arbitrary closed constraints that are easy to project onto. We also emphasize that the requirement of continuity relative to the domain is virtually negligible, as in most cases it can be circumvented through suitable reformulations of the problem that leverage the flexibility of the constraints. As a particularly enticing such instance, we mention the reformulation of the so-called $L^{0}$ -norm penalty (number of nonzero entries) $\|{\bm{x}}\|_{0}$ for ${\bm{x}}\in{}^{n}$ as the linear program

\|{\bm{x}}\|_{0}=\min_{{\bm{u}}\in{}^{n}}{}\|{\bm{u}}\|_{1}\quad\operatorname{% subject\ to}{}\leavevmode\nobreak\ -{\bm{1}}\leq{\bm{u}}\leq{\bm{1}},% \leavevmode\nobreak\ {\mathopen{}\left\langle{}{\bm{u}}{},{}{\bm{x}}{}\right% \rangle\mathclose{}}=\|{\bm{x}}\|_{1},

and remark that, more generally, matrix rank can also be cast in a similar fashion [2, Lem. 3.1].

Motivations and related work

The class of problems (1) with structured cost $q$ has been recently studied in [4] and [12], respectively, for the fully convex and nonconvex setting, developing methods that bear strong convergence guarantees under some restrictive assumptions. Above all, building on a pure barrier approach, these methods demand a feasible set with nonempty interior, thus excluding problems with equality constraints. Although restricted to simple bounds, a similar interior-point technique is investigated in [18] and manifests analogous pros and cons. In contrast to these works, we intend to address equality constraints as well. An augmented Lagrangian scheme for constrained structured problems was developed in [10], which also allows the specification of constraints in a function-in-set format.

Constrained structured programs (1) are also closely related to the template of structured composite optimization

\operatorname*{minimize}_{{\bm{x}}\in{}^{n}}{}\leavevmode\nobreak\ q({\bm{x}})% +h({\bm{c}}({\bm{x}}))

with $h:{}^{m}\rightarrow\overline{}\m@thbbch@rR$ . By introducing additional variables ${\bm{z}}\in{}^{m}$ , composite problems can be rewritten in (equality) constrained form recovering the class of problems (1), with a one-to-one relationship between (local and global) solutions and stationary points [10, Lemma 3.1]. The recent literature on structured composite optimization includes [21], only for convex $h$ , and [16, 9] for fully nonconvex problems, and concentrates almost exclusively on the augmented Lagrangian framework. Relying essentially on a penalty approach, in contrast to a barrier, the algorithmic characterization in [10] involved weaker assumptions and yet retrieved standard convergence results in constrained nonconvex optimization. However, the dependency on dual estimates makes methods of this family sensitive to the initialization of Lagrange multipliers. Moreover, they require some safeguards to ensure convergence from arbitrary starting points [3, 10]. In contrast, thanks to their ‘primal’ nature and inherent regularizing effect, penalty-barrier techniques can conveniently cope with degenerate problems.

The idea of adopting and merging penalty and barrier approaches, in a variety of possible flavors and combinations, is certainly not new, tracing back at least to [14]. Among several recent concretizations of this avenue, we refer to Curtis’ work [7] for a comprehensive discussion and further references. Our motivation for developing this technique for constrained structured problems comes from previous experience while designing the interior point scheme IPprox [12]. The key observation therein is that, with a pure barrier approach, the arising subproblems have a smooth term without full domain. This nonstandard situation, together with a nonconvex and possibly extended-real-valued cost $q$ and nonlinear constraints ${\bm{c}}$ , makes it difficult to adopt accelerated subsolvers.²²2For comparison, most interior point algorithms for classical nonlinear programming transform the original problem into one with equalities and simple bounds only, treating the latter with a barrier and dampening search directions with a so-called fraction-to-the-boundary rule to maintain strict feasibility (relative only to simple bounds) [26].

In the broad setting of (1) under Section 1, a blind application of penalty-barrier strategies in the spirit of [7] would bear no advantages, since the issue of IPprox of a restricted domain would persist, hindering again the practical performance. In this paper we propose and investigate in details a simple technique to overcome this limitation. The crucial step consists in the marginalization of auxiliary variables: after applying some penalty and barrier modifications, the auxiliary variables are optimized pointwise, for any given decision variable ${\bm{x}}$ .³³3This approach can be interpreted as an extreme version of the so-called magical steps [5, 3], or slack reset in [7], and was inspired by the proximal approaches in [13, 9]. Before proceeding with the technical content, we emphasize that the marginalization step not only reduces the subproblems’ size (recovering that of only the original decision variable ${\bm{x}}$ ), but it also—and especially—results in a smooth penalty term for the subproblems that has always full domain. The emergence of this penalty-barrier envelope enables the adoption of generic, possibly accelerated subsolvers, as well as tailored routines that exploit the problem’s original structure. This claim will be substantiated in Fig. 1 where we show that convexity and Lipschitz differentiability are preserved in the transformed problems.

2 Preliminaries

In this section we comment on useful notation and preliminary results before discussing optimality notions to characterize solutions of (1).

2.1 Notation and known facts

With and $\overline{}\m@thbbch@rR\coloneqq\m@thbbch@rR\cup{\mathopen{}\left\{\infty{}% \mathrel{\mid}{}{}\right\}\mathclose{}}$ we denote the real and extended-real line, respectively, and with ${}_{+}\coloneqq[0,\infty)$ and ${}_{-}\coloneqq(-\infty,0]$ the set of positive and negative real numbers, respectively. The positive and negative parts of a number $r\in\m@thbbch@rR$ are respectively denoted as $[r]_{+}\coloneqq\max{\mathopen{}\left\{0,r{}\mathrel{\mid}{}{}\right\}% \mathclose{}}$ and $[r]_{-}\coloneqq\max{\mathopen{}\left\{0,-r{}\mathrel{\mid}{}{}\right\}% \mathclose{}}$ , so that $r=[r]_{+}-[r]_{-}$ . In case of a vector ${\bm{r}}$ , then $[{\bm{r}}]_{+}$ and $[{\bm{r}}]_{-}$ are meant elementwise.

The notation $T:{}^{n}\rightrightarrows{}^{m}$ indicates a set-valued mapping $T$ that maps any ${\bm{x}}\in{}^{n}$ to a (possibly empty) subset $T({\bm{x}})$ of ^m. Its (effective) domain and graph are the sets $\operatorname{dom}T\coloneqq{\mathopen{}\left\{{\bm{x}}\in{}^{n}{}\mathrel{% \mid}{}T({\bm{x}})\neq\emptyset\right\}\mathclose{}}$ and $\operatorname{gph}T\coloneqq{\mathopen{}\left\{({\bm{x}},{\bm{y}})\in{}^{n}% \times{}^{m}{}\mathrel{\mid}{}{\bm{y}}\in T({\bm{x}})\right\}\mathclose{}}$ . Algebraic operations with or among set-valued mappings are meant in a componentwise sense; for instance, the sum of $T_{1},T_{2}:{}^{n}\rightrightarrows{}^{m}$ is defined as $(T_{1}+T_{2})({\bm{x}})\coloneqq{\mathopen{}\left\{{\bm{y}}^{1}+{\bm{y}}^{2}{}% \mathrel{\mid}{}({\bm{y}}^{1},{\bm{y}}^{2})\in T_{1}({\bm{x}})\times T_{2}({% \bm{x}})\right\}\mathclose{}}$ for all ${\bm{x}}\in{}^{n}$ .

With $\operatorname{dist}_{E}:{}^{n}\rightarrow[0,\infty)$ and $\operatorname{\Pi}_{E}:{}^{n}\rightrightarrows{}^{n}$ we indicate the distance from and the projection onto a nonempty set $E\subseteq{}^{n}$ , respectively, namely

\operatorname{dist}_{E}({\bm{x}})\coloneqq\operatorname*{inf}_{{\bm{y}}\in E}% \|{\bm{y}}-{\bm{x}}\|\quad\text{and}\quad\operatorname{\Pi}_{E}({\bm{x}})% \coloneqq\operatorname*{arg\,min}_{{\bm{y}}\in E}\|{\bm{y}}-{\bm{x}}\|.

With $\operatorname{\delta}_{E}:{}^{n}\rightarrow\overline{}\m@thbbch@rR$ we denote the indicator function of $E$ , namely such that $\operatorname{\delta}_{E}({\bm{x}})=0$ if ${\bm{x}}\in E$ and $\infty$ otherwise. For an extended-real-valued function $h:{}^{n}\rightarrow\overline{}\m@thbbch@rR$ , the (effective) domain, graph, and epigraph are given by $\operatorname{dom}h\coloneqq{\mathopen{}\left\{{\bm{x}}\in{}^{n}{}\mathrel{% \mid}{}h({\bm{x}})<\infty\right\}\mathclose{}}$ , $\operatorname{gph}h\coloneqq{\mathopen{}\left\{({\bm{x}},h({\bm{x}})){}% \mathrel{\mid}{}{\bm{x}}\in\operatorname{dom}h\right\}\mathclose{}}$ , and $\operatorname{epi}h\coloneqq{\mathopen{}\left\{({\bm{x}},\alpha)\in{}^{n}% \times\m@thbbch@rR{}\mathrel{\mid}{}\alpha\geq h({\bm{x}})\right\}\mathclose{}}$ . We say that $h$ is proper if $\operatorname{dom}h\neq\emptyset$ and lower semicontinuous (lsc) if $h(\bar{\bm{x}})\leq\liminf_{{\bm{x}}\to\bar{\bm{x}}}h({\bm{x}})$ for all $\bar{\bm{x}}\in{}^{n}$ or, equivalently, if $\operatorname{epi}h$ is a closed subset of ⁿ⁺¹. Following [22, Def. 8.3], we denote by $\hat{\partial}h:{}^{n}\rightrightarrows{}^{n}$ the regular subdifferential of $h$ , where

{\bm{v}}\in\hat{\partial}h(\bar{\bm{x}})\quad\mathrel{{}\mathop{% \Leftrightarrow}\limits^{\text{\clap{\tiny(def)}}}}\quad\liminf_{{\begin{array% }[]{>{\scriptstyle}r >{\scriptstyle{}}c<{{}} >{\scriptstyle}l}{\bm{x}}&\to&% \bar{\bm{x}}\\ {\bm{x}}&\neq&\bar{\bm{x}}\end{array}}}\frac{h({\bm{x}})-h(\bar{\bm{x}})-{% \mathopen{}\left\langle{}{\bm{v}}{},{}{\bm{x}}-\bar{\bm{x}}{}\right\rangle% \mathclose{}}}{\|{\bm{x}}-\bar{\bm{x}}\|}\geq 0.

(1)

The (limiting, or Mordukhovich) subdifferential of $h$ is $\partial h:{}^{n}\rightrightarrows{}^{n}$ , where $\bar{{\bm{v}}}\in\partial h(\bar{\bm{x}})$ if and only if $\bar{\bm{x}}\in\operatorname{dom}h$ and there exists a sequence $({\bm{x}}^{k},{\bm{v}}^{k})_{k\in\m@thbbch@rN}$ in $\operatorname{gph}\hat{\partial}h$ such that $({\bm{x}}^{k},{\bm{v}}^{k},h({\bm{x}}^{k}))\to(\bar{\bm{x}},\bar{{\bm{v}}},h(% \bar{\bm{x}}))$ . In particular, $\hat{\partial}h({\bm{x}})\subseteq\partial h({\bm{x}})$ holds at any ${\bm{x}}\in{}^{n}$ ; moreover, ${\bm{0}}\in\hat{\partial}h({\bm{x}})$ is a necessary condition for local minimality of $h$ at ${\bm{x}}$ [22, Thm. 10.1]. The subdifferential of $h$ at $\bar{\bm{x}}$ satisfies $\partial(h+h_{0})(\bar{\bm{x}})=\partial h(\bar{\bm{x}})+{\nabla}h_{0}(\bar{% \bm{x}})$ for any $h_{0}:{}^{n}\rightarrow\overline{}\m@thbbch@rR$ continuously differentiable around $\bar{\bm{x}}$ [22, Ex. 8.8]. If $h$ is convex, then $\hat{\partial}h=\partial h$ coincide with the convex subdifferential

{}^{n}\ni\bar{\bm{x}}\mapsto{\mathopen{}\left\{{\bm{v}}\in{}^{n}{}\mathrel{% \mid}{}h({\bm{x}})-h(\bar{\bm{x}})-{\mathopen{}\left\langle{}{\bm{v}}{},{}{\bm% {x}}-\bar{\bm{x}}{}\right\rangle\mathclose{}}\geq 0\ \forall{\bm{x}}\in{}^{n}% \right\}\mathclose{}}.

For a convex set $C\subseteq{}^{m}$ and a point ${\bm{x}}\in C$ one has that $\partial\operatorname{\delta}_{C}({\bm{x}})=\operatorname{N}_{C}({\bm{x}})$ , where

\operatorname{N}_{C}({\bm{x}})\coloneqq{\mathopen{}\left\{{\bm{v}}\in{}^{n}{}% \mathrel{\mid}{}{\mathopen{}\left\langle{}{\bm{v}}{},{}{\bm{x}}^{\prime}-{\bm{% x}}{}\right\rangle\mathclose{}}\leq 0\ \forall{\bm{x}}^{\prime}\in C\right\}% \mathclose{}}

denotes the normal cone of $C$ at ${\bm{x}}$ .

We use the symbol $\mathop{}\!{\operatorname{J}}{\bm{F}}:{}^{n}\rightarrow{}^{m\times n}$ to indicate the Jacobian of a differentiable mapping ${\bm{F}}:{}^{n}\rightarrow{}^{m}$ , namely $\mathop{}\!{\operatorname{J}}{\bm{F}}(\bar{\bm{x}})_{i,j}=\frac{\partial{{F}}_% {i}missing}{\partial{{x}}_{j}missing}(\bar{\bm{x}})$ for all $\bar{\bm{x}}\in{}^{m}$ . For a real-valued function $h$ , we instead use the gradient notation ${\nabla}h\coloneqq\mathop{}\!{\operatorname{J}}h^{\top}$ to indicate the column vector of its partial derivatives. Finally, we remind that the convex conjugate of a proper lsc convex function $b:\m@thbbch@rR\rightarrow\overline{}\m@thbbch@rR$ is the proper lsc convex function $b^{\ast}:\m@thbbch@rR\rightarrow\overline{}\m@thbbch@rR$ defined as $b^{\ast}(\tau)\coloneqq\sup_{t\in\m@thbbch@rR}{\mathopen{}\left\{\tau t-b(t){}% \mathrel{\mid}{}{}\right\}\mathclose{}}$ , and that one then has $\tau\in\partial b(t)$ if and only if $t\in\partial b^{\ast}(\tau)$ .

2.2 Stationarity concepts

This subsection summarizes standard local optimality measures which were adopted in the proximal interior point framework of [12], and which will be further developed in the following Section 3 into conditions tailored to the setting of this paper. The interested reader is referred to [12, §2] for a verbose introduction and to [3, §3] for a detailed treatise. We start with the usual notion of (approximate) stationarity for general minimization problems of an extended-real-valued function.

{definition}

[stationarity]Relative to the problem $\operatorname*{minimize}_{{\bm{x}}\in{}^{n}}\varphi({\bm{x}})$ for a function $\varphi:{}^{n}\rightarrow\overline{}\m@thbbch@rR$ , a point $\bar{\bm{x}}\in{}^{n}$ is called

1.

stationary if it satisfies ${\bm{0}}\in\partial\varphi(\bar{\bm{x}})$ ;
2.

$\varepsilon$ -stationary (with $\varepsilon>0$ ) if it satisfies $\operatorname{dist}_{\partial\varphi(\bar{\bm{x}})}({\bm{0}})\leq\varepsilon$ .

A standard optimality notion that reflects the constrained structure of (1) is given by the Karush-Kuhn-Tucker (KKT) conditions.

{definition}

[2.2 optimality]Relative to problem (1), we say that $\bar{\bm{x}}\in{}^{n}$ is KKT-optimal if there exists $\bar{\bm{y}}\in{}^{m}$ such that

{\mathopen{}\left\{\begin{array}[]{@{}l@{}l@{}}-\mathop{}\!{\operatorname{J}}{% \bm{c}}(\bar{\bm{x}})^{\top}\bar{\bm{y}}\in\partial q(\bar{\bm{x}})\\ {\bm{c}}(\bar{\bm{x}})\leq{\bm{0}}\\ \bar{\bm{y}}\geq{\bm{0}}\\ \bar{{y}}_{i}{{c}}_{i}(\bar{\bm{x}})=0\quad\forall i.\end{array}\right.% \mathclose{}}

In such case, we say that $(\bar{\bm{x}},\bar{\bm{y}})\in{}^{n}\times{}^{m}$ is a KKT-optimal pair for (1).

Even for convex problems, unless suitable constraint and epigraphical qualifications are met, local solutions may fail to be 2.2-optimal. Necessary conditions in the generality of problem (1) are provided by the following asymptotic counterpart.

{definition}

[2.2 optimality]Relative to problem (1), we say that $\bar{\bm{x}}\in{}^{n}$ is asymptotically KKT-optimal if there exist sequences $({\bm{x}}^{k})_{k\in\m@thbbch@rN}\to\bar{\bm{x}}$ and $({\bm{y}}^{k})_{k\in\m@thbbch@rN}\subset{}^{m}$ such that

{\mathopen{}\left\{\begin{array}[]{@{}l@{}l@{}}\operatorname{dist}_{\partial q% ({\bm{x}}^{k})}(-\mathop{}\!{\operatorname{J}}{\bm{c}}({\bm{x}}^{k})^{\top}{% \bm{y}}^{k})\to 0\\ {}[{\bm{c}}({\bm{x}}^{k})]_{+}\to{\bm{0}}\\ {\bm{y}}^{k}\geq{\bm{0}}\\ {{y}}_{i}^{k}{{c}}_{i}(\bar{\bm{x}})=0\quad\forall i.\end{array}\right.% \mathclose{}}

{proposition}

[[3, Thm. 3.1], [10, Prop. 2.5]] Any local minimizer for (1) is 2.2-stationary.

For the sake of designing suitable algorithmic stopping criteria, we also define an approximate variant which provides a further weaker notion of optimality.

{definition}

[ ${\bm{\epsilon}}$ -KKT optimality]Relative to problem (1), for ${\bm{\epsilon}}=(\epsilon_{\rm p},\epsilon_{\rm d})>(0,0)$ we say that $\bar{\bm{x}}$ is an (approximate) ${\bm{\epsilon}}$ -KKT point if there exists $\bar{\bm{y}}\in{}^{m}$ such that

{\mathopen{}\left\{\begin{array}[]{@{}l@{}l@{}}\operatorname{dist}_{\partial q% (\bar{\bm{x}})}(-\mathop{}\!{\operatorname{J}}{\bm{c}}(\bar{\bm{x}})^{\top}% \bar{\bm{y}})\leq\epsilon_{\rm d}\\ {\bm{c}}(\bar{\bm{x}})\leq\epsilon_{\rm p}\\ \bar{\bm{y}}\geq{\bm{0}}\\ \min{\mathopen{}\left\{\bar{{y}}_{i},[{{c}}_{i}(\bar{\bm{x}})]_{-}{}\mathrel{% \mid}{}{}\right\}\mathclose{}}\leq\epsilon_{\rm p}\quad\forall i.\end{array}% \right.\mathclose{}}

As discussed in the commentary after [3, Thm. 3.1], 2.2-optimality of $\bar{\bm{x}}$ is tantamount to the existence of a sequence ${\bm{x}}^{k}\to\bar{\bm{x}}$ of ${\bm{\epsilon}}^{k}$ -KKTpoints for some ${\bm{\epsilon}}^{k}\to(0,0)$ . More generally, for any $\epsilon_{\rm p},\epsilon_{\rm d}>0$ the notions are related as

\text{\ref{KKT}}\quad\Rightarrow\quad\text{\ref{AKKT}}\quad\Rightarrow\quad% \text{\hyperref@@ii[eKKT]{${\bm{\epsilon}}$-KKT}},

while when $\epsilon_{\rm p}=\epsilon_{\rm d}=0$ one has that ${\bm{\epsilon}}$ -KKT reduces to 2.2. We conclude by listing the observations in [12, Lem. 6 and Rem. 7] that will be useful in the sequel.

{remark}

Relative to the conditions 2.2 in Section 2.2:

1.

Up to possibly perturbing the sequence of multipliers, the complementarity slackness ${{y}}_{i}^{k}{{c}}_{i}(\bar{\bm{x}})=0$ can be equivalently expressed as ${{y}}_{i}^{k}{{c}}_{i}({\bm{x}}^{k})\to 0$ .
2.

If the sequence of multipliers $({\bm{y}}^{k})_{k\in\m@thbbch@rN}$ contains a bounded subsequence, then $\bar{\bm{x}}$ is a 2.2 point, not merely asymptotically.

3 Subproblems generation

In this section we operate a two-step modification of problem (1), whose conceptual roadmap is as follows. We begin with a soft-constrained reformulation in which violation of the inequality ${\bm{c}}({\bm{x}})\leq{\bm{0}}$ is penalized with an $L^{1}$ -norm in the cost function; the use of a slack variable ${\bm{s}}\in{}^{m}$ simplifies this formulation by promoting separability. Next, a barrier term is added to enforce strict satisfaction of the constraints in the $L^{1}$ -penalized reformulation. The minimization with respect to the slack variable ${\bm{s}}$ in the resulting problem can be carried out explicitly, and gives rise to a new problem in which the constraint ${\bm{c}}({\bm{x}})\leq{\bm{0}}$ is softened with a smooth penalty. Increasing the $L^{1}$ -penalty and decreasing the barrier coefficients results in a homotopic transition between smooth reformulations and the original nonsmooth problem.

3.1 $\bm{L^{1}}$ -penalization

Given $\alpha>0$ , we consider the following $L^{1}$ relaxation of (1):

\operatorname*{minimize}_{x\in{}^{n}}q({\bm{x}})+\alpha\|[{\bm{c}}({\bm{x}})]_% {+}\|_{1}.

By introducing a slack variable ${\bm{s}}\in{}^{m}$ , (3.1) can equivalently be cast as

\begin{array}[t]{>{\displaystyle}r @{\ } >{\displaystyle}l}\operatorname*{% minimize}_{({\bm{x}},{\bm{s}})\in{}^{n}\times{}^{m}}{}&q({\bm{x}})+\alpha{% \mathopen{}\left\langle{}{\bm{1}}{},{}{\bm{s}}{}\right\rangle\mathclose{}}+% \operatorname{\delta}_{{}_{+}^{m}}({\bm{s}})\\ \operatorname{subject\ to}{}&{\bm{c}}({\bm{x}})\leq{\bm{s}},\end{array}

as one can easily verify that $[{\bm{c}}({\bm{x}})]_{+}=\operatorname*{arg\,min}_{{\bm{s}}\in{}^{m}}{% \mathopen{}\left\{\alpha{\mathopen{}\left\langle{}{\bm{1}}{},{}{\bm{s}}{}% \right\rangle\mathclose{}}+\operatorname{\delta}_{{}_{+}^{m}}({\bm{s}}){}% \mathrel{\mid}{}{\bm{c}}({\bm{x}})\leq{\bm{s}}\right\}\mathclose{}}$ holds for any ${\bm{x}}\in{}^{n}$ and $\alpha>0$ . In other words, (3.1) amounts to (3.1) after a marginal minimization with respect to the slack variable ${\bm{s}}$ . Accordingly, we may consider the following relaxed optimality notion for problem (1), which, as explained below, is tantamount to 2.2-optimality for problem (3.1).

{definition}

[3.1 optimality]Given $\alpha>0$ , we say that a point $\bar{\bm{x}}^{\alpha}\in{}^{n}$ is 3.1-optimal for (1) if there exists $\bar{\bm{y}}^{\alpha}\in{}^{m}$ such that

{\mathopen{}\left\{\begin{array}[]{@{}l@{}l@{}}-\mathop{}\!{\operatorname{J}}{% \bm{c}}(\bar{\bm{x}}^{\alpha})^{\top}\bar{\bm{y}}^{\alpha}\in\partial q(\bar{% \bm{x}}^{\alpha})\\ 0\leq\bar{{y}}_{i}^{\alpha}\leq\alpha\\ (\alpha-\bar{{y}}_{i}^{\alpha})[{{c}}_{i}(\bar{\bm{x}}^{\alpha})]_{+}=\bar{{y}% }_{i}^{\alpha}[{{c}}_{i}(\bar{\bm{x}}^{\alpha})]_{-}=0\quad\forall i.\end{% array}\right.\mathclose{}}

In such case, we call $(\bar{\bm{x}}^{\alpha},\bar{\bm{y}}^{\alpha})\in{}^{n}\times{}^{m}$ a 3.1-optimal pair for (1).

Since the cost function in (3.1) is separable in ${\bm{x}}$ and ${\bm{s}}$ , its subdifferential at any point $({\bm{x}},{\bm{s}})\in{}^{n}\times{}^{m}$ is the Cartesian product $\partial q({\bm{x}})\times\bigl{(}\alpha{\bm{1}}+\partial\operatorname{\delta}% _{{}_{m}^{+}}({\bm{s}})\bigr{)}$ of the partial subdifferentials, see [22, Prop. 10.5]. By further observing that

\partial\operatorname{\delta}_{{}_{m}^{+}}({\bm{s}})={\mathopen{}\left\{{\bm{v% }}\in{}_{-}^{m}{}\mathrel{\mid}{}{{v}}_{i}{{s}}_{i}=0,\ i=1,\dots,m\right\}% \mathclose{}}

for all ${\bm{s}}\leq{\bm{0}}$ (and is empty otherwise), it is easy to see that $\bar{\bm{x}}^{\alpha}$ is 3.1-optimal iff $(\bar{\bm{x}}^{\alpha},\bar{}{\bm{s}}^{\alpha})$ (with $\bar{}{\bm{s}}^{\alpha}=[{\bm{c}}(\bar{\bm{x}}^{\alpha})]_{+}$ ) is 2.2-optimal for (3.1), and in which case the multipliers $\bar{\bm{y}}^{\alpha}$ coincide. More importantly, the following result clarifies how 2.2- and 3.1-optimality for problem (1) are interrelated. The result is standard, but its simple proof is nevertheless provided out of self containedness.

{lemma}

The following hold:

1.

A 3.1-optimal pair $(\bar{\bm{x}}^{\alpha},\bar{\bm{y}}^{\alpha})$ satisfying ${\bm{c}}(\bar{\bm{x}}^{\alpha})\leq{\bm{0}}$ is 2.2-optimal.
2.

A 2.2-optimal pair $(\bar{\bm{x}},\bar{\bm{y}})$ is 3.1-optimal for any $\alpha\geq\|\bar{\bm{y}}\|_{\infty}$ .

Proof.

For clarity of exposition, we have schematically reported the 2.2 and 3.1 optimality conditions side by side in (1). Notice in particular that the dual feasibility $\bar{\bm{y}}\geq{\bm{0}}$ and primal optimality $-\mathop{}\!{\operatorname{J}}{\bm{c}}(\bar{\bm{x}})^{\top}\bar{\bm{y}}\in% \partial q(\bar{\bm{x}})$ coincide in both notions.

\begin{array}[]{l|c|c|c|c|c|}\cline{2-6}\cr\text{\text{\ref{KKT}}}&\hbox{% \multirowsetup$-\mathop{}\!{\operatorname{J}}{\bm{c}}(\bar{\bm{x}})^{\top}\bar% {\bm{y}}\in\partial q(\bar{\bm{x}})$}&\hbox{\multirowsetup$\bar{\bm{y}}\geq{% \bm{0}}$}&{\bm{c}}(\bar{\bm{x}})\leq{\bm{0}}&\lx@intercol\hfil\bar{{y}}_{i}{{c% }}_{i}(\bar{\bm{x}})=0\hfil\lx@intercol\vrule\lx@intercol\\ \cline{4-6}\cr\text{\text{\ref{KKTa}}}&&&\bar{\bm{y}}\leq\alpha{\bm{1}}&(% \alpha-\bar{{y}}_{i})[{{c}}_{i}(\bar{\bm{x}})]_{+}=0&\bar{{y}}_{i}[{{c}}_{i}(% \bar{\bm{x}})]_{-}=0\\ \cline{2-6}\cr\end{array}

(1)

$\diamondsuit$

??) Primal feasibility ${\bm{c}}(\bar{\bm{x}}^{\alpha})\leq{\bm{0}}$ holds by assumption, and since ${\bm{c}}(\bar{\bm{x}}^{\alpha})\leq{\bm{0}}$ , the complementarity slackness in 3.1 reduces to $0=[{{c}}_{i}(\bar{\bm{x}}^{\alpha})]_{-}\bar{{y}}_{i}^{\alpha}=-{{c}}_{i}(\bar% {\bm{x}}^{\alpha})\bar{{y}}_{i}^{\alpha}$ , $i=1,\dots,m$ , yielding the corresponding complementarity slackness condition in Section 2.2.
$\diamondsuit$

??) The upper bound $\bar{{y}}_{i}\leq\alpha$ holds by assumption, and since ${\bm{c}}(\bar{\bm{x}})\leq{\bm{0}}$ , for every $i$ one has that $0=[{{c}}_{i}(\bar{\bm{x}})]_{+}=(\alpha-\bar{{y}}_{i})[{{c}}_{i}(\bar{\bm{x}})% ]_{+}$ and $0=\bar{{y}}_{i}{{c}}_{i}(\bar{\bm{x}})=-\bar{{y}}_{i}[{{c}}_{i}(\bar{\bm{x}})]% _{-}$ . ∎

3.2 IP-type barrier reformulation

To carry on with the second modification of the problem, in what follows we fix a barrier satisfying the following requirements.

For reasons that will be elaborated on later, convenient choices of barriers are $\eulerb{}(t)=-\frac{1}{t}$ and $\eulerb{}(t)=\ln(1-\frac{1}{t})$ (both extended as $\infty$ on ₊), see Table 1 in Section 4.1. Once such is fixed, in the spirit of interior point methods, given a parameter $\mu>0$ we enforce strict satisfaction of the constraint in (3.1) by considering the following barrier version

\operatorname*{minimize}_{({\bm{x}},{\bm{s}})\in{}^{n}\times{}^{m}}q({\bm{x}})% +\alpha{\mathopen{}\left\langle{}{\bm{1}}{},{}{\bm{s}}{}\right\rangle% \mathclose{}}+\operatorname{\delta}_{{}_{+}^{m}}({\bm{s}})+\mu\sum_{i=1}^{m}% \eulerb{}\bigl{(}{{c}}_{i}({\bm{x}})-{{s}}_{i}\bigr{)}.

Differently from the IP frameworks of [4, 12], we here enforce a barrier in the relaxed version (3.1), and not on the original problem (1). As such, it is only pairs $({\bm{x}},{\bm{s}})$ that need to lie in the interior of the constraints, but ${\bm{x}}$ is otherwise ‘unconstrained’: for any ${\bm{x}}\in{}^{n}$ , any ${\bm{s}}>{\bm{c}}({\bm{x}})$ (elementwise) yields a pair $({\bm{x}},{\bm{s}})$ that satisfies the strict constraint ${\bm{c}}({\bm{x}})-{\bm{s}}<{\bm{0}}$ . In fact, we may again explicitly minimize with respect to the slack variable ${\bm{s}}$ by observing that

\operatorname*{arg\,min}_{{\bm{s}}\in{}^{m}}{\mathopen{}\left\{\alpha{% \mathopen{}\left\langle{}{\bm{1}}{},{}{\bm{s}}{}\right\rangle\mathclose{}}+% \operatorname{\delta}_{{}_{+}^{m}}({\bm{s}})+\mu\sum_{i=1}^{m}\eulerb{}\bigl{(% }{{c}}_{i}({\bm{x}})-{{s}}_{i}\bigr{)}{}\mathrel{\mid}{}{}\right\}\mathclose{}% }=\bigl{[}{\bm{c}}({\bm{x}})-({\bm{\eulerb}}^{\ast})^{\prime}{}(\nicefrac{{% \alpha}}{{\mu}})\bigr{]}_{+}

(1)

holds for every ${\bm{x}}\in{}^{n}$ , where ${\bm{\eulerb}}{}(t)\coloneqq(\eulerb{}(t),\dots,\eulerb{}(t))$ and similarly conjugation and derivative are meant elementwise. Plugging the optimal ${\bm{s}}$ into (3.2) results in

\operatorname*{minimize}_{{\bm{x}}\in{}^{n}}q({\bm{x}})+\mu\Psi_{\nicefrac{{% \alpha}}{{\mu}}}\bigl{(}{\bm{c}}({\bm{x}})\bigr{)},

where, for any $\rho^{\ast}>0$ , $\Psi_{\rho^{\ast}}:{}^{m}\rightarrow\m@thbbch@rR$ is a separable function given by


	$\displaystyle\Psi_{\rho^{\ast}}({\bm{y}})\coloneqq\sum_{i=1}^{m}\psi_{\rho^{% \ast}}({{y}}_{i})$	(1a)
with
	${\displaystyle\psi_{\rho^{\ast}}(t)\coloneqq{\mathopen{}\left\{\begin{array}[]{% l @{\hspace{\ifcasescolsep}} >{\text{if~}}l }\eulerb{}(t)\hfil\hskip 10.00002% pt&\leavevmode\nobreak\ }{}^{\prime}{}(t)\leq\rho^{\ast}\\ \rho^{\ast}t-{}^{\ast}{}(\rho^{\ast})\hfil\hskip 10.00002pt&\lx@intercol\text{% otherwise}\hfil\lx@intercol\end{array}\right.\mathclose{}}$	(1d)
being (globally) Lipschitz differentiable and $\rho^{\ast}$ -Lipschitz continuous with derivative
	$\displaystyle\psi_{\rho^{\ast}}^{\prime}(t)=\min{\mathopen{}\left\{{}^{\prime}% {}(t),\rho^{\ast}{}\mathrel{\mid}{}{}\right\}\mathclose{}}.$	(1e)

A step-by-step derivation of all the identities above is given in A.2 in the appendix.

Problem (3.2) is ‘unconstrained’, in the sense that no explicit ambient constraints are provided, yet stationarity notions relative to it bear a close resemblance with KKT-type optimality conditions.

{lemma}

For any $\mu,\alpha>0$ and ${\bm{x}}\in{}^{n}$

\partial{\mathopen{}\left[q+\mu\Psi_{\nicefrac{{\alpha}}{{\mu}}}\circ{\bm{c}}% \right]\mathclose{}}({\bm{x}})=\partial q({\bm{x}})+\sum_{i=1}^{m}\min{% \mathopen{}\left\{\alpha,\mu{}^{\prime}{}({{c}}_{i}({\bm{x}})){}\mathrel{\mid}% {}{}\right\}\mathclose{}}{\nabla}{{c}}_{i}({\bm{x}}).

In particular, $\bar{\bm{x}}^{\alpha,\mu}\in{}^{n}$ is $\varepsilon$ -stationary for (3.2) iff

\operatorname{dist}_{\partial q(\bar{\bm{x}}^{\alpha,\mu})}\bigl{(}-\mathop{}% \!{\operatorname{J}}{\bm{c}}(\bar{\bm{x}}^{\alpha,\mu})^{\top}\bar{\bm{y}}^{% \alpha,\mu}\bigr{)}\leq\varepsilon,\quad\text{where}\quad\bar{{y}}_{i}^{\alpha% ,\mu}\coloneqq\min{\mathopen{}\left\{\alpha,\mu{}^{\prime}{}({{c}}_{i}({\bm{x}% })){}\mathrel{\mid}{}{}\right\}\mathclose{}}\in(0,\alpha].

Proof.

Since $\mu\Psi_{\nicefrac{{\alpha}}{{\mu}}}\circ{\bm{c}}$ is differentiable, we just need to confirm that its gradient equals the sum in the display. We have

{\nabla}{\mathopen{}\left[\mu\Psi_{\nicefrac{{\alpha}}{{\mu}}}\circ{\bm{c}}% \right]\mathclose{}}({\bm{x}})=\mu\sum_{i=1}^{m}{\nabla}\bigl{[}\psi_{% \nicefrac{{\alpha}}{{\mu}}}\circ{{c}}_{i}\bigr{]}({\bm{x}})=\mu\sum_{i=1}^{m}% \psi^{\prime}_{\nicefrac{{\alpha}}{{\mu}}}({{c}}_{i}({\bm{x}})){\nabla}{{c}}_{% i}({\bm{x}}),

where the first identity owes to the definition of $\Psi_{\nicefrac{{\alpha}}{{\mu}}}$ . Using (1e) yields the claimed expression. ∎

As is apparent from (1d), $\psi_{\nicefrac{{\alpha}}{{\mu}}}$ coincides with up to when its slope is $\nicefrac{{\alpha}}{{\mu}}$ , and after that point it reduces to its tangent line. As such, $\psi_{\nicefrac{{\alpha}}{{\mu}}}$ coincides with a McShane Lipschitz (and globally Lipschitz differentiable) extension [20] of a portion of the barrier , cf. Fig. 1(a). As Fig. 1(b) instead shows, by introducing a scaling factor $\mu$ the linear part has slope $\alpha$ , independently of $\mu$ , and the sharp $L^{1}$ penalty $\alpha[{}\cdot{}]_{+}$ coincides with the limiting case as $\mu$ is driven to 0. These details are formalized next.

Refer to caption — (a) *Lipschitz extension of a portion of .* Function $\psi_{\rho^{\ast}}$ agrees with the barrier until its slope equals $\rho^{\ast}$ (at $\rho\coloneqq({}^{\ast})^{\prime}{}(\rho^{\ast})$ ), and then continues linearly with slope $\rho^{\ast}$ . Apparently, $\psi_{\rho^{\ast}}\nearrow\eulerb{}$ as $\rho^{\ast}\nearrow\infty$ .

{lemma}

[Limiting behavior of $\psi_{\nicefrac{{\alpha}}{{\mu}}}$ ] The following hold:

1.

$\psi_{\rho^{\ast}}\nearrow b$ pointwise as $\rho^{\ast}\nearrow\infty$ .
2.

$\psi_{\rho^{\ast}}/\rho^{\ast}\searrow[{}\cdot{}]_{+}$ pointwise as $\rho^{\ast}\nearrow\infty$ .
3.

For any $\alpha>0$ , $\mu\psi_{\nicefrac{{\alpha}}{{\mu}}}\searrow\alpha[{}\cdot{}]_{+}$ pointwise as $\mu\searrow 0$ .

Proof.

$\diamondsuit$

??) Follows from the expression (1d) by observing that $t\mapsto\rho^{\ast}t-{}^{\ast}{}(\rho^{\ast})$ is tangent to the graph of (at $\rho={}^{\ast}{}(\rho^{\ast})$ ), and is thus globally majorized by because of convexity.

\diamondsuit

??) From the relation $\psi_{\rho^{\ast}}={\mathopen{}\left({}^{\ast}{}+\operatorname{\delta}_{[0,% \rho^{\ast}]}\right)\mathclose{}}^{\ast}$ as in (19c) and the conjugacy calculus rule of [1, Prop. 13.23(ii)] it follows that

\frac{\psi_{\rho^{\ast}}}{\rho^{\ast}}={\mathopen{}\left[\tfrac{1}{\rho^{\ast}% }{\mathopen{}\left({}^{\ast}{}(\rho^{\ast}{}\cdot{})+\operatorname{\delta}_{[0% ,\rho^{\ast}]}(\rho^{\ast}{}\cdot{})\right)\mathclose{}}\right]\mathclose{}}^{% \ast}={\mathopen{}\left[\tfrac{1}{\rho^{\ast}}{}^{\ast}{}(\rho^{\ast}{}\cdot{}% )+\operatorname{\delta}_{[0,1]}\right]\mathclose{}}^{\ast}.

To show that $\psi_{\rho^{\ast}}/\rho^{\ast}$ is pointwise decreasing as $\rho^{\ast}\nearrow\infty$ , it thus suffices to show that $\tfrac{1}{\rho^{\ast}}{}^{\ast}{}(\rho^{\ast}{}\cdot{})$ is pointwise increasing, owing to the relation $f\leq g\Rightarrow f^{\ast}\geq g^{\ast}$ [1, Prop. 13.16(ii)]. On $(-\infty,0)$ one has that $\tfrac{1}{\rho^{\ast}}{}^{\ast}{}(\rho^{\ast}{}\cdot{})\equiv\infty$ independently of $\rho^{\ast}$ , as it follows from Item 2. Moreover, ${}^{\ast}{}(0)=-\operatorname*{inf}\eulerb{}=0$ . Finally, for $t^{*}>0$ one has that

\frac{\mathop{}\!{\operator@font d}}{\mathop{}\!{\operator@font d}\rho^{\ast}}% {\mathopen{}\left[\tfrac{{}^{\ast}{}(\rho^{\ast}t^{*})}{\rho^{\ast}}\right]% \mathclose{}}=\frac{\rho^{\ast}t^{*}({}^{\ast})^{\prime}{}(\rho^{\ast}t^{*})-{% }^{\ast}{}(\rho^{\ast}t^{*})}{(\rho^{\ast})^{2}}=\frac{\eulerb{}{\mathopen{}% \left(({}^{\ast})^{\prime}{}(\rho^{\ast}t^{*})\right)\mathclose{}}}{(\rho^{% \ast})^{2}}>0,

where the second identity follows from Item 3. This confirms monotonicity on as $\rho^{\ast}\nearrow\infty$ .

It remains to show that the limit (exists and) equals $[{}\cdot{}]_{+}$ . Since $\tfrac{{}^{\ast}{}(\rho^{\ast}t^{*})}{\rho^{\ast}}\geq 0$ , the monotonically decreasing behavior just proven implies that the limit exists and is pointwise positive. For $t<0$ , from the definition (1d) we have

\lim_{\rho^{\ast}\to\infty}\frac{\psi_{\rho^{\ast}}(t)}{\rho^{\ast}}=\lim_{% \rho^{\ast}\to\infty}\frac{\eulerb{}(t)}{\rho^{\ast}}=0.

For $t\geq 0$ we have $\frac{\psi_{\rho^{\ast}}(t)}{\rho^{\ast}}=t-\frac{{}^{\ast}{}(\rho^{\ast})}{% \rho^{\ast}}$ , which converges to $t$ as $\rho^{\ast}\to\infty$ by virtue of (17).

$\diamondsuit$

??) This follows from assertion ??, since $\mu\psi_{\nicefrac{{\alpha}}{{\mu}}}=\alpha\psi_{\nicefrac{{\alpha}}{{\mu}}}/(% \nicefrac{{\alpha}}{{\mu}})$ , and $\nicefrac{{\alpha}}{{\mu}}\nearrow\infty$ as $\mu\searrow 0$ . ∎

In support of our claims about, we emphasize that the smooth penalty $\Psi_{\nicefrac{{\alpha}}{{\mu}}}\circ{\bm{c}}$ is convex whenever all the components ${{c}}_{i}$ are, and demonstrate how Lipschitz differentiability of ${\bm{c}}$ is usually preserved after a composition with $\psi_{\nicefrac{{\alpha}}{{\mu}}}$ .

{lemma}

[Properties of $\psi_{\nicefrac{{\alpha}}{{\mu}}}\circ{\bm{c}}$ ]Let $\alpha,\mu>0$ be fixed and let comply with Section 3.2.

1.

If ${{c}}_{i}$ is convex, then $\psi_{\nicefrac{{\alpha}}{{\mu}}}\circ{{c}}_{i}$ is convex.
2.

If ${{c}}_{i}$ has Lipschitz-continuous gradient, then so does $\psi_{\nicefrac{{\alpha}}{{\mu}}}\circ{{c}}_{i}$ provided that ${{c}}_{i}$ is Lipschitz continuous on closed subsets of ${\mathopen{}\left\{{\bm{x}}\in{}^{n}{}\mathrel{\mid}{}{{c}}_{i}({\bm{x}})<0% \right\}\mathclose{}}$ (as is the case when ${{c}}_{i}$ is lower bounded [17, Lem. 2.3]).

Proof.

The first claim about convexity is straightforward, since $\psi_{\nicefrac{{\alpha}}{{\mu}}}\circ{{c}}_{i}$ would amount to the composition of the convex and increasing function $\psi_{\nicefrac{{\alpha}}{{\mu}}}$ with the convex function ${{c}}_{i}$ . We next prove the statement about Lipschitz differentiability, thereby assuming that ${\nabla}{{c}}_{i}$ is Lipschitz on ⁿ with modulus $L\geq 0$ , while ${{c}}_{i}$ is Lipschitz on ${\mathopen{}\left\{{\bm{x}}\in{}^{n}{}\mathrel{\mid}{}{{c}}_{i}({\bm{x}})\leq% \rho\right\}\mathclose{}}$ with modulus $\ell\geq 0$ , where $\rho\coloneqq({}^{\ast})^{\prime}{}(\nicefrac{{\alpha}}{{\mu}})<0$ . Recall that $\psi_{\nicefrac{{\alpha}}{{\mu}}}$ is $(\nicefrac{{\alpha}}{{\mu}})$ -Lipschitz continuous, coincides with on $(-\infty,\rho]$ , and is then linear with slope $\nicefrac{{\alpha}}{{\mu}}$ on $(\rho,\infty)$ , cf. (1e). Fix ${\bm{x}},{\bm{y}}\in{}^{n}$ , and without loss of generality assume that ${{c}}_{i}({\bm{x}})\leq{{c}}_{i}({\bm{y}})$ . We have

	$\displaystyle\\|{\nabla}(\psi_{\nicefrac{{\alpha}}{{\mu}}}\circ{{c}}_{i})({\bm{% x}})-{\nabla}(\psi_{\nicefrac{{\alpha}}{{\mu}}}\circ{{c}}_{i})({\bm{y}})\\|={}$	$\displaystyle\\|\psi^{\prime}_{\nicefrac{{\alpha}}{{\mu}}}({{c}}_{i}({\bm{x}}))% {\nabla}{{c}}_{i}({\bm{x}})-\psi^{\prime}_{\nicefrac{{\alpha}}{{\mu}}}({{c}}_{% i}({\bm{y}})){\nabla}{{c}}_{i}({\bm{y}})\\|$
	$\displaystyle\leq{}$	$\displaystyle\psi^{\prime}_{\nicefrac{{\alpha}}{{\mu}}}({{c}}_{i}({\bm{x}}))\\|% {\nabla}{{c}}_{i}({\bm{x}})-{\nabla}{{c}}_{i}({\bm{y}})\\|$
		$\displaystyle+\\|{\nabla}{{c}}_{i}({\bm{x}})\\|\|\psi^{\prime}_{\nicefrac{{\alpha% }}{{\mu}}}({{c}}_{i}({\bm{x}}))-\psi^{\prime}_{\nicefrac{{\alpha}}{{\mu}}}({{c% }}_{i}({\bm{y}}))\|$
	$\displaystyle\leq{}$	$\displaystyle\tfrac{\alpha}{\mu}L\\|{\bm{x}}-{\bm{y}}\\|+\\|{\nabla}{{c}}_{i}({% \bm{x}})\\|\|\psi^{\prime}_{\nicefrac{{\alpha}}{{\mu}}}({{c}}_{i}({\bm{x}}))-% \psi^{\prime}_{\nicefrac{{\alpha}}{{\mu}}}({{c}}_{i}({\bm{y}}))\|.$

It remains to account for the second term in the last sum. If ${{c}}_{i}({\bm{x}})\leq{{c}}_{i}({\bm{y}})\leq\rho$ , then $\psi_{\nicefrac{{\alpha}}{{\mu}}}$ coincides with in all occurrences, and the term can be upper bounded as $B\ell^{2}\|{\bm{x}}-{\bm{y}}\|$ , where $B\coloneqq\max_{(-\infty,\rho]}{}^{\prime}{}^{\prime}$ is a Lipschitz modulus for ^′ on $(-\infty,\rho]$ . If ${{c}}_{i}({\bm{x}})\leq\rho<{{c}}_{i}({\bm{y}})$ , then $\psi^{\prime}_{\nicefrac{{\alpha}}{{\mu}}}({{c}}_{i}({\bm{y}}))=\psi^{\prime}_% {\nicefrac{{\alpha}}{{\mu}}}(\rho)$ and, by continuity, there exists $t\in[0,1]$ such that ${{c}}_{i}({\bm{x}}+t({\bm{y}}-{\bm{x}}))=\rho$ , so that $\psi^{\prime}_{\nicefrac{{\alpha}}{{\mu}}}({{c}}_{i}({\bm{x}}+t({\bm{y}}-{\bm{% x}})))=\psi^{\prime}_{\nicefrac{{\alpha}}{{\mu}}}(\rho)$ , resulting in the same bound $Bt\ell^{2}\|{\bm{x}}-{\bm{y}}\|\leq B\ell^{2}\|{\bm{x}}-{\bm{y}}\|$ . Lastly, if $\rho\leq{{c}}_{i}({\bm{x}})\leq{{c}}_{i}({\bm{y}})$ then the last term is zero. In all cases we conclude that

\|{\nabla}(\psi_{\nicefrac{{\alpha}}{{\mu}}}\circ{{c}}_{i})({\bm{x}})-{\nabla}% (\psi_{\nicefrac{{\alpha}}{{\mu}}}\circ{{c}}_{i})({\bm{y}})\|\leq\bigl{(}% \tfrac{\alpha}{\mu}L+B\ell^{2}\bigr{)}\|{\bm{x}}-{\bm{y}}\|\quad\forall{\bm{x}% },{\bm{y}}\in{}^{n},

proving the claim. ∎

The requirements on ${{c}}_{i}$ other than Lipschitz differentiability in Item 2 are virtually negligible, since lower boundedness can be artificially imposed by replacing ${{c}}_{i}$ with, say, $r\circ{{c}}_{i}$ where $r(t)\coloneqq\sqrt{[t+1]_{+}^{2}+3}-2$ . Inspired by the penalty function adopted in [23], $r$ is lower bounded and such that $r(t)\leq 0$ iff $t\leq 0$ ; in addition, having $r^{\prime}(0)=\nicefrac{{1}}{{2}}\neq 0$ , its adoption does not affect qualifications of active constraints. A Hessian inspection reveals that Lipschitz differentiability is preserved for ‘reasonable’ $c_{i}$ , e.g. whenever $\frac{\|{\nabla}{{c}}_{i}({\bm{x}})\|^{2}}{\max{\mathopen{}\left\{{{c}}_{i}({% \bm{x}})^{3},1{}\mathrel{\mid}{}{}\right\}\mathclose{}}}$ is bounded on the set ${\mathopen{}\left\{{\bm{x}}{}\mathrel{\mid}{}{{c}}_{i}({\bm{x}})\geq-1\right\}% \mathclose{}}$ , as it happens for quadratic functions. This is in stark constrast with methods such as ALM in which Lipschitz differentiability is typically lost in the composition with quadratic penalties.

4 Algorithmic framework

Algorithm 1 General framework

1.1:tolerances

\epsilon_{\rm p},\epsilon_{\rm d}\geq 0

; parameters

\alpha_{0},\mu_{0}>0

and

\varepsilon_{0}\geq\epsilon_{\rm d}

; ratios

\delta_{\alpha}>1

and

\delta_{\varepsilon},\delta_{\mu}\in(0,1)

1.2:

1.3:Find an

\varepsilon_{k}

-stationary point

{\bm{x}}^{k}

for (3.2) with

(\alpha,\mu)=(\alpha_{k},\mu_{k})

1.4:

{{y}}_{i}^{k}=\min{\mathopen{}\left\{\alpha_{k},\mu_{k}{}^{\prime}{}({{c}}_{i}% ({\bm{x}}^{k})){}\mathrel{\mid}{}{}\right\}\mathclose{}}

i=1,\dots,m

{{y}}_{i}^{k}=\mu_{k}\psi_{\nicefrac{{\alpha_{k}}}{{\mu_{k}}}}^{\prime}({{c}}_% {i}({\bm{x}}^{k}))

, cf. Section 3.2

1.5:

p_{k}=\|[{\bm{c}}({\bm{x}}^{k})]_{+}\|_{\infty}

% constraints violation

1.6:

s_{k}=\|\min{\mathopen{}\left\{{\bm{y}}^{k},[{\bm{c}}({\bm{x}}^{k})]_{-}{}% \mathrel{\mid}{}{}\right\}\mathclose{}}\|_{\infty}

% slackness violation

1.7:if

\varepsilon_{k}\leq\epsilon_{\rm d}

p_{k}\leq\epsilon_{\rm p}

, and

s_{k}\leq\epsilon_{\rm p}

then

1.8: return

({\bm{x}}^{k},{\bm{y}}^{k})

(\epsilon_{\rm p},\epsilon_{\rm d})

-KKTpair for (1)

1.9:

\varepsilon_{k+1}=\max{\mathopen{}\left\{\delta_{\varepsilon}\varepsilon_{k},% \,\epsilon_{\rm d}{}\mathrel{\mid}{}{}\right\}\mathclose{}},

1.10:if

p_{k}>\max{\mathopen{}\left\{\epsilon_{\rm p},-2m\frac{\mu_{k}}{\alpha_{k}}{}^% {\ast}{}\bigl{(}\tfrac{\alpha_{k}}{\mu_{k}}\bigr{)}{}\mathrel{\mid}{}{}\right% \}\mathclose{}}

then

1.11:

\alpha_{k+1}=\delta_{\alpha}\alpha_{k}

and

{\mu_{k+1}={\mathopen{}\left\{\begin{array}[]{l @{\hspace{\ifcasescolsep}} >{% \text{if~}}l }\delta_{\mu}\mu_{k}\hfil\hskip 10.00002pt&\leavevmode\nobreak\ }% s_{k}>\epsilon_{\rm p}\\ \mu_{k}\hfil\hskip 10.00002pt&\lx@intercol\text{otherwise}\hfil\lx@intercol% \end{array}\right.\mathclose{}}

1.12:else

1.13:

\alpha_{k+1}=\alpha_{k}\hphantom{\delta_{\alpha}}

and

\mu_{k+1}=\delta_{\mu}\mu_{k}

As shown in the previous section, the cost function in the smoothened problem (3.2) pointwise converges to the original hard-constrained cost $q+\operatorname{\delta}_{{}_{-}^{m}}\circ{\bm{c}}$ of (1) as $\mu\searrow 0$ and $\alpha\nearrow\infty$ . Following the penalty method rationale, this motivates solving (up to approximate local optimality) instances of (3.2) for progressively small values of $\mu$ and larger values of $\alpha$ . This is the leading idea of the algorithmic framework of Algorithm 1 presented in this section, which also implements suitable update rules for the coefficients ensuring that the output satisfies suitable optimality conditions for the original problem (1). In fact, a careful design of the update rule for the $L^{1}$ penalization parameter $\alpha$ in (3.2) prevents this coefficient from divergent behaviors under favorable conditions on the problem. The reason behind the involvement of the conjugate ^∗ in the update criterion at Step 1.10 will be revealed in Sections 4.1 and 4.2 through a systematic study of the properties of the barrier in the generality of Section 1 as well as when specialized to the convex case.

Algorithm 1 is not tied to any particular solver for addressing each instance of (3.2). Whenever $q$ amounts to the sum of a differentiable and a proximable function (in the sense that its proximal mapping is easily computable), such structure is retained by the cost function in (3.2), indicating that proximal-gradient based methods are suitable candidates. This was also the case in the purely interior-point based IPprox of [12], which considers a plain proximal gradient with a backtracking routine for selecting the stepsizes. Differently from the subproblems of IPprox in which the differentiable term is extended-real valued, the differentiable term in (3.2) is smooth on the whole ⁿ. This enables the employment of more sophisticated proximal-gradient-type algorithms such as PANOC⁺ [11] that make use of higher-order information to considerably enhance convergence speed. This claim will be substantiated with numerical evidence in Section 5; in this section, we instead focus on properties of the outer Algorithm 1 that are independent of the inner solver.

{lemma}

[properties of the iterates]Suppose that Sections 1 and 3.2 hold, and consider the iterates generated by Algorithm 1. At every iteration $k$ the following hold:

1.

${\bm{0}}\leq{\bm{y}}^{k}\leq\alpha_{k}{\bm{1}}$ .
2.

( ${\bm{x}}^{k}\in\operatorname{dom}q$ and) $\operatorname{dist}_{\partial q({\bm{x}}^{k})}(-\mathop{}\!{\operatorname{J}}{% \bm{c}}({\bm{x}}^{k})^{\top}{\bm{y}}^{k})\leq\varepsilon_{k}$ .
3.

If ( $\epsilon_{\rm p}>0$ and) $\mu_{k}\leq\frac{\epsilon_{\rm p}}{{}^{\prime}{}(-\epsilon_{\rm p})}$ , then $s_{k}\leq\epsilon_{\rm p}$ .
4.

For $k\geq 1$ , either $\alpha_{k}=\delta_{\alpha}\alpha_{k-1}$ or $\mu_{k}=\delta_{\mu}\mu_{k-1}$ (possibly both); in particular, letting $\rho^{\ast}_{k}\coloneqq\nicefrac{{\alpha_{k}}}{{\mu_{k}}}$ and $\delta_{\rho^{\ast}}\coloneqq\min{\mathopen{}\left\{\delta_{\alpha},\delta_{% \mu}^{-1}{}\mathrel{\mid}{}{}\right\}\mathclose{}}$ it holds that $\rho^{\ast}_{k}\geq\delta_{\rho^{\ast}}\rho^{\ast}_{k-1}$ .

Proof.

Recall that ${\bm{y}}^{k}\geq{\bm{0}}$ holds by construction for every $k$ , since ${}^{\prime}{}>0$ , from which assertion ?? follows. Regarding assertion ??, by Section 3.2 this is precisely the required stationarity for ${\bm{x}}^{k}$ . Assertion ?? is obvious by observing that whenever $\alpha_{k+1}=\alpha_{k}$ the update $\mu_{k+1}=\delta_{\mu}\mu_{k}$ is enforced.

We finally turn to assertion ??, and suppose that $\mu_{k}\leq\frac{\epsilon_{\rm p}}{{}^{\prime}{}(-\epsilon_{\rm p})}$ . Then, for all $i$ such that $[{{c}}_{i}({\bm{x}}^{k})]_{-}>\epsilon_{\rm p}$ (or, equivalently, ${{c}}_{i}({\bm{x}}^{k})<-\epsilon_{\rm p}$ ), one has

{{y}}_{i}^{k}\leq\mu_{k}{}^{\prime}{}({{c}}_{i}({\bm{x}}^{k}))\leq\mu_{k}{}^{% \prime}{}(-\epsilon_{\rm p})\leq\epsilon_{\rm p},

where the first inequality follows from the definition of ${\bm{y}}^{k}$ and the second one owes to monotonicity of ^′. Thus, for all $i$ at least one among ${{y}}_{i}^{k}$ and $[{{c}}_{i}({\bm{x}}^{k})]_{-}$ is not larger than $\epsilon_{\rm p}$ , proving that $s_{k}=\max_{i=1,\dots,m}\min{\mathopen{}\left\{{{y}}_{i}^{k},[{{c}}_{i}({\bm{x% }}^{k})]_{-}{}\mathrel{\mid}{}{}\right\}\mathclose{}}\leq\epsilon_{\rm p}$ . ∎

{corollary}

[stationarity of feasible limit points]Let Sections 1 and 3.2 hold, and consider the iterates generated by Algorithm 1. If the algorithm runs indefinitely, then $-\frac{\mu_{k}}{\alpha_{k}}{}^{\ast}{}(\nicefrac{{\alpha_{k}}}{{\mu_{k}}})\searrow 0$ as $k\to\infty$ and any accumulation point $\bar{\bm{x}}$ of $({\bm{x}}^{k})_{k\in\m@thbbch@rN}$ that satisfies ${\bm{c}}(\bar{\bm{x}})\leq{\bm{0}}$ is 2.2-optimal for (1).

Proof.

The monotonic vanishing of $-\frac{\mu_{k}}{\alpha_{k}}{}^{\ast}{}(\nicefrac{{\alpha_{k}}}{{\mu_{k}}})$ follows from Items 4 and 4. Suppose that $({\bm{x}}^{k})_{k\in K}\to\bar{\bm{x}}$ with ${\bm{c}}(\bar{\bm{x}})\leq{\bm{0}}$ , and let $i$ be such that ${{c}}_{i}(\bar{\bm{x}})<0$ (if such an $i$ does not exist, then there is nothing to show). According to Sections 2.2 and 1, it suffices to show that $({{y}}_{i}^{k})_{k\in K}\to 0$ ; in turn, by definition of ${\bm{y}}^{k}$ and continuity of ${\bm{c}}$ it suffices to show that $\mu_{k}\searrow 0$ . If $\epsilon_{\rm p}>0$ , then continuity of ${\bm{c}}$ implies that $\alpha_{k+1}=\alpha_{k}$ for all $k\in K$ large enough, hence, by virtue of Item 4, $\mu_{k+1}=\delta_{\mu}\mu_{k}$ for all such $k$ . Since $(\mu_{k})_{k\in\m@thbbch@rN}$ is monotone, in this case $\mu_{k}\searrow 0$ as $k\to\infty$ . Suppose instead that $\epsilon_{\rm p}=0$ , and, to arrive to a contradiction, that $\mu_{k}$ is asymptotically constant. This implies that $s_{k}\leq\epsilon_{\rm p}=0$ eventually always holds, which is a contradiction since

s_{k}\geq\min{\mathopen{}\left\{{{y}}_{i}^{k},[{{c}}_{i}({\bm{x}}^{k})]_{-}{}% \mathrel{\mid}{}{}\right\}\mathclose{}}\geq\min{\mathopen{}\left\{{{y}}_{i}^{k% },-\tfrac{1}{2}{{c}}_{i}(\bar{\bm{x}}){}\mathrel{\mid}{}{}\right\}\mathclose{}% }>0\quad\forall k\in K\text{ large,}

where the first inequality follows by definition of $s_{k}$ , cf. Step 1.6, the second one for $k\in K$ large since ${{c}}_{i}({\bm{x}}^{k})\to{{c}}_{i}(\bar{\bm{x}})<0$ as $K\ni k\to\infty$ , and the last one because ${\bm{y}}^{k}>{\bm{0}}$ . ∎

The update rule for the penalty parameter does not demand (approximate) feasibility, but it depends on a relaxed condition at Step 1.10. By (17), the second term vanishes as $\alpha/\mu\to\infty$ , so the penalty parameter is eventually increased as needed to achieve $\epsilon_{\rm p}$ -feasibility. The relaxation of this condition using a quantity involving the conjugate ^∗ mitigates the growth of $\alpha$ . Simultaneously, under suitable choices of the barrier , it ensures that this parameter remains unchanged only if the constraints violation stays within a controlled range, as will be ultimately demonstrated in Section 4.1.

{theorem}

Suppose that Sections 1 and 3.2 hold, and consider the iterates generated by Algorithm 1 with $\epsilon_{\rm p},\epsilon_{\rm d}>0$ . Then, $\min_{k}\mu_{k}>0$ and exactly one of the following scenarios occurs:

1

either the algorithm terminates returning an $(\epsilon_{\rm p},\epsilon_{\rm d})$ -KKTstationary point for (1),
2

or it runs indenfinitely with $s_{k}\leq\epsilon_{\rm p}<p_{k}$ for all $k$ large enough, and $(\alpha_{k})_{k\in\m@thbbch@rN}\nearrow\infty$ .

In the latter case, if $\operatorname{dom}q$ is closed, then for any accumulation point $\bar{\bm{x}}$ of $({\bm{x}}^{k})_{k\in\m@thbbch@rN}$ one has that $(\bar{\bm{x}},q(\bar{\bm{x}}))$ is KKT-stationary for the feasiblity problem

\operatorname*{minimize}_{({\bm{x}},t)\in\operatorname{epi}q}\|[{\bm{c}}({\bm{% x}})]_{+}\|_{1},

(2)

in the sense that $({\bm{0}},0)\in\partial\|[c(\bar{\bm{x}})]_{+}\|_{1}\times{\mathopen{}\left\{0% {}\mathrel{\mid}{}{}\right\}\mathclose{}}+\operatorname{N}_{\operatorname{epi}% q}(\bar{\bm{x}},q(\bar{\bm{x}}))$ .

Proof.

Since $\mu_{k+1}\leq\mu_{k}$ for all $k$ , and $\mu_{k}$ is linearly reduced whenever $s_{k}>\epsilon_{\rm p}$ , we conclude that (either the algorithm terminates or) $s_{k}\leq\epsilon_{\rm p}$ eventually always holds.

If the algorithm returns a pair $({\bm{x}}^{k},{\bm{y}}^{k})$ , then the compliance with the termination criteria ensures that $({\bm{x}}^{k},{\bm{y}}^{k})$ meets all conditions in Section 2.2, and hence it is $(\epsilon_{\rm p},\epsilon_{\rm d})$ -KKT-stationary for (1).

Suppose instead that the algorithm does not terminate. Clearly, $\varepsilon_{k}=\epsilon_{\rm d}$ holds for $k$ large enough, so that the only unmet termination criterion is eventually $p_{k}\leq\epsilon_{\rm d}$ . Therefore, $p_{k}>\epsilon_{\rm p}$ holds for every $k$ large enough. It follows from assertion ?? and Item 4 that $-{}^{\ast}{}(\rho^{\ast}_{k})/\rho^{\ast}_{k}$ eventually drops below $\epsilon_{\rm p}$ , implying that the condition for increasing $\alpha_{k+1}$ at Step 1.10 reduces to $p_{k}>\epsilon_{\rm p}$ . Having shown that this is eventually always the case, $\alpha_{k+1}=\delta_{\alpha}\alpha_{k}$ always holds for $k$ large, $\alpha_{k}\nearrow\infty$ , and $\mu_{k}$ is eventually never updated, cf. Step 1.10.

To conclude, suppose that $\operatorname{dom}q$ is closed. By Section 3.2, for every $k$ we have that there exists ${\bm{\eta}}^{k}\in{}^{n}$ with $\|{\bm{\eta}}^{k}\|\leq\epsilon_{\rm d}$ such that

{\bm{\eta}}^{k}-\mathop{}\!{\operatorname{J}}{\bm{c}}({\bm{x}}^{k})^{\top}{\bm% {y}}^{k}\in\partial q({\bm{x}}^{k}).

Let $\bar{\bm{x}}$ be the limit of a subsequence $({\bm{x}}^{k})_{k\in K}$ and, up to extracting, let $\bar{{\bm{\lambda}}}$ be the limit of $(\frac{1}{\alpha_{k}}{\bm{y}}^{k})_{k\in K}$ . By the definition of ${\bm{y}}^{k}$ and the continuity of ${\bm{c}}$ ,

{{{{{\bar{\lambda}}}_{i}{\mathopen{}\left\{\begin{array}[]{l @{\hspace{% \ifcasescolsep}} >{\text{if~}}l }=0\hfil\hskip 10.00002pt&\leavevmode\nobreak% \ }{{c}}_{i}(\bar{\bm{x}})<0\\ =1\hfil\hskip 10.00002pt&\leavevmode\nobreak\ }{{c}}_{i}(\bar{\bm{x}})>0\\ \in[0,1]\hfil\hskip 10.00002pt&\leavevmode\nobreak\ }{{c}}_{i}(\bar{\bm{x}})=0% .\end{array}\right.\mathclose{}}

Equivalently,

{{\bar{\lambda}}}_{i}\in\partial[{}\cdot{}]_{+}({{c}}_{i}(\bar{\bm{x}})).

(3)

Since $\operatorname{dom}q$ is closed and ${\bm{x}}^{k}\in\operatorname{dom}q$ for all $k$ , one has that $q(\bar{\bm{x}})<\infty$ . Moreover, it follows from Assumption 1 that $q({\bm{x}}^{k})\to q(\bar{\bm{x}})$ as $K\ni k\to\infty$ , hence that

-\mathop{}\!{\operatorname{J}}{\bm{c}}(\bar{\bm{x}})^{\top}{\bm{\bar{\lambda}}% }\in\partial^{\infty}q({\bm{x}}^{k})\quad\Rightarrow\quad\bigl{(}\mathop{}\!{% \operatorname{J}}{\bm{c}}(\bar{\bm{x}})^{\top}{\bm{\bar{\lambda}}},0\bigr{)}% \in\operatorname{N}_{\operatorname{epi}q}(\bar{\bm{x}},q(\bar{\bm{x}}))=% \partial\operatorname{\delta}_{\operatorname{epi}q}(\bar{\bm{x}},q(\bar{\bm{x}% })),

(4)

where the inclusion follows from [22, Thm. 8.9]. Since

\partial\|[{\bm{c}}({\bm{x}})]_{+}\|_{1}=\sum_{i=1}^{m}\partial[{{c}}_{i}({\bm% {x}})]_{+}=\sum_{i=1}^{m}\partial[{}\cdot{}]_{+}({{c}}_{i}({\bm{x}})){\nabla}{% {c}}_{i}({\bm{x}}),

see [22, Ex. 10.26], it follows from (3) and continuity of $\|[{\bm{c}}({\bm{x}})]_{+}\|_{1}$ that ${\bm{\bar{\lambda}}}\in\partial\|[{\bm{c}}(\bar{\bm{x}})]_{+}\|_{1}$ . Combining with (4) concludes the proof. ∎

The abuse of terminology to express KKT-stationarity in terms of subdifferentials passes through the same construct relating (3.1) and (3.1), in which a slack variable is tacitly introduced to reformulate the $L^{1}$ norm; see the discussion in Section 3.1. More importantly, the involvement in (2) of the epigraph of $q$ , as opposed to its domain, is a necessary technicality that cannot be avoided in the generality of Section 1, as we illustrate next.

{remark}

[ $\operatorname{epi}q$ vs $\operatorname{dom}q$ ]Stationarity for (2) is, in general, weaker than that for the more natural minimal infeasibility violation problem

\operatorname*{minimize}_{{\bm{x}}\in\operatorname{dom}q}\|[{\bm{c}}({\bm{x}})% ]_{+}\|_{1}.

(5)

To see how this notion may be violated, consider $q(x)=\sqrt{|x|}$ and $c(x)=x+1$ , so that (1) reads

\operatorname*{minimize}_{x\in\m@thbbch@rR}\sqrt{|x|}\quad\operatorname{% subject\ to}x\leq-1.

The point $x^{k}=0$ is stationary for any subproblem (3.2) with arbitrary $\alpha,\mu>0$ , and therefore constitutes a feasible choice in Algorithm 1. However, the limit $\bar{x}=0$ of the corresponding constant sequence is not stationary for the minimization of $[x+1]_{+}$ over $\operatorname{dom}q=\m@thbbch@rR$ . Nevertheless,

\partial[x+1]_{+}(0)\times{\mathopen{}\left\{0{}\mathrel{\mid}{}{}\right\}% \mathclose{}}+\operatorname{N}_{\operatorname{epi}q}(0,0)={\mathopen{}\left\{(% 1,0){}\mathrel{\mid}{}{}\right\}\mathclose{}}+(\m@thbbch@rR\times{\mathopen{}% \left\{0{}\mathrel{\mid}{}{}\right\}\mathclose{}})\ni(0,0),

confirming that $(0,0)$ is stationary for the epigraphical problem (2).

We next formally illustrate why stationarity for (5) always implies that for (2), and identify the culprit of a possible discrepancy in uncontrolled growths around $\bar{\bm{x}}$ from within $\operatorname{dom}q$ . To this end, we remind that a function $h:{}^{n}\rightarrow\overline{}\m@thbbch@rR$ is said to be calm at a point $\bar{\bm{x}}\in\operatorname{dom}h$ relative to a set $X\ni\bar{\bm{x}}$ if

\liminf_{\begin{subarray}{c}X\ni{\bm{x}}\to\bar{\bm{x}}\\ {\bm{x}}\neq\bar{\bm{x}}\end{subarray}}\tfrac{|h({\bm{x}})-h(\bar{\bm{x}})|}{% \|{\bm{x}}-\bar{\bm{x}}\|}<\infty,

and that this condition is weaker than strict continuity.

{lemma}

Let $h:{}^{n}\rightarrow\overline{}\m@thbbch@rR$ be proper and lsc. Then, for any $\bar{\bm{x}}\in\operatorname{dom}h$ one has

	$\displaystyle\widehat{\operatorname{N}}_{\operatorname{dom}h}(\bar{\bm{x}})% \subseteq{}$	$\displaystyle{\mathopen{}\left\{{\bm{\bar{v}}}{}\mathrel{\mid}{}({\bm{\bar{v}}% },0)\in\widehat{\operatorname{N}}_{\operatorname{epi}h}(\bar{\bm{x}},h(\bar{% \bm{x}}))\right\}\mathclose{}}$
and
	$\displaystyle\operatorname{N}_{\operatorname{dom}h}(\bar{\bm{x}})\subseteq{}$	$\displaystyle{\mathopen{}\left\{{\bm{\bar{v}}}{}\mathrel{\mid}{}({\bm{\bar{v}}% },0)\in\operatorname{N}_{\operatorname{epi}h}(\bar{\bm{x}},h(\bar{\bm{x}}))% \right\}\mathclose{}}=\partial^{\infty}h(\bar{\bm{x}}).$

When $h$ is convex, both inclusions hold as equality. More generally, when $h$ is calm (in particular, if it is strictly continuous) at $\bar{\bm{x}}$ relative to $\operatorname{dom}h$ , then the first inclusion holds as equality, and so does the second one when such property holds not only at $\bar{\bm{x}}$ , but also at all points in $\operatorname{dom}h$ close to it.

Proof.

The relations in the convex case are shown in [22, Thm. 8.9 and Prop. 8.12]; in what follows, we consider an arbitrary proper and lsc function $h$ . Let $\bar{v}\in\widehat{\operatorname{N}}_{\operatorname{dom}h}(\bar{\bm{x}})$ and let $\operatorname{epi}h\ni({\bm{x}}^{k},t^{k})\to(\bar{\bm{x}},h(\bar{\bm{x}}))$ . Then, there exists $\varepsilon_{k}\to 0$ such that ${\mathopen{}\left\langle{}{\bm{\bar{v}}}{},{}{\bm{x}}^{k}-\bar{\bm{x}}{}\right% \rangle\mathclose{}}\leq\varepsilon_{k}\|{\bm{x}}^{k}-\bar{\bm{x}}\|$ holds for every $k$ , hence

\varepsilon_{k}{\mathopen{}\left\|\binom{{\bm{x}}^{k}-\bar{\bm{x}}missing}{t^{% k}-h(\bar{\bm{x}})}\right\|\mathclose{}}\geq\varepsilon_{k}\|{\bm{x}}^{k}-\bar% {\bm{x}}\|\geq{\mathopen{}\left\langle{}{\bm{\bar{v}}}{},{}{\bm{x}}^{k}-\bar{% \bm{x}}{}\right\rangle\mathclose{}}={\mathopen{}\left\langle{}\binom{{\bm{\bar% {v}}}missing}{0}{},{}\binom{{\bm{x}}^{k}-\bar{\bm{x}}missing}{t^{k}-h(\bar{\bm% {x}})}{}\right\rangle\mathclose{}}.

By the arbitrariness of the sequence, we conclude that $({\bm{\bar{v}}},0)\in\widehat{\operatorname{N}}_{\operatorname{epi}h}(\bar{\bm% {x}},h(\bar{\bm{x}}))$ . The same inclusion must then hold for the limiting normal cones, leading to

\operatorname{N}_{\operatorname{dom}h}(\bar{\bm{x}})\subseteq{\mathopen{}\left% \{{\bm{\bar{v}}}\in{}^{n}{}\mathrel{\mid}{}({\bm{\bar{v}}},0)\in\operatorname{% N}_{\operatorname{epi}h}(\bar{\bm{x}},h(\bar{\bm{x}}))\right\}\mathclose{}}=% \partial^{\infty}h(\bar{\bm{x}}),

(6)

where the identity follows from [22, Thm. 8.9].

Suppose now that there exists $\kappa>0$ such that $|h({\bm{x}})-h(\bar{\bm{x}})|\leq\kappa\|{\bm{x}}-\bar{\bm{x}}\|$ for ${\bm{x}}\in\operatorname{dom}h$ close to $\bar{\bm{x}}$ , and suppose that $({\bm{\bar{v}}},0)\in\widehat{\operatorname{N}}_{\operatorname{epi}h}(\bar{\bm% {x}},h(\bar{\bm{x}}))$ . Let $\operatorname{dom}h\ni{\bm{x}}^{k}\to\bar{\bm{x}}$ , and note that $\operatorname{epi}h\ni({\bm{x}}^{k},h({\bm{x}}^{k}))\to(\bar{\bm{x}},h(\bar{% \bm{x}}))$ . Then, there exists $\varepsilon_{k}\to 0$ such that

	$\displaystyle{\mathopen{}\left\langle{}{\bm{\bar{v}}}{},{}{\bm{x}}^{k}-\bar{% \bm{x}}{}\right\rangle\mathclose{}}={\mathopen{}\left\langle{}\binom{{\bm{\bar% {v}}}missing}{0}{},{}\binom{{\bm{x}}^{k}-\bar{\bm{x}}missing}{t^{k}-h(\bar{\bm% {x}})}{}\right\rangle\mathclose{}}\leq{}$	$\displaystyle\varepsilon_{k}{\mathopen{}\left\\|\binom{{\bm{x}}^{k}-\bar{\bm{x}% }missing}{h({\bm{x}}^{k})-h(\bar{\bm{x}})}\right\\|\mathclose{}}$
	$\displaystyle\leq{}$	$\displaystyle\varepsilon_{k}{\mathopen{}\left\\|\binom{{\bm{x}}^{k}-\bar{\bm{x}% }missing}{\kappa\\|{\bm{x}}^{k}-\bar{\bm{x}}\\|}\right\\|\mathclose{}}=% \varepsilon_{k}\sqrt{1+\kappa^{2}}\\|{\bm{x}}^{k}-\bar{\bm{x}}\\|,$

where the second inequality holds for $k$ large enough. Arguing again by the arbitrariness of the sequence, we conclude that ${\bm{\bar{v}}}\in\widehat{\operatorname{N}}_{\operatorname{dom}h}(\bar{\bm{x}})$ . Finally, when $h$ is calm relative to its domain at all points ${\bm{x}}\in\operatorname{dom}h$ close to $\bar{\bm{x}}$ , then the identity $\widehat{\operatorname{N}}_{\operatorname{dom}h}({\bm{x}})\times{\mathopen{}% \left\{0{}\mathrel{\mid}{}{}\right\}\mathclose{}}=\widehat{\operatorname{N}}_{% \operatorname{epi}h}({\bm{x}},h({\bm{x}}))$ holds for all such points, and a limiting argument then yields that $\operatorname{N}_{\operatorname{dom}h}(\bar{\bm{x}})\times{\mathopen{}\left\{0% {}\mathrel{\mid}{}{}\right\}\mathclose{}}=\operatorname{N}_{\operatorname{epi}% h}(\bar{\bm{x}},h(\bar{\bm{x}}))$ holds for the limiting normal cones. Therefore, the inclusion in (6) holds as equality, which concludes the proof. ∎

4.1 Barrier’s properties

According to its update rule in Algorithm 1, before a desired feasibility violation $p_{k}\leq\epsilon_{\rm p}$ has been reached, $\alpha_{k+1}=\alpha_{k}$ means that $p_{k}\leq 2m\frac{-{}^{\ast}{}(\rho^{\ast}_{k})}{\rho^{\ast}_{k}}$ , where $\rho^{\ast}_{k}=\nicefrac{{\alpha_{k}}}{{\mu_{k}}}$ . As shown in Item 4, regardless of whether $\alpha_{k}$ is updated or not, $\rho^{\ast}_{k}$ grows linearly over the iterations; therefore, having $\alpha_{k+1}=\alpha_{k}$ implies in particular that either the constraint violation $p_{k}$ is within a desired tolerance $\epsilon_{\rm p}$ , or that it is controlled by $\frac{-{}^{\ast}{}(\rho^{\ast}_{k})}{\rho^{\ast}_{k}}\leq 2m\frac{-{}^{\ast}{}% (\delta_{\rho^{\ast}}^{k}\rho^{\ast}_{0})}{\delta_{\rho^{\ast}}^{k}\rho^{\ast}% _{0}}$ , where the inequality follows from monotonicity of $\nicefrac{{-{}^{\ast}{}(t^{*})}}{{t^{*}}}$ , cf. Item 4. This means that a desired decrease in feasibility violation can be enforced through suitable choices of the barrier . This will be particularly significant in the convex case, for it can be shown that $\alpha_{k}$ is eventually never updated under reasonable assumptions.

{lemma}

Let Sections 1 and 3.2 hold, and consider the iterates generated by Algorithm 1. Suppose that there exists $\theta\in(0,1)$ such that the barrier satisfies $\eulerb{}(\theta t)\leq\theta\delta_{\rho^{\ast}}\eulerb{}(t)$ for every $t<0$ (resp. for every $t<0$ close enough to 0), where $\delta_{\rho^{\ast}}\coloneqq\min{\mathopen{}\left\{\delta_{\mu}^{-1},\delta_{% \alpha}{}\mathrel{\mid}{}{}\right\}\mathclose{}}>1$ . Then,

\alpha_{k+1}=\alpha_{k}\quad\Rightarrow\quad p_{k}\leq\max{\mathopen{}\left\{% \epsilon_{\rm p},-2m\tfrac{\mu_{0}}{\alpha_{0}}{}^{\ast}{}\bigl{(}\tfrac{% \alpha_{0}}{\mu_{0}}\bigr{)}\theta^{k}{}\mathrel{\mid}{}{}\right\}\mathclose{}}

(7)

holds for every $k$ (resp. for every $k$ large enough).

Proof.

To simplify the presentation, without loss of generality let us set $\epsilon_{\rm p}=0$ . We have already argued that $\alpha_{k+1}=\alpha_{k}$ implies $p_{k}\leq 2m\pi_{k}$ , where $\pi_{k}\coloneqq\frac{-{}^{\ast}{}(\delta_{\rho^{\ast}}^{k}\rho^{\ast}_{0})}{% \delta_{\rho^{\ast}}^{k}\rho^{\ast}_{0}}$ for all $k\in\m@thbbch@rN$ . It thus suffices to show that $\pi_{k}\leq-\tfrac{\mu_{0}}{\alpha_{0}}{}^{\ast}{}\bigl{(}\tfrac{\alpha_{0}}{% \mu_{0}}\bigr{)}\theta^{k}$ . To this end, notice that for every $t^{*}>0$ one has

\frac{-{}^{\ast}{}(\delta_{\rho^{\ast}}t^{*})}{\delta_{\rho^{\ast}}t^{*}}\leq% \theta\frac{-{}^{\ast}{}(t^{*})}{t^{*}}\quad\Leftrightarrow\quad{}^{\ast}{}(t^% {*})\leq\frac{{}^{\ast}{}(\delta_{\rho^{\ast}}t^{*})}{\theta\delta_{\rho^{\ast% }}}=\sup_{\tau}{\mathopen{}\left\{t^{*}\tau-\tfrac{\eulerb{}(\theta\tau)}{% \delta_{\rho^{\ast}}\theta}{}\mathrel{\mid}{}{}\right\}\mathclose{}}={% \mathopen{}\left(\tfrac{\eulerb{}(\theta\cdot{})}{\theta\delta_{\rho^{\ast}}}% \right)\mathclose{}}^{\ast}(t^{*}),

hence, since ${}^{\ast}{}(t^{*})=\infty$ for $t^{*}<0$ ,

\frac{-{}^{\ast}{}(\delta_{\rho^{\ast}}t^{*})}{\delta_{\rho^{\ast}}t^{*}}\leq% \theta\frac{-{}^{\ast}{}(t^{*})}{t^{*}}\quad\forall t^{*}\in\m@thbbch@rR\quad% \Leftrightarrow\quad\eulerb{}(t)\geq\tfrac{\eulerb{}(\theta t)}{\theta\delta_{% \rho^{\ast}}}\quad\forall t\in\m@thbbch@rR,

which amounts to the condition in the statement. Under such condition, then, $\pi_{k+1}\leq\theta\pi_{k}$ holds for every $k$ , leading to $\pi_{k}\leq\pi_{0}\theta^{k}=\frac{-{}^{\ast}{}(\rho^{\ast}_{0})}{\rho^{\ast}_% {0}}\theta^{k}$ as claimed. ∎

Though it would be tempting to seek barriers for which $(\pi_{k})_{k\in\m@thbbch@rN}$ as in the proof vanishes at any desired rate, it can be easily verified that no choice of or $\delta_{\rho^{\ast}}$ can result in $(\pi_{k})_{k\in\m@thbbch@rN}$ converging any faster than linearly. In fact,

\pi_{k+1}=\frac{-{}^{\ast}{}(\rho^{\ast}_{0}\delta_{\rho^{\ast}}^{k+1})}{\rho^% {\ast}_{0}\delta_{\rho^{\ast}}^{k+1}}>\frac{-{}^{\ast}{}(\rho^{\ast}_{0}\delta% _{\rho^{\ast}}^{k})}{\rho^{\ast}_{0}\delta_{\rho^{\ast}}^{k+1}}=\frac{1}{% \delta_{\rho^{\ast}}}\pi_{k},

where the inequality follows from monotonicity of $-{}^{\ast}{}$ , cf. Item 2. This shows that a linear decrease by a factor $\delta_{\rho^{\ast}}^{-1}$ is the best achievable rate, and that this can only happen in the limit. Section 4.1 nevertheless identifies a property that allows us to judge the fitness of a barrier within the framework of Algorithm 1. As we will see in Section 4.2, this will be particularly evident in the convex case, for it can be guaranteed that, under assumptions, $\alpha_{k}$ eventually does remain always constant, so that employing a barrier that complies with this requirement is a guarantee that eventually the infeasibility $p_{k}$ of the iterates generated by Algorithm 1 vanishes at R-linear rate. This motivates the following definition.

{definition}

[behavior profiles of ]We say that a barrier complying with Section 3.2 is asymptotically well behaved if

	$\displaystyle\forall\theta\in(0,1)\quad\mathchoice{\hskip 14.32803pt\clap{${% \displaystyle{}\kappa(\theta){}}$}\hskip 14.32803pt}{\hskip 14.32803pt\clap{${% {}\kappa(\theta){}}$}\hskip 14.32803pt}{\hskip 10.10406pt\clap{${\scriptstyle{% }\kappa(\theta){}}$}\hskip 10.10406pt}{\hskip 8.28067pt\clap{${% \scriptscriptstyle{}\kappa(\theta){}}$}\hskip 8.28067pt}\coloneqq{}$	$\displaystyle\limsup_{t\to 0^{-}}\frac{\eulerb{}(\theta t)}{\theta\eulerb{}(t)% }<\infty$	$\displaystyle\text{and}\quad\lim_{\theta\to 1^{-}}\kappa(\theta)=1.$
If this condition can be strengthened to
	$\displaystyle\forall\theta\in(0,1)\quad\kappa^{\rm max}(\theta)\coloneqq{}$	$\displaystyle\sup_{t<0}\frac{\eulerb{}(\theta t)}{\theta\eulerb{}(t)}<\infty$	$\displaystyle\text{and}\quad\lim_{\theta\to 1^{-}}\kappa^{\rm max}(\theta)=1,$

then we say that is well behaved (not merely asymptotically). We call the functions $\kappa^{\rm max},\kappa:(0,1)\rightarrow(1,\infty)$ the behavior profile and the asymptotic behavior profile of , respectively.

In penalty-type methods, the update of a penalty parameter is typically decided based on the violation of the corresponding constraints. Under the assumption that the barrier is (asymptotically) well behaved, Section 4.1 demonstrates that in Algorithm 1 (eventually) the condition $\alpha_{k+1}=\alpha_{k}$ furnishes a guarantee of linear decrease of the infeasibility. Insisting on continuity of $\kappa$ and $\kappa^{\rm max}$ at $\theta=1$ in Section 4.1 is a minor technicality ensuring that, regardless of the value of $\delta_{\mu}\in(0,1)$ and $\delta_{\alpha}>1$ , for any (asymptotically) well behaved barrier there always exists $\theta\in(0,1)$ such that $\eulerb{}(\theta t)\leq\theta\delta_{\rho^{\ast}}\eulerb{}(t)$ holds for every $t<0$ (close enough to zero) as required in Section 4.1. The result can thus be restated as follows.

{corollary}

Additionally to Sections 1 and 3.2, suppose that the barrier is (asymptotically) well behaved. Then, there exists $\theta\in(0,1)$ such that the iterates of Algorithm 1 satisfy (7) for all $k\in\m@thbbch@rN$ (large enough).

When it comes to comparing different barriers, lower values of $\kappa$ are clearly preferable. Notice that both $\kappa^{\rm max}$ and $\kappa$ are scaling invariant:

\kappa_{\beta\eulerb{}}=\kappa_{\eulerb{}(\beta{}\cdot{})}=\kappa\quad\text{% and}\quad\kappa_{\beta\eulerb{}}^{\rm max}=\kappa_{\eulerb{}(\beta{}\cdot{})}^% {\rm max}=\kappa^{\rm max}\qquad\forall\beta>0.

Moreover, since

\kappa(\theta)\geq\tfrac{1}{\theta}\quad\forall\theta\in(0,1)

(owing to monotonicity of and the fact that consequently $\eulerb{}(\theta t)\geq\eulerb{}(t)$ for $t<0$ ), barriers attaining $\kappa(\theta)=\frac{1}{\theta}$ can be considered asymptotically optimal. Table 1 shows that logarithmic barriers can attain such lower bound.

\begin{array}[]{|cccc|l}\lx@intercol\hfil\eulerb{}(t)\hfil\lx@intercol&% \lx@intercol\hfil{}^{\ast}{}(\tau)\hfil\lx@intercol&\lx@intercol\hfil\kappa(% \theta)\hfil\lx@intercol&\lx@intercol\hfil\kappa^{\rm max}(\theta)\hfil% \lx@intercol\\ \cline{1-4}\cr\vphantom{X^{\big{|}}}\frac{1}{p}(-t)^{-p}&-\frac{1}{q}\tau^{q}&% {\mathopen{}\left(\frac{1}{\theta}\right)\mathclose{}}^{1+p}&{\mathopen{}\left% (\frac{1}{\theta}\right)\mathclose{}}^{1+p}&\text{($p>0$, $q=\frac{p}{1+p}$)}% \\ \ln{\mathopen{}\left(1-\frac{1}{t}\right)\mathclose{}}&-2{\mathopen{}\left(% \frac{\sqrt{\tau}}{\sqrt{\tau}+\sqrt{\tau+4}}+\ln\bigl{(}\frac{\sqrt{\tau}+% \sqrt{\tau+4}}{2}\bigr{)}\right)\mathclose{}}&\frac{1}{\theta}&{\mathopen{}% \left(\frac{1}{\theta}\right)\mathclose{}}^{2}\\ \exp{\mathopen{}\left(-\frac{1}{t}\right)\mathclose{}}&-\frac{1}{2% \operatorname{W_{0}}(\nicefrac{{\sqrt{\tau}}}{{2}})}{\mathopen{}\left(1+\frac{% 1}{2\operatorname{W_{0}}(\nicefrac{{\sqrt{\tau}}}{{2}})}\right)\mathclose{}}% \tau&\infty&\infty\\ \cline{1-4}\cr\end{array}

Table 1: Examples of barriers and their behavior profiles

\kappa

. A low

\kappa

is symptomatic of good aptitude of as barrier within Algorithm 1. Geometrically, it indicates that well approximates the nonsmooth indicator

\operatorname{\delta}_{{}_{-}}

. Functions like

\exp{\mathopen{}\left(-\frac{1}{t}\right)\mathclose{}}

growing too fast are unsuited, whereas logarithmic barriers such as

\eulerb{}(t)=\ln{\mathopen{}\left(1-\frac{1}{t}\right)\mathclose{}}

attain an optimal asymptotic behavior profile

\kappa(\theta)=\frac{1}{\theta}

. Here,

\operatorname{W_{0}}

denotes the Lambert

\operatorname{W_{0}}

function (product logarithm), namely the functional inverse of

\tau\mapsto\tau\exp(\tau)

for

\tau\geq 0

[6].

4.2 The convex case

In this section we investigate the behavior of Algorithm 1 when applied to convex problems. In particular, we detail an asymptotic analysis in which the termination tolerances are set to zero, so that the algorithm (may) run indefinitely. We demonstrate that under standard assumptions the iterates subsequentially converge to (global) solutions, and that the $L^{1}$ penalty parameter $\alpha$ is eventually never updated.

{theorem}

Additionally to Sections 1 and 3.2, suppose that $q$ and ${{c}}_{i}$ , $i=1,\dots,m$ , are convex functions, and that there exists an optimal 2.2-pair $({\bm{x}}^{\star},{\bm{y}}^{\star})$ for (1). Then, the following hold for the iterates generated by Algorithm 1 with $\epsilon_{\rm p}=\epsilon_{\rm d}=0$ :

1.

Any accumulation point of the sequence $({\bm{x}}^{k})_{k\in\m@thbbch@rN}$ is a solution of (1).
2.

If, additionally, $({\bm{x}}^{k})_{k\in\m@thbbch@rN}$ remains bounded (as is the case when $\operatorname{dom}q$ is bounded), $\alpha_{k}$ is eventually never updated.
3.

Further assuming that the barrier is asymptotically well conditioned, so that there exists $\theta\in(0,1)$ such that $\eulerb{}(\theta t)\leq\theta\min{\mathopen{}\left\{\delta_{\mu}^{-1},\delta_{% \alpha}{}\mathrel{\mid}{}{}\right\}\mathclose{}}\eulerb{}(t)$ for every $t<0$ close enough to 0, then the feasibility violation eventually vanishes with rate $p_{k}\leq-2m\tfrac{\mu_{0}}{\alpha_{0}}{}^{\ast}{}\bigl{(}\tfrac{\alpha_{0}}{% \mu_{0}}\bigr{)}\theta^{k}$ .

Proof.

It follows from Item 2 that ${\bm{x}}^{\star}$ solves (3.1) for all $\alpha\geq\|{\bm{y}}^{\star}\|_{\infty}$ . For every $k$ , there exists ${\bm{\eta}}^{k}$ with $\|{\bm{\eta}}^{k}\|\leq\varepsilon_{k}$ such that ${\bm{\eta}}^{k}\in\partial\bigl{[}q+\mu_{k}\Psi_{\nicefrac{{\alpha_{k}}}{{\mu_% {k}}}}\bigr{]}({\bm{x}}^{k})$ . If $(\alpha_{k})_{k\in\m@thbbch@rN}$ is asymptotically constant, then $p_{k}\leq-\frac{\mu_{k}}{\alpha_{k}}{}^{\ast}{}(\nicefrac{{\alpha_{k}}}{{\mu_{% k}}})$ eventually always holds, and, since the right-hand side vanishes as $k\to\infty$ , any limit point $\bar{\bm{x}}$ of $({\bm{x}}^{k})_{k\in\m@thbbch@rN}$ satisfies ${\bm{c}}(\bar{\bm{x}})\leq{\bm{0}}$ . Otherwise, $(\alpha_{k})_{k\in\m@thbbch@rN}\nearrow\infty$ and, for $k$ large such that $\alpha_{k}>\alpha\coloneqq\|{\bm{y}}^{\star}\|_{\infty}$ , since ${\bm{c}}({\bm{x}}^{\star})\leq{\bm{0}}$ and ${\bm{x}}^{\star}$ solves (3.1) one has

	$\displaystyle q({\bm{x}}^{\star})={}$	$\displaystyle q({\bm{x}}^{\star})+\alpha\\|[{\bm{c}}({\bm{x}}^{\star})]_{+}\\|_{1}$
	$\displaystyle\leq{}$	$\displaystyle q({\bm{x}}^{k})+\alpha\\|[{\bm{c}}({\bm{x}}^{k})]_{+}\\|_{1}$
	$\displaystyle={}$	$\displaystyle q({\bm{x}}^{k})+\alpha_{k}\\|[{\bm{c}}({\bm{x}}^{k})]_{+}\\|_{1}-(% \alpha_{k}-\alpha)\\|[{\bm{c}}({\bm{x}}^{k})]_{+}\\|_{1}$
since $[{}\cdot{}]_{+}\leq\psi_{\rho^{\ast}}/\rho^{\ast}$ for any $\rho^{\ast}>0$ , see Item 2,
	$\displaystyle\leq{}$	$\displaystyle q({\bm{x}}^{k})+\mu_{k}\Psi_{\nicefrac{{\alpha_{k}}}{{\mu_{k}}}}% ({\bm{c}}({\bm{x}}^{k}))-(\alpha_{k}-\alpha)\\|[{\bm{c}}({\bm{x}}^{k})]_{+}\\|_{1}$
since ${\bm{\eta}}^{k}\in\partial\bigl{[}q+\mu_{k}\Psi_{\nicefrac{{\alpha_{k}}}{{\mu_% {k}}}}\bigr{]}({\bm{x}}^{k})$ and $q+\mu_{k}\Psi_{\nicefrac{{\alpha_{k}}}{{\mu_{k}}}}$ is convex by virtue of Item 1,
	$\displaystyle\leq{}$	$\displaystyle q({\bm{x}}^{\star})+\mu_{k}\Psi_{\nicefrac{{\alpha_{k}}}{{\mu_{k% }}}}({\bm{c}}({\bm{x}}^{\star}))+{\mathopen{}\left\langle{}{\bm{\eta}}^{k}{},{% }{\bm{x}}^{\star}-{\bm{x}}^{k}{}\right\rangle\mathclose{}}-(\alpha_{k}-\alpha)% \\|[{\bm{c}}({\bm{x}}^{k})]_{+}\\|_{1}$
since ${\bm{c}}({\bm{x}}^{\star})\leq{\bm{0}}$ and $\psi_{\rho^{\ast}}$ is increasing for any $\rho^{\ast}>0$ , cf. (1e), and since $\\|{\bm{\eta}}^{k}\\|\leq\varepsilon_{k}$ ,
	$\displaystyle\leq{}$	$\displaystyle q({\bm{x}}^{\star})+\mu_{k}\Psi_{\nicefrac{{\alpha_{k}}}{{\mu_{k% }}}}({\bm{0}})+\varepsilon_{k}\\|{\bm{x}}^{\star}-{\bm{x}}^{k}\\|-(\alpha_{k}-% \alpha)\\|[{\bm{c}}({\bm{x}}^{k})]_{+}\\|_{1}$
	$\displaystyle={}$	$\displaystyle q({\bm{x}}^{\star})+m\mu_{k}\psi_{\nicefrac{{\alpha_{k}}}{{\mu_{% k}}}}(0)+\varepsilon_{k}\\|{\bm{x}}^{\star}-{\bm{x}}^{k}\\|-(\alpha_{k}-\alpha)% \\|[{\bm{c}}({\bm{x}}^{k})]_{+}\\|_{1}$	(8)
	$\displaystyle={}$	$\displaystyle q({\bm{x}}^{\star})+m\mu_{k}\bigl{(}-{}^{\ast}{}(\nicefrac{{% \alpha_{k}}}{{\mu_{k}}})\bigr{)}+\varepsilon_{k}\\|{\bm{x}}^{\star}-{\bm{x}}^{k% }\\|-(\alpha_{k}-\alpha)\\|[{\bm{c}}({\bm{x}}^{k})]_{+}\\|_{1},$

where the last identity uses (1d). Therefore, dividing by $\alpha_{k}-\alpha$ and denoting $\rho^{\ast}_{k}\coloneqq\nicefrac{{\alpha_{k}}}{{\mu_{k}}}\nearrow\infty$ ,

p_{k}=\|[{\bm{c}}({\bm{x}}^{k})]_{+}\|_{\infty}\leq\|[{\bm{c}}({\bm{x}}^{k})]_% {+}\|_{1}\leq m\tfrac{\alpha_{k}}{\alpha_{k}-\alpha}{{\underbracket{\tfrac{-{}% ^{\ast}{}(\rho^{\ast}_{k})}{\rho^{\ast}_{k}}}_{\to 0}}}+{{\underbracket{\tfrac% {\varepsilon_{k}}{\alpha_{k}-\alpha}\vphantom{\tfrac{-{}^{\ast}{}(\rho^{\ast}_% {k})}{\rho^{\ast}_{k}}}}_{\to 0}}}\|{\bm{x}}^{\star}-{\bm{x}}^{k}\|

holds for all $k$ large. Along any convergent subsequence, it is clear that $p_{k}\to 0$ , hence that the corresponding limit point $\bar{\bm{x}}$ satisfies ${\bm{c}}(\bar{\bm{x}})\leq{\bm{0}}$ , and is thus optimal in view of Algorithm 1 and convexity of the problem.

If the entire sequence $({\bm{x}}^{k})_{k\in\m@thbbch@rN}$ is contained in $\operatorname{B}({\bm{0}};R)$ for some $R>0$ , then

\frac{p_{k}}{\frac{-2m{}^{\ast}{}(\rho^{\ast}_{k})}{\rho^{\ast}_{k}}}\leq\frac% {m\tfrac{\alpha_{k}}{\alpha_{k}-\alpha}\tfrac{-{}^{\ast}{}(\rho^{\ast}_{k})}{% \rho^{\ast}_{k}}+2R\tfrac{\varepsilon_{k}}{\alpha_{k}-\alpha}\vphantom{\tfrac{% -{}^{\ast}{}(\rho^{\ast}_{k})}{\rho^{\ast}_{k}}}}{\frac{-2m{}^{\ast}{}(\rho^{% \ast}_{k})}{\rho^{\ast}_{k}}}=\tfrac{1}{2}\tfrac{\alpha_{k}}{\alpha_{k}-\alpha% }+R\varepsilon_{k}\tfrac{\alpha_{k}}{\alpha_{k}-\alpha}\frac{\mu_{k}}{-m{}^{% \ast}{}(\nicefrac{{\alpha_{k}}}{{\mu_{k}}})}.

If, contrary to the claim, $\alpha_{k}\nearrow\infty$ , then appealing to Item 2 the right-hand side of the above inequality converges to $\frac{1}{2}$ as $k\to\infty$ , and in particular is eventually smaller than one. That is, $p_{k}\leq-2m\frac{\mu_{k}}{\alpha_{k}}{}^{\ast}{}(\nicefrac{{\alpha_{k}}}{{\mu% _{k}}})$ eventually always holds and $\alpha_{k}$ thus never updated, a contradiction.

Finally, the claim about the rate of $p_{k}$ follows from Section 4.1. ∎

4.3 Equalities and bilateral constraints

In previous sections we described an algorithm for the inequality constrained problem (1), but our approach can be applied to problems with bilateral and equality constraints as well. There are several possibilities for incorporating such constraints in the format (1) by means of reformulations. Below we discuss only a few options and refer to the related discussion in [7, §4.1.4] for more. Here we focus on two key aspects: on the one hand, for the sake of efficiency, we intend to exploit the additional problem structure available when bilateral constraints are explicitly specified by the user. On the other hand, for the sake of robustness, we need to tolerate and handle bad formulations as well. The latter point is particularly significant for large-scale, possibly automatically generated, optimization models, whereby it might be impractical to parse the constraints specification with the goal of uncovering, e.g., hidden equalities. In practice, nonlinear problems (1) may contain (approximate) constraint redundancies leading to singularities in the Jacobian: thanks to the penalty-barrier regularization, we expect Algorithm 1 to cope with these kinds of degeneracy.

Two inequalities

An equality constraint ${{c}}_{i}({\bm{x}})=0$ can be split into the pair of inequalities ${{c}}_{i}({\bm{x}})\leq 0$ and $-{{c}}_{i}({\bm{x}})\leq 0$ , making the technique developed for (1) directly applicable. However, following the approach outlined in Section 3, this can result in a flat portion around the feasible set, and so in the absence of an effective penalization term. This circumstance is demonstrated by the fact that

{\psi_{\rho^{\ast}}^{\pm[0,0]}(t)\coloneqq\psi_{\rho^{\ast}}(t)+\psi_{\rho^{% \ast}}(-t)={\mathopen{}\left\{\begin{array}[]{l @{\hspace{\ifcasescolsep}} >{% \text{if~}}l }\eulerb{}(-|t|)+\rho^{\ast}|t|-{}^{\ast}{}(\rho^{\ast})\hfil% \hskip 10.00002pt&\leavevmode\nobreak\ }{}^{\prime}{}(-|t|)\leq\rho^{\ast},\\ -2{}^{\ast}{}(\rho^{\ast})\hfil\hskip 10.00002pt&\lx@intercol\text{otherwise,}% \hfil\lx@intercol\end{array}\right.\mathclose{}}

(9)

which displays a flat region around $t=0$ with radius $-({}^{\ast})^{\prime}{}(\rho^{\ast})>0$ vanishing as $\rho^{\ast}\to\infty$ .

This splitting into a pair of inequality constraints also affects the theoretical motivations of Section 3. In fact, applying this technique, constraint qualifications fail to hold for ${\bm{c}}({\bm{x}})\leq{\bm{0}}$ in (1). Therefore, as these typically guarantee boundedness of the set of optimal Lagrange multipliers, the penalty exactness may not apply. These considerations underline that it is important to handle equality constraints carefully.

We now retrace the developments of Section 3 focusing on problems of the form

\operatorname*{minimize}_{{\bm{x}}\in{}^{n}}{}\leavevmode\nobreak\ q({\bm{x}})% \quad\operatorname{subject\ to}{}\leavevmode\nobreak\ {\bm{l}}\leq{\bm{c}}({% \bm{x}})\leq{\bm{u}}.

(10)

Formulating the penalty problem—in the form (3.1)—with two slack variables ${{\bm{s}}}_{\rm u},{{\bm{s}}}_{\rm l}$ as

	$\displaystyle\operatorname*{minimize}_{{\bm{x}}\in{}^{n},{{\bm{s}}}_{\rm u},{{% \bm{s}}}_{\rm l}\in{}^{m}}\quad$	$\displaystyle q({\bm{x}})+\alpha{\mathopen{}\left\langle{}{\bm{1}}{},{}{{\bm{s% }}}_{\rm u}+{{\bm{s}}}_{\rm l}{}\right\rangle\mathclose{}}+\operatorname{% \delta}_{{}_{+}^{m}}({{\bm{s}}}_{\rm u})+\operatorname{\delta}_{{}_{+}^{m}}({{% \bm{s}}}_{\rm l})$
	$\displaystyle\operatorname{subject\ to}\quad$	$\displaystyle{\bm{l}}-{{\bm{s}}}_{\rm l}\leq{\bm{c}}({\bm{x}})\leq{\bm{u}}+{{% \bm{s}}}_{\rm u}$

results in a penalty barrier problem analogous to (3.2) after marginalization with respect to ${{\bm{s}}}_{\rm u},{{\bm{s}}}_{\rm l}$ , which reads

\operatorname*{minimize}_{{\bm{x}}\in{}^{n}}\quad q({\bm{x}})+\mu\Psi_{% \nicefrac{{\alpha}}{{\mu}}}({\bm{c}}({\bm{x}})-{\bm{u}})+\mu\Psi_{\nicefrac{{% \alpha}}{{\mu}}}({\bm{l}}-{\bm{c}}({\bm{x}})).

Including the sum $\psi_{\nicefrac{{\alpha}}{{\mu}}}^{\pm[l,u]}\coloneqq\psi_{\nicefrac{{\alpha}}% {{\mu}}}(\cdot-u)+\psi_{\nicefrac{{\alpha}}{{\mu}}}(l-\cdot)$ , it is easy to see that this approach is equivalent to specifying two (independent) inequalities. Thus, considering two slack variables does not overcome the issue of flat regions discussed above for (hidden) equalities. Instead, applying the technique above with only one slack variable appears to be a more reasonable approach for handling equality constraints, as we are about to show.

Combined marginalization

For problems with bilateral constraints as (10) we may formulate the associated penalty problem, in analogy with (3.1), with a single slack variable ${\bm{s}}$ as

\operatorname*{minimize}_{{\bm{x}}\in{}^{n},{\bm{s}}\in{}^{m}}\leavevmode% \nobreak\ q({\bm{x}})+\alpha{\mathopen{}\left\langle{}{\bm{1}}{},{}{\bm{s}}{}% \right\rangle\mathclose{}}+\operatorname{\delta}_{{}_{+}^{m}}({\bm{s}})\quad% \operatorname{subject\ to}{}\leavevmode\nobreak\ {\bm{l}}-{\bm{s}}\leq{\bm{c}}% ({\bm{x}})\leq{\bm{u}}+{\bm{s}}.

Then, as for (3.2), we introduce a barrier to replace the inequality constraints, obtaining

\operatorname*{minimize}_{{\bm{x}}\in{}^{n},{\bm{s}}\in{}^{m}}\leavevmode% \nobreak\ q({\bm{x}})+\alpha{\mathopen{}\left\langle{}{\bm{1}}{},{}{\bm{s}}{}% \right\rangle\mathclose{}}+\operatorname{\delta}_{{}_{+}^{m}}({\bm{s}})+\mu% \sum_{i=1}^{m}\eulerb{}({{c}}_{i}({\bm{x}})-{{u}}_{i}-{{s}}_{i})+\mu\sum_{i=1}% ^{m}\eulerb{}({{l}}_{i}-{{c}}_{i}({\bm{x}})-{{s}}_{i}).

Following this strategy, the marginalization subproblem corresponding to a constraint ${{l}}_{i}\leq{{c}}_{i}({\bm{x}})\leq{{u}}_{i}$ yields a definition for the counterpart of $\psi_{\rho^{\ast}}$ in (1) with bounds—see also (1):

\psi_{\nicefrac{{\alpha}}{{\mu}}}^{[l,u]}(t)\coloneqq\operatorname*{inf}_{s% \geq 0}{\mathopen{}\left\{\tfrac{\alpha}{\mu}s+\eulerb{}(t-u-s)+\eulerb{}(l-t-% s){}\mathrel{\mid}{}{}\right\}\mathclose{}}

(11)

and, in analogy to (3.2), leads to the penalty-barrier subproblem

\operatorname*{minimize}_{{\bm{x}}\in{}^{n}}\leavevmode\nobreak\ q({\bm{x}})+% \mu\Psi_{\nicefrac{{\alpha}}{{\mu}}}^{[{\bm{l}},{\bm{u}}]}({\bm{c}}({\bm{x}})),

where $\Psi_{\rho^{\ast}}^{[{\bm{l}},{\bm{u}}]}({\bm{y}})\coloneqq\sum_{i=1}^{m}\psi_% {\rho^{\ast}}^{[{{l}}_{i},{{u}}_{i}]}({{y}}_{i})$ . A closed-form expression for the marginalization subproblem (11) can be obtained for specific barrier choices. For the inverse barrier function $\eulerb{}(t)=-\frac{1}{t}$ (extended as $\infty$ on ₊), the optimal value for the auxiliary variable is given by

s_{\rho^{\ast}}^{[l,u]}\bigl{(}t+\tfrac{l+u}{2}\bigr{)}\coloneqq\max{\mathopen% {}\left\{0,\sqrt{t^{2}+\frac{1}{\rho^{\ast}}+\sqrt{\frac{4}{\rho^{\ast}}t^{2}+% \frac{1}{(\rho^{\ast})^{2}}}}+\frac{l-u}{2}{}\mathrel{\mid}{}{}\right\}% \mathclose{}}.

Similarly, the oracle for the log-like barrier function $\eulerb{}(t)=\ln\frac{t-1}{t}$ (extended as $\infty$ on ₊) reads

s_{\rho^{\ast}}^{[l,u]}\bigl{(}t+\tfrac{l+u}{2}\bigr{)}\coloneqq\max{\mathopen% {}\left\{0,\sqrt{\rho^{\ast}t^{2}+\frac{\rho^{\ast}}{4}+1+\sqrt{\frac{4}{\rho^% {\ast}}t^{2}+t^{2}+\frac{1}{\rho^{\ast}}}}-\frac{1}{2}+\frac{l-u}{2}{}\mathrel% {\mid}{}{}\right\}\mathclose{}}.

It is nevertheless easy to verify for arbitrary complying with Section 3.2 the sharp $L^{1}$ penalization of the interval $[{\bm{l}},{\bm{u}}]$ attained in the limit.

{lemma}

[Limiting behavior of $\psi_{\rho^{\ast}}^{[l,u]}$ ]For any $l,u\in\m@thbbch@rR$ with $l\leq u$ one has that $\psi_{\rho^{\ast}}^{[l,u]}/\rho^{\ast}\searrow\operatorname{dist}_{[l,u]}$ pointwise as $\rho^{\ast}\nearrow\infty$ .

Proof.

To begin with, notice that $\psi_{\rho^{\ast}}^{[l,u]}$ as in (11) is bounded as

\operatorname*{inf}_{s\geq 0}{\mathopen{}\left\{\rho^{\ast}s+\eulerb{}\bigl{(}% |t|-\tfrac{u-l}{2}-s\bigr{)}{}\mathrel{\mid}{}{}\right\}\mathclose{}}\leq\psi_% {\rho^{\ast}}^{[l,u]}\bigl{(}t+\tfrac{l+u}{2}\bigr{)}\leq\operatorname*{inf}_{% s\geq 0}{\mathopen{}\left\{\rho^{\ast}s+2\eulerb{}\bigl{(}|t|-\tfrac{u-l}{2}-s% \bigr{)}{}\mathrel{\mid}{}{}\right\}\mathclose{}},

where the first inequality owes to the fact that $\eulerb{}>0$ , and the second one to the fact that ${}^{\prime}{}>0$ (hence that is increasing). Again deferring to A.2 for the details, such lower and upper bounds coincide with unilateral smooth penalties as in Section 3.2, namely

\psi_{\rho^{\ast}}\bigl{(}|t|-\tfrac{u-l}{2}\bigr{)}\leq\psi_{\rho^{\ast}}^{[l% ,u]}\bigl{(}t+\tfrac{l+u}{2}\bigr{)}\leq 2\psi_{\nicefrac{{\rho^{\ast}}}{{2}}}% \bigl{(}|t|-\tfrac{u-l}{2}\bigr{)}.

(12)

Dividing by $\rho^{\ast}$ and letting $\rho^{\ast}\to\infty$ , Item 2 yields that both the lower and upper bounds converge to $\bigl{[}|t|-\tfrac{u-l}{2}\bigr{]}_{+}=\operatorname{dist}_{[l,u]}\bigl{(}t+% \tfrac{l+u}{2}\bigr{)}$ , demonstrating the claim. ∎

The penalty-barrier function $\psi_{\rho^{\ast}}^{[0,0]}$ associated to an equality constraint ${{c}}_{i}({\bm{x}})=0$ is depicted in Fig. 2 and contrasted with the two-inequality term $\psi_{\rho^{\ast}}^{\pm[0,0]}$ . An analogous comparison between $\psi_{\rho^{\ast}}^{[l,u]}$ and $\mu\psi_{\rho^{\ast}}^{\pm[l,u]}$ associated to the more general bilateral constraint ${{l}}_{i}\leq{{c}}_{i}({\bm{x}})\leq{{u}}_{i}$ is displayed in Fig. 3 for the case ${{l}}_{i}<{{u}}_{i}$ . The two scaled penalty-barrier terms $\mu\psi_{\rho^{\ast}}^{[l,u]}$ and $\mu\psi_{\rho^{\ast}}^{\pm[l,u]}$ have similar behaviors, according to Figs. 2(b) and 3(b), and lead to a well-justified procedure within Algorithm 1. In particular, both penalty-barrier terms converge to some sharp penalty function as $\mu\searrow 0$ (for fixed $\alpha>0$ ). However, $\psi_{\rho^{\ast}}^{\pm[l,u]}$ exhibits a flat region around the midpoint $\tfrac{l+u}{2}$ , whereas $\psi_{\rho^{\ast}}^{[l,u]}$ is strictly convex everywhere; this stark contrast is especially apparent for $\psi_{\rho^{\ast}}^{[0,0]}$ and $\psi_{\rho^{\ast}}^{\pm[0,0]}$ at the origin, where only the former attains its unique minimum. Regardless, since the width of the flat portion of $\psi_{\rho^{\ast}}^{\pm[l,u]}$ vanishes as $\rho^{\ast}\nearrow\infty$ , Algorithm 1 can terminate for any $\epsilon_{\rm p},\epsilon_{\rm d}>0$ even when invoked with hidden equality or bilateral constraints. The minor modifications needed to account for explicit equalities are given next.

{remark}

[Algorithm 1 with equality constraints]Two additional steps in Algorithm 1 allow it to handle problems of the form

\operatorname*{minimize}_{{\bm{x}}\in{}^{n}}\leavevmode\nobreak\ q({\bm{x}})% \quad\operatorname{subject\ to}\leavevmode\nobreak\ {\bm{c}}({\bm{x}})\leq{\bm% {0}},\leavevmode\nobreak\ {{\bm{c}}}_{\rm eq}({\bm{x}})={\bm{0}}

for some smooth mapping ${{\bm{c}}}_{\rm eq}:{}^{n}\rightarrow{}^{m_{\rm eq}}$ . First, an equality multiplier ${{\bm{y}}}_{\rm eq}^{k}\in{}^{m_{\rm eq}}$ is introduced; its update patterns the one of ${\bm{y}}^{k}$ , but involving the derivative of $\psi_{\nicefrac{{\alpha_{k}}}{{\mu_{k}}}}^{[0,0]}$ as opposed to that of $\psi_{\nicefrac{{\alpha_{k}}}{{\mu_{k}}}}$ , see the comment in Step 1.4, and thus not restricted in sign (as expected from equality multipliers). The infeasibility measure $p_{k}$ is captured in a straightforward manner. Altogether, the addition to the respective Steps 1.4 and 1.5 are as follows:

2a:

{{\bm{y}}}_{\rm eq}^{k}=\mu_{k}\bigl{(}\Psi_{\nicefrac{{\alpha_{k}}}{{\mu_{k}}% }}^{[0,0]}\bigr{)}^{\prime}({{\bm{c}}}_{\rm eq}({\bm{x}}^{k}))

3a:

p_{k}\leftarrow\max{\mathopen{}\left\{p_{k},\bigl{\|}{{\bm{c}}}_{\rm eq}({\bm{% x}}^{k})\bigr{\|}_{\infty}{}\mathrel{\mid}{}{}\right\}\mathclose{}}

All the convergence results remain valid for this variant; boundedness of $\alpha_{k}$ in Section 4.2 is also recovered as long as $m$ at Step 1.10 is replaced by $m+2m_{\rm eq}$ , namely in such a way that equalities are counted as two inequalities. The reason behind this can be easily understood from the upper bound $\psi_{\nicefrac{{\alpha}}{{\mu}}}^{[0,0]}\leq 2\psi_{\nicefrac{{\alpha}}{{2\mu% }}}$ in (12), which can be used in (8) by turning the equality therein into an inequality.

Based on the discussion above, bilateral constraints should be explicitly specified, whenever possible, in order to employ the tailored penalty-barrier relaxation. Nevertheless, it is an important feature of our scheme that it can handle badly formulated models, for instance, with hidden equalities. These favorable qualities will be showcased in Section 5.5, where we present a numerical comparison of the two options.

5 Numerical experiments

This section is dedicated to experimental results and comparisons with other numerical approaches for constrained structured optimization. We will refer to our implementation of Algorithm 1 as to Alg1. The perfomance and behavior of Alg1 is illustrated in different variants, considering two barrier functions, namely $\eulerb{}(t)=-\frac{1}{t}$ and $\eulerb{}(t)=\ln\frac{t-1}{t}$ (both extended as $\infty$ on ₊) denominated inverse and log-like, respectively, and two inner solvers, NMPG [8] and PANOC⁺ [11, 24]. The numerical comparison will highlight the influence of the barrier function on the performance of Alg1, supporting the quality assessment of Section 4.1.

The two subsolvers follow a proximal-gradient scheme and can handle merely local smoothness (as opposed to global Lipschitz continuity of the gradient of the smooth term). NMPG combines a spectral stepsize with a nonmonotone globalization strategy. PANOC⁺ can exploit acceleration directions (e.g., of quasi-Newton type) while ensuring convergence with a backtracking linesearch, see also [25, §5.1].

The performance of Alg1 is compared against those of IPprox [12, Alg. 1] and ALPS [9, Alg. 4.1], based on [10]. IPprox builds upon a pure interior point scheme and solves the barrier subproblems with a tailored adaptive proximal-gradient algorithm. ALPS belongs to the family of augmented Lagrangian algorithms and does not require a custom subsolver—suitable subsolvers for Alg1 can be applied within ALPS and viceversa.

Patterning the simulations of [12, §5.2], we examine the nonnegative PCA problem in Section 5.2 to evaluate Alg1 in several variants and compare it against IPprox. Then, Section 5.3 focuses on a low-rank matrix completion task, a fully nonconvex problem with bilateral constraints, contrasting Alg1 and ALPS. Finally, the exact penalty behavior and the ability to handle hidden equalities are illustrated and discussed in Sections 5.4, LABEL: and 5.5, respectively.

The source code of our implementation has been made available for reproducibility of the numerical results presented in this paper; it can be found on Zenodo at doi: 10.5281/zenodo.11098283.

5.1 Implementation details

We describe here details pertinent to our implementation Alg1 of Algorithm 1, such as the initialization and update of algorithmic parameters. These numerical features tend to improve the practical performances, without compromising the convergence guarantees. IPprox is available from [12] and adopted as is, whereas ALPS is a slight modification of the code from [9] to be comparable with Alg1, as detailed below.

Alg1 accepts problems formulated as in (10), with bilateral bounds defined by extended-real-valued vectors ${\bm{l}}$ and ${\bm{u}}$ . In a preprocessing phase (within Alg1), these vectors are parsed to instantiate the penalty-barrier functions to treat one-sided, two-sided and equality constraints.

Default parameters for Alg1 are $\mu_{0}=1$ and $\delta_{\varepsilon}=\delta_{\mu}=1/4$ as in IPprox, $\delta_{\alpha}=1/2$ as in ALPS, and $\alpha_{0}=1$ . The initial tolerance $\varepsilon_{0}$ for Alg1 (and ALPS) is chosen adaptively, based on the user-provided starting point ${{\bm{x}}}^{0}$ and penalty-barrier parameters. Matching the mechanism implemented in IPprox, we set $\varepsilon_{0}=\max\{\epsilon_{\rm d},\kappa_{\varepsilon}\eta_{0}\}$ , where $\kappa_{\varepsilon}\in(0,1)$ is a user-specified parameter (default $\kappa_{\varepsilon}=10^{-2}$ ) and $\eta_{0}$ is an estimate of the initial stationarity measure, as evaluated by the inner solver invoked at $({{\bm{x}}}^{0},\alpha_{0},\mu_{0})$ . For simplicity, no infeasibility detection mechanism nor artificial bounds on penalty and barrier parameters have been included.

We run ALPS with the same settings as in [10, 9] apart from the following features to match Alg1: the initial penalty parameter is fixed ( $\alpha_{0}=1$ ) and not adaptive, the tolerance reduction factor is set to $\delta_{\varepsilon}=\nicefrac{{1}}{{4}}$ instead of $\delta_{\varepsilon}=\nicefrac{{1}}{{10}}$ , and the initial inner tolerance is selected adaptively and not fixed to $\varepsilon_{0}=\epsilon_{\rm d}^{\nicefrac{{1}}{{3}}}$ . We always initialize ALPS with dual estimate ${\bm{y}}^{0}={\bm{0}}$ . The two subsolvers are considered with their default tuning: PANOC⁺ wih L-BFGS directions (memory 5) and monotone linesearch strategy as in [11], NMPG with spectral stepsize and nonmonotone globalization with average-based merit function as in [8].

For $P$ the set of problems and $S$ the set of solvers, let $t_{s,p}$ denote the user-defined metric for the computational effort required by solver $s\in S$ to solve instance $p\in P$ (lower is better). We will monitor the (total) number of gradient evaluations, so that the computational overhead triggered by backtracking is fairly accounted for, and the number of (outer) iterations. Then, to graphically summarize our numerical results and compare different solvers, we display so-called data profiles. A data profile is the graph of the cumulative distribution function $f_{s}:[0,\infty)\rightarrow[0,1]$ of the evaluation metric, namely $f_{s}(t)\coloneqq|{\mathopen{}\left\{p\in P{}\mathrel{\mid}{}t_{s,p}\leq t% \right\}\mathclose{}}|/|P|$ . As such, each data profile reports the fraction of problems $f_{s}(t)$ solved by solver $s$ with a budget $t$ of evaluation metric, and therefore it is independent of the other solvers.

5.2 Nonnegative PCA

Principal component analysis (PCA) aims at estimating the direction of maximal variability of a high-dimensional dataset. Imposing nonnegativity of entries as prior knowledge, we address PCA restricted to the positive orthant:

\operatorname*{maximize}_{{\bm{x}}\in{}^{n}}\leavevmode\nobreak\ {\bm{x}}^{% \top}{\bm{Z}}{\bm{x}}\quad\operatorname{subject\ to}{}\leavevmode\nobreak\ \|{% \bm{x}}\|=1,\leavevmode\nobreak\ {\bm{x}}\geq{\bm{0}}.

(13)

This task falls within the scope of (1), with $f({\bm{x}})\coloneqq-{\bm{x}}^{\top}{\bm{Z}}{\bm{x}}$ , $g({\bm{x}})\coloneqq\operatorname{\delta}_{\|\cdot\|=1}({\bm{x}})$ , and ${\bm{c}}({\bm{x}})=-{\bm{x}}$ , and has been considered in [12] for validating IPprox and tuning its hyperparameters.

Setup

We generate synthetic problem data as in [12, §5.2]. For a problem size $n\in\m@thbbch@rN$ , let ${\bm{Z}}=\sqrt{\sigma_{n}}{\bm{z}}{\bm{z}}^{\top}+{\bm{N}}\in{}^{n\times n}$ , where ${\bm{N}}\in{}^{n\times n}$ is a random symmetric noise matrix, ${\bm{z}}\in{}^{n}$ is the true (random) principal direction, and $\sigma_{n}>0$ is the signal-to-noise ratio. We consider some dimensions $n$ and, for each dimension, the set of problems parametrized by $\sigma_{n}\in{\mathopen{}\left\{0.05,0.1,0.25,0.5,1.0{}\mathrel{\mid}{}{}% \right\}\mathclose{}}$ and $\sigma_{s}\in{\mathopen{}\left\{0.1,0.3,0.7,0.9{}\mathrel{\mid}{}{}\right\}% \mathclose{}}$ , which control the noise and sparsity level, respectively. There are 5 choices for $\sigma_{n}$ , 4 for $\sigma_{s}$ , and, for each set of parameters, 5 instances are generated with different problem data ${\bm{Z}}$ and starting point ${{\bm{x}}}^{0}$ . Overall, each solver-settings pair is invoked on 100 different instances for each dimension $n$ .

A strictly feasible starting point ${{\bm{x}}}^{0}$ is generated by sampling a uniform distribution over $[0,3]^{n}$ and projecting onto $\operatorname{dom}g={\mathopen{}\left\{{\bm{x}}\in{}^{n}{}\mathrel{\mid}{}\|{% \bm{x}}\|=1\right\}\mathclose{}}$ . This property is necessary for IPprox but not for Alg1. We will test Alg1 also with arbitrary initialization, in which case ${{\bm{x}}}^{0}$ is generated by sampling a uniform distribution over $[-3,3]^{n}$ and then projectin onto $\operatorname{dom}g$ .

Barriers and subsolvers

Algorithm 1 is controlled by, and its performance depends on, several algorithmic hyperparameters, such as the (sequences of) barrier and penalty parameters, the choice of barrier function , and the subsolver adopted at Step 1.3. We now focus on the effect of the last two elements, for different levels of accuracy requirements, testing all combinations of barriers and subsolvers considering problem dimensions $n\in{\mathopen{}\left\{10,15,20,25,30{}\mathrel{\mid}{}{}\right\}\mathclose{}}$ , for a total of 500 calls to each solver. For this set of experiments we set a time limit of 100 seconds on each call.

The results are graphically summarized in Fig. 4. As IPprox performed almost identically with the two barrier functions, only the inverse variant (originally considered in [12]) is displayed for clarity. Moreover, because of the excessive run time to perform all simulations for IPprox with high accuracy, we exclude it altogether for the high accuracy tests and consider instead starting points ${{\bm{x}}}^{0}$ that are not necessarily (strictly) feasible.

For low and medium accuracy, all instances are solved by Alg1 and IPprox up to the desired primal-dual tolerances. With high accuracy, only the variant of Alg1 with NMPG and log-like is not able to solve all instances within the time limit. Across all accuracy levels, Alg1 PANOC⁺ inverse operates consistently better than the other variants of Alg1, all of which outperform IPprox. In particular, the overall effort (number of gradient evaluations) required by PANOC⁺ is less than NMPG (for fixed barrier) and with the inverse barrier is less than with the log-like (for fixed subsolver). With increasing accuracy it becomes more efficient to adopt PANOC⁺ than NMPG, while IPprox performs gradually more poorly. The slow tail convergence typical of first-order schemes badly affects the scalability of IPprox, whereas the adoption of a quasi-Newton scheme within PANOC⁺ seems to beat the simpler spectral approximation in NMPG.

In terms of (outer) iterations, the results appear unsurprisingly independent on the subsolver. Moreover, the log-like barrier invariably demands the solution of fewer subproblems than the inverse one, in agreement with the discussion in Section 4.1 on the barrier’s well behavior. However, we emphasize that the overall computational effort (measured in terms of gradient evaluations) also depends on the subsolver’s efficiency in solving the subproblems, as demonstrated by Fig. 4.

Problem size and accuracy

To investigate scalability and influence of accuracy requirements, we consider instances of (13) with dimensions $n\in{\mathopen{}\left\{10,\lceil 10^{1.5}\rceil,10^{2},\lceil 10^{2.5}\rceil,1% 0^{3}{}\mathrel{\mid}{}{}\right\}\mathclose{}}$ and tolerances $\epsilon_{\rm p}=\epsilon_{\rm d}=\varepsilon\in{\mathopen{}\left\{10^{-3},10^% {-4},10^{-5}{}\mathrel{\mid}{}{}\right\}\mathclose{}}$ , without time limit. For each of these tolerance parameters, we test Alg1 with PANOC⁺ and the log-like barrier. For this test, we generate 5 instances (as described above) for each set of parameters, leading to a total of 500 problem instances to be solved with increasing accuracy.

All instances are solved up to the desired primal-dual tolerances. The influence of problem size and tolerance is depicted in Fig. 5, which displays for each pair $(n,\varepsilon)$ the number of gradient evaluations with a jitter plot (for a better visualization of the distribution of numerical values over categories). The empirical cumulative distribution function with the associated median value are also indicated. This chart visualizes how problem size and accuracy requirement affect the solution process, and reveals the stark effect of both $n$ and $\varepsilon$ . For low accuracy, Alg1 scales relatively well with the problem size, whereas large-scale problems become prohibitive for high accuracy.

This behavior is typical of first-order methods, due to their slow tail convergence, and we take it as a motivation for investigating the interaction between subpoblems and subsolvers in future works. Nevertheless, these experiments (and those forthcoming) demonstrate Alg1’s capability to handle thousands of variables and constraints in a fully nonconvex optimization landscape, witnessing a tremendous improvement over IPprox, not only in the practical performance but also in the ease of use.

5.3 Low-rank matrix completion

Given an incomplete matrix of (uncertain) ratings ${\bm{Y}}$ , a common task is to find a complete ratings matrix ${\bm{X}}$ that is a parsimonious representation of ${\bm{Y}}$ , in the sense of low-rank, and such that ${\bm{Y}}\approx{\bm{X}}$ for the entries available [19]. Let ${\#_{u}}$ and ${\#_{m}}$ denote the number of users and items, respectively, and let the rating ${Y_{i,j}}$ by the $i$ th user for the $j$ th item range on a scale defined by constants $Y_{\min}$ and $Y_{\max}$ . Let $\Omega$ represent the set of observed ratings, and $|\Omega|$ the cardinality of $\Omega$ . The ratings matrix ${\bm{Y}}$ could be very large and often most of the entries are unobserved, since a given user will only rate a small subset of items. Low-rankness of ${\bm{X}}$ can be enforced by construction, with the Ansatz ${\bm{X}}\equiv{\bm{U}}{\bm{V}^{\top}}$ , as in dictionary learning. In practice, for some prescribed embedding dimension ${\#_{a}}$ , we seek a user embedding matrix ${\bm{U}}\in{}^{{\#_{u}}\times{\#_{a}}}$ and an item embedding matrix ${\bm{V}}\in{}^{{\#_{m}}\times{\#_{a}}}$ . Each row ${\bm{U}_{i,:}}$ of ${\bm{U}}$ is a ${\#_{a}}$ -dimensional vector representing user $i$ , while each row ${\bm{V}_{j,:}}$ of ${\bm{V}}$ is a ${\#_{a}}$ -dimensional vector representing item $j$ . We address the joint completion and factorization of the ratings matrix ${\bm{Y}}$ , encoded in the following form:

$\displaystyle\operatorname*{minimize}_{{\bm{U}}\in{}^{{\#_{u}}\times{\#_{a}}},% {\bm{V}}\in{}^{{\#_{m}}\times{\#_{a}}}}\quad$	$\displaystyle\frac{1}{\|\Omega\|}\sum_{(i,j)\in\Omega}{\mathopen{}\left({% \mathopen{}\left\langle{}{\bm{U}_{i,:}}{},{}{\bm{V}_{j,:}}{}\right\rangle% \mathclose{}}-{Y_{i,j}}\right)\mathclose{}}^{2}+\frac{\lambda}{{\#_{m}}}\sum_{% j=1}^{{\#_{m}}}\\|{\bm{V}_{j,:}}\\|_{0}$		(14)
$\displaystyle\operatorname{subject\ to}\quad$	$\displaystyle\max{\mathopen{}\left\{Y_{\min},{Y_{i,j}}-1{}\mathrel{\mid}{}{}% \right\}\mathclose{}}\leq{\mathopen{}\left\langle{}{\bm{U}_{i,:}}{},{}{\bm{V}_% {j,:}}{}\right\rangle\mathclose{}}\leq\min{\mathopen{}\left\{Y_{\max},{Y_{i,j}% }+1{}\mathrel{\mid}{}{}\right\}\mathclose{}}$	$\displaystyle\forall(i,j)\in\Omega,$
	$\displaystyle Y_{\min}\leq{\mathopen{}\left\langle{}{\bm{U}_{i,:}}{},{}{\bm{V}% _{j,:}}{}\right\rangle\mathclose{}}\leq Y_{\max}$	$\displaystyle\forall(i,j)\notin\Omega,$
	$\displaystyle\\|{\bm{U}_{i,:}}\\|_{2}=1$	$\displaystyle\forall i\in{\mathopen{}\left\{1,\dots,{\#_{u}}{}\mathrel{\mid}{}% {}\right\}\mathclose{}}.$

While aiming at ${\bm{U}}{\bm{V}^{\top}}\approx{\bm{Y}}$ , the model in (14) sets the rating range $[Y_{\min},Y_{\max}]$ as a hard constraint for all predictions; a tighter constraint is imposed to observed ratings. Following [25, §6.2], we explicitly constrain the norm of the dictionary atoms ${\bm{U}_{i,:}}$ , without loss of generality, to reduce the number of equivalent (up to scaling) solutions; this norm specification is included as an indicator in the nonsmooth objective term $g$ . Furthermore, we encourage sparsity of the coefficient representation ${\bm{V}_{j,:}}$ with the $\|\cdot\|_{0}$ penalty, which counts the nonzero elements, scaled with a regularization parameter $\lambda\geq 0$ . Overall, this problem has $n\coloneqq{\#_{a}}({\#_{u}}+{\#_{m}})$ decision variables and $m\coloneqq{\#_{u}}{\#_{m}}$ bilateral constraints. All terms ( $f$ , $g$ , and ${\bm{c}}$ ) are nonconvex, as well as the (unbounded) feasible set.

It appears nontrivial to find a strictly feasible point for (14), in the sense of [12, Def. 2], which is required for initializing IPprox, thus highlighting a major advantage of Alg1.

Setup

We consider the MovieLens 100k dataset⁴⁴4The entire dataset is available at https://grouplens.org/datasets/movielens/100k/., which contains $1000023$ ratings for $3706$ unique movies (the dataset contains some repetitions in movie ratings and we have ignored them); these recommendations were made by $6040$ users on a discrete rating scale from $Y_{\min}=1$ to $Y_{\max}=5$ . If we construct a matrix of movie ratings by the users, then it is a sparse unstructured matrix with only $4.47\%$ of the total entries available.

We compare Alg1 and ALPS and test their scalability with instances of increasing size. We fix the number of atoms to ${\#_{a}}=10$ and consider the instances of (14) corresponding to subsets of ${\#_{u}}\in{\mathopen{}\left\{10,15,\ldots,45{}\mathrel{\mid}{}{}\right\}% \mathclose{}}$ users (always starting from the first one). For these problem instances the sizes range $n\in[7220,11060]$ and $m\in[7120,47745]$ . We set the regularization parameter $\lambda=10^{-2}$ and invoke each solver with the primal-dual tolerances $\epsilon_{\rm p}=\epsilon_{\rm d}=10^{-3}$ and without time limit. For each problem instance, we randomly generated $10$ starting points, for a total of $80$ calls to each solver variant.

Results

A summary of the numerical results is depicted in Fig. 6. For the sake of clarity we display only the variant of ALPS with NMPG since it performed better than with PANOC⁺. Also the variants of Alg1 with the inverse barrier are not included, as their corresponding profiles were intermediate between those of Alg1 with the log-like barrier. All solver variants were always able to find a solution up to the primal-dual tolerance. Although not all variants of Alg1 consistently outperform ALPS, Alg1 NMPG log-like finishes ahead in most calls, but the advantage becomes not significant for larger instances. Alg1 PANOC⁺ log-like appears comparable to ALPS for smaller instances, but then it exhibits a performance much more sensitive to the starting point (its shaded area in Fig. 6 is significantly larger for larger instances). This behavior may stem from the more complicated mechanisms (search direction and globalizatin with line search) employed in PANOC⁺. These observations are in agreement with those in the previous Section 5.2 and highlight the potential benefits of subsolvers tailored to the subproblems’ structure.

5.4 Exact penalty behavior

This subsection is dedicated to the penalty-barrier behavior of Alg1. When addressing the nonconvex problem (14), we observed that the sequences of penalty parameters $(\alpha_{k})_{k\in\m@thbbch@rN}$ remain constant (equal to the initial value $\alpha_{0}=1$ ) for all instances and solver variants. This is enabled by the relaxed condition at Step 1.10 of Algorithm 1, which does not require any sufficient improvement at every iteration, but instead monitors globally how the constraint violation $p_{k}$ vanishes. Correspondingly, only the barrier parameter $\mu_{k}$ is decreased in order to reduce the complementarity slackness $s_{k}$ , see Item 3. When active, this exact penalty quality prevents the barrier to yield too much ill-conditioning, exploiting Item 3.

Even in the fully nonconvex problem of the previous Section 5.3, the sequence of penalty parameters $(\alpha_{k})_{k\in\m@thbbch@rN}$ generated by Alg1 remains always bounded. Although this indicates that the assumptions behind Item 2 could be relaxed, the penalty exactness does not always take effect. We illustrate now an example problem where Alg1 exhibits $\alpha_{k}\nearrow\infty$ , hence it does not boil down to an exact penalty method. For this purpose it suffices to consider the two-dimensional convex problem

\operatorname*{minimize}_{{\bm{x}}\in{}^{2}}\leavevmode\nobreak\ {{x}}_{1}+% \operatorname{\delta}_{{}_{+}}({{x}}_{2})\quad\operatorname{subject\ to}{}% \leavevmode\nobreak\ {{x}}_{1}^{2}+{{x}}_{2}\leq 0,

(15)

whose (unique) solution is the only feasible point ${\bm{x}}^{\star}=(0,0)$ which is however not 2.2-optimal (there exists no suitable multiplier ${\bm{y}}^{\star}$ ).

We intend to solve problem (15) with tolerances $\epsilon_{\rm p}=\epsilon_{\rm d}=10^{-7}$ starting from 100 random initializations, generated according to ${{x}}_{i}^{0}\sim\mathcal{N}(0,\sigma_{x}^{2})$ with large standard deviation $\sigma_{x}=30$ .

Results

All solver variants find a primal-dual solution to (15), up to the tolerances, for all starting points. The unbounded behavior of the penalty parameters $(\alpha_{k})_{k\in\m@thbbch@rN}$ appears evident in Fig. 7. Thus, $\alpha_{k}\nearrow\infty$ seems necessary to drive the constraint violation $p_{k}$ to zero, while the barrier parameter $\mu_{k}\searrow 0$ forces the complementarity slackness $s_{k}$ .

The numerical performance of Alg1 and ALPS are summarized in Fig. 8, where ALPS is displayed only in the NMPG variant (which performed better than with PANOC⁺). Considering both the overall effort (number of gradient evaluations) and the number of subproblems needed by Alg1, the log-like barrier yields better results than the inverse (for fixed subsolver) and NMPG is more efficient than PANOC⁺ (for fixed barrier). Moreover, Alg1 performs consistently better than ALPS, despite the lack of penalty exactness.

5.5 Handling equalities

Even though bilateral constraints can be handled explicitly, as examined in Section 4.3, it is important that Alg1 can cope with hidden equalities too. These may appear as the result of automatic model constructions, and are often difficult to identify by inspection. Here we compare the behavior of Alg1 when the problem specification has explicit equalities against the same problem but whose constraints are described using two inequalities each. Consider possibly nonconvex quadratic programming (QP) problems of the form

\operatorname*{minimize}_{{\bm{x}}\in{}^{n}}\leavevmode\nobreak\ \tfrac{1}{2}{% \bm{x}}^{\top}{\bm{Q}}{\bm{x}}+{\mathopen{}\left\langle{}{\bm{q}}{},{}{\bm{x}}% {}\right\rangle\mathclose{}}\quad\operatorname{subject\ to}{}\leavevmode% \nobreak\ {\bm{A}}{\bm{x}}={\bm{b}},\leavevmode\nobreak\ \underline{{\bm{x}}% missing}\leq{\bm{x}}\leq\overline{{\bm{x}}}

(16)

with matrices ${\bm{Q}}\in{}^{n\times n}$ , ${\bm{A}}\in{}^{m\times n}$ and vectors ${\bm{q}},\underline{{\bm{x}}missing},\overline{{\bm{x}}}\in{}^{n}$ , ${\bm{b}}\in{}^{m}$ as problem data. Problem (16) can be cast as (10) with $f({\bm{x}})\coloneqq\tfrac{1}{2}{\bm{x}}^{\top}{\bm{Q}}{\bm{x}}+{\mathopen{}% \left\langle{}{\bm{q}}{},{}{\bm{x}}{}\right\rangle\mathclose{}}$ , $g({\bm{x}})\coloneqq\operatorname{\delta}_{[\underline{{\bm{x}}missing},% \overline{{\bm{x}}}]}({\bm{x}})$ , ${\bm{c}}({\bm{x}})\coloneqq{\bm{A}}{\bm{x}}-{\bm{b}}$ and ${\bm{l}}\coloneqq{\bm{u}}\coloneqq{\bm{0}}$ . We are interested in comparing the performance of Alg1 (in different variants) with the two problem formulations described in Section 4.3 to deal with equalities: either by splitting into two inequalities (leading to $\psi_{\rho^{\ast}}^{\pm[0,0]}$ defined in (9)) or by performing a combined marginalization (resulting in $\psi_{\rho^{\ast}}^{[0,0]}$ given by (11)). Hence, for each solver’s variant and problem instance, we contrast these two formulations, symbolized by Alg1^± and Alg1, respectively.

Setup

Problem instances are generated as follows: we let ${\bm{Q}}=({\bm{M}}+{\bm{M}^{\top}})/2$ where the elements of ${\bm{M}}\in{}^{n\times n}$ are normally distributed, ${\bm{M}_{ij,:}}\sim\mathcal{N}(0,1)$ , with only 10% being nonzero. The linear part of the cost ${\bm{q}}$ is also normally distributed, i.e., ${{q}}_{i}\sim\mathcal{N}(0,1)$ . Simple bounds are generated according to a uniform distribution, i.e., $\underline{x}_{i}\sim-\mathcal{U}(0,1)$ and $\overline{x}_{i}\sim\mathcal{U}(0,1)$ . We set the elements of ${\bm{A}}\in{}^{m\times n}$ as ${\bm{A}_{ij,:}}\sim\mathcal{N}(0,1)$ with only 10% being nonzero. To ensure that the problem is feasible, we draw an element $\widehat{{\bm{x}}}\in[\underline{{\bm{x}}missing},\overline{{\bm{x}}}]$ (as $\widehat{x}_{i}=\underline{x}_{i}+(\overline{x}_{i}-\underline{x}_{i}){{a}}_{i}$ , ${{a}}_{i}\sim\mathcal{U}(0,1)$ ) and set ${\bm{b}}={\bm{A}}\widehat{{\bm{x}}}$ . An initial guess is randomly generated for each problem instance, as ${{x}}_{i}^{0}\sim\mathcal{N}(0,1)$ , and shared across all solvers and formulations.

We consider problems with $m\in\{1,2,\ldots,10\}$ and $n=10m$ , set the tolerances $\epsilon_{\rm p}=\epsilon_{\rm d}=10^{-5}$ , and construct 10 instances for each size, for a total of 100 calls to each solver for each formulation.

Results

Numerical results are visualized by means of pairwise (extended) performance profiles. Let $t_{s,p}$ and $t_{s,p}^{\pm}$ denote the evaluation metric of solver $s\in S$ on a certain instance $p\in P$ with the two formulations. Then, for each solver $s$ , the corresponding pairwise performance profile displays the cumulative distribution $\rho_{s}:[0,\infty)\to[0,1]$ of its performance ratio $\tau_{s,p}$ , namely

\rho_{s}(\tau)\coloneqq\frac{|{\mathopen{}\left\{p\in P{}\mathrel{\mid}{}\tau_% {s,p}\leq\tau\right\}\mathclose{}}|}{|P|}\quad\text{where}\quad\tau_{s,p}% \coloneqq\frac{t_{s,p}}{t_{s,p}^{\pm}}.

Thus, the profile for solver $s$ indicates the fraction of problems $\rho_{s}(\tau)$ for which solver $s$ invoked by Alg1 requires at most $\tau$ times the computational effort needed by the same solver $s$ invoked by Alg1^±. As depicted in Fig. 9, all pairwise performance profiles cross the unit ratio with at least 83% problems solved, meaning that all solver variants benefit from the tailored handling of equality constraints as in Section 4.3. All variants are nevertheless robust to degenerate formulations, confirming that our algorithmic framework can endure redundant constraints and hidden equalities.

6 Final remarks and open questions

We proposed an optimization framework for the numerical solution of constrained structure problems in the fully nonconvex setting. We went beyond a simple combination of (exact) penalty and barrier approaches by taking a marginalization step, which not only allows us to reduce the problem size but also enables the adoption of generic subsolvers. In particular, by extending the domain of the subproblems’ smooth objective term, the proposed methodology overcomes the need for safeguards within the subsolver and the difficulty of accelerating it, a major drawback of IPprox [12]. Under mild assumptions, our theretical analysis established convergence results on par with those typical for nonconvex nonsmooth optimization. Most notably, all feasible accumulation points are asymptotically KKT optimal. We tested our approach numerically with problems arising in data science, studying scalability and the effect of accuracy requirements. Furthermore, illustrative examples confirmed the robust behavior on badly formulated problems and degenerate cases.

The methodology in this paper could be applied to, and compared with, a combination of barrier and augmented Lagrangian approaches. By generating a smoother penalty-barrier term, this strategy could benefit from the more effective performance of subsolvers. However, this development comes with the additional challenge of designing suitable updates for the Lagrange multiplier. Future research may also focus on specializing the proposed framework to classical nonlinear programming, taking advantage of the special structure and linear algebra. Finally, mechanisms for rapid infeasibility detection and guaranteed existence of subproblems’ solutions should be investigated.

Appendix A Auxiliary results

This appendix contains some auxiliary results and proofs of statements referred to in the main body.

Lemma A.1 (Properties of the barrier ).

Any function as in Section 3.2 satisfies the following:

1.

$\lim_{t\to-\infty}\eulerb{}(t)=\operatorname*{inf}\eulerb{}=0$ and $\lim_{t\to 0^{-}}\eulerb{}(t)=\lim_{t\to 0^{-}}{}^{\prime}{}(t)=\infty$ .
2.

The conjugate ^∗ is continuously differentiable on the interior of its domain $\operatorname{dom}{}^{\ast}{}={}_{+}$ with $({}^{\ast})^{\prime}{}<0$ , and satisfies ${}^{\ast}{}(0)=0$ and $\lim_{t^{\ast}\to\infty}{}^{\ast}{}(t^{\ast})=-\infty$ .
3.

${}^{\ast}{}(t^{*})=({}^{\ast})^{\prime}{}(t^{*})t^{*}-\eulerb{}(({}^{\ast})^{% \prime}{}(t^{*}))$ for any $t^{*}>0$ .
4.

The function $(0,\infty)\ni t^{\ast}\mapsto\nicefrac{{{}^{\ast}{}(t^{\ast})}}{{t^{\ast}}}=t-% \nicefrac{{\eulerb{}(t)}}{{{}^{\prime}{}(t)}}$ , where $t\coloneqq({}^{\ast})^{\prime}{}(t^{\ast})$ , strictly increases from $-\infty$ to 0.

Proof.

$\diamondsuit$

??) Trivial because of strict monotonicity on $(-\infty,0)$ (since ${}^{\prime}{}>0$ ).
$\diamondsuit$

??) Since ${}^{\ast}{}(t^{*})\mathrel{{}\mathop{=}\limits^{\text{\clap{\tiny(def)}}}}\sup% _{t<0}{\mathopen{}\left\{tt^{*}-\eulerb{}(t){}\mathrel{\mid}{}{}\right\}% \mathclose{}}$ , if $t^{*}<0$ one has that $\lim_{t\to-\infty}tt^{*}-\eulerb{}(t)=\infty$ . For $t^{*}=0$ one directly has that ${}^{\ast}{}(0)=-\operatorname*{inf}\eulerb{}=0$ , see [1, Prop. 13.10(i)], and in particular $0\in\operatorname{dom}{}^{\ast}{}$ ; in addition, since $\operatorname{dom}{}^{\ast}{}\supseteq\operatorname{dom}({}^{\ast})^{\prime}{}% =\operatorname{range}{}^{\prime}{}={}_{++}$ with equality holding by virtue of [1, Thm. 16.29], we conclude that $\operatorname{dom}{}^{\ast}{}={}_{+}$ . For the same reason, one has that ${{}^{\ast}{}}^{\prime}<0$ on $(0,\infty)$ , which proves that ^∗ is strictly decreasing. Finally, since $\operatorname*{inf}{}^{\ast}{}=-\eulerb{}(0)=-\infty$ , we conclude that $\lim_{t^{*}\to\infty}{}^{\ast}{}(t^{*})=-\infty$ .
$\diamondsuit$

??) This is a standard result of Fenchel conjugacy, see e.g. [1, Prop. 16.10], here specialized to the fact that $\operatorname{range}{}^{\prime}{}={}_{++}$ .

\diamondsuit

??) Strict monotonic increase follows by observing that ${\mathopen{}\left(\nicefrac{{{}^{\ast}{}(t^{*})}}{{t^{*}}}\right)\mathclose{}}% ^{\prime}=\nicefrac{{\eulerb{}(t)}}{{(t^{*})^{2}}}>0$ for $t^{*}>0$ . Moreover,

	$\displaystyle\lim_{t^{}\to 0^{+}}\tfrac{{}^{\ast}{}(t^{})}{t^{*}}=\lim_{{}^{% \prime}{}(t)\to 0^{+}}t-\tfrac{\eulerb{}(t)}{{}^{\prime}{}(t)}=\lim_{t\to-% \infty}t-{\vphantom{\tfrac{\eulerb{}(t)}{{}^{\prime}{}(t)}}\smash{\overbracket% {\tfrac{\eulerb{}(t)}{{}^{\prime}{}(t)}}^{>0}}}=-\infty.$
Lastly,
	$\displaystyle\lim_{t^{}\to\infty}\tfrac{{}^{\ast}{}(t^{})}{t^{}}=\lim_{t^{% }\to\infty}{{}^{\ast}{}}^{\prime}(t^{*})=\lim_{{}^{\prime}{}(t)\to\infty}t=% \lim_{t\to 0^{-}}t=0,$	(17)

where the first equality uses L’Hôpital’s rule. ∎

We next prove in detail the derivations of (1) and (1). Noticing that the minimization in (1) separates into that of $\sigma_{t,\nicefrac{{\alpha}}{{\mu}}}(\tau)\coloneqq\frac{\alpha}{\mu}\tau+% \operatorname{\delta}_{{}_{+}}(\tau)+\eulerb{}(t-\tau)$ for $t={{c}}_{1}({\bm{x}}),\dots,{{c}}_{m}({\bm{x}})$ (after a division by a factor $\mu$ ), we provide the elementwise version of the claim.

Lemma A.2.

For $\rho^{\ast}>0$ and $t\in\m@thbbch@rR$ , let $\sigma_{t,\rho^{\ast}}:\m@thbbch@rR\rightarrow\overline{}\m@thbbch@rR$ be defined as $\sigma_{t,\rho^{\ast}}(\tau)=\rho^{\ast}\tau+\operatorname{\delta}_{{}_{+}}(% \tau)+\eulerb{}(t-\tau)$ . Then,


	$\displaystyle\operatorname*{arg\,min}\sigma_{t,\rho^{\ast}}={}$	$\displaystyle\bigl{[}t-({}^{\ast})^{\prime}{}(\rho^{\ast})\bigr{]}_{+}$	(18a)
and
	$\displaystyle\min\sigma_{t,\rho^{\ast}}={}$	$\displaystyle\psi_{\rho^{\ast}}(t),$	(18b)

where


	$\displaystyle\psi_{\rho^{\ast}}(t)\coloneqq{}$	${\displaystyle{\mathopen{}\left\{\begin{array}[]{l @{\hspace{\ifcasescolsep}} >% {\text{if~}}l }\eulerb{}(t)\hfil\hskip 10.00002pt&\leavevmode\nobreak\ }{}^{% \prime}{}(t)\leq\rho^{\ast}\\ \rho^{\ast}t-{}^{\ast}{}(\rho^{\ast})\hfil\hskip 10.00002pt&\lx@intercol\text{% otherwise}\hfil\lx@intercol\end{array}\right.\mathclose{}}={\mathopen{}\left({% }^{\ast}{}+\operatorname{\delta}_{[0,\rho^{\ast}]}\right)\mathclose{}}^{\ast}(t)$	(19c)
is (globally) Lipschitz differentiable and $\rho^{\ast}$ -Lipschitz continuous with derivative
	$\displaystyle\psi_{\rho^{\ast}}^{\prime}(t)={}$	$\displaystyle\min{\mathopen{}\left\{b^{\prime}(t),\rho^{\ast}{}\mathrel{\mid}{% }{}\right\}\mathclose{}}.$	(19d)

Proof.

We first consider (18a).

The properties of as in Section 3.2 ensure that $\sigma_{t,\rho^{\ast}}$ is strictly convex and coercive, and that therefore it attains a unique (global) minimizer. Moreover, $\sigma_{t,\rho^{\ast}}$ is differentiable on $\operatorname{int}\operatorname{dom}\sigma_{t,\rho^{\ast}}=([t]_{+},\infty)$ ; as such, the minimizer is the unique zero of $\sigma_{t,\rho^{\ast}}^{\prime}$ if it exists, and 0 otherwise. Solving $\sigma_{t,\rho^{\ast}}^{\prime}(\tau)=0$ for $\tau>0$ gives

\alpha-\mu{}^{\prime}{}(t-\tau)=0\wedge\tau>0\quad\Leftrightarrow\quad t-\tau=% ({}^{\ast})^{\prime}{}(\rho^{\ast})\wedge\tau>0\quad\Leftrightarrow\quad\tau=t% -({}^{\ast})^{\prime}{}(\rho^{\ast})>0.

Therefore, the minimizer is given by

{\operatorname*{arg\,min}\sigma_{t,\rho^{\ast}}={\mathopen{}\left\{\begin{array% }[]{l @{\hspace{\ifcasescolsep}} >{\text{if~}}l }t-({}^{\ast})^{\prime}{}(\rho% ^{\ast})\hfil\hskip 10.00002pt&\leavevmode\nobreak\ }t-({}^{\ast})^{\prime}{}(% \rho^{\ast})>0\\ 0\hfil\hskip 10.00002pt&\lx@intercol\text{otherwise}\hfil\lx@intercol\end{% array}\right.\mathclose{}}=\bigl{[}t-({}^{\ast})^{\prime}{}(\rho^{\ast})\bigr{% ]}_{+},

which yields the claimed expression in (18a).

Next, according to this formula, if $({}^{\ast})^{\prime}{}(\rho^{\ast})<t$ then the minimizer of $\sigma_{t,\rho^{\ast}}$ is $t-({}^{\ast})^{\prime}{}(\rho^{\ast})$ , hence

	$\displaystyle({}^{\ast})^{\prime}{}(\rho^{\ast})<t\quad\Rightarrow\quad\min% \sigma_{t,\rho^{\ast}}={}$	$\displaystyle\sigma_{t,\rho^{\ast}}\bigl{(}t-({}^{\ast})^{\prime}{}(\rho^{\ast% })\bigr{)}$
	$\displaystyle={}$	$\displaystyle\rho^{\ast}t-\rho^{\ast}({}^{\ast})^{\prime}{}(\rho^{\ast})+% \eulerb{}(({}^{\ast})^{\prime}{}(\rho^{\ast}))$
	$\displaystyle={}$	$\displaystyle\rho^{\ast}t-\rho^{\ast}\rho+\eulerb{}(\rho)$
	$\displaystyle={}$	$\displaystyle\rho^{\ast}t-{}^{\ast}{}(\rho^{\ast}),$

where $\rho\coloneqq({}^{\ast})^{\prime}{}(\rho^{\ast})$ (i.e., such that $\rho^{\ast}={}^{\prime}{}(\rho)$ ). Otherwise, if $({}^{\ast})^{\prime}{}(\rho^{\ast})\geq t$ then the minimizer of $\sigma_{t,\rho^{\ast}}$ is 0, which gives

\displaystyle({}^{\ast})^{\prime}{}(\rho^{\ast})\geq t\quad\Rightarrow\quad% \min\sigma_{t,\rho^{\ast}}=\sigma_{t,\rho^{\ast}}(0)=b(t).

By observing that $({}^{\ast})^{\prime}{}(\rho^{\ast})\geq t$ iff $b^{\prime}(t)\leq\rho^{\ast}$ we infer that

{\min\sigma_{t,\rho^{\ast}}={\mathopen{}\left\{\begin{array}[]{l @{\hspace{% \ifcasescolsep}} >{\text{if~}}l }\eulerb{}(t)\hfil\hskip 10.00002pt&% \leavevmode\nobreak\ }{}^{\prime}{}(t)\leq\rho^{\ast}\\ \rho^{\ast}t-{}^{\ast}{}(\rho^{\ast})\hfil\hskip 10.00002pt&\lx@intercol\text{% otherwise}\hfil\lx@intercol\end{array}\right.\mathclose{}}\mathrel{{}\mathop{=% }\limits^{\text{\clap{\tiny(def)}}}}\psi_{\nicefrac{{\alpha}}{{\mu}}}(t),

which proves (18b).

We next show the alternative expression involving the convex conjugacy for $\psi_{\nicefrac{{\alpha}}{{\mu}}}$ as in (19c). To this end, for a function $h:\m@thbbch@rR\rightarrow\overline{}\m@thbbch@rR$ and a number $t\in\m@thbbch@rR$ , let $\operatorname{T}_{t}h\coloneqq h({}\cdot-t)$ and $h_{-}\coloneqq h(-\cdot{})$ denote the translation by $t$ and the reflection, respectively, and recall that $(\operatorname{T}_{t}h)^{\ast}=h^{\ast}+t{}\cdot{}$ and $(h_{-})^{\ast}=({h^{\ast}})_{-}\eqqcolon h^{\ast}_{-}$ , see [1, Prop.s 13.23(iii)-(iv)]. We will also use the fact that $(h_{1}+h_{2})^{\ast}=h_{1}\mathbin{\square}h_{2}$ holds for any pair of proper, lsc, convex functions $h_{i}$ defined on the same space, $i=1,2$ , where $h_{1}\mathbin{\square}h_{2}\coloneqq\operatorname*{inf}_{t}h_{1}({}\cdot-t)+h_% {2}(t)$ denotes the infimal convolution, see [1, Prop. 13.24]. Then,

	$\displaystyle\operatorname*{inf}\tfrac{1}{\mu}\sigma_{t,\rho^{\ast}}={}$	$\displaystyle\operatorname*{inf}_{\tau\in\m@thbbch@rR}{\mathopen{}\left\{% \tfrac{\alpha}{\mu}\tau+\operatorname{\delta}_{{}_{+}}(\tau)+{\vphantom{b(t-% \tau)}\smash{\overbracket{b(t-\tau)}^{\operatorname{T}_{t}b_{-}(\tau)}}}{}% \mathrel{\mid}{}{}\right\}\mathclose{}}\vphantom{{{\overbracket{b(t-\tau)}^{% \operatorname{T}_{t}b_{-}(\tau)}}}missing}$
	$\displaystyle={}$	$\displaystyle-\sup_{\tau\in\m@thbbch@rR}{\mathopen{}\left\{-\tfrac{\alpha}{\mu% }\tau-\Bigl{[}\operatorname{\delta}_{{}_{+}}(\tau)+\operatorname{T}_{t}b_{-}(% \tau)\Bigr{]}{}\mathrel{\mid}{}{}\right\}\mathclose{}}$
	$\displaystyle={}$	$\displaystyle-{\mathopen{}\left(\operatorname{T}_{t}b_{-}+\operatorname{\delta% }_{{}_{+}}\right)\mathclose{}}^{\ast}(-\rho^{\ast})$
	$\displaystyle={}$	$\displaystyle-\Bigl{[}{\mathopen{}\left(\operatorname{T}_{t}b_{-}\right)% \mathclose{}}^{\ast}\mathbin{\square}\operatorname{\delta}_{{}_{+}}^{\ast}% \Bigr{]}(-\rho^{\ast})$
	$\displaystyle={}$	$\displaystyle-\Bigl{[}{\mathopen{}\left({}^{\ast}{}_{-}+t{}\cdot{}\right)% \mathclose{}}\mathbin{\square}\operatorname{\delta}_{{}_{-}}\Bigr{]}(-\rho^{% \ast})$
	$\displaystyle={}$	$\displaystyle-\operatorname*{inf}_{u\leq 0}{\mathopen{}\left\{{}^{\ast}{}_{-}(% -\rho^{\ast}-u)-t(\rho^{\ast}+u){}\mathrel{\mid}{}{}\right\}\mathclose{}}$
	$\displaystyle={}$	$\displaystyle\sup_{u\leq 0}{\mathopen{}\left\{t(\rho^{\ast}+u)-{}^{\ast}{}(% \rho^{\ast}+u){}\mathrel{\mid}{}{}\right\}\mathclose{}}$
and since ${}^{\ast}{}(u)=\infty$ for $u<0$ ,
	$\displaystyle={}$	$\displaystyle\sup_{u\in[0,\rho^{\ast}]}{\mathopen{}\left\{tu-{}^{\ast}{}(u){}% \mathrel{\mid}{}{}\right\}\mathclose{}}$
	$\displaystyle={}$	$\displaystyle{\mathopen{}\left({}^{\ast}{}+\operatorname{\delta}_{[0,\rho^{% \ast}]}\right)\mathclose{}}^{\ast}(t)\mathrel{{}\mathop{=}\limits^{\text{\clap% {\tiny(def)}}}}\psi_{\nicefrac{{\alpha}}{{\mu}}}(t),$

which yields (19c), and the formula for the derivative as in (19d) is then immediately obtained.

To conclude, observe that $b$ is essentially differentiable,⁵⁵5In the sense that $b^{\prime}(t)\to\infty$ as $t\to 0^{-}$ , 0 being the only point in the boundary of $\operatorname{dom}b$ . locally strongly convex and locally Lipschitz differentiable (having $b^{\prime\prime}>0$ ), all these conditions also holding for the conjugate ^∗ by virtue of [15, Cor. 4.4]. Therefore, ${}^{\ast}{}+\operatorname{\delta}_{[0,\rho^{\ast}]}$ is (globally) strongly convex, proving the claimed global Lipschitz-smoothness of its conjugate ^∗. ∎

References

[1] Heinz H. Bauschke and Patrick L. Combettes. Convex analysis and monotone operator theory in Hilbert spaces. CMS Books in Mathematics. Springer, 2017.
[2] Shujun Bi and Shaohua Pan. Multistage convex relaxation approach to rank regularized minimization problems based on equivalent mathematical program with a generalized complementarity constraint. SIAM Journal on Control and Optimization, 55(4):2493–2518, 2017.
[3] Ernesto G. Birgin and José Mario Martínez. Practical Augmented Lagrangian Methods for Constrained Optimization. Society for Industrial and Applied Mathematics, Philadelphia, PA, 2014.
[4] Emilie Chouzenoux, Marie-Caroline Corbineau, and Jean-Christophe Pesquet. A proximal interior point algorithm with applications to image processing. Journal of Mathematical Imaging and Vision, 62(6):919–940, 2020.
[5] Andrew R. Conn, Nicholas I. M. Gould, and Philippe L. Toint. Trust Region Methods. Society for Industrial and Applied Mathematics, 2000.
[6] Robert M. Corless, Gaston H. Gonnet, David E.G. Hare, David J. Jeffrey, and Donald E. Knuth. On the Lambert W function. Advances in Computational mathematics, 5:329–359, 1996.
[7] Frank E. Curtis. A penalty-interior-point algorithm for nonlinear constrained optimization. Mathematical Programming Computation, 4(2):181–209, 6 2012.
[8] Alberto De Marchi. Proximal gradient methods beyond monotony. Journal of Nonsmooth Analysis and Optimization, 4, 2023.
[9] Alberto De Marchi. Implicit augmented Lagrangian and generalized optimization. Journal of Applied and Numerical Optimization, 6(2):291–320, 2024.
[10] Alberto De Marchi, Xiaoxi Jia, Christian Kanzow, and Patrick Mehlitz. Constrained composite optimization and augmented Lagrangian methods. Mathematical Programming, 201(1):863–896, 2023.
[11] Alberto De Marchi and Andreas Themelis. Proximal gradient algorithms under local Lipschitz gradient continuity. Journal of Optimization Theory and Applications, 194(3):771–794, 2022.
[12] Alberto De Marchi and Andreas Themelis. An interior proximal gradient method for nonconvex optimization. Open Journal of Mathematical Optimization, 2024. to appear.
[13] N. K. Dhingra, S. Z. Khong, and M. R. Jovanović. The proximal augmented Lagrangian method for nonsmooth composite optimization. IEEE Transactions on Automatic Control, 64(7):2861–2868, 2019.
[14] Anthony V. Fiacco and Garth P. McCormick. The sequential unconstrained minimization technique for nonlinear programing, a primal-dual method. Management Science, 10(2):360–366, 1964.
[15] Rafal Goebel and R. Tyrrell Rockafellar. Local strong convexity and local Lipschitz continuity of the gradient of convex functions. Journal of Convex Analysis, 15(2):263, 2008.
[16] Nadav Hallak and Marc Teboulle. An adaptive Lagrangian-based scheme for nonconvex composite optimization. Mathematics of Operations Research, 48(4):2337–2352, 2023.
[17] Ben Hermans, Andreas Themelis, and Panagiotis Patrinos. QPALM: A proximal augmented Lagrangian method for nonconvex quadratic programs. Mathematical Programming Computation, 14:497–541, 3 2022.
[18] Geoffroy Leconte and Dominique Orban. An interior-point trust-region method for nonsmooth regularized bound-constrained optimization, 2024.
[19] Jakub Mareček, Peter Richtárik, and Martin Takáč. Matrix completion under interval uncertainty. European Journal of Operational Research, 256(1):35 – 43, 2017.
[20] Edward J. McShane. Extension of range of functions. Bulletin of the American Mathematical Society, 40(12):837 – 842, 1934.
[21] R. Tyrrell Rockafellar. Convergence of augmented Lagrangian methods in extensions beyond nonlinear programming. Mathematical Programming, 199(1):375–420, 2022.
[22] R. Tyrrell Rockafellar and Roger J.B. Wets. Variational analysis, volume 317. Springer, 1998.
[23] Ajay S. Sathya, Pantelis Sopasakis, Ruben Van Parys, Andreas Themelis, Goele Pipeleers, and Panos Patrinos. Embedded nonlinear model predictive control for obstacle avoidance using PANOC. In 2018 European Control Conference (ECC), pages 1523–1528, 6 2018.
[24] Lorenzo Stella, Andreas Themelis, Pantelis Sopasakis, and Panagiotis Patrinos. A simple and efficient algorithm for nonlinear model predictive control. In 2017 IEEE 56th Annual Conference on Decision and Control (CDC), pages 1939–1944. IEEE, 2017.
[25] Andreas Themelis, Lorenzo Stella, and Panagiotis Patrinos. Forward-backward envelope for the sum of two nonconvex functions: Further properties and nonmonotone linesearch algorithms. SIAM Journal on Optimization, 28(3):2274–2303, 2018.
[26] Andreas Wächter and Lorenz T. Biegler. On the implementation of an interior-point filter line-search algorithm for large-scale nonlinear programming. Mathematical Programming, 106(1):25–57, 3 2006.

	$\displaystyle\\|{\nabla}(\psi_{\nicefrac{{\alpha}}{{\mu}}}\circ{{c}}_{i})({\bm{% x}})-{\nabla}(\psi_{\nicefrac{{\alpha}}{{\mu}}}\circ{{c}}_{i})({\bm{y}})\\|={}$	$\displaystyle\\|\psi^{\prime}_{\nicefrac{{\alpha}}{{\mu}}}({{c}}_{i}({\bm{x}}))% {\nabla}{{c}}_{i}({\bm{x}})-\psi^{\prime}_{\nicefrac{{\alpha}}{{\mu}}}({{c}}_{% i}({\bm{y}})){\nabla}{{c}}_{i}({\bm{y}})\\|$
	$\displaystyle\leq{}$	$\displaystyle\psi^{\prime}_{\nicefrac{{\alpha}}{{\mu}}}({{c}}_{i}({\bm{x}}))\\|% {\nabla}{{c}}_{i}({\bm{x}})-{\nabla}{{c}}_{i}({\bm{y}})\\|$
		$\displaystyle+\\|{\nabla}{{c}}_{i}({\bm{x}})\\|\|\psi^{\prime}_{\nicefrac{{\alpha% }}{{\mu}}}({{c}}_{i}({\bm{x}}))-\psi^{\prime}_{\nicefrac{{\alpha}}{{\mu}}}({{c% }}_{i}({\bm{y}}))\|$
	$\displaystyle\leq{}$	$\displaystyle\tfrac{\alpha}{\mu}L\\|{\bm{x}}-{\bm{y}}\\|+\\|{\nabla}{{c}}_{i}({% \bm{x}})\\|\|\psi^{\prime}_{\nicefrac{{\alpha}}{{\mu}}}({{c}}_{i}({\bm{x}}))-% \psi^{\prime}_{\nicefrac{{\alpha}}{{\mu}}}({{c}}_{i}({\bm{y}}))\|.$

A penalty barrier framework for nonconvex constrained optimization

Abstract

1 Introduction

Motivations and related work

2 Preliminaries

2.1 Notation and known facts

2.2 Stationarity concepts

3 Subproblems generation

3.1 𝑳𝟏superscript𝑳1\bm{L^{1}}bold_italic_L start_POSTSUPERSCRIPT bold_1 end_POSTSUPERSCRIPT-penalization

Proof.

3.2 IP-type barrier reformulation

Proof.

Proof.

Proof.

4 Algorithmic framework

Proof.

Proof.

Proof.

Proof.

4.1 Barrier’s properties

Proof.

4.2 The convex case

Proof.

4.3 Equalities and bilateral constraints

Two inequalities

Combined marginalization

Proof.

5 Numerical experiments

5.1 Implementation details

5.2 Nonnegative PCA

Setup

Barriers and subsolvers

Problem size and accuracy

5.3 Low-rank matrix completion

Setup

Results

5.4 Exact penalty behavior

Results

5.5 Handling equalities

Setup

Results

6 Final remarks and open questions

Appendix A Auxiliary results

Lemma A.1 (Properties of the barrier ).

Proof.

Lemma A.2.

Proof.

References

3.1 $\bm{L^{1}}$ -penalization