A penalty barrier framework for nonconvex constrained optimization
Abstract
Focusing on minimization problems with structured objective function and smooth constraints, we present a flexible technique that combines the beneficial regularization effects of (exact) penalty and interior-point methods. Working in the fully nonconvex setting, a pure barrier approach requires careful steps when approaching the infeasible set, thus hindering convergence. We show how a tight integration with a penalty scheme overcomes such conservatism, does not require a strictly feasible starting point, and thus accommodates equality constraints. The crucial advancement that allows us to invoke generic (possibly accelerated) subsolvers is a marginalization step: amounting to a conjugacy operation, this step effectively merges (exact) penalty and barrier into a smooth, full domain functional object. When the penalty exactness takes effect, the generated subproblems do not suffer the ill-conditioning typical of penalty methods, nor do they exhibit the nonsmoothness of exact penalty terms. We provide a theoretical characterization of the algorithm and its asymptotic properties, deriving convergence results for fully nonconvex problems. Illustrative examples and numerical simulations demonstrate the wide range of problems our theory and algorithm are able to cover.
Keywords. Nonsmooth nonconvex optimization exact penalty methods interior point methods proximal algorithms
Contents
1 Introduction
We are interested in developing numerical methods for constrained optimization problems of the form
where is the decision variable and functions and are problem data. (Throughout, we stick to the convention of bold-facing vector variables and vector-valued functions, so that indicates the zero vector of suitable size and similarly is the vector with all entries equal to one.) The proposed algorithm for (1) can also be applied for tackling problems with equality constraints . For simplicity of presentation, we focus on the more general inequality constrained case and illustrate in Section 4.3 an efficient way of handling equality constraints in our framework other than describing them as two-sided inequalities. Henceforth we consider (1) under the following standing assumptions.
Notice that no differentiability requirements are imposed on the cost , nor convexity on any term in the formulation. The primary objective of this paper is to devise an abstract algorithmic framework in the generality of this setting. The methodology requires an oracle for solving, up to approximate local optimality, minimization instances of the sum of with a differentiable term. In our numerical experiments we will invoke off-the-shelf routines based on proximal gradient iterations, thereby restricting our attention to problem instances in which is structured as for a differentiable function and a function that enjoys an easily computable proximal map. Most nonsmooth functions widely used in practice comply with all these requirements. For instance, can include indicators of any nonempty and closed set, and thus enforce arbitrary closed constraints that are easy to project onto. We also emphasize that the requirement of continuity relative to the domain is virtually negligible, as in most cases it can be circumvented through suitable reformulations of the problem that leverage the flexibility of the constraints. As a particularly enticing such instance, we mention the reformulation of the so-called -norm penalty (number of nonzero entries) for as the linear program
and remark that, more generally, matrix rank can also be cast in a similar fashion [2, Lem. 3.1].
Motivations and related work
The class of problems (1) with structured cost has been recently studied in [4] and [12], respectively, for the fully convex and nonconvex setting, developing methods that bear strong convergence guarantees under some restrictive assumptions. Above all, building on a pure barrier approach, these methods demand a feasible set with nonempty interior, thus excluding problems with equality constraints. Although restricted to simple bounds, a similar interior-point technique is investigated in [18] and manifests analogous pros and cons. In contrast to these works, we intend to address equality constraints as well. An augmented Lagrangian scheme for constrained structured problems was developed in [10], which also allows the specification of constraints in a function-in-set format.
Constrained structured programs (1) are also closely related to the template of structured composite optimization
with . By introducing additional variables , composite problems can be rewritten in (equality) constrained form recovering the class of problems (1), with a one-to-one relationship between (local and global) solutions and stationary points [10, Lemma 3.1]. The recent literature on structured composite optimization includes [21], only for convex , and [16, 9] for fully nonconvex problems, and concentrates almost exclusively on the augmented Lagrangian framework. Relying essentially on a penalty approach, in contrast to a barrier, the algorithmic characterization in [10] involved weaker assumptions and yet retrieved standard convergence results in constrained nonconvex optimization. However, the dependency on dual estimates makes methods of this family sensitive to the initialization of Lagrange multipliers. Moreover, they require some safeguards to ensure convergence from arbitrary starting points [3, 10]. In contrast, thanks to their ‘primal’ nature and inherent regularizing effect, penalty-barrier techniques can conveniently cope with degenerate problems.
The idea of adopting and merging penalty and barrier approaches, in a variety of possible flavors and combinations, is certainly not new, tracing back at least to [14]. Among several recent concretizations of this avenue, we refer to Curtis’ work [7] for a comprehensive discussion and further references. Our motivation for developing this technique for constrained structured problems comes from previous experience while designing the interior point scheme IPprox [12]. The key observation therein is that, with a pure barrier approach, the arising subproblems have a smooth term without full domain. This nonstandard situation, together with a nonconvex and possibly extended-real-valued cost and nonlinear constraints , makes it difficult to adopt accelerated subsolvers.222For comparison, most interior point algorithms for classical nonlinear programming transform the original problem into one with equalities and simple bounds only, treating the latter with a barrier and dampening search directions with a so-called fraction-to-the-boundary rule to maintain strict feasibility (relative only to simple bounds) [26].
In the broad setting of (1) under Section 1, a blind application of penalty-barrier strategies in the spirit of [7] would bear no advantages, since the issue of IPprox of a restricted domain would persist, hindering again the practical performance. In this paper we propose and investigate in details a simple technique to overcome this limitation. The crucial step consists in the marginalization of auxiliary variables: after applying some penalty and barrier modifications, the auxiliary variables are optimized pointwise, for any given decision variable .333This approach can be interpreted as an extreme version of the so-called magical steps [5, 3], or slack reset in [7], and was inspired by the proximal approaches in [13, 9]. Before proceeding with the technical content, we emphasize that the marginalization step not only reduces the subproblems’ size (recovering that of only the original decision variable ), but it also—and especially—results in a smooth penalty term for the subproblems that has always full domain. The emergence of this penalty-barrier envelope enables the adoption of generic, possibly accelerated subsolvers, as well as tailored routines that exploit the problem’s original structure. This claim will be substantiated in Fig. 1 where we show that convexity and Lipschitz differentiability are preserved in the transformed problems.
2 Preliminaries
In this section we comment on useful notation and preliminary results before discussing optimality notions to characterize solutions of (1).
2.1 Notation and known facts
With and we denote the real and extended-real line, respectively, and with and the set of positive and negative real numbers, respectively. The positive and negative parts of a number are respectively denoted as and , so that . In case of a vector , then and are meant elementwise.
The notation indicates a set-valued mapping that maps any to a (possibly empty) subset of m. Its (effective) domain and graph are the sets and . Algebraic operations with or among set-valued mappings are meant in a componentwise sense; for instance, the sum of is defined as for all .
With and we indicate the distance from and the projection onto a nonempty set , respectively, namely
With we denote the indicator function of , namely such that if and otherwise. For an extended-real-valued function , the (effective) domain, graph, and epigraph are given by , , and . We say that is proper if and lower semicontinuous (lsc) if for all or, equivalently, if is a closed subset of n+1. Following [22, Def. 8.3], we denote by the regular subdifferential of , where
(1) |
The (limiting, or Mordukhovich) subdifferential of is , where if and only if and there exists a sequence in such that . In particular, holds at any ; moreover, is a necessary condition for local minimality of at [22, Thm. 10.1]. The subdifferential of at satisfies for any continuously differentiable around [22, Ex. 8.8]. If is convex, then coincide with the convex subdifferential
For a convex set and a point one has that , where
denotes the normal cone of at .
We use the symbol to indicate the Jacobian of a differentiable mapping , namely for all . For a real-valued function , we instead use the gradient notation to indicate the column vector of its partial derivatives. Finally, we remind that the convex conjugate of a proper lsc convex function is the proper lsc convex function defined as , and that one then has if and only if .
2.2 Stationarity concepts
This subsection summarizes standard local optimality measures which were adopted in the proximal interior point framework of [12], and which will be further developed in the following Section 3 into conditions tailored to the setting of this paper. The interested reader is referred to [12, §2] for a verbose introduction and to [3, §3] for a detailed treatise. We start with the usual notion of (approximate) stationarity for general minimization problems of an extended-real-valued function.
[stationarity]Relative to the problem for a function , a point is called
-
1.
stationary if it satisfies ;
-
2.
-stationary (with ) if it satisfies .
A standard optimality notion that reflects the constrained structure of (1) is given by the Karush-Kuhn-Tucker (KKT) conditions.
[2.2 optimality]Relative to problem (1), we say that is KKT-optimal if there exists such that
In such case, we say that is a KKT-optimal pair for (1).
Even for convex problems, unless suitable constraint and epigraphical qualifications are met, local solutions may fail to be 2.2-optimal. Necessary conditions in the generality of problem (1) are provided by the following asymptotic counterpart.
[2.2 optimality]Relative to problem (1), we say that is asymptotically KKT-optimal if there exist sequences and such that
For the sake of designing suitable algorithmic stopping criteria, we also define an approximate variant which provides a further weaker notion of optimality.
[-KKT optimality]Relative to problem (1), for we say that is an (approximate) -KKT point if there exists such that
As discussed in the commentary after [3, Thm. 3.1], 2.2-optimality of is tantamount to the existence of a sequence of -KKTpoints for some . More generally, for any the notions are related as
while when one has that -KKT reduces to 2.2. We conclude by listing the observations in [12, Lem. 6 and Rem. 7] that will be useful in the sequel.
Relative to the conditions 2.2 in Section 2.2:
-
1.
Up to possibly perturbing the sequence of multipliers, the complementarity slackness can be equivalently expressed as .
-
2.
If the sequence of multipliers contains a bounded subsequence, then is a 2.2 point, not merely asymptotically.
3 Subproblems generation
In this section we operate a two-step modification of problem (1), whose conceptual roadmap is as follows. We begin with a soft-constrained reformulation in which violation of the inequality is penalized with an -norm in the cost function; the use of a slack variable simplifies this formulation by promoting separability. Next, a barrier term is added to enforce strict satisfaction of the constraints in the -penalized reformulation. The minimization with respect to the slack variable in the resulting problem can be carried out explicitly, and gives rise to a new problem in which the constraint is softened with a smooth penalty. Increasing the -penalty and decreasing the barrier coefficients results in a homotopic transition between smooth reformulations and the original nonsmooth problem.
3.1 -penalization
Given , we consider the following relaxation of (1):
By introducing a slack variable , (3.1) can equivalently be cast as
as one can easily verify that holds for any and . In other words, (3.1) amounts to (3.1) after a marginal minimization with respect to the slack variable . Accordingly, we may consider the following relaxed optimality notion for problem (1), which, as explained below, is tantamount to 2.2-optimality for problem (3.1).
[3.1 optimality]Given , we say that a point is 3.1-optimal for (1) if there exists such that
In such case, we call a 3.1-optimal pair for (1).
Since the cost function in (3.1) is separable in and , its subdifferential at any point is the Cartesian product of the partial subdifferentials, see [22, Prop. 10.5]. By further observing that
for all (and is empty otherwise), it is easy to see that is 3.1-optimal iff (with ) is 2.2-optimal for (3.1), and in which case the multipliers coincide. More importantly, the following result clarifies how 2.2- and 3.1-optimality for problem (1) are interrelated. The result is standard, but its simple proof is nevertheless provided out of self containedness.
The following hold:
- 1.
- 2.
Proof.
For clarity of exposition, we have schematically reported the 2.2 and 3.1 optimality conditions side by side in (1). Notice in particular that the dual feasibility and primal optimality coincide in both notions.
(1) |
-
??) Primal feasibility holds by assumption, and since , the complementarity slackness in 3.1 reduces to , , yielding the corresponding complementarity slackness condition in Section 2.2.
-
??) The upper bound holds by assumption, and since , for every one has that and . ∎
3.2 IP-type barrier reformulation
To carry on with the second modification of the problem, in what follows we fix a barrier satisfying the following requirements.
For reasons that will be elaborated on later, convenient choices of barriers are and (both extended as on +), see Table 1 in Section 4.1. Once such is fixed, in the spirit of interior point methods, given a parameter we enforce strict satisfaction of the constraint in (3.1) by considering the following barrier version
Differently from the IP frameworks of [4, 12], we here enforce a barrier in the relaxed version (3.1), and not on the original problem (1). As such, it is only pairs that need to lie in the interior of the constraints, but is otherwise ‘unconstrained’: for any , any (elementwise) yields a pair that satisfies the strict constraint . In fact, we may again explicitly minimize with respect to the slack variable by observing that
(1) |
holds for every , where and similarly conjugation and derivative are meant elementwise. Plugging the optimal into (3.2) results in
where, for any , is a separable function given by
(1a) | |||
with | |||
(1d) | |||
being (globally) Lipschitz differentiable and -Lipschitz continuous with derivative | |||
(1e) |
A step-by-step derivation of all the identities above is given in A.2 in the appendix.
Problem (3.2) is ‘unconstrained’, in the sense that no explicit ambient constraints are provided, yet stationarity notions relative to it bear a close resemblance with KKT-type optimality conditions.
Proof.
Since is differentiable, we just need to confirm that its gradient equals the sum in the display. We have
where the first identity owes to the definition of . Using (1e) yields the claimed expression. ∎
As is apparent from (1d), coincides with up to when its slope is , and after that point it reduces to its tangent line. As such, coincides with a McShane Lipschitz (and globally Lipschitz differentiable) extension [20] of a portion of the barrier , cf. Fig. 1(a). As Fig. 1(b) instead shows, by introducing a scaling factor the linear part has slope , independently of , and the sharp penalty coincides with the limiting case as is driven to 0. These details are formalized next.
[Limiting behavior of ] The following hold:
-
1.
pointwise as .
-
2.
pointwise as .
-
3.
For any , pointwise as .
Proof.
-
??) Follows from the expression (1d) by observing that is tangent to the graph of (at ), and is thus globally majorized by because of convexity.
-
??) From the relation as in (19c) and the conjugacy calculus rule of [1, Prop. 13.23(ii)] it follows that
To show that is pointwise decreasing as , it thus suffices to show that is pointwise increasing, owing to the relation [1, Prop. 13.16(ii)]. On one has that independently of , as it follows from Item 2. Moreover, . Finally, for one has that
where the second identity follows from Item 3. This confirms monotonicity on as .
-
??) This follows from assertion ??, since , and as . ∎
In support of our claims about, we emphasize that the smooth penalty is convex whenever all the components are, and demonstrate how Lipschitz differentiability of is usually preserved after a composition with .
[Properties of ]Let be fixed and let comply with Section 3.2.
-
1.
If is convex, then is convex.
-
2.
If has Lipschitz-continuous gradient, then so does provided that is Lipschitz continuous on closed subsets of (as is the case when is lower bounded [17, Lem. 2.3]).
Proof.
The first claim about convexity is straightforward, since would amount to the composition of the convex and increasing function with the convex function . We next prove the statement about Lipschitz differentiability, thereby assuming that is Lipschitz on n with modulus , while is Lipschitz on with modulus , where . Recall that is -Lipschitz continuous, coincides with on , and is then linear with slope on , cf. (1e). Fix , and without loss of generality assume that . We have
It remains to account for the second term in the last sum. If , then coincides with in all occurrences, and the term can be upper bounded as , where is a Lipschitz modulus for ′ on . If , then and, by continuity, there exists such that , so that , resulting in the same bound . Lastly, if then the last term is zero. In all cases we conclude that
proving the claim. ∎
The requirements on other than Lipschitz differentiability in Item 2 are virtually negligible, since lower boundedness can be artificially imposed by replacing with, say, where . Inspired by the penalty function adopted in [23], is lower bounded and such that iff ; in addition, having , its adoption does not affect qualifications of active constraints. A Hessian inspection reveals that Lipschitz differentiability is preserved for ‘reasonable’ , e.g. whenever is bounded on the set , as it happens for quadratic functions. This is in stark constrast with methods such as ALM in which Lipschitz differentiability is typically lost in the composition with quadratic penalties.
4 Algorithmic framework
As shown in the previous section, the cost function in the smoothened problem (3.2) pointwise converges to the original hard-constrained cost of (1) as and . Following the penalty method rationale, this motivates solving (up to approximate local optimality) instances of (3.2) for progressively small values of and larger values of . This is the leading idea of the algorithmic framework of Algorithm 1 presented in this section, which also implements suitable update rules for the coefficients ensuring that the output satisfies suitable optimality conditions for the original problem (1). In fact, a careful design of the update rule for the penalization parameter in (3.2) prevents this coefficient from divergent behaviors under favorable conditions on the problem. The reason behind the involvement of the conjugate ∗ in the update criterion at Step 1.10 will be revealed in Sections 4.1 and 4.2 through a systematic study of the properties of the barrier in the generality of Section 1 as well as when specialized to the convex case.
Algorithm 1 is not tied to any particular solver for addressing each instance of (3.2). Whenever amounts to the sum of a differentiable and a proximable function (in the sense that its proximal mapping is easily computable), such structure is retained by the cost function in (3.2), indicating that proximal-gradient based methods are suitable candidates. This was also the case in the purely interior-point based IPprox of [12], which considers a plain proximal gradient with a backtracking routine for selecting the stepsizes. Differently from the subproblems of IPprox in which the differentiable term is extended-real valued, the differentiable term in (3.2) is smooth on the whole n. This enables the employment of more sophisticated proximal-gradient-type algorithms such as PANOC+ [11] that make use of higher-order information to considerably enhance convergence speed. This claim will be substantiated with numerical evidence in Section 5; in this section, we instead focus on properties of the outer Algorithm 1 that are independent of the inner solver.
[properties of the iterates]Suppose that Sections 1 and 3.2 hold, and consider the iterates generated by Algorithm 1. At every iteration the following hold:
-
1.
.
-
2.
( and) .
-
3.
If ( and) , then .
-
4.
For , either or (possibly both); in particular, letting and it holds that .
Proof.
Recall that holds by construction for every , since , from which assertion ?? follows. Regarding assertion ??, by Section 3.2 this is precisely the required stationarity for . Assertion ?? is obvious by observing that whenever the update is enforced.
We finally turn to assertion ??, and suppose that . Then, for all such that (or, equivalently, ), one has
where the first inequality follows from the definition of and the second one owes to monotonicity of ′. Thus, for all at least one among and is not larger than , proving that . ∎
[stationarity of feasible limit points]Let Sections 1 and 3.2 hold, and consider the iterates generated by Algorithm 1. If the algorithm runs indefinitely, then as and any accumulation point of that satisfies is 2.2-optimal for (1).
Proof.
The monotonic vanishing of follows from Items 4 and 4. Suppose that with , and let be such that (if such an does not exist, then there is nothing to show). According to Sections 2.2 and 1, it suffices to show that ; in turn, by definition of and continuity of it suffices to show that . If , then continuity of implies that for all large enough, hence, by virtue of Item 4, for all such . Since is monotone, in this case as . Suppose instead that , and, to arrive to a contradiction, that is asymptotically constant. This implies that eventually always holds, which is a contradiction since
where the first inequality follows by definition of , cf. Step 1.6, the second one for large since as , and the last one because . ∎
The update rule for the penalty parameter does not demand (approximate) feasibility, but it depends on a relaxed condition at Step 1.10. By (17), the second term vanishes as , so the penalty parameter is eventually increased as needed to achieve -feasibility. The relaxation of this condition using a quantity involving the conjugate ∗ mitigates the growth of . Simultaneously, under suitable choices of the barrier , it ensures that this parameter remains unchanged only if the constraints violation stays within a controlled range, as will be ultimately demonstrated in Section 4.1.
Suppose that Sections 1 and 3.2 hold, and consider the iterates generated by Algorithm 1 with . Then, and exactly one of the following scenarios occurs:
- 1
-
2
or it runs indenfinitely with for all large enough, and .
In the latter case, if is closed, then for any accumulation point of one has that is KKT-stationary for the feasiblity problem
(2) |
in the sense that .
Proof.
Since for all , and is linearly reduced whenever , we conclude that (either the algorithm terminates or) eventually always holds.
If the algorithm returns a pair , then the compliance with the termination criteria ensures that meets all conditions in Section 2.2, and hence it is -KKT-stationary for (1).
Suppose instead that the algorithm does not terminate. Clearly, holds for large enough, so that the only unmet termination criterion is eventually . Therefore, holds for every large enough. It follows from assertion ?? and Item 4 that eventually drops below , implying that the condition for increasing at Step 1.10 reduces to . Having shown that this is eventually always the case, always holds for large, , and is eventually never updated, cf. Step 1.10.
To conclude, suppose that is closed. By Section 3.2, for every we have that there exists with such that
Let be the limit of a subsequence and, up to extracting, let be the limit of . By the definition of and the continuity of ,
Equivalently,
(3) |
Since is closed and for all , one has that . Moreover, it follows from Assumption 1 that as , hence that
(4) |
where the inclusion follows from [22, Thm. 8.9]. Since
see [22, Ex. 10.26], it follows from (3) and continuity of that . Combining with (4) concludes the proof. ∎
The abuse of terminology to express KKT-stationarity in terms of subdifferentials passes through the same construct relating (3.1) and (3.1), in which a slack variable is tacitly introduced to reformulate the norm; see the discussion in Section 3.1. More importantly, the involvement in (2) of the epigraph of , as opposed to its domain, is a necessary technicality that cannot be avoided in the generality of Section 1, as we illustrate next.
[ vs ]Stationarity for (2) is, in general, weaker than that for the more natural minimal infeasibility violation problem
(5) |
To see how this notion may be violated, consider and , so that (1) reads
The point is stationary for any subproblem (3.2) with arbitrary , and therefore constitutes a feasible choice in Algorithm 1. However, the limit of the corresponding constant sequence is not stationary for the minimization of over . Nevertheless,
confirming that is stationary for the epigraphical problem (2).
We next formally illustrate why stationarity for (5) always implies that for (2), and identify the culprit of a possible discrepancy in uncontrolled growths around from within . To this end, we remind that a function is said to be calm at a point relative to a set if
and that this condition is weaker than strict continuity.
Let be proper and lsc. Then, for any one has
and | ||||
When is convex, both inclusions hold as equality. More generally, when is calm (in particular, if it is strictly continuous) at relative to , then the first inclusion holds as equality, and so does the second one when such property holds not only at , but also at all points in close to it.
Proof.
The relations in the convex case are shown in [22, Thm. 8.9 and Prop. 8.12]; in what follows, we consider an arbitrary proper and lsc function . Let and let . Then, there exists such that holds for every , hence
By the arbitrariness of the sequence, we conclude that . The same inclusion must then hold for the limiting normal cones, leading to
(6) |
where the identity follows from [22, Thm. 8.9].
Suppose now that there exists such that for close to , and suppose that . Let , and note that . Then, there exists such that
where the second inequality holds for large enough. Arguing again by the arbitrariness of the sequence, we conclude that . Finally, when is calm relative to its domain at all points close to , then the identity holds for all such points, and a limiting argument then yields that holds for the limiting normal cones. Therefore, the inclusion in (6) holds as equality, which concludes the proof. ∎
4.1 Barrier’s properties
According to its update rule in Algorithm 1, before a desired feasibility violation has been reached, means that , where . As shown in Item 4, regardless of whether is updated or not, grows linearly over the iterations; therefore, having implies in particular that either the constraint violation is within a desired tolerance , or that it is controlled by , where the inequality follows from monotonicity of , cf. Item 4. This means that a desired decrease in feasibility violation can be enforced through suitable choices of the barrier . This will be particularly significant in the convex case, for it can be shown that is eventually never updated under reasonable assumptions.
Let Sections 1 and 3.2 hold, and consider the iterates generated by Algorithm 1. Suppose that there exists such that the barrier satisfies for every (resp. for every close enough to 0), where . Then,
(7) |
holds for every (resp. for every large enough).
Proof.
To simplify the presentation, without loss of generality let us set . We have already argued that implies , where for all . It thus suffices to show that . To this end, notice that for every one has
hence, since for ,
which amounts to the condition in the statement. Under such condition, then, holds for every , leading to as claimed. ∎
Though it would be tempting to seek barriers for which as in the proof vanishes at any desired rate, it can be easily verified that no choice of or can result in converging any faster than linearly. In fact,
where the inequality follows from monotonicity of , cf. Item 2. This shows that a linear decrease by a factor is the best achievable rate, and that this can only happen in the limit. Section 4.1 nevertheless identifies a property that allows us to judge the fitness of a barrier within the framework of Algorithm 1. As we will see in Section 4.2, this will be particularly evident in the convex case, for it can be guaranteed that, under assumptions, eventually does remain always constant, so that employing a barrier that complies with this requirement is a guarantee that eventually the infeasibility of the iterates generated by Algorithm 1 vanishes at R-linear rate. This motivates the following definition.
[behavior profiles of ]We say that a barrier complying with Section 3.2 is asymptotically well behaved if
If this condition can be strengthened to | |||||
then we say that is well behaved (not merely asymptotically). We call the functions the behavior profile and the asymptotic behavior profile of , respectively.
In penalty-type methods, the update of a penalty parameter is typically decided based on the violation of the corresponding constraints. Under the assumption that the barrier is (asymptotically) well behaved, Section 4.1 demonstrates that in Algorithm 1 (eventually) the condition furnishes a guarantee of linear decrease of the infeasibility. Insisting on continuity of and at in Section 4.1 is a minor technicality ensuring that, regardless of the value of and , for any (asymptotically) well behaved barrier there always exists such that holds for every (close enough to zero) as required in Section 4.1. The result can thus be restated as follows.
Additionally to Sections 1 and 3.2, suppose that the barrier is (asymptotically) well behaved. Then, there exists such that the iterates of Algorithm 1 satisfy (7) for all (large enough).
When it comes to comparing different barriers, lower values of are clearly preferable. Notice that both and are scaling invariant:
Moreover, since
(owing to monotonicity of and the fact that consequently for ), barriers attaining can be considered asymptotically optimal. Table 1 shows that logarithmic barriers can attain such lower bound.
4.2 The convex case
In this section we investigate the behavior of Algorithm 1 when applied to convex problems. In particular, we detail an asymptotic analysis in which the termination tolerances are set to zero, so that the algorithm (may) run indefinitely. We demonstrate that under standard assumptions the iterates subsequentially converge to (global) solutions, and that the penalty parameter is eventually never updated.
Additionally to Sections 1 and 3.2, suppose that and , , are convex functions, and that there exists an optimal 2.2-pair for (1). Then, the following hold for the iterates generated by Algorithm 1 with :
-
1.
Any accumulation point of the sequence is a solution of (1).
-
2.
If, additionally, remains bounded (as is the case when is bounded), is eventually never updated.
-
3.
Further assuming that the barrier is asymptotically well conditioned, so that there exists such that for every close enough to 0, then the feasibility violation eventually vanishes with rate .
Proof.
It follows from Item 2 that solves (3.1) for all . For every , there exists with such that . If is asymptotically constant, then eventually always holds, and, since the right-hand side vanishes as , any limit point of satisfies . Otherwise, and, for large such that , since and solves (3.1) one has
since for any , see Item 2, | ||||
since and is convex by virtue of Item 1, | ||||
since and is increasing for any , cf. (1e), and since , | ||||
(8) | ||||
where the last identity uses (1d). Therefore, dividing by and denoting ,
holds for all large. Along any convergent subsequence, it is clear that , hence that the corresponding limit point satisfies , and is thus optimal in view of Algorithm 1 and convexity of the problem.
If the entire sequence is contained in for some , then
If, contrary to the claim, , then appealing to Item 2 the right-hand side of the above inequality converges to as , and in particular is eventually smaller than one. That is, eventually always holds and thus never updated, a contradiction.
Finally, the claim about the rate of follows from Section 4.1. ∎
4.3 Equalities and bilateral constraints
In previous sections we described an algorithm for the inequality constrained problem (1), but our approach can be applied to problems with bilateral and equality constraints as well. There are several possibilities for incorporating such constraints in the format (1) by means of reformulations. Below we discuss only a few options and refer to the related discussion in [7, §4.1.4] for more. Here we focus on two key aspects: on the one hand, for the sake of efficiency, we intend to exploit the additional problem structure available when bilateral constraints are explicitly specified by the user. On the other hand, for the sake of robustness, we need to tolerate and handle bad formulations as well. The latter point is particularly significant for large-scale, possibly automatically generated, optimization models, whereby it might be impractical to parse the constraints specification with the goal of uncovering, e.g., hidden equalities. In practice, nonlinear problems (1) may contain (approximate) constraint redundancies leading to singularities in the Jacobian: thanks to the penalty-barrier regularization, we expect Algorithm 1 to cope with these kinds of degeneracy.
Two inequalities
An equality constraint can be split into the pair of inequalities and , making the technique developed for (1) directly applicable. However, following the approach outlined in Section 3, this can result in a flat portion around the feasible set, and so in the absence of an effective penalization term. This circumstance is demonstrated by the fact that
(9) |
which displays a flat region around with radius vanishing as .
This splitting into a pair of inequality constraints also affects the theoretical motivations of Section 3. In fact, applying this technique, constraint qualifications fail to hold for in (1). Therefore, as these typically guarantee boundedness of the set of optimal Lagrange multipliers, the penalty exactness may not apply. These considerations underline that it is important to handle equality constraints carefully.
We now retrace the developments of Section 3 focusing on problems of the form
(10) |
Formulating the penalty problem—in the form (3.1)—with two slack variables as
results in a penalty barrier problem analogous to (3.2) after marginalization with respect to , which reads
Including the sum , it is easy to see that this approach is equivalent to specifying two (independent) inequalities. Thus, considering two slack variables does not overcome the issue of flat regions discussed above for (hidden) equalities. Instead, applying the technique above with only one slack variable appears to be a more reasonable approach for handling equality constraints, as we are about to show.
Combined marginalization
For problems with bilateral constraints as (10) we may formulate the associated penalty problem, in analogy with (3.1), with a single slack variable as
Then, as for (3.2), we introduce a barrier to replace the inequality constraints, obtaining
Following this strategy, the marginalization subproblem corresponding to a constraint yields a definition for the counterpart of in (1) with bounds—see also (1):
(11) |
and, in analogy to (3.2), leads to the penalty-barrier subproblem
where . A closed-form expression for the marginalization subproblem (11) can be obtained for specific barrier choices. For the inverse barrier function (extended as on +), the optimal value for the auxiliary variable is given by
Similarly, the oracle for the log-like barrier function (extended as on +) reads
It is nevertheless easy to verify for arbitrary complying with Section 3.2 the sharp penalization of the interval attained in the limit.
[Limiting behavior of ]For any with one has that pointwise as .
Proof.
To begin with, notice that as in (11) is bounded as
where the first inequality owes to the fact that , and the second one to the fact that (hence that is increasing). Again deferring to A.2 for the details, such lower and upper bounds coincide with unilateral smooth penalties as in Section 3.2, namely
(12) |
Dividing by and letting , Item 2 yields that both the lower and upper bounds converge to , demonstrating the claim. ∎
The penalty-barrier function associated to an equality constraint is depicted in Fig. 2 and contrasted with the two-inequality term . An analogous comparison between and associated to the more general bilateral constraint is displayed in Fig. 3 for the case . The two scaled penalty-barrier terms and have similar behaviors, according to Figs. 2(b) and 3(b), and lead to a well-justified procedure within Algorithm 1. In particular, both penalty-barrier terms converge to some sharp penalty function as (for fixed ). However, exhibits a flat region around the midpoint , whereas is strictly convex everywhere; this stark contrast is especially apparent for and at the origin, where only the former attains its unique minimum. Regardless, since the width of the flat portion of vanishes as , Algorithm 1 can terminate for any even when invoked with hidden equality or bilateral constraints. The minor modifications needed to account for explicit equalities are given next.
[Algorithm 1 with equality constraints]Two additional steps in Algorithm 1 allow it to handle problems of the form
for some smooth mapping . First, an equality multiplier is introduced; its update patterns the one of , but involving the derivative of as opposed to that of , see the comment in Step 1.4, and thus not restricted in sign (as expected from equality multipliers). The infeasibility measure is captured in a straightforward manner. Altogether, the addition to the respective Steps 1.4 and 1.5 are as follows:
All the convergence results remain valid for this variant; boundedness of in Section 4.2 is also recovered as long as at Step 1.10 is replaced by , namely in such a way that equalities are counted as two inequalities. The reason behind this can be easily understood from the upper bound in (12), which can be used in (8) by turning the equality therein into an inequality.
Based on the discussion above, bilateral constraints should be explicitly specified, whenever possible, in order to employ the tailored penalty-barrier relaxation. Nevertheless, it is an important feature of our scheme that it can handle badly formulated models, for instance, with hidden equalities. These favorable qualities will be showcased in Section 5.5, where we present a numerical comparison of the two options.
5 Numerical experiments
This section is dedicated to experimental results and comparisons with other numerical approaches for constrained structured optimization. We will refer to our implementation of Algorithm 1 as to Alg1. The perfomance and behavior of Alg1 is illustrated in different variants, considering two barrier functions, namely and (both extended as on +) denominated inverse and log-like, respectively, and two inner solvers, NMPG [8] and PANOC+ [11, 24]. The numerical comparison will highlight the influence of the barrier function on the performance of Alg1, supporting the quality assessment of Section 4.1.
The two subsolvers follow a proximal-gradient scheme and can handle merely local smoothness (as opposed to global Lipschitz continuity of the gradient of the smooth term). NMPG combines a spectral stepsize with a nonmonotone globalization strategy. PANOC+ can exploit acceleration directions (e.g., of quasi-Newton type) while ensuring convergence with a backtracking linesearch, see also [25, §5.1].
The performance of Alg1 is compared against those of IPprox [12, Alg. 1] and ALPS [9, Alg. 4.1], based on [10]. IPprox builds upon a pure interior point scheme and solves the barrier subproblems with a tailored adaptive proximal-gradient algorithm. ALPS belongs to the family of augmented Lagrangian algorithms and does not require a custom subsolver—suitable subsolvers for Alg1 can be applied within ALPS and viceversa.
Patterning the simulations of [12, §5.2], we examine the nonnegative PCA problem in Section 5.2 to evaluate Alg1 in several variants and compare it against IPprox. Then, Section 5.3 focuses on a low-rank matrix completion task, a fully nonconvex problem with bilateral constraints, contrasting Alg1 and ALPS. Finally, the exact penalty behavior and the ability to handle hidden equalities are illustrated and discussed in Sections 5.4, LABEL: and 5.5, respectively.
The source code of our implementation has been made available for reproducibility of the numerical results presented in this paper; it can be found on Zenodo at doi: 10.5281/zenodo.11098283.
5.1 Implementation details
We describe here details pertinent to our implementation Alg1 of Algorithm 1, such as the initialization and update of algorithmic parameters. These numerical features tend to improve the practical performances, without compromising the convergence guarantees. IPprox is available from [12] and adopted as is, whereas ALPS is a slight modification of the code from [9] to be comparable with Alg1, as detailed below.
Alg1 accepts problems formulated as in (10), with bilateral bounds defined by extended-real-valued vectors and . In a preprocessing phase (within Alg1), these vectors are parsed to instantiate the penalty-barrier functions to treat one-sided, two-sided and equality constraints.
Default parameters for Alg1 are and as in IPprox, as in ALPS, and . The initial tolerance for Alg1 (and ALPS) is chosen adaptively, based on the user-provided starting point and penalty-barrier parameters. Matching the mechanism implemented in IPprox, we set , where is a user-specified parameter (default ) and is an estimate of the initial stationarity measure, as evaluated by the inner solver invoked at . For simplicity, no infeasibility detection mechanism nor artificial bounds on penalty and barrier parameters have been included.
We run ALPS with the same settings as in [10, 9] apart from the following features to match Alg1: the initial penalty parameter is fixed () and not adaptive, the tolerance reduction factor is set to instead of , and the initial inner tolerance is selected adaptively and not fixed to . We always initialize ALPS with dual estimate . The two subsolvers are considered with their default tuning: PANOC+ wih L-BFGS directions (memory 5) and monotone linesearch strategy as in [11], NMPG with spectral stepsize and nonmonotone globalization with average-based merit function as in [8].
For the set of problems and the set of solvers, let denote the user-defined metric for the computational effort required by solver to solve instance (lower is better). We will monitor the (total) number of gradient evaluations, so that the computational overhead triggered by backtracking is fairly accounted for, and the number of (outer) iterations. Then, to graphically summarize our numerical results and compare different solvers, we display so-called data profiles. A data profile is the graph of the cumulative distribution function of the evaluation metric, namely . As such, each data profile reports the fraction of problems solved by solver with a budget of evaluation metric, and therefore it is independent of the other solvers.
5.2 Nonnegative PCA
Principal component analysis (PCA) aims at estimating the direction of maximal variability of a high-dimensional dataset. Imposing nonnegativity of entries as prior knowledge, we address PCA restricted to the positive orthant:
(13) |
This task falls within the scope of (1), with , , and , and has been considered in [12] for validating IPprox and tuning its hyperparameters.
Setup
We generate synthetic problem data as in [12, §5.2]. For a problem size , let , where is a random symmetric noise matrix, is the true (random) principal direction, and is the signal-to-noise ratio. We consider some dimensions and, for each dimension, the set of problems parametrized by and , which control the noise and sparsity level, respectively. There are 5 choices for , 4 for , and, for each set of parameters, 5 instances are generated with different problem data and starting point . Overall, each solver-settings pair is invoked on 100 different instances for each dimension .
A strictly feasible starting point is generated by sampling a uniform distribution over and projecting onto . This property is necessary for IPprox but not for Alg1. We will test Alg1 also with arbitrary initialization, in which case is generated by sampling a uniform distribution over and then projectin onto .
Barriers and subsolvers
Algorithm 1 is controlled by, and its performance depends on, several algorithmic hyperparameters, such as the (sequences of) barrier and penalty parameters, the choice of barrier function , and the subsolver adopted at Step 1.3. We now focus on the effect of the last two elements, for different levels of accuracy requirements, testing all combinations of barriers and subsolvers considering problem dimensions , for a total of 500 calls to each solver. For this set of experiments we set a time limit of 100 seconds on each call.
The results are graphically summarized in Fig. 4. As IPprox performed almost identically with the two barrier functions, only the inverse variant (originally considered in [12]) is displayed for clarity. Moreover, because of the excessive run time to perform all simulations for IPprox with high accuracy, we exclude it altogether for the high accuracy tests and consider instead starting points that are not necessarily (strictly) feasible.
For low and medium accuracy, all instances are solved by Alg1 and IPprox up to the desired primal-dual tolerances. With high accuracy, only the variant of Alg1 with NMPG and log-like is not able to solve all instances within the time limit. Across all accuracy levels, Alg1 PANOC+ inverse operates consistently better than the other variants of Alg1, all of which outperform IPprox. In particular, the overall effort (number of gradient evaluations) required by PANOC+ is less than NMPG (for fixed barrier) and with the inverse barrier is less than with the log-like (for fixed subsolver). With increasing accuracy it becomes more efficient to adopt PANOC+ than NMPG, while IPprox performs gradually more poorly. The slow tail convergence typical of first-order schemes badly affects the scalability of IPprox, whereas the adoption of a quasi-Newton scheme within PANOC+ seems to beat the simpler spectral approximation in NMPG.
In terms of (outer) iterations, the results appear unsurprisingly independent on the subsolver. Moreover, the log-like barrier invariably demands the solution of fewer subproblems than the inverse one, in agreement with the discussion in Section 4.1 on the barrier’s well behavior. However, we emphasize that the overall computational effort (measured in terms of gradient evaluations) also depends on the subsolver’s efficiency in solving the subproblems, as demonstrated by Fig. 4.
Problem size and accuracy
To investigate scalability and influence of accuracy requirements, we consider instances of (13) with dimensions and tolerances , without time limit. For each of these tolerance parameters, we test Alg1 with PANOC+ and the log-like barrier. For this test, we generate 5 instances (as described above) for each set of parameters, leading to a total of 500 problem instances to be solved with increasing accuracy.
All instances are solved up to the desired primal-dual tolerances. The influence of problem size and tolerance is depicted in Fig. 5, which displays for each pair the number of gradient evaluations with a jitter plot (for a better visualization of the distribution of numerical values over categories). The empirical cumulative distribution function with the associated median value are also indicated. This chart visualizes how problem size and accuracy requirement affect the solution process, and reveals the stark effect of both and . For low accuracy, Alg1 scales relatively well with the problem size, whereas large-scale problems become prohibitive for high accuracy.
This behavior is typical of first-order methods, due to their slow tail convergence, and we take it as a motivation for investigating the interaction between subpoblems and subsolvers in future works. Nevertheless, these experiments (and those forthcoming) demonstrate Alg1’s capability to handle thousands of variables and constraints in a fully nonconvex optimization landscape, witnessing a tremendous improvement over IPprox, not only in the practical performance but also in the ease of use.
5.3 Low-rank matrix completion
Given an incomplete matrix of (uncertain) ratings , a common task is to find a complete ratings matrix that is a parsimonious representation of , in the sense of low-rank, and such that for the entries available [19]. Let and denote the number of users and items, respectively, and let the rating by the th user for the th item range on a scale defined by constants and . Let represent the set of observed ratings, and the cardinality of . The ratings matrix could be very large and often most of the entries are unobserved, since a given user will only rate a small subset of items. Low-rankness of can be enforced by construction, with the Ansatz , as in dictionary learning. In practice, for some prescribed embedding dimension , we seek a user embedding matrix and an item embedding matrix . Each row of is a -dimensional vector representing user , while each row of is a -dimensional vector representing item . We address the joint completion and factorization of the ratings matrix , encoded in the following form:
(14) | |||||
While aiming at , the model in (14) sets the rating range as a hard constraint for all predictions; a tighter constraint is imposed to observed ratings. Following [25, §6.2], we explicitly constrain the norm of the dictionary atoms , without loss of generality, to reduce the number of equivalent (up to scaling) solutions; this norm specification is included as an indicator in the nonsmooth objective term . Furthermore, we encourage sparsity of the coefficient representation with the penalty, which counts the nonzero elements, scaled with a regularization parameter . Overall, this problem has decision variables and bilateral constraints. All terms (, , and ) are nonconvex, as well as the (unbounded) feasible set.
It appears nontrivial to find a strictly feasible point for (14), in the sense of [12, Def. 2], which is required for initializing IPprox, thus highlighting a major advantage of Alg1.
Setup
We consider the MovieLens 100k dataset444The entire dataset is available at https://grouplens.org/datasets/movielens/100k/., which contains ratings for unique movies (the dataset contains some repetitions in movie ratings and we have ignored them); these recommendations were made by users on a discrete rating scale from to . If we construct a matrix of movie ratings by the users, then it is a sparse unstructured matrix with only of the total entries available.
We compare Alg1 and ALPS and test their scalability with instances of increasing size. We fix the number of atoms to and consider the instances of (14) corresponding to subsets of users (always starting from the first one). For these problem instances the sizes range and . We set the regularization parameter and invoke each solver with the primal-dual tolerances and without time limit. For each problem instance, we randomly generated starting points, for a total of calls to each solver variant.
Results
A summary of the numerical results is depicted in Fig. 6. For the sake of clarity we display only the variant of ALPS with NMPG since it performed better than with PANOC+. Also the variants of Alg1 with the inverse barrier are not included, as their corresponding profiles were intermediate between those of Alg1 with the log-like barrier. All solver variants were always able to find a solution up to the primal-dual tolerance. Although not all variants of Alg1 consistently outperform ALPS, Alg1 NMPG log-like finishes ahead in most calls, but the advantage becomes not significant for larger instances. Alg1 PANOC+ log-like appears comparable to ALPS for smaller instances, but then it exhibits a performance much more sensitive to the starting point (its shaded area in Fig. 6 is significantly larger for larger instances). This behavior may stem from the more complicated mechanisms (search direction and globalizatin with line search) employed in PANOC+. These observations are in agreement with those in the previous Section 5.2 and highlight the potential benefits of subsolvers tailored to the subproblems’ structure.
5.4 Exact penalty behavior
This subsection is dedicated to the penalty-barrier behavior of Alg1. When addressing the nonconvex problem (14), we observed that the sequences of penalty parameters remain constant (equal to the initial value ) for all instances and solver variants. This is enabled by the relaxed condition at Step 1.10 of Algorithm 1, which does not require any sufficient improvement at every iteration, but instead monitors globally how the constraint violation vanishes. Correspondingly, only the barrier parameter is decreased in order to reduce the complementarity slackness , see Item 3. When active, this exact penalty quality prevents the barrier to yield too much ill-conditioning, exploiting Item 3.
Even in the fully nonconvex problem of the previous Section 5.3, the sequence of penalty parameters generated by Alg1 remains always bounded. Although this indicates that the assumptions behind Item 2 could be relaxed, the penalty exactness does not always take effect. We illustrate now an example problem where Alg1 exhibits , hence it does not boil down to an exact penalty method. For this purpose it suffices to consider the two-dimensional convex problem
(15) |
whose (unique) solution is the only feasible point which is however not 2.2-optimal (there exists no suitable multiplier ).
We intend to solve problem (15) with tolerances starting from 100 random initializations, generated according to with large standard deviation .
Results
All solver variants find a primal-dual solution to (15), up to the tolerances, for all starting points. The unbounded behavior of the penalty parameters appears evident in Fig. 7. Thus, seems necessary to drive the constraint violation to zero, while the barrier parameter forces the complementarity slackness .
The numerical performance of Alg1 and ALPS are summarized in Fig. 8, where ALPS is displayed only in the NMPG variant (which performed better than with PANOC+). Considering both the overall effort (number of gradient evaluations) and the number of subproblems needed by Alg1, the log-like barrier yields better results than the inverse (for fixed subsolver) and NMPG is more efficient than PANOC+ (for fixed barrier). Moreover, Alg1 performs consistently better than ALPS, despite the lack of penalty exactness.
5.5 Handling equalities
Even though bilateral constraints can be handled explicitly, as examined in Section 4.3, it is important that Alg1 can cope with hidden equalities too. These may appear as the result of automatic model constructions, and are often difficult to identify by inspection. Here we compare the behavior of Alg1 when the problem specification has explicit equalities against the same problem but whose constraints are described using two inequalities each. Consider possibly nonconvex quadratic programming (QP) problems of the form
(16) |
with matrices , and vectors , as problem data. Problem (16) can be cast as (10) with , , and . We are interested in comparing the performance of Alg1 (in different variants) with the two problem formulations described in Section 4.3 to deal with equalities: either by splitting into two inequalities (leading to defined in (9)) or by performing a combined marginalization (resulting in given by (11)). Hence, for each solver’s variant and problem instance, we contrast these two formulations, symbolized by Alg1± and Alg1, respectively.
Setup
Problem instances are generated as follows: we let where the elements of are normally distributed, , with only 10% being nonzero. The linear part of the cost is also normally distributed, i.e., . Simple bounds are generated according to a uniform distribution, i.e., and . We set the elements of as with only 10% being nonzero. To ensure that the problem is feasible, we draw an element (as , ) and set . An initial guess is randomly generated for each problem instance, as , and shared across all solvers and formulations.
We consider problems with and , set the tolerances , and construct 10 instances for each size, for a total of 100 calls to each solver for each formulation.
Results
Numerical results are visualized by means of pairwise (extended) performance profiles. Let and denote the evaluation metric of solver on a certain instance with the two formulations. Then, for each solver , the corresponding pairwise performance profile displays the cumulative distribution of its performance ratio , namely
Thus, the profile for solver indicates the fraction of problems for which solver invoked by Alg1 requires at most times the computational effort needed by the same solver invoked by Alg1±. As depicted in Fig. 9, all pairwise performance profiles cross the unit ratio with at least 83% problems solved, meaning that all solver variants benefit from the tailored handling of equality constraints as in Section 4.3. All variants are nevertheless robust to degenerate formulations, confirming that our algorithmic framework can endure redundant constraints and hidden equalities.
6 Final remarks and open questions
We proposed an optimization framework for the numerical solution of constrained structure problems in the fully nonconvex setting. We went beyond a simple combination of (exact) penalty and barrier approaches by taking a marginalization step, which not only allows us to reduce the problem size but also enables the adoption of generic subsolvers. In particular, by extending the domain of the subproblems’ smooth objective term, the proposed methodology overcomes the need for safeguards within the subsolver and the difficulty of accelerating it, a major drawback of IPprox [12]. Under mild assumptions, our theretical analysis established convergence results on par with those typical for nonconvex nonsmooth optimization. Most notably, all feasible accumulation points are asymptotically KKT optimal. We tested our approach numerically with problems arising in data science, studying scalability and the effect of accuracy requirements. Furthermore, illustrative examples confirmed the robust behavior on badly formulated problems and degenerate cases.
The methodology in this paper could be applied to, and compared with, a combination of barrier and augmented Lagrangian approaches. By generating a smoother penalty-barrier term, this strategy could benefit from the more effective performance of subsolvers. However, this development comes with the additional challenge of designing suitable updates for the Lagrange multiplier. Future research may also focus on specializing the proposed framework to classical nonlinear programming, taking advantage of the special structure and linear algebra. Finally, mechanisms for rapid infeasibility detection and guaranteed existence of subproblems’ solutions should be investigated.
Appendix A Auxiliary results
This appendix contains some auxiliary results and proofs of statements referred to in the main body.
Lemma A.1 (Properties of the barrier ).
Any function as in Section 3.2 satisfies the following:
-
1.
and .
-
2.
The conjugate ∗ is continuously differentiable on the interior of its domain with , and satisfies and .
-
3.
for any .
-
4.
The function , where , strictly increases from to 0.
Proof.
-
??) Trivial because of strict monotonicity on (since ).
-
??) Since , if one has that . For one directly has that , see [1, Prop. 13.10(i)], and in particular ; in addition, since with equality holding by virtue of [1, Thm. 16.29], we conclude that . For the same reason, one has that on , which proves that ∗ is strictly decreasing. Finally, since , we conclude that .
-
??) This is a standard result of Fenchel conjugacy, see e.g. [1, Prop. 16.10], here specialized to the fact that .
-
??) Strict monotonic increase follows by observing that for . Moreover,
Lastly, (17) where the first equality uses L’Hôpital’s rule. ∎
We next prove in detail the derivations of (1) and (1). Noticing that the minimization in (1) separates into that of for (after a division by a factor ), we provide the elementwise version of the claim.
Lemma A.2.
For and , let be defined as . Then,
(18a) | ||||
and | ||||
(18b) |
where
(19c) | ||||
is (globally) Lipschitz differentiable and -Lipschitz continuous with derivative | ||||
(19d) |
Proof.
We first consider (18a).
The properties of as in Section 3.2 ensure that is strictly convex and coercive, and that therefore it attains a unique (global) minimizer. Moreover, is differentiable on ; as such, the minimizer is the unique zero of if it exists, and 0 otherwise. Solving for gives
Therefore, the minimizer is given by
which yields the claimed expression in (18a).
Next, according to this formula, if then the minimizer of is , hence
where (i.e., such that ). Otherwise, if then the minimizer of is 0, which gives
By observing that iff we infer that
which proves (18b).
We next show the alternative expression involving the convex conjugacy for as in (19c). To this end, for a function and a number , let and denote the translation by and the reflection, respectively, and recall that and , see [1, Prop.s 13.23(iii)-(iv)]. We will also use the fact that holds for any pair of proper, lsc, convex functions defined on the same space, , where denotes the infimal convolution, see [1, Prop. 13.24]. Then,
and since for , | ||||
which yields (19c), and the formula for the derivative as in (19d) is then immediately obtained.
To conclude, observe that is essentially differentiable,555In the sense that as , 0 being the only point in the boundary of . locally strongly convex and locally Lipschitz differentiable (having ), all these conditions also holding for the conjugate ∗ by virtue of [15, Cor. 4.4]. Therefore, is (globally) strongly convex, proving the claimed global Lipschitz-smoothness of its conjugate ∗. ∎
References
- [1] Heinz H. Bauschke and Patrick L. Combettes. Convex analysis and monotone operator theory in Hilbert spaces. CMS Books in Mathematics. Springer, 2017.
- [2] Shujun Bi and Shaohua Pan. Multistage convex relaxation approach to rank regularized minimization problems based on equivalent mathematical program with a generalized complementarity constraint. SIAM Journal on Control and Optimization, 55(4):2493–2518, 2017.
- [3] Ernesto G. Birgin and José Mario Martínez. Practical Augmented Lagrangian Methods for Constrained Optimization. Society for Industrial and Applied Mathematics, Philadelphia, PA, 2014.
- [4] Emilie Chouzenoux, Marie-Caroline Corbineau, and Jean-Christophe Pesquet. A proximal interior point algorithm with applications to image processing. Journal of Mathematical Imaging and Vision, 62(6):919–940, 2020.
- [5] Andrew R. Conn, Nicholas I. M. Gould, and Philippe L. Toint. Trust Region Methods. Society for Industrial and Applied Mathematics, 2000.
- [6] Robert M. Corless, Gaston H. Gonnet, David E.G. Hare, David J. Jeffrey, and Donald E. Knuth. On the Lambert W function. Advances in Computational mathematics, 5:329–359, 1996.
- [7] Frank E. Curtis. A penalty-interior-point algorithm for nonlinear constrained optimization. Mathematical Programming Computation, 4(2):181–209, 6 2012.
- [8] Alberto De Marchi. Proximal gradient methods beyond monotony. Journal of Nonsmooth Analysis and Optimization, 4, 2023.
- [9] Alberto De Marchi. Implicit augmented Lagrangian and generalized optimization. Journal of Applied and Numerical Optimization, 6(2):291–320, 2024.
- [10] Alberto De Marchi, Xiaoxi Jia, Christian Kanzow, and Patrick Mehlitz. Constrained composite optimization and augmented Lagrangian methods. Mathematical Programming, 201(1):863–896, 2023.
- [11] Alberto De Marchi and Andreas Themelis. Proximal gradient algorithms under local Lipschitz gradient continuity. Journal of Optimization Theory and Applications, 194(3):771–794, 2022.
- [12] Alberto De Marchi and Andreas Themelis. An interior proximal gradient method for nonconvex optimization. Open Journal of Mathematical Optimization, 2024. to appear.
- [13] N. K. Dhingra, S. Z. Khong, and M. R. Jovanović. The proximal augmented Lagrangian method for nonsmooth composite optimization. IEEE Transactions on Automatic Control, 64(7):2861–2868, 2019.
- [14] Anthony V. Fiacco and Garth P. McCormick. The sequential unconstrained minimization technique for nonlinear programing, a primal-dual method. Management Science, 10(2):360–366, 1964.
- [15] Rafal Goebel and R. Tyrrell Rockafellar. Local strong convexity and local Lipschitz continuity of the gradient of convex functions. Journal of Convex Analysis, 15(2):263, 2008.
- [16] Nadav Hallak and Marc Teboulle. An adaptive Lagrangian-based scheme for nonconvex composite optimization. Mathematics of Operations Research, 48(4):2337–2352, 2023.
- [17] Ben Hermans, Andreas Themelis, and Panagiotis Patrinos. QPALM: A proximal augmented Lagrangian method for nonconvex quadratic programs. Mathematical Programming Computation, 14:497–541, 3 2022.
- [18] Geoffroy Leconte and Dominique Orban. An interior-point trust-region method for nonsmooth regularized bound-constrained optimization, 2024.
- [19] Jakub Mareček, Peter Richtárik, and Martin Takáč. Matrix completion under interval uncertainty. European Journal of Operational Research, 256(1):35 – 43, 2017.
- [20] Edward J. McShane. Extension of range of functions. Bulletin of the American Mathematical Society, 40(12):837 – 842, 1934.
- [21] R. Tyrrell Rockafellar. Convergence of augmented Lagrangian methods in extensions beyond nonlinear programming. Mathematical Programming, 199(1):375–420, 2022.
- [22] R. Tyrrell Rockafellar and Roger J.B. Wets. Variational analysis, volume 317. Springer, 1998.
- [23] Ajay S. Sathya, Pantelis Sopasakis, Ruben Van Parys, Andreas Themelis, Goele Pipeleers, and Panos Patrinos. Embedded nonlinear model predictive control for obstacle avoidance using PANOC. In 2018 European Control Conference (ECC), pages 1523–1528, 6 2018.
- [24] Lorenzo Stella, Andreas Themelis, Pantelis Sopasakis, and Panagiotis Patrinos. A simple and efficient algorithm for nonlinear model predictive control. In 2017 IEEE 56th Annual Conference on Decision and Control (CDC), pages 1939–1944. IEEE, 2017.
- [25] Andreas Themelis, Lorenzo Stella, and Panagiotis Patrinos. Forward-backward envelope for the sum of two nonconvex functions: Further properties and nonmonotone linesearch algorithms. SIAM Journal on Optimization, 28(3):2274–2303, 2018.
- [26] Andreas Wächter and Lorenz T. Biegler. On the implementation of an interior-point filter line-search algorithm for large-scale nonlinear programming. Mathematical Programming, 106(1):25–57, 3 2006.