Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Noname manuscript No. (will be inserted by the editor) Optimal subgradient algorithms with application to large-scale linear inverse problems Masoud Ahookhosh Received: date / Accepted: date Abstract This study addresses some algorithms for solving structured unconstrained convex optimization problems using first-order information where the underlying function includes high-dimensional data. The primary aim is to develop an implementable algorithmic framework for solving problems with multiterm composite objective functions involving linear mappings using the optimal subgradient algorithm, OSGA, proposed by Neumaier in [49]. To this end, we propose some prox-functions for which the corresponding subproblem of OSGA is solved in a closed form. Considering various inverse problems arising in signal and image processing, machine learning, statistics, we report extensive numerical and comparisons with several state-of-the-art solvers proposing favourably performance of our algorithm. We also compare with the most widely used optimal first-order methods for some smooth and nonsmooth convex problems. Surprisingly, when some Nesterov-type optimal methods originally proposed for smooth problems are adapted for solving nonsmooth problems by simply passing a subgradient instead of the gradient, the results of these subgradient-based algorithms are competitive and totally interesting for solving nonsmooth problems. Finally, the OSGA software package is available. Keywords Structured convex optimization · Linear inverse problems · Nonsmooth optimization · Sparse optimization · First-order black-box oracle · Optimal complexity · Subgradient methods · High-dimensional data · Compressed sensing · Image restoration Mathematics Subject Classification (2000) 90C25 · 90C60 · 49M37 · 65K05 1 Introduction In many applications, e.g., those arising in signal and image processing, machine learning, compressed sensing, geophysics and statistics, key features cannot be studied by straightforward investigations, but must be indirectly inferred from some observable quantities. Thanks to this characteristic, they are usually referred as inverse problems. If a linear relevance between the features of interest and observed data can be prescribed, this leads to linear inverse problems. If y ∈ Rm be an indirect observation of an original object x ∈ Rn and A : Rn → Rm is a linear operator, then the linear inverse problem is defined using y = Ax + ν, (1) where ν ∈ Rm represents an additive noise about which little is known apart from which qualitative knowledge. Motivation & history. In practice, the systems (1) is typically underdetermined , rank-deficient or ill-conditioned . The primary difficulty with linear inverse problems is that the inverse object is extremely sensitive to y due to small or zero singular values of A meaning ill-conditioned systems behave like singular cases. Indeed, in the case that A−1 for square problems or pseudo-inverse A† = (A∗ A)−1 A∗ for full rank over-determined systems is exist, then analyzing the singular value decomposition shows that Faculty of Mathematics, University of Vienna, Oskar-Morgenstern-Platz 1, 1090 Vienna, Austria E-mail: masoud.ahookhosh@univie.ac.at 2 Masoud Ahookhosh x̃ = A−1 y or x̃ = A† y be inaccurate and meaningless approximation for x, see [50]. Furthermore, when the vector δ is not known, one cannot solve (1) directly. From underdetermined or rank-deficient feature of inverse problems, we know that if a solution exists, then there exist infinitely many solutions. Hence some additional information is required to determine a satisfactory solution of (1). It is usually interested to determine a solution by minimizing the residual kAx − yk2 , leading to the well-known least-square problem min x∈Rn 1 kAx − yk22 , 2 (2) where k.k2 denotes l2 -norm. In view of ill-condition feature of the problem (1), the solution of (2) is usually improper. Hence Tikhonov in [53] proposed the following penalized minimization problem λ 1 kAx − yk22 + kxk22 , 2 2 min x∈Rn (3) where λ is the regularization parameter that controls the trade-off between the least square data fitting term and the regularization term. The reformulated problem is also smooth and convex, and selecting a suitable regularization parameter leads to a well-posed problem. In many applications, we seek the sparsest solution of (1) among all the solutions provided that y is acquired from a highly sparse observation. To do so, the following problem is proposed min kxk0 s.t. kAx − yk ≤ ǫ where kxk0 is the number of all nonzero elements of x and ǫ is a nonnegative constant. It is well-known that the objective function is nonconvex, however its convex relaxation is much more preferable. As investigated in [18], this is achievable by considering the problem min kxk1 s.t. kAx − yk ≤ ǫ. or its unconstrained version 1 kAx − yk22 + λkxk1 , (4) 2 which is referred as basis pursuit denoising or lasso. Due to non-differentiability of the l1 -norm, this problem is nonsmooth and its solving is more difficult than solving (3). In general, linear inverse problems can be modelled as a special case of the following affine composite minimization problem (5) minn f (Ax) + ϕ(Wx). min x∈Rn x∈R n Usually, f : R → R is a smooth convex loss function defined by a black-box oracle and ϕ : Rn → R is a regularizer or regularization function, which is usually nonsmooth and possibly nonconvex. This means that the general underlying regularizer can be smooth or nonsmooth and convex or nonconvex. In this paper, we consider structured convex optimization problems of the form (5) when regularizers can be smooth or nonsmooth convex terms, so nonconvex regularizers are not the matter of this study. Over the past few decades, because of emerging many applications of inverse problems in applied sciences and engineering, solving structured composite optimization problems of the form (5) attracts many interesting researches. To exploit different features of objective functions, a variety of techniques for solving this class of problems are introduced, where proposed optimization schemes employ particular structures of objective functions. It is also known that solving nonsmooth problems are usually more difficult than solving smooth problems, even for unconstrained cases. In the meantime, black-box convex optimization is theoretically and computationally regarded as a mature area in optimization and frequently employed to solve problems with large number of variables appearing in various applications. Since problems of the form (5) mostly include high-dimensional data, optimization schemes should avoid of exploiting costly operations and also high amount of memory. Therefore, the first-order black-box oracle is intensively used to construct appropriate schemes for handling large-scale problems, where just function and subgradient evaluations are needed. Some popular first-order references are gradient-based algorithms [25, 45, 46, 48], subgradient-based schemes [8, 12, 39, 42, 47], bundle-type procedures [35], smoothing techniques [9, 43] and proximal gradient methods [9, 11, 52]. Optimal subgradient algorithms with application to large-scale linear inverse problems 3 In the context of convex optimization, the analytical complexity refer to the number of calls of an oracle, for example iterations or function and subgradient evaluations, to reach an ε-solution of the optimum, for more information see [39, 42]. In 1983, Nemirovski & Yudin in [39] proved that the lower complexity bound of first-order methods for finding √ an ε-solution of the optimum for smooth convex functions with Lipschitz continuous gradient is O(1/ ε), while for Lipschitz continuous nonsmooth convex function it is O(1/ε2 ). Indeed, the interesting feature of these error bounds is that they are independent of the problem’s dimension. For the established class of problems, an algorithm called optimal if it can achieve these worst-case complexities. In a case of nonsmooth convex problems, the well-known subgradient decent and mirror descent methods are optimal, see [39]. The seminal optimal first-order method for smooth convex problems with Lipschitz continuous gradient is introduced by Nesterov [40] in 1983, and since then he has developed the idea of optimal methods for various classes of problems in [41, 42, 46, 48]. It is clear that the gap between the complexity of smooth and nonsmooth convex problems is considerably big. This fact motivates Nesterov to develop some schemes with better performance for nonsmooth convex problems so called smoothing techniques where the worst-case complexity of these methods is in order O(1/ε) and can be improved using the idea of optimal gradient methods, see [43, 44, 45]. The theoretical analysis and surprising computational results of optimal methods are totally interesting, especially for solving large-scale problems. These appealing features along with increasing the interest of solving large-scale problems attract many authors to develop the idea of the optimal firstorder methods, for examples Beck & Teboulle [8], Becker et al. [11], Devolder et al. [26], Gonzaga et al. [31, 32], Lan [35], Meng & Chen [37], Nemirovski & Nesterov [38] and Tseng [54]. In most of the studies mentioned, the first difficulty in implementation is the algorithms need the knowledge of global parameters such as the Lipschitz constant of objective function for nonsmooth problems and the Lipschitz constant for its gradient in smooth cases. Nesterov in [48] proposed an adaptive procedure to approximate the Lipschitz constant, however it still needs a suitable initial guess. Recently Lan in [35] based on employing bundle techniques avoids of requiring the global parameters. More recently, Neumaier in [49] introduced a novel optimal algorithm on the basis of incorporating a linear relaxation of the objective function and a prox-function into a fractional subproblem to construct a framework that can be employed for both smooth and nonsmooth convex optimization in the same time and to avoid of needing Lipschitz constants. It called OSGA, where it only needs the first-order blackbox oracle. In a case that no strong convexity is assumed, it shares the uniformity, freeness of global parameters and complexity properties of Lan’s approach, but has a far simpler structure and derivation. Thanks to achieving the optimal complexity of first-order methods, it can be regarded as a fully adaptive alternative of Nesterov-type optimal methods. Furthermore, low memory requirements of this algorithm make it appropriate for solving large-scale practical problems. Contribution. The main focus of this paper is to extend and develop the basic idea of the optimal subgradient algorithm for minimizing general composite functions to deal with structured convex optimization problems in applications. There exist many applications of linear inverse problems with a complicated objective function consisting of various linear operators and regularizers, which are not supported by algorithms and solvers using the proximal mapping in their structures. For example, consider the scaled lasso problem that is defined the same as (4) in which the regularization term kxk1 is substituted by kW xk1 for a general linear operator W . In this case, the following proximal mapping should be solved as an auxiliary subproblem 1 proxλ (y) = argmin kx − yk22 + kWxk1 . n 2 x∈R Indeed, this proximity problem cannot be solved explicitly using iterative shrinkage-thresholding except for the case that W is orthonormal, i.e., it cannot be solved using usual proximal-based algorithms. Thanks to accessible requirements of OSGA needing just function and subgradient evaluations, solving most of such complicated problems are tractable in reasonable costs. Furthermore, some appropriate prox-functions are proposed for unconstrained problems of the form (5) such that the corresponding subproblem of OSGA can be solved explicitly. We also adapt some Nesterov-type optimal methods originally proposed for smooth problems to solve nonsmooth problems by passing subgradient instead of gradient. The surprising results indicate that the adapted algorithms perform competitively or even better than optimal methods specifically proposed for solving nonsmooth problems. Finally, the paper is accompanied with a software release. 4 Masoud Ahookhosh Contents. The rest of the paper is organized as follows. In next section, we establish the basic idea of the optimal subgradient algorithm for the general affine composite functions. Section 3 devotes to implementing OSGA and comparing it with some popular first-order methods and state-of-the-art solvers for some practical problems in applications. The OSGA software package is described in Section 4. Finally, we conclude our results in Section 5. 2 Optimal subgradient framework In this section, we shall give a short sketch of the optimal subgradient framework introduced by Neumaier in [49] for composite functions in the presence of linear operators which is appropriate for solving considered applications of linear inverse problems. This paper considers composite functions of the form (5) or a more general form that will be discussed in the sequel, however we can still solve constrained problems by defining an indicator function. Let us to consider the following convex constrained problem min f (Ax), (6) x∈C where f : C → R is a convex function defined on a nonempty convex subset C of a finite-dimensional real vector space X with a bilinear pairing hh, zi, where h is in the dual space X ∗ which is formed by all linear functions on R. The indicator function of the set C is defined by  0 x ∈ C; IC (x) = (7) ∞ otherwise. Since C is convex set, it is straightforward to show IC is a convex function. Thus, by setting ϕ(x) = IC (x), we can reformulate the problem (6) in an equivalent unconstrained form (5), which is desired problem of the current paper. In some applications, one has no easy access to the matrices used in (5), instead it is assembled in a subroutine without ever forming explicit matrices so that it is preferred to employ linear operators in (5) instead of their matrix representations. We generally consider functions of the form Ψ (x) = n1 X fi (Ai x) + n2 X ϕj (Wj x), (8) j=1 i=1 where fi : Ui → R for i = 1, 2, · · · , n1 are convex smooth functions, ϕj : Vj → R for j = 1, 2, · · · , n2 are smooth or nonsmooth convex functions and Ai : X → Ui for i = 1, 2, · · · , n1 and Wj : X → Vj for j = 1, 2, · · · , n2 are linear operators. Here, Ψ is called the multi-term affine composite function. Since each affine term Ai x or Wj x and each function fi or ϕj for i = 1, · · · , n1 and j = 1, · · · , n2 is convex and the domain of Ψ defined by  ! n n1 2 \ \ \  dom ϕj (Wj x) dom fi (Ai x) dom Ψ = j=1 i=1 is a convex set, then the function Ψ is convex on its domain. Considering (8), we are generally interested in solving the following composite convex optimization problem Ψ ∗ := min Ψ (x) x∈X (9) Under considered properties of Ψ , the problem (9) has a global optimizer denoted by x∗ and Ψ ∗ = Ψ (x∗ ) is its global optimum. The functions of the form Ψ are frequently appeared in the recent interests of employing hybrid regularizations or mixed penalty functions for solving problems in the fields such as signal and image processing, machine learning and statistics. In the sequel, we will see that many well studied structured optimization problems are special cases of (8). It is also clear that (9) is a generalization of (5). For example, the following scaled elastic net problem Ψ (x) = λ2 1 kAx − bk22 + λ1 kW1 xk1 + kW2 xk22 2 2 (10) Optimal subgradient algorithms with application to large-scale linear inverse problems 5 can be considered in the form (8) by setting Ψ (x) = f1 (A1 x) + ϕ1 (W1 x) + ϕ2 (W2 x) where f1 (A1 x) = λ2 1 2 2 2 kA1 x − bk2 , ϕ1 (W1 x) = λ1 kW1 xk1 and ϕ2 (W2 x) = 2 kW2 xk2 . Basic idea of optimal subgradient scheme. This section summarizes the basic idea of OSGA [49] which is adapted for the considered unconstrained problem (9). Let us to consider an arbitrary convex function Ψ : X → R of the form (8). A vector gΨ (x) ∈ X ∗ is called a subgradient of Ψ at a given point x ∈ dom Ψ if for any z ∈ dom Ψ we have that Ψ (z) ≥ Ψ (x) + hgΨ (x), z − xi. The set of all subgradients of the function Ψ at x is called subdifferential of Ψ at x denoted by ∂Ψ (x). The subdifferential of a general convex function Ψ at any point x in its domain is nonempty, and it is unique if Ψ is differentiable. We now consider a minimization problem of the form (8) and suppose that the first-order black-box oracle O(x) = (Ψ (x), gΨ (x)), is available, where gΨ (x) ∈ ∂Ψ (x) at x ∈ X . The primary objective of the optimal subgradient algorithm (OSGA) is to construct a sequence of iterations such that the error bound Ψ (xb ) − Ψ ∗ is monotonically reducing to be smaller than an accuracy parameter ε, i.e. 0 ≤ Ψb − Ψ ∗ ≤ ε, where Ψb := Ψ (xb ) is the function value in the best known point xb . To derive a framework for OSGA, let us to define the following affine global underestimator , which is a linear relaxation of Ψ , Ψ (z) ≥ γ + hh, zi, (11) for all z ∈ dom Ψ with γ ∈ R and h ∈ X ∗ , which is clearly available due to the first-order condition for the convex function Ψ . We also consider a continuously differentiable prox-function Q : X → R, which is a strongly convex function with the convexity parameter set to σ satisfying σ Q(λx + (1 − λ)z) + λ(1 − λ) kx − zk2 ≤ λQ(x) + (1 − λ)Q(z), 2 (12) for all x, z ∈ X and λ ∈ (0, 1) where k.k is an arbitrary norm of the vector space X . This is equivalent to Q(z) ≥ Q(x) + hgQ (x), z − xi + σ kz − xk2 , 2 (13) for all x, z ∈ X where gQ (x) denotes the gradient of Q at x. Furthermore, we suppose that Q0 := inf Q(z) > 0. z∈X (14) The point z0 ∈ X minimizing the problem (14) is called the center of Q. At the end of this section, we will propose some prox-functions satisfying these conditions. Afterwards, the following minimization subproblem is defined γ + hh, zi E(γ, h) := − inf . (15) z∈ dom Ψ Q(z) By the definition of prox-functions, it is clear that the denominator Q(z) is positive and bounded away from zero on the feasible region dom Ψ . The solution of this subproblem (15) is denoted by u := U (γ, h), and it is supposed that e := E(γ, h) and u can be effectively computed. In particular, for the unconstrained problem (8), we will show that this subproblem can be solved explicitly and cheaply for some suitable prox-functions. Theoretical analysis. We here establish a bound on error Ψ (xb ) − Ψ ∗ for the proposed scheme. The definition of e = E(γ, h) in (15), for arbitrary γb ∈ R and h ∈ X ∗ , simply implies that γb + hh, zi ≥ −E(γb , h)Q(z), for all z ∈ X . From (11) and setting γb = γ − Ψ (xb ), we have Ψ (z) − Ψ (xb ) ≥ γb + hh, zi, (16) 6 Masoud Ahookhosh for all z ∈ X . By this fact, assuming E(γb , h) ≤ η, setting z = x∗ and (16), we conclude the following upper bound on the function value error 0 ≤ Ψ (xb ) − Ψ ∗ ≤ ηQ(x∗ ). (17) It is clear that if an upper bound for Q(x∗ ) is known or assumed, this bound translates into a computable error estimate for the minimal function value. But even in the absence of such an upper bound, the optimization problem (8) can be solved with a target accuracy 0 ≤ Ψ (xb ) − Ψ ∗ ≤ εQ(x∗ ) (18) if one manages to decrease the error factor η from its initial value until η ≤ ε, for some target accuracy tolerance ε. This bound clearly means that the rate of decrease in the error bound Ψ (xb ) − Ψ ∗ , is at least the same as convergence rate of the sequence η. Indeed, this fact is the main motivation of OSGA. Neumaier in [49] considers a problem of the form (6) and derive the following complexity bounds for OSGA matching to the optimal complexity bounds of first-order methods for smooth and nonsmooth problems, see [39, 42]. Theorem 1 Suppose that f is a convex function of the form (6) and let {xk } is generated by OSGA. Then complexity bounds for smooth and nonsmooth problems are as follows: (i) (Nonsmooth complexity bound) If the point generated by OSGA stay in bounded region of the  interior of C, or f is Lipschitz continuous in C, then the total number of iterations needed is O 1/ε2 . Thus the asymptotic worst case complexity is O 1/ε2 . (ii) (Smooth complexity bound) √ If f has Lipschitz continuous gradient, the total number of iterations needed for the algorithm is O (1/ ε). Algorithmic framework. Before presenting the OSGA algorithm, we first look at the first-order oracle of the problem (8) which is important in the presence of linear mappings, and then we outline the adapted algorithm. On the one hand, practical problems are commonly involving high-dimensional data, i.e. function and subgradient evaluations are quite expensive. On the other hand, in many applications the most computational cost of function and subgradient evaluations relates to applying direct and adjoint linear operators, originated from an existence of affine terms in (8). This facts suggest that we should apply linear operators and their adjoints as less as possible during our implementations. For the sake of simplicity in the rest of the paper, we employ Ψx and gx instead of Ψ (x) and gΨ (x), respectively. Considering the structure of (8), we see that affine terms Ai x for i = 1, 2, · · · , n1 and Wj x for j = 1, 2, · · · , n2 are appearing in both function and subgradient evaluations. As a result, in the oracle, we set vxi = Ai x for i = 1, 2, · · · , n1 and wxj = Wj x for j = 1, 2, · · · , n2 . Therefore, the function value and subgradient of Ψ at x are computed in the following scheme: Algorithm 1: NFO-FG nonsmooth first-order oracle Input: global parameters: Ai for i = 1, · · · , n1 ; Wj for j = 1, · · · , n2 ; Output: Ψx ; gΨ (x); begin vxi ← Ax for i = 1, · · · , n1 ; wxj ← P Wx for j = 1, ·P · · , n2 ; n1 n2 Ψx ← i=1 fi (vxi ) + j=1 ϕj (wxj ); P n1 ∗ P n2 i gΨ (x) ← i=1 Ai ∂fi (vx ) + j=1 Wi∗ ∂ϕj (wxj ); end local parameters: x; The structure of NFO-FG immediately implies that any call of the oracle O(x) = (Ψx , gΨ (x)) requires n1 + n2 calls of each direct and adjoint linear operator. It is also clear that by this scheme, one can avoid of double applying of expensive linear operators in the computation of function values and subgradients. In cases that algorithms need only a function value or a subgradient, we consider two special cases of NFO-FG that just return a function value or a subgradient, which are respectively called NFO-F and NFO-G. We also emphasize that in the cases that total computational cost of the oracle dominated by applying linear mappings, the complexity of an algorithm can be measured by counting the number of direct and adjoint linear operators used to achieve a prescribed accuracy ε. Optimal subgradient algorithms with application to large-scale linear inverse problems 7 Considering what established in this section and those introduced in Section 2 of [49] about linear relaxation constructions and the case of strongly convex functions, we outline a version of OSGA for multi-term affine composite function (8) as follows: Algorithm 2: OSGA for multi-term affine composite functions Input: global tuning parameters: δ; αmax ∈ ]0, 1[; 0 < κ′ ≤ κ; local parameters: x0 ; µ ≥ 0; Ψtarget ; Output: xb ; Ψxb ; begin choose an initial best point xb ; compute Ψxb and gΨ (xb ) using NFO-FG; if Ψxb ≤ Ψtarget then stop; else h ← gxb − µgQ (xb ); γ ← Ψxb − µQ(xb ) − hh, xb i; γb ← γ − Ψxb ; u ← U (γb , h); η ← E(γb , h) − µ; end α ← αmax ; while stopping criteria do not hold do x ← xb + α(u − xb ); compute Ψx and gΨ (x) using NFO-FG; g ← gx − µgQ (x); h ← h + α(g − h); γ ← γ + α(Ψx − µQ(x) − hg, xi − γ); x′b ← argminz∈{xb ,x} Ψ (z, vz ); Ψx′b ← min{Ψxb , Ψx }; γb′ ← γ − fx′b ; u′ ← U (γb′ , h); x′ ← xb + α(u′ − xb ); compute Ψx′ using NFO-F; choose xb in such a way that Ψxb ≤ min{Ψx′b , Ψx′ }; γ b ← γ − Ψxb ; u ← U (γ b , h); η ← E(γ b , h) − µ; x b ← x b ; Ψ x b = Ψx b ; if Ψxb ≤ Ψtarget then stop; else update the parameters α, h, γ, η and u using PUS; end end end Pay attention to the structure of problem (8) and the OSGA framework, we only require to calculate the following simple operations to implement the algorithm. • • • • x + βy, where x, y ∈ Ui for i = 1, · · · , n1 and β ∈ R; r + βs, where r, s ∈ Vj for j = 1, · · · , n2 and β ∈ R; Ai x for i = 1, · · · , n1 and Wj x for j = 1, · · · , n2 where x ∈ X ; A∗i y with y ∈ Ui for i = 1, · · · , n1 and Wj∗ y with y ∈ Vj for j = 1, · · · , n2 . As a result of (18), a natural stopping criterion for the algorithm is the condition η ≤ ǫ, but our experiments shows that satisfying this condition is slow and takes many iterations. Then more sophisticated stopping criteria for the algorithm is needed. For example a bound on the number of iterations, function evaluations, subgradient evaluations or the running time or a combination of them can be used. As discussed in [49], OSGA uses the following scheme for updating the given parameters α, h, γ, η and u: 8 Masoud Ahookhosh Algorithm 3: PUS parameters updating scheme Input: global tuning parameters: δ; αmax ∈ ]0, 1[; 0 < κ′ ≤ κ; Output: α; h; γ; η; u; begin R ← (η − η)/(δαη); if R < 1 then h ← h; else ′ α ← min(αeκ (R−1) , αmax ); end α ← α; if η < η then h ← h; γ ← γ; η ← η; u ← u; end end local parameters: α; η; h̄; γ̄; η̄; ū; We here emphasize that the composite version of optimal subgradient algorithms inherits the main characteristics of the original OSGA framework. Thus, we expect the same theoretical analysis to be the same as that reported for OSGA in [49] which means that the current version remains optimal. Prox-functions and solving the auxiliary subproblem. In the rest of this section, we first propose some prox-functions and then derive a closed form for the corresponding subproblem (15). As mentioned, the prox-function Q(z) should be a strongly convex function that is analytically known. This means that it takes its unique global minimum which is positive by the definition of Q. Suppose z0 is the global minimizer of Q. By the definition of center of Q and the first-order optimality condition for (14), we have that gQ (z0 ) = 0. This fact along with (13) imply Q(z) ≥ Q0 + σ kz − z0 k2 , 2 (19) which simply means that Q(z) > 0 for all z ∈ X . In addition, it is interested that prox-functions be separable. Taking this fact into account and (19), appropriate choices of prox-functions for unconstrained problem (8) can be σ (20) Q1 (z) := Q0 + kz − z0 k22 2 and Q2 (z) := Q0 + n σX wi (zi − (z0 )i )2 , 2 i=1 (21) where Q0 > 0 and wi ≥ 1 for i = 1, 2, · · · , n. It is not hard to show that both (20) and (21) are strongly convex functions satisfying (19). To introduce a more general class of prox-functions, let us to define the quadratic norm on vector space X using p kzkD := hDz, zi by means of a preconditioner D, where D is symmetric and positive definite. The associated dual norm on X ∗ is then given by p khk∗D := kD−1 hkD = hh, D−1 hi, where D−1 denotes an inverse of D. Given a symmetric and positive definite preconditioner D, then it is natural to consider the quadratic function Q(z) := Q0 + σ kz − z0 k2D , 2 (22) Optimal subgradient algorithms with application to large-scale linear inverse problems 9 where Q0 is a positive number and z0 ∈ X is the center of Q. Now, we should prove that Q is a proxfunction and satisfies (19). The fact that gQ (x) = σD(x − z0 ) leads to Q(x) + hgQ (x), z − xi + σ σ kz − xk2D = Q(x) + hgQ (x), z − xi + hD(z − x), z − xi 2 2 σ σ = Q0 + hD(x − z0 ), x − z0 i + hgQ (x), z − xi + hD(z − x), z − xi 2 2 σ σ = Q0 + hD(x − z0 ), z − z0 i + hD(z − z0 ), z − xi 2 2 σ σ = Q0 + hD(x − z0 ), z − z0 i + hD(z − x), z − z0 i 2 2 σ = Q0 + hD(z − z0 ), z − z0 i 2 = Q(z). This clearly means that Q is strongly convex by the convexity parameter σ, i.e., (22) is a prox-function. Moreover, z0 is the center of Q and gQ (z0 ) = 0. This fact together with (13) and (14) imply that (19) holds. At this point, we emphasize that an efficient solving of the subproblem (15) is highly related to the selection of prox-functions. While using a suitable prox-function allows a very efficient way of computing u, employing other selections may significantly slow down the process of solving the auxiliary subproblem. Here, it is assumed that by appropriate selections of prox-functions the subproblem (15) can be solved explicitly. In the following result, we verify the solution of (15) using the prox-function (22). Proposition. 21 Suppose Q is determined by (22) and Q0 > 0. Then Q is a prox-function with the center z0 and satisfies (19). Moreover, the subproblem (15) corresponds to this Q is explicitly solved as follows u = z0 − E(γ, h)−1 σ −1 D−1 h (23) with p β12 + 4Q0 β2 2β p 2 , = E(γ, h) = 2Q0 β1 + β12 + 4Q0 β2  where β1 = γ + hh, z0 i and β2 = 21 σ −2 − σ −1 khk2∗ . −β1 + (24) Proof We already proved that Q is a prox-function and (19) holds. The rest of the proof is similar to that discussed in [49] for σ = 1. ⊔ ⊓ It is obvious that (20) and (21) are special cases of (22), so the solution (23) and (24) derived for (22) can be simply adapted for the situation (20) or (21) used as a prox-function. Furthermore, notice that the error bound (17) is proportional to Q(x∗ ) implying that an acceptable choosing of x0 make the term Q(x∗ ) = Q0 + 12 kx∗ − x0 k2D enough small. Hence, selecting a suitable starting point x0 close to the optimizer x∗ as much as possible has a positive effect on giving a better complexity bound. This also propose that a reasonable choice for Q0 can be Q0 ≈ 12 kx∗ − x0 k2 . This fact will be considered along numerical experiments. We conclude this section by considering cases that the variable domain is a set of all m × n matrices, X = Rm×n . In this case, we introduce a suitable prox-function such that the subproblem (15) is solved in a closed form. Details are described in the following proposition. Proposition. 22 Suppose Q0 > 0 and Q is determined by σ Q(Z) := Q0 + kZ − Z0 k2F , 2 (25) where k.kF is the Frobenius norm. Then Q is a prox-function with the center Z0 and satisfies (19). Moreover, the subproblem (15) corresponded to this Q is explicitly solved by U = Z0 − E(γ, H)−1 σ −1 H with p ξ12 + 4Q0 ξ2 2ξ p 2 = , 2Q0 ξ1 + ξ12 + 4Q0 ξ2  where ξ1 = γ + T r(H T Z0 ) and ξ2 = 12 σ −2 − σ −1 kHk2F . E(γ, H) = −ξ1 + (26) (27) 10 Masoud Ahookhosh Proof By the features of the Frobenius norm, it is clear that Z = Z0 is the minimizer of (14). Also, Q is continuously differentiable gQ (x) = σ(X − Z0 ). Thus, we have σ σ σ Q(X) + hgQ (X), Z − Xi + kZ − Xk2F = Q0 + hX − Z0 , X − Z0 i + σhX − Z0 , Z − Xi + kZ − Xk2F 2 2 2 σ σ = Q0 + h(X − Z0 ), Z − Z0 i + h(Z − Z0 ), Z − Xi 2 2 σ = Q0 + h(Z − Z0 ), Z − Z0 i 2 = Q(Z). This means that Q is strongly convex by the convexity parameter σ, i.e., (22) is a prox-function. Also by (13) and (14), the condition (19) is clearly hold. Now, we consider the subproblem (15) and drive its minimizer. Let us to define the function Eγ,H : X → R using γ + hH, Zi , Eγ,H (Z) := − Q(Z) where it is continuously differentiable and attains its supremum at Z = U . Therefore, in the point Z = U , we obtain E(γ, H)Q(U ) = −γ − hH, U i. (28) It follows from gQ (Z) = σ(Z − Z0 ), (28) and the first-order necessary optimality condition that E(γ, H)σ(U − Z0 ) + H = 0 or equivalently U = Z0 − E(γ, H)−1 σ −1 H. Setting e = E(γ, H) and substituting it into (28) straightforwardly lead to   1 −1 −1 2 eQ(U ) + γ + hH, U i = e Q0 + ke σ HkF + γ + hH, Z0 − e−1 σ −1 Hi 2   1 −2 −2 2 = e Q0 + e σ kHkF + γ + T r(H T Z0 ) + e−1 σ −1 kHk2F 2    1 −2 2 T −1 = Q0 e + γ + T r(H Z0 ) e + σ −σ kHk2F 2 = Q0 e2 + ξ1 e + ξ2 = 0,  where ξ1 = γ + T r(H T Z0 ) and ξ2 = 12 σ −2 − σ −1 kHk2F . Solving this quadratic equation and selecting the larger solution lead to p −ξ1 + ξ12 + 4Q0 ξ2 2ξ p 2 = . E(γ, H) = 2Q0 ξ1 + ξ12 + 4Q0 ξ2 This completes the proof. ⊔ ⊓ 3 Numerical experiments This section reports extensive numerical experiments of the proposed algorithm dealing with some famous inverse problems in applications compared with some popular algorithms and software packages. We assess the convergence speed of OSGA and some widely used first-order methods for solving such practical problems. The experiment is divided into two parts. We first consider some imaging problems and give results of experiments with OSGA and some state-of-the-art solvers. In the second part, we adapt some smooth optimization algorithms for nonsmooth problems and report encouraging results and comparisons. All codes are written in MATLAB, and default values of parameters for the algorithms and packages are used. For OSGA, we used the prox-function (20) with Q0 = 21 kx0 k2 + ǫ, where ǫ is the machine precision in MATLAB. We also set the following parameters for OSGA: δ = 0.9; αmax = 0.7; κ = κ′ = 0.5; Ψtarget = −∞. All implementations are executed on a Toshiba Satellite Pro L750-176 laptop with Intel Core i7-2670QM Processor and 8 GB RAM. Optimal subgradient algorithms with application to large-scale linear inverse problems 11 3.1 Image restoration Image reconstruction, also called image restoration, is one of the classical linear inverse problems dating back to the 1960s, see for example [7, 14]. The goal is to reconstruct images from some observations. Let y ∈ U be a noisy indirect observation of an image x ∈ X with an unknown noise δ ∈ U . Suppose that A : X → U is a linear operator. To recover the unknown vector x, we use the linear inverse model (1). As discussed, according to the ill-conditioned or singular nature of the problem, some regularizations are needed. Recall (3), (4) or the more general model (8) to derive a suitable solution for the corresponding inverse problem and refer to [33, 34, 50] for more regularization options. The strategy employed for solving such penalized optimization problems depends on features of objective functions and regularization terms like differentiability or nondifferentiability and convexity or nonconvexity. The quality of the reconstructed image highly depends on these features. A typical model for image reconstruction consists of a smooth objective function penalized by smooth or nonsmooth regularization penalties. These penalties are selected based on expected features of the recovered image. Suppose that the known image x ∈ X is represented by x = Wτ in some basis domain, e.g., wavelet, curvelet, or time-frequency domains, where the operator W : U → V can be orthogonal or any dictionary. There are two main strategies for restoring the image x from an indirect observation y, namely synthesis and analysis. To recover a clean image, the synthesis approach reconstructs the image by solving the following minimization problem 1 kAWτ − yk22 + λϕ(τ ), (29) min τ ∈U 2 where ϕ is a convex regularizer. If τ ∗ is an optimizer of (29), then an approximation of the clean image x will be reconstructed using x∗ = Wτ ∗ . Alternatively, the analysis strategy recovers the image by solving min x∈X 1 kA(x) − yk22 + λϕ(x). 2 (30) Similarity of the problems (29) and (30) suggests that they are strongly related. These two approaches are equivalent in some assumptions, while variant in others, for more details see [29]. Although various regularizers like lp -norm are popular in image restoration, it is arguable that the best known and frequently employed regularizer in analysis approaches for image restoration is total variation (TV), which will be discussed later in this section. In the sequel, we report our numerical experiments based on the analysis strategy. The pioneering work on total variation as a regularizer for image denoising and restoration was proposed by Rudin, Osher, and Fatemi in [51]. Total variation regularizations are able to restore discontinuities of images and recover the edges. This makes TV appealing and widely used in applications. TV is originally defined in the infinite-dimensional Hilbert space, however for the digital image processing it is interested to consider it in a finite-dimensional space like Rn of pixel values on a two-dimensional lattice. In practice, some discrete versions of TV are favoured to be used. Two standard choices of discrete TV-based regularizers, namely isotropic total variation and anisotropic total variation, are popular in signal and image processing, defined by Pm−1 Pn−1 p (Xi+1,j − Xi,j )2 + (Xi,j+1 − Xi,j )2 kXkIT V = i Pm−1 j Pn−1 + i |Xi+1,n − Xi,n | + i |Xm,j+1 − Xm,j |, and Pm−1 Pn−1 kXkAT V = i {|Xi+1,j − Xi,j | + |Xi,j+1 − Xi,j |} Pm−1 j Pn−1 |Xi+1,n − Xi,n | + i |Xm,j+1 − Xm,j |, + i for X ∈ Rm×n , respectively. The image restoration problem with the discrete TV-based regularization can be formulated by 1 kA(X) − Y k2F + λ ϕ(X), (31) min X∈Rm×n 2 √ where A : Rm×n → Rr×m is a linear operator, ϕ = k.kIT V or ϕ = k.kAT V and kXkF = Tr XX T is the Frobenius norm in which Tr stands for the trace of a matrix. In what follows, we consider some prominent classes of imaging problems, more specifically denoising, inpainting and deblurring, and conduct our tests on numerous images. 12 Masoud Ahookhosh 3.1.1 Denoising with total variation Image denoising is one of the most fundamental image restoration problems aiming to remove noise from images, in which the noise comes from some resources like sensor imperfection, poor illumination, or communication errors. While this task itself is a prominent imaging problem, it is important as a component in other processes like deblurring with the proximal mapping. The main objective is to denoise images while the edges are preserved. To do so, various models and ideas have been proposed. Among all of the regularizers, total variation proposed a model preserving discontinuities (edges). The particular case of (31) when A is the identity operator, is called denoising and is very popular with total variation regularizers. Although the original total variation introduced in [51] is an infinitedimensional continuous constrained problem, we here consider the following unconstrained discrete finitedimensional version 1 (32) XY,λ = argmin kX − Y k2F + λ ϕ(X), X∈Rm×n 2 where ϕ = k.kIT V or ϕ = k.kAT V . It is clear that both k.kIT V and k.kAT V are nonsmooth semi-norms and the proximal operator (32) cannot be solved explicitly for a given Y . Rather than solving the problem (32) explicitly, it has been approximately solved by iterative processes similar to that proposed by Chambolle in [23]. We note that k.kIT V and k.kAT V are nonsmooth and convex, and their subdifferentials are available. Therefore, OSGA can be employed for solving (32). We consider the isotropic version of problem (32) where Y denotes a noisy image and XY,λ is denoised approximation of the original X corresponding to a regularization parameter λ. We run IST, TwIST [15] and FISTA [9, 10] in order to compare their performances with OSGA’s results. For IST, TwIST and FISTA, we use original codes and only adding more stopping criteria are needed in our comparisons. We set λ = 0.05 for all algorithms. As IST, TwIST and FISTA need to solve their subproblems iteratively, we consider three different cases by limiting the number of internal iterations to chit = 5, 10 and 20 iterations. Let us consider the recovery of the 1024 × 1024 Pirate image by IST, TwIST, FISTA and OSGA to verify the results visually and in more details. The noisy image is generated by using the MATLAB internal function awgn with SNR = 15, and the algorithms are stopped after 50 iterations. The results are summarized in Figures 1 and 2. To see details of this experiment, we compare the results of implementations in different measures in Figure 2. The performances are measured by some relative errors for points and function values and also so-called signal-to-noise improvement (ISNR). Here, ISNR and PSNR are defined by    √  mn kY − X0 kF and PSNR = 20 log10 , (33) ISNR = 20 log10 kX − X0 kF kX − X0 kF where X0 denotes the m × n clean image, Y is the observed image and pixel values are in [0, 1]. Generally, these ratios measure the quality of the restored image X relative to the blurred or noisy observation Y . From Figure 2, it is observed that OSGA outperforms IST, TwIST and FISTA with respect to all considered measures. Figure 2, shows that the denoised image looks good for all considered algorithms, where the best function value and the second best PSNR are attained by OSGA requiring 25.46 seconds to be stopped. However, it can be seen that by increasing the number of Chambolle’s iterations, the running time is dramatically increased for IST, TwIST and FISTA, however, a better function value or PSNR (compared with OSGA) is not attained. 3.1.2 Inpainting with isotropic total variation Image inpainting is a set of techniques to fill lost or damaged parts of an image such that these modifications are undetectable for viewers who are not aware of the original image. The basic idea is originated from reconstructing of degraded or damaged paintings conventionally carried out by expert artists. This is without no doubt very time consuming and risky. Thus, the goal is to produce a modified image in which the inpainted region is unified into the image and looks normal for ordinary observers. The pioneering work on digital image inpainting was proposed by Bertalmi et al. in [13] based on nonlinear PDE. Since then lots of applications of inpainting in image interpolation, photo restoration, zooming, super-resolution images, image compression and image transmission have emerged. In 2006, Chan in [24] proposed an efficient method to recover piecewise constant or smooth images by combining total variation regularizers and wavelet representations. Motivated by this work, we consider Optimal subgradient algorithms with application to large-scale linear inverse problems 13 0.16 0.16 0.16 IST TwIST FISTA OSGA 0.14 IST TwIST FISTA OSGA 0.14 0.12 0.12 0.12 0.1 0.1 0.08 0.08 0.06 0.06 0.06 0.04 0.04 0.04 0.02 0.02 0.02 0 0 0 5 10 15 20 25 30 35 40 45 50 (a) rel. 1 vs. iterations (chit = 5) IST TwIST FISTA OSGA 0.14 0.1 0.08 0 0 5 10 15 20 25 30 35 40 45 50 (b) rel. 1 vs. iterations (chit = 10) 0 5 10 15 20 25 30 35 40 45 50 (c) rel. 1 vs. iterations (chit = 20) 0 10 0 10 0 10 IST TwIST FISTA OSGA −1 10 IST TwIST FISTA OSGA −1 10 10 −2 −2 10 −2 10 −3 10 −4 10 10 −3 10 −3 10 −4 10 −4 10 −5 10 IST TwIST FISTA OSGA −1 −5 −5 0 5 10 15 20 25 30 35 40 45 50 (d) rel. 2 vs. time (chit = 5) 10 10 0 20 40 60 80 100 120 0 140 (e) rel. 2 vs. time (chit = 10) 50 100 150 200 250 (f) rel. 2 vs. time (chit = 20) 0 0 10 0 10 10 IST TwIST FISTA OSGA −1 10 −1 10 10 −2 10 −3 10 −4 10 10 −3 10 −3 10 −4 10 −4 10 −5 −5 −5 0 5 10 15 20 25 30 35 40 45 50 (g) rel. 2 vs. iterations (chit = 5) 10 10 0 5 10 15 20 25 30 35 40 45 0 50 (h) rel. 2 vs. iterations (chit = 10) 7 7 6 6 5 5 4 4 4 3 3 3 2 2 IST TwIST FISTA OSGA 0 5 10 15 20 25 30 35 40 45 (j) ISN R vs. iterations (chit = 5) 0 15 20 25 30 35 40 45 50 7 6 5 2 IST TwIST FISTA OSGA 1 50 10 8 8 1 5 (i) rel. 2 vs. iterations (chit = 20) 8 0 IST TwIST FISTA OSGA −1 −2 −2 10 10 IST TwIST FISTA OSGA 0 5 10 15 20 25 30 35 40 45 IST TwIST FISTA OSGA 1 50 (k) ISN R vs. iterations (chit = 10) 0 0 5 10 15 20 25 30 35 40 45 50 (l) ISN R vs. iterations (chit = 20) Fig. 1: A comparison among IST, TwIST, FISTA and OSGA for denoisinging the 1024 × 1024 Pirate image when they stopped after 50 iterations. Each column stands for comparisons with a different number of Chambolle’s iterations. While the first row denotes the relative error rel. 1 := kxk − x∗ k2 /kx∗ k2 of points versus iterations, the second row shows the relative error rel. 2 := (fk − f ∗ )/(f0 − f ∗ ) of function values versus time. The third row illustrates rel. 2 versus iterations and fourth row stands for ISNR (33) versus iterations. 14 Masoud Ahookhosh (a) Original image (b) Noisy image (c) OSGA: F = 3721.52, PSNR = 30.12, T = 25.46 (d) IST (chit = 5): F = 3791.57, PSNR = 30.05, T = 22.76 (e) IST (chit = 10): F = 3750.28, PSNR = 30.16, T = 43.05 (f) IST (chit = 20): F = 3741.17, PSNR = 30.08, T = 71.42 (g) TwIST (chit = 5): F = 3791.57, PSNR = 30.05, T = 47.02 (h) TwIST (chit = 10): F = 3729.11, PSNR = 30.06, T = 124.25 (i) TwIST (chit = 20): F = 3735.04, PSNR = 30.00, T = 224.44 (j) FISTA (chit = 5): F = 3791.57, PSNR = 30.05, T = 20.48 (k) FISTA (chit = 10): F = 3750.28, PSNR = 30.16, T = 42.04 (l) FISTA (chit = 20): F = 3741.17, PSNR = 30.08, T = 65.88; Fig. 2: Denoising of the 1024 × 1024 Pirate image by IST, TwIST, FISTA and OSGA. Here, F denotes the best function value, PSNR is defined by (33) and T stands for CPU time of the algorithms after 50 iterations. Optimal subgradient algorithms with application to large-scale linear inverse problems 15 inpainting with isotropic total variation to reconstruct images with missing data. We consider the 512×512 Head CT image with 40% missing sample. The same as previous section, IST, TwIST, FISTA and OSGA are employed to recover the image. The parameter λ sets to 0.09. We draw your attention to the different running time of these algorithms in each step, we stop them after 50 seconds. To have a fair comparison, three versions of each IST, TwIST and FISTA corresponding to 5, 10 and 20 iterations of Chambolle’s algorithm for solving the proximal subproblems are considered. The results of these implementations are illustrated in Figures 3 and 4. Similar to denoising, Figure 3 reports comparisons with some relative errors for step points and function values and also ISNR. These results display that IST is not comparable with others, and also OSGA is superior to IST, TwIST and FISTA regarding all considered measures. Since the data is not scaled in the implementation, PSNR is computed by   √ 255 mn , PSNR = 20 log10 kX − X0 kF where X0 is the clean image. The definition clearly proposes that a bigger PSNR value means a smaller kX −X0 kF , suggesting a better quality for the restored image. Conversely, a smaller PSNR value indicates a bigger kX −X0 kF implying the worse image quality. Figure 5 shows that IST cannot recover the missing data effectively compared with other algorithms. Moreover, the results suggest that OSGA achieves the best function value 187564.02 and the best PSNR 33.13 among the solvers considered. 3.1.3 Deblurring with isotropic total variation Image blur is one of the most common problem which happens in the photography and can often ruin the photograph. In digital photography, the motion blur is caused by camera shakes, which is difficult to avoid in many situations. Deblurring has been a fundamental research problem in the digital imaging and can be considered as an inverse problem of the form (1). The same as other inverse problems, this is an inherently ill-conditioned problem leading to the optimization problem of the form (5) or (8). Deblurring is generally much harder than denoising because most of algorithms that have been developed for solving this problem need to solve a subproblem of the form (32) in each iteration. Here, the deblurring problem of the form (31) with the isotropic total variation regularizer is considered. As discussed about denoising and inpainting, IST cannot recover images efficiently and is not comparable with the other algorithms considered. Hence, we instead take advantage of the more sophisticated package SpaRSA [55] in our implementations, so the comparison for deblurring involves TwIST, SpaRSA, FISTA and OSGA. Here, the regularization parameter set to λ = 0.05. We now consider deblurring of the 512 × 512 blurred/noisy Elaine image. Let Y be a blurred/noisy version of this image generated by 9 × 9 uniform blur and adding Gaussian noise with SNR = 40 dB. We recover it using (31) by the algorithms in 25 seconds of the running time. The results are demonstrated in Figures 5 and 6. In Figure 5, the algorithms are compared regarding relative errors for step points and function values and also ISNR. In Figures 5, subfigures (a)-(b) indicate that OSGA obtains a better level of accuracy regarding both of these measures, while Figure 5 (d) suggests that OSGA is superior to the others regarding ISNR. To see the constructed results visually, Figure 6 is considered, where it reports last function value and PSNR of the algorithms. It can be seen that OSGA obtains the best function value 79252.60 and the best PSNR 32.23 among the considered algorithms. We finally consider 72 widely used test images to do more comparisons, where blurred/noisy images are generated by 9 × 9 uniform blur and adding Gaussian noise with SNR = 40 dB. The images recovered by TwIST, SpaRSA, FISTA and OSGA, where the algorithm are stopped after 100 iterations. The results of reconstructions are summarized in Table 1, where we report the best function value (F), PSNR and CPU time (T) suggesting OSGA outperforms the others. The results are illustrated in Figure 7, where the comparisons are based on the performance profile of Dolan and Moré in [27]. Figure 7 shows that OSGA wins with 84% and 93% score for function values and PSNR among all others. It also indicates that all algorithms recover images successfully. 16 Masoud Ahookhosh 0.7 0.7 0.7 IST TwIST FISTA OSGA 0.6 IST TwIST FISTA OSGA 0.6 0.5 0.5 0.5 0.4 0.4 0.4 0.3 0.3 0.3 0.2 0.2 0.2 0.1 0.1 0.1 0 0 0 100 200 300 400 500 600 (a) rel. 1 vs. iterations (chit = 5) 0 0 50 100 150 200 250 300 350 400 (b) rel. 1 vs. iterations (chit = 10) −1 −2 −3 50 60 (d) rel. 2 vs. time (chit = 5) 0 10 −1 20 30 40 50 −3 10 −2 10 −3 0 50 100 150 200 250 300 350 400 450 (h) rel. 2 vs. iterations (chit = 10) 10 0 25 20 20 20 15 10 5 5 5 400 500 200 250 300 350 400 600 (j) ISN R vs. iterations (chit = 5) 0 450 500 IST TwIST FISTA OSGA 10 300 150 IST TwIST FISTA OSGA 10 200 100 15 IST TwIST FISTA OSGA 100 50 (i) rel. 2 vs. iterations (chit = 20) 25 15 60 (f) rel. 2 vs. time (chit = 20) 25 0 50 IST TwIST FISTA OSGA −3 600 (g) rel. 2 vs. iterations (chit = 5) 0 40 −1 −2 500 30 10 10 400 20 IST TwIST FISTA OSGA −2 300 10 0 10 10 500 10 IST TwIST FISTA OSGA 200 0 60 −1 10 450 −3 10 0 100 400 (c) rel. 1 vs. iterations (chit = 20) 10 0 10 0 350 −2 (e) rel. 2 vs. time (chit = 10) 10 10 300 10 −3 40 250 IST TwIST FISTA OSGA −2 30 200 10 10 20 150 IST TwIST FISTA OSGA 10 10 100 −1 −1 10 0 50 0 IST TwIST FISTA OSGA 10 0 10 10 10 0 450 0 10 IST TwIST FISTA OSGA 0.6 0 50 100 150 200 250 300 350 400 450 (k) ISN R vs. iterations (chit = 10) 0 0 50 100 150 200 250 300 350 400 450 500 (l) ISN R vs. iterations (chit = 20) Fig. 3: A comparison among IST, TwIST, FISTA and OSGA for inpainting the 512 × 512 Head CT image when 40% of its data is missing. The algorithms are stopped after 50 seconds of the running time. Each column stands for comparisons with a different number of Chambolle’s iterations. The first row denotes the relative error rel. 1 = kxk − x∗ k2 /kx∗ k2 of points versus iterations, the second row shows the relative error rel. 2 = (fk − f ∗ )/(f0 − f ∗ ) of function values versus time, the third row illustrates rel.2 versus iterations, and fourth row stands for ISNR (33) versus iterations. Optimal subgradient algorithms with application to large-scale linear inverse problems 17 (a) Original image (b) Blurred/noisy image (c) OSGA: F = 187564.02, PSNR = 33.13 (d) IST (chit = 5): F = 710603.30, PSNR = 16.15 (e) IST (chit = 10): F = 1046712.18, PSNR = 13.61 (f) IST (chit = 20): F = 1284434.37, PSNR = 12.08 (g) TwIST (chit = 5): F = 189347.03, PSNR = 29.84 (h) TwIST (chit = 10): F 189533.03, PSNR = 29.94 = (i) TwIST (chit = 20): F 192946.47, PSNR = 29.26 (j) FISTA (chit = 5): F = 188676.25, PSNR = 30.37 (k) FISTA (chit = 10): F 189145.34, PSNR = 30.36 = (l) FISTA (chit = 20): F = 198692.14 & PSNR = 30.24 = Fig. 4: Inpainting of the 512 × 512 Head CT image by IST, TwIST, FISTA, OSGA when they stopped after 50 seconds of the running time. 18 Masoud Ahookhosh 0 10 0.07 TwIST SpaRSA FISTA OSGA 0.06 10 relative error of function values 0.05 relative error of steps TwIST SpaRSA FISTA OSGA −1 0.04 0.03 −2 10 −3 10 −4 10 0.02 −5 10 0.01 0 −6 0 10 20 30 40 50 60 iterations 70 80 90 100 10 0 5 15 20 25 time (a) rel. 1 vs. iterations, chit = 5 (b) rel. 2 vs. time, chit = 5 0 5 10 TwIST SpaRSA FISTA OSGA −1 10 4.5 4 3.5 −2 10 SNR improvement relative error of function values 10 −3 10 3 2.5 2 −4 10 1.5 TwIST SpaRSA FISTA OSGA 1 −5 10 0.5 −6 10 0 10 20 30 40 50 60 iterations 70 80 (c) rel. 2 vs. iterations, chit = 5 90 100 0 0 10 20 30 40 50 60 iterations 70 80 90 100 (d) ISN R vs. iterations, chit = 5 Fig. 5: A comparison among TwIST, SpaRSA, FISTA and OSGA for deblurring the 512 × 512 Elaine image with 9×9 uniform blur and Gaussian noise with SNR = 40 dB. TwIST, SpaRSA and FISTA exploit chit = 5, and all algorithms are stopped after 100 iterations. In the figure: (a) stands for the relative error rel. 1 = kxk −x∗ k2 /kx∗ k2 of points versus iterations, (b) shows the relative error rel. 2 = (fk −f ∗ )/(f0 −f ∗ ) of function values versus time, (c) demonstrates rel.2 versus iterations, and (d) depicts ISNR (33) versus iterations. Optimal subgradient algorithms with application to large-scale linear inverse problems (a) Original image (b) Blurred/noisy image (c) TwIST: f = 80733.24, PSNR = 31.99, T = 20.07 (d) SpaRSA: f = 79698.36, PSNR = 32.12, T = 12.16 (e) FISTA: f = 80795.59, PSNR = 31.92, T = 11.84 (f) OSGA: f = 79252.60, PSNR = 32.23, T = 14.82 19 Fig. 6: Deblurring of the 512×512 Elaine image with 9×9 uniform blur and Gaussian noise with SNR = 40 dB by TwIST, SpaRSA, FISTA and OSGA. The algorithms stopped after 100 iterations, and the number of Chambolle iterations for TwIST, SpaRSA and FISTA is set to chit = 5. 20 Masoud Ahookhosh 1 0.9 0.8 P(rp,s ≤ τ : 1 ≤ s ≤ ns) 0.7 0.6 0.5 0.4 0.3 0.2 TwIST SpaRSA FISTA OSGA 0.1 0 1 1.05 τ 1.1 1.15 (a) performance profile for function values 1 0.9 0.8 P(rp,s ≤ τ : 1 ≤ s ≤ ns) 0.7 0.6 0.5 0.4 0.3 TwIST SpaRSA FISTA OSGA 0.2 0.1 0 1 1.05 τ 1.1 1.15 (b) performance profile for PSNR Fig. 7: Deblurring of 72 test images with isotropic total variation using TwIST, SpaRSA FISTA, OSGA in 25 seconds of the running time. TwIST SpaRSA FISTA OSGA Image name Dimension F PSNR T F PSNR T F PSNR T F PSNR T Aerial Airplane Cameraman Chemical plant Clock Fingerprint 1 Lena Moon surface Pattern 1 Pattern 2 Star Aerial Airfield Airplane 1 Airplane 2 APC Barbara Blobs Blonde woman Boat Cameraman Car & APCs 1 Car & APCs 2 Clown Crowd 1 Crowd 2 Darkhair woman Dollar Elaine Fingerprint Flinstones Girlface Goldhill Head CT Houses 256 × 256 256 × 256 256 × 256 256 × 256 256 × 256 256 × 256 256 × 256 256 × 256 256 × 256 256 × 256 256 × 256 512 × 512 512 × 512 512 × 512 512 × 512 512 × 512 512 × 512 512 × 512 512 × 512 512 × 512 512 × 512 512 × 512 512 × 512 512 × 512 512 × 512 512 × 512 512 × 512 512 × 512 512 × 512 512 × 512 512 × 512 512 × 512 512 × 512 512 × 512 512 × 512 53380.99 16527.67 37077.60 37890.81 31581.71 54289.58 30700.61 16648.60 231150.20 176.53 23482.48 155827.40 169428.35 105947.22 29318.44 45314.07 114725.54 191940.70 96330.41 118404.12 111579.37 75501.10 69193.23 137437.13 127463.77 213890.70 85530.72 249508.76 80406.78 256043.16 245122.90 82611.48 110067.92 178265.55 250441.66 24.99 33.75 27.39 26.80 31.07 20.98 28.33 30.22 18.07 48.90 24.31 25.94 26.06 32.00 35.86 33.18 25.45 20.63 29.63 29.87 32.98 32.67 32.52 31.71 30.55 28.16 37.31 21.49 32.01 26.09 25.64 33.00 30.69 30.91 25.79 5.80 6.50 4.94 6.64 6.04 7.06 7.98 8.76 4.48 7.27 6.06 23.55 23.87 23.43 19.22 24.09 27.10 24.31 30.11 27.41 29.51 22.48 30.35 23.04 25.33 18.55 24.19 18.39 20.28 22.58 17.50 24.73 23.78 16.10 17.79 52966.30 16563.06 37230.96 37902.64 31740.38 55070.52 30610.62 16552.11 232285.41 190.83 23596.24 166304.12 169301.74 105942.26 29609.95 45128.61 114235.16 191945.61 95947.22 118194.91 111655.52 74826.32 68645.53 137241.10 127563.32 214814.30 85441.92 273059.31 79667.61 256803.08 246272.55 82842.10 109700.41 179561.46 262200.38 25.14 33.72 27.11 26.87 30.95 21.12 28.31 30.29 17.85 48.92 24.26 25.93 25.97 32.06 35.14 33.20 25.47 20.59 29.71 29.82 32.87 32.90 32.65 31.77 30.72 28.35 37.49 21.42 32.13 26.19 25.45 33.06 30.78 30.05 25.69 3.56 3.39 3.28 3.48 4.12 3.68 4.12 4.39 4.32 3.86 4.04 17.01 16.91 14.29 14.83 14.60 14.39 15.92 15.40 16.12 15.86 13.88 15.55 13.31 13.58 12.29 12.47 12.34 12.45 12.34 12.31 12.18 12.49 12.63 12.47 53513.56 16549.09 37096.87 37934.72 31604.08 54374.17 30730.93 16718.75 231937.14 186.27 23514.17 156015.30 169528.18 106078.89 29329.10 45425.04 115004.02 192059.89 96567.05 118590.52 111656.63 75868.96 69636.88 137550.26 127708.39 214494.66 85854.12 249686.20 80765.32 256811.98 245410.87 82896.35 110273.58 178400.71 250589.16 25.10 33.69 27.44 26.80 31.18 20.98 28.39 30.27 18.09 48.90 24.37 26.01 26.14 32.06 35.90 33.22 25.43 20.63 29.65 29.90 33.00 32.59 32.43 31.75 30.61 28.16 36.96 21.50 31.94 25.85 25.66 32.89 30.71 30.93 25.83 3.33 3.28 3.28 3.32 3.33 3.46 3.37 3.68 3.61 3.41 3.81 12.86 13.53 13.57 13.44 13.48 13.84 14.03 13.66 14.02 13.30 13.65 13.84 12.79 12.21 12.56 13.09 12.52 14.08 12.66 11.99 12.50 12.27 11.94 11.62 52832.73 16630.05 37089.05 37562.79 31502.30 54256.89 30468.62 16505.01 232025.68 172.69 23370.80 154008.47 168303.91 105076.82 29285.75 44985.53 113697.78 192043.10 95219.10 117624.44 110900.70 74614.10 68233.64 136517.93 125715.43 210665.21 84848.24 248817.22 79206.17 253486.89 243375.64 81871.62 109353.67 177648.72 250375.51 25.66 34.36 27.82 27.25 31.64 21.22 28.73 30.54 18.17 48.89 24.49 26.51 26.63 32.70 35.79 33.48 25.60 20.85 30.02 30.33 33.53 32.98 32.85 32.12 31.33 29.00 37.77 21.68 32.24 26.49 26.47 33.44 31.06 31.52 25.97 3.72 3.62 3.80 3.72 4.28 4.23 4.19 4.67 4.13 4.26 4.37 16.53 17.68 18.18 18.10 17.97 18.17 17.21 17.74 16.84 17.56 17.24 17.23 16.65 15.61 15.37 15.31 15.34 14.84 15.71 15.50 15.63 15.66 15.58 15.74 Optimal subgradient algorithms with application to large-scale linear inverse problems Table 1. Deblurring with isotropic TV (F, T & PSNR) 21 22 Table 1. Deblurring results (F, T & PSNR) (continued) 512 × 512 512 × 512 512 × 512 512 × 512 512 × 512 512 × 512 512 × 512 512 × 512 512 × 512 512 × 512 512 × 512 512 × 512 512 × 512 512 × 512 512 × 512 512 × 512 512 × 512 512 × 512 512 × 512 512 × 512 512 × 512 512 × 512 512 × 512 512 × 512 600 × 600 600 × 600 600 × 600 600 × 600 600 × 600 92142.42 600 × 600 1024 × 1024 1024 × 1024 1024 × 1024 1024 × 1024 1024 × 1024 1024 × 1024 155925.23 154117.24 91492.93 258205.38 43626.15 169572.40 126804.63 149477.52 85213.65 101362.85 108105.61 110681.19 154177.76 63474.30 85867.65 93860.07 73437.12 106703.13 105112.40 255921.91 257153.15 261572.25 222243.70 71117.95 201150.09 197734.24 129529.71 208297.87 219117.86 600 × 600 83431.60 423652.87 324683.85 462035.52 289127.64 921586.14 816333.83 159823.01 25.82 28.88 31.80 20.82 37.40 27.50 29.83 25.90 35.59 32.89 29.83 33.14 50.49 31.45 29.39 29.68 31.78 28.69 28.74 20.11 20.06 20.01 20.81 35.27 23.93 23.85 33.80 29.22 28.00 40.12 35.32 28.25 28.48 30.22 37.12 21.89 23.60 29.36 16.35 24.72 24.05 21.23 24.98 19.19 24.84 23.63 21.32 19.18 20.98 29.36 17.53 30.85 23.93 30.64 23.54 27.71 28.27 23.93 27.90 24.69 22.17 26.02 29.71 27.26 28.46 35.19 32.76 31.82 21.89 75.10 97.22 88.40 86.41 56.31 94.63 26.54 156039.11 154535.79 92149.44 260738.06 43713.46 169527.68 126502.75 150370.73 85179.01 102312.47 107788.51 110432.93 154454.47 63215.81 85277.68 93348.41 73088.97 106087.95 104405.27 255818.43 256665.77 266971.93 220980.31 71246.57 201437.41 198225.38 129835.46 207720.39 220633.95 92885.81 85917.89 422352.54 323003.05 460696.14 289405.02 927968.42 812073.87 160619.72 25.76 28.86 31.78 20.57 37.24 27.44 29.84 26.01 35.61 32.86 29.91 33.17 48.95 31.48 29.49 29.77 31.93 28.79 28.84 20.03 20.02 20.03 20.85 35.60 23.91 23.83 33.82 29.30 28.08 40.82 31.69 28.25 28.54 30.24 37.14 21.68 23.67 29.30 12.83 13.30 12.50 15.70 14.06 14.50 13.65 12.48 13.12 13.45 14.87 13.95 13.19 14.99 14.39 15.40 12.35 14.49 14.09 12.68 15.31 12.64 12.40 12.69 21.04 19.98 17.79 19.67 17.94 17.67 18.52 53.99 51.01 45.67 45.00 44.94 49.28 15.68 156013.11 154292.14 91620.97 258430.66 43668.81 169664.95 127007.96 149889.46 85397.57 101591.74 108301.20 110858.04 154228.55 63643.84 86445.51 94147.81 73768.38 107119.20 105581.13 256223.57 257413.32 261772.59 222443.50 71547.26 201282.62 197847.53 129693.50 208547.46 219692.19 92779.78 83775.26 424082.24 325787.77 462855.60 289515.55 923132.76 817252.58 160106.63 25.89 28.95 31.80 20.88 37.47 27.55 29.89 25.88 35.51 32.89 29.90 33.16 50.21 31.44 29.38 29.66 31.77 28.69 28.71 20.10 20.08 20.04 20.82 35.08 23.95 23.86 33.83 29.18 27.90 39.56 34.38 28.32 28.51 30.23 36.99 21.88 23.61 29.34 11.88 13.10 12.02 12.07 11.44 11.82 11.82 12.82 12.20 13.44 12.26 12.12 12.22 11.73 11.89 12.91 12.25 12.18 12.44 12.06 12.52 11.78 11.82 12.10 18.03 17.43 17.41 17.09 17.45 17.34 17.22 59.41 47.50 47.90 48.80 48.89 51.87 14.86 155847.17 152700.95 90544.16 258452.85 43420.69 169081.03 126280.35 148035.40 84823.91 100470.74 107122.20 109880.54 154711.06 62876.00 84963.72 93238.90 72621.40 105751.32 104103.47 255585.39 256847.89 261064.29 222099.26 70038.80 201705.89 198123.72 129342.37 207095.69 218101.17 90891.80 85334.45 420651.39 320916.50 457747.96 288520.76 927122.49 811272.73 159106.67 26.01 29.42 32.27 20.92 37.69 27.72 30.18 26.30 36.10 33.44 30.31 33.65 47.79 31.70 29.70 29.96 32.17 29.00 29.06 20.39 20.36 20.33 21.06 35.77 24.09 24.00 33.92 29.40 28.30 41.15 32.38 28.63 28.81 30.68 37.32 21.84 23.95 29.65 15.81 15.49 14.87 15.51 15.68 15.55 15.78 15.32 15.75 15.50 15.69 15.66 15.72 15.71 15.55 15.79 15.44 15.74 15.46 15.78 15.50 15.74 15.33 15.65 21.04 21.49 21.52 21.63 21.35 21.54 21.27 59.71 59.67 59.85 59.31 61.62 63.54 18.47 Masoud Ahookhosh Kiel Lake Lena Lena numbers Liftingbody Lighthouse Livingroom Mandril MRI spine Peppers Pirate Smiling woman Squares Tank 1 Tank 2 Tank 3 Truck Truck & APCs 1 Truck & APCs 2 Washington DC 1 Washington DC 2 Washington DC 3 Washington DC 4 Zelda Dark blobs 1 Dark blobs 2 House Ordered matches Random matches Rice Shepp-logan phantom Airport Pentagon Pirate Rose Testpat Washington DC Average Optimal subgradient algorithms with application to large-scale linear inverse problems 23 3.2 Comparisons with first-order methods As in the first section discussed, there is an excessive gap between the optimal complexity of first-order √ methods for solving smooth and nonsmooth convex problems to get an ε-solution, where it is O(1/ ε) for smooth problems and O(1/ε2 ) for nonsmooth ones. Inspired by this fact, we consider problems of both cases in the form (8) and compare the best-known first-order algorithms with OSGA. In spite of the fact that some first-order methods use smoothing techniques to solve nonsmooth problems [16, 17, 43], we only consider those that directly solve the original problem. In the other words, we only consider algorithms exploiting the nonsmooth first-order black-box oracle or employing the proximity operators along with the smooth first-order black-box oracle. Hence, during our implementations, the following algorithms are considered • PGA : Proximal gradient algorithm [52]; • NSDSG: Nonsummable diminishing subgradient algorithm [19]; • FISTA: Beck and Tebolle’s fast proximal gradient algorithm [9]; • NESCO: Nesterov’s composite optimal algorithm [46]; • NESUN: Nesterov’s universal gradient algorithm [48]; • NES83: Nesterov’s 1983 optimal algorithm [40]; • NESCS: Nesterov’s constant step optimal algorithm [42]; • NES05: Nesterov’s 2005 optimal algorithm [43]. The algorithms NES83, NESCS and NES05 originally proposed to solve smooth convex problems, where√ they use the smooth first-order black-box oracle and attain the optimal complexity of order O(1/ ε). Although, obtaining the optimal complexity bound of smooth problems is computationally very interesting, the considered problem (8) is commonly nonsmooth. Recently, Lewis and Overton in [36] investigated the behaviour of the BFGS method for nonsmooth nonconvex problems, where their results are interesting. Furthermore, Nesterov in [46] shows that the subgradient of composite functions preserve all important properties of the gradient of smooth convex functions. These facts, motivate us to do some computational investigations on the behaviour of optimal smooth first-order methods in the presence of the nonsmooth first-order black-box oracle, where the oracle passes a subgradient of composite objective functions instead of the gradient of smooth parts. In particular, we here consider Nesterov’s 1983 algorithm, [40], and adapt it to solve (8) by simply preparing the subgradient of the functions as follows: Algorithm 4: NES83 Nesterov’s 1983 algorithm for multi-term affine composite functions Input: select z such that z 6= y0 and gy0 6= gz ; local parameters: y0 ; ε > 0; Output: xk ; Ψxk ; begin a0 ← 0; x−1 ← y0 ; compute gy0 = gΨ (y0 ) using NFO-G; compute gz = gΨ (z) using NFO-G; α−1 ← ky0 − zk/kgy0 − gz k; while stopping criteria do not hold do α̂k ← αk−1 ; x̂k ← yk − α̂k gyk ; compute Ψx̂k using NFO-F; while Ψx̂k < Ψyk − 12 α̂k kgyk k2 do α̂k ← ρα̂k ; x̂k ← yk − α̂k gyk ; compute Ψx̂k using NFO-F; end xk ← x̂k ; Ψxk ← Ψx̂k ;  αk ← α̂k ; p ak+1 ← 1 + 4a2k + 1 /2; yk+1 ← xk + (ak − 1)(xk − xk−1 )/ak+1 ; compute gyk+1 = gΨ (yk+1 ) using NFO-G; end end 24 Masoud Ahookhosh In a similar way considered for NES83, the algorithms NESCS and NES05 are respectively adapted from Nesterov’s constant step [42] and Nesterov’s 2005 algorithms [43]. Using this technique, NES83, NESCS and NES05 are able to solve nonsmooth problems as well, however there is no theory to support their convergence at the moment. Among the algorithms considered, PGA, NSDSG, FISTA, NESCO and NESUN are originally introduced to solve nonsmooth problems, where NSDSG only employs the nonsmooth first-order black-box oracle and the others use the smooth first-order black-box oracle together with the proximal mapping. In the sense of complexity theory, NSDSG is just optimal for nonsmooth problems with the order O(1/ε2 ), PGA is not optimal with the√complexity of the order O(1/ε), and FISTA, NESCO and NESUN are optimal with the order O(1/ ε). These complexity bounds clearly imply that NSDSG and PGA are not theoretically comparable with the others, however we still consider them in our comparisons to see how much the optimal first-order methods are computationally interesting than the classical first-order methods, especially for solving large-scale problems in applications. To get a clear assessment of OSGA, we compare it with the algorithms only using the nonsmooth firstorder oracle (NSDSG, NES83, NESCS, NES05) and with those employing the smooth first-order oracle together with the proximal mapping (PGA, NSDSG, FISTA, NESCO, NESUN) in separate comparisons. Most of these algorithms require the global Lipschitz constant L to determine a step size. While NESCS, NES05, PGA and FISTA use the constant L to determine a step size, NESCO and NESUN get a lower approximation of L and adapt it by backtracking line searches, for more details see [46] and [48]. In our implementations of NESCO and NESUN, similar to [11], we determine the initial estimate L0 by L0 = kg(x0 ) − g(z0 )k∗ , kx0 − z0 k where x0 and z0 are two arbitrary points such that x0 6= z0 and g(x0 ) 6= g(z0 ). In [40], Nesterov used an equivalent scheme to √ determine an initial step size that can be seen in NES83. For NSDSG, the step size is calculated by α0 / k, where k is the current iteration number and α0 is a positive constant that should be specified as big as possible such that the algorithm is not divergent, see [19]. All the other parameters of the algorithms are set to those reported in the corresponding papers. We here consider solving an underdetermined system Ax = y, where A is a m × n random matrix and y is a random m-vector. This problem is frequently appeared in applications, and the goal is to determine x by optimization techniques. Considering the ill-conditioned feature of this problem, the most popular optimization models are (3) and (4), where (3) is clearly smooth and (4) is nonsmooth. In each case, we consider both dense and sparse data. More precisely, the dense data is constructed by setting A = rand(m, n), y = rand(1, m) and the initial point x0 = rand(1, n), and the sparse data is created by A = sprand(m, n, 0.05), y = sprand(1, m, 0.05) and the initial point x0 = rand(1, n, 0.05). The problems use m = 5000 and n = 10000 and employ λ = 1 similar to [46]. For the dense problems, the algorithms are stopped after 60 seconds of the running time, and for the sparse problems, they are stopped after 45 seconds. Let to define L̂ = max kai k2 , 1≤i≤n where ai for i = 1, 2, · · · , n are columns of A. For the dense and sparse problems, NESCS, NES05, PGA and FISTA respectively use L = 104 L̂ and L = 102 L̂. Moreover, NSDSG employs α0 = 10−7 and α0 = 10−4 for the dense and sparse problems, respectively. To assess the considered algorithms, we now solve (3) and (4) for both dense and sparse data, where the results are respectively illustrated in Figures 8 and 9. In Figure 8, subfigures (a) and (b) illustrate the results of dense data, while subfigures (c) and (d) demonstrate the results of sparse data. From this figure, it can be seen that NSDSG and PGA are not competitive with the others confirming the theoretical results about their complexity. More specifically, Figure 8 (a) and (c) imply that NES83 and NES05 are competitive and produce better results than NESCS. In this case, OSGA achieves the best results, especially for sparse data. Figure 8 (b) and (d) suggest that NESCO, NESUN and FISTA are comparable, but still OSGA generates better results. Figure 9 depicts the results of reconstructing an approximation solution for the nonsmooth problem (4) using the considered algorithms. The same interpretation for Figure 9 indicates that most of the optimal methods produce better results than classical ones and are competitive with each other. However, OSGA commonly obtains the best results among them. Summarizing the results of both figures, we see that the adapted algorithms NES83, NESCS and NES05 are comparable or even better than the the other optimal algorithms, especially for NES83. Optimal subgradient algorithms with application to large-scale linear inverse problems 25 11 10 11 10 NSDSG NES83 NESCS NES05 OSGA 10 10 9 10 9 10 8 10 8 10 function values function values PGA FISTA NESCO NESUN OSGA 10 10 7 10 6 10 7 10 6 10 5 5 10 10 4 4 10 10 3 3 10 10 2 10 2 0 10 1 10 2 10 iterations 3 10 10 4 10 0 10 (a) Function values vs. iterations 1 10 2 10 iterations 3 10 4 10 (b) Function values vs. iterations 5 5 10 4 10 3 10 10 4 10 3 function values function values 10 2 10 1 10 1 10 NSDSG NES83 NESCS NES05 OSGA 0 10 PGA FISTA NESCO NESUN OSGA 0 10 −1 −1 10 2 10 0 10 1 10 2 10 iterations 3 10 (c) Function values vs. iterations 4 10 10 0 10 1 10 2 10 iterations 3 10 4 10 (d) Function values vs. iterations Fig. 8: A comparison among the considered algorithms for solving the l22 − l22 problem (3): subfigures (a) and (b) stand for the dense data, where the algorithms stopped after 60 seconds; subfigures (c) and (d) show the results of the sparse data, where the algorithms stopped after 45 seconds. 3.3 Sparse optimization In recent decades, taking advantage of sparse solutions by using the structure of problems has become a common task in various applications of applied mathematics and particularly optimization, where applications are mostly arising in the fields of signal and image processing, statistics, machine learning and data mining. In the most cases, the problem involves high-dimensional data, and a small number of measurements is accessible, where the core of these problems involves an optimization problem. Thanks to the sparsity of solutions and the structure of problems, these optimization problems can be solved in reasonable time even for the extremely high-dimensional data sets. Basis pursuit, lasso, wavelet-based deconvolution and compressed sensing are some examples, where the last one receives lots of attentions during the recent years. Among fields involving sparse optimization, compressed sensing , also called compressive sensing and compressive sampling , is a novel sensing/sampling framework for acquiring and recovering objects like a sparse image or signal in the most efficient way possible with employing an incoherent projecting basis. On the one hand, conventional processes in the image/signal acquisition from frequency data 26 Masoud Ahookhosh 11 11 10 10 NSDSG NES83 NESCS NES05 OSGA 10 10 9 10 PGA FISTA NESCO NESUN OSGA 10 10 9 10 8 10 function values function values 8 10 7 10 6 10 7 10 6 10 5 10 5 4 10 3 10 10 4 10 2 10 3 0 10 1 10 2 10 iterations 3 10 10 4 10 0 10 (a) Function values vs. iterations 1 10 2 10 iterations 3 10 4 10 (b) Function values vs. iterations 5 10 5 10 4 10 4 10 3 10 3 function values function values 10 2 10 1 10 1 10 NSDSG NES83 NESCS NES05 OSGA 0 10 PGA FISTA NESCO NESUN OSGA 0 10 −1 −1 10 2 10 10 0 10 1 10 2 10 iterations 3 10 (c) Function values vs. iterations 4 10 0 10 1 10 2 10 iterations 3 10 4 10 (d) Function values vs. iterations Fig. 9: A comparison among the considered algorithms for solving the l22 − l1 problem (4): subfigures (a) and (b) stand for the dense data, where the algorithms stopped after 60 seconds; subfigures (c) and (d) show the results of the sparse data, where the algorithms stopped after 45 seconds. follow the Nyquist-Shannon density sampling theorem declaring that the number of samples required for reconstructing an image matches the number of pixels in the image and for recovering a signal without error is devoted by its bandwidth. On the other hand, it is known that most of the data we acquire can be thrown away with almost no perceptual loss. Then, the question is: why all the data should be acquired while most of them will be thrown away? Also, can we only recover parts of the data that will be useful in the final reconstruction? In response to these questions, a novel sensing/sampling approach was introduced which goes against common wisdom in data acquisition, namely compressed sensing, for example see [21, 22, 28] and references therein. It is known that compressed sensing supposes that the considered object, image or signal, has a sparse representation in some bases or dictionaries, and the considered representation dictionary is incoherent with the sensing basis, see [20, 28]. The idea behind this framework is centred on how many measurements are necessary to reconstruct an image/signal and the related nonlinear reconstruction techniques needed to recover this image/signal. It means that the object is recovered reasonably well from highly undersampled data, in spite of violating the traditional Nyquist-Shannon sampling constraints. In the other words, once an object is known to be sparse in a specific dictionary, one should find a set of measurements or sensing bases supposed to be incoherent with the dictionary. Afterwards, an underdetermined system of linear Optimal subgradient algorithms with application to large-scale linear inverse problems 27 equations of the form (1) is emerged, which is usually ill-conditioned. Some theoretical results in [21, 22, 28] yield that the minimum number of measurements required to reconstruct the original object provides a specific pair of measurement matrices and optimization problems. Thus, the main challenge is how to reformulate this inverse problem as an appropriate minimization problem involving regularization terms and solve it by a suitable optimization technique. We now consider the inverse problem (1) in the matrix form with A ∈ Rm×n , x ∈ Rn and y ∈ Rm , where the observation is deficient, m < n, as a common interest in compressed sensing. Therefore, the object x can not be recovered from the observation y, even in noiseless system Ax = y, unless some additional assumptions like sparsity of x is presumed. We here consider a sparse signal x ∈ Rn that should be recovered by the observation y ∈ Rm , where A is obtained by first filling it with independent samples of the standard Gaussian distribution and then orthonormalizing the rows, see [30]. In our experiments, we consider m = 5000 and n = 10000, where the original signal x consist of 300 randomly placed ±1 spikes, and y is generated using (1) by σ 2 = 10−6 . We reconstruct x using (4), where the regularization parameter is set to λ = 0.1kAyk∞ or λ = 0.001kAyk∞ . The algorithms stopped after 20 seconds of the running time, and the results are illustrated in Figure 10. 4 1.4 10 NSDSG NES83 NESCS NES05 PGA FISTA NESCO NESUN OSGA 3 1 0.8 MSE function values 10 NSDSG NES83 NESCS NES05 PGA FISTA NESCO NESUN OSGA 1.2 0.6 2 10 0.4 0.2 1 10 0 10 1 2 10 0 0 10 3 10 10 1 3 10 10 iterations (a) Function values vs. iterations (b) MSE vs. iterations 4 10 1.4 NSDSG NES83 NESCS NES05 PGA FISTA NESCO NESUN OSGA 3 10 2 10 NSDSG NES83 NESCS NES05 PGA FISTA NESCO NESUN OSGA 1.2 1 0.8 MSE function values 2 10 iterations 0.6 1 10 0.4 0 10 0.2 −1 10 0 10 1 2 10 10 iterations (c) Function values vs. iterations 3 10 0 0 10 1 2 10 10 3 10 iterations (d) MSE vs. iterations Fig. 10: Recovering a noisy sparse signal using l22 − l1 problem (4), where the algorithms stopped after 20 seconds. Subfigures (a) and (b) show the results for the regularization parameter λ = 0.1kAyk∞ , while subfigures (c) and (d) indicate the results for the regularization parameter λ = 0.001kAyk∞ . 28 Masoud Ahookhosh Original signal(n = 10000, number of nonzeros = 300) 1 0 −1 0 1000 2000 3000 500 1000 1500 1000 2000 1000 4000 5000 Noisy signal 6000 7000 8000 9000 10000 2000 2500 3000 Direct solution using the pseudo inverse 3500 4000 4500 5000 3000 4000 5000 6000 NES83 (MSE = 0.000729386) 7000 8000 9000 10000 2000 3000 4000 5000 6000 NESCS (MSE = 0.000947418) 7000 8000 9000 10000 1000 2000 3000 4000 5000 6000 NES05 (MSE = 0.000870573) 7000 8000 9000 10000 1000 2000 3000 4000 5000 6000 FISTA (MSE = 0.000849244) 7000 8000 9000 10000 1000 2000 3000 4000 5000 6000 OSGA (MSE = 0.000796183) 7000 8000 9000 10000 1000 2000 3000 4000 7000 8000 9000 10000 1 0 −1 0 1 0 −1 0 1 0 −1 0 1 0 −1 0 1 0 −1 0 1 0 −1 0 1 0 −1 0 5000 6000 Fig. 11: Recovering a noisy sparse signal using l22 − l1 problem (4) by NES83, NESCS, NES05, FISTA and OSGA. The algorithms stopped after 20 seconds, and the quality measure MSE is defined by (34). Figure 10 illustrates the results of the sparse signal recovery by NSDSG, NSE83, NESCS, NES05, PGA, FISTA, NESCO, NESUN and OSGA, where the the results are compared in the sense of function values and MSE. Indeed, MSE is a usual measure of comparisons in signal reconstruction defined by MSE(x) = 1 kx − x0 k2 , n (34) where x0 is the original signal. The results of Figure 10 indicate that NSDSG, PGA, NESCO and NESUN are unsuccessful to recover the signal accurately, where NESCO and NESUN could not leave the backtracking line searches in a reasonable number of inner iterations which dramatically decrease the step sizes. In Figure 10 (a) and (b), we see that NSE83, NESCS, NES05, FISTA and OSGA can reconstruct the signal accurately, however NES83 and OSGA are much faster than the others regarding both function values and MSE. To see the sensitivity of the algorithms to the regularization parameter, Figure 10 (c) and (d) demonstrate the results of recovery for the smaller regularization parameter λ = 0.001kAyk∞ . Optimal subgradient algorithms with application to large-scale linear inverse problems 29 It is clear that NESCS, NES05 and FISTA are more sensitive than NES83 and OSGA to the regularization parameter, where they could not get an accurate approximation of the original signal when the regularization parameter decreased. To show the results of our test visually, we illustrate the recovered signal of the experiment reported in Figure 10 (a) and (b) in Figure 11. Since NSDSG, PGA, NESCO and NESUN could not recover the signal accurately, we do not consider them in Figure 11. We also add the direct solution x = AT (AAT )−1 y to our comparison indicating that this solution is very inaccurate. The results suggest that NSE83, NESCS, NES05, FISTA and OSGA recover the signal visually acceptable. Summarizing the results of Figures 10 and 11, it can be seen that NES83 and OSGA just need about 15 iterations to achieve acceptable results, while NESCS, NES05 and FISTA require about 100 iterations to reach the same accuracy. 4 OSGA software package The first version of the OSGA software package for solving unconstrained convex optimization problems is publicly available at http://homepage.univie.ac.at/masoud.ahookhosh/uploads/OSGA_v1.1.tar.gz where it is implemented in MATLAB programming language. Some examples are available to show how the user can implement OSGA. The interface to each subprogram in the package is fully documented in the corresponding file. Moreover, the OSGA user’s manual [1] describes the design of the package and how the user can solve his/her own problems. 5 Concluding remarks Simple gradient and subgradient procedures are historically first numerical schemes proposed for solving smooth and nonsmooth convex optimization, respectively. Theoretical and computational results indicate that they are very slow to be able to solve large-scale practical problems in applications. However, these algorithms have a simple structure and need a low amount of memory in implementation. These features cause that first-order algorithms have received much attention during last few decades. In particular, both theoretical and computational aspects of optimal first-order methods are very interesting. In this paper, we study and develop an optimal subgradient algorithm called OSGA for the class of multi-term affine composite functions. The algorithm is flexible enough to handle a broad range of structured convex optimization problems and has a simple structure. It also does not need global parameters like Lipschitz constants except for the strong convexity parameter, which can be set to zero if it is unknown. We consider examples of linear inverse problems and report extensive numerical results by comparing OSGA with some state-of-the-art algorithms and software packages. The numerical results suggest that OSGA is computationally competitive or even better than the considered state-of-the-art solvers designed specifically for solving these special problems. It can often reach the fast rate of convergence even beyond the theoretical rate suggesting that OSGA is a promising algorithm for solving a wide range of large-scale convex optimization problems. We also develop some optimal first-order methods originally proposed for solving smooth convex optimization to deal with nonsmooth problems as well. To this end, we simply pass a subgradient of the nonsmooth composite objective function to these algorithms in place of the gradient. This simply means that we change their original oracle by the nonsmooth black-box oracle. The numerical results show that the adapted algorithms are promising and can produce competitive or even better results than the other considered algorithms specifically proposed for nonsmooth problems. However, they are not supported by any theoretical results at the moment, so it might be an interesting topic for a future research. The current paper only considers unconstrained convex optimization problems, however, the constrained version for some simple constraints developed in [2, 3, 5]. Furthermore, for problems involving costly direct and adjoint operators, a version of OSGA using the multi-dimensional subspace search is established in [5, 6]. Finally, many more applications like sparse optimization with the l1 /l2 or l1 /l∞ norm could be considered, but this is beyond the scope of the current work and remains for a future research. 30 Masoud Ahookhosh Acknowledgement. We would like to thank Arnold Neumaier for a careful reading of the manuscript and his insightful comments and feedback. We are also grateful to the authors of TwIST, SpaRSA and FISTA software packages for generously making their codes available on the web. References 1. Ahookhosh, M.: User’s manual for OSGA (Optimal SubGradient Algorithm), (2014) http://homepage.univie.ac.at/ masoud.ahookhosh/uploads/User’s_manual_for_OSGA.pdf [29] 2. Ahookhosh, M., Neumaier, A.: An optimal subgradient method for large-scale bound-constrained convex optimization, Manuscript, University of Vienna, (2014) [29] 3. Ahookhosh, M., Neumaier, A.: Optimal subgradient methods for structured convex constrained optimization I: theoretical results, Manuscript, University of Vienna, (2014) [29] 4. Ahookhosh, M., Neumaier, A.: Optimal subgradient methods for structured convex constrained optimization II: numerical results, Manuscript, University of Vienna, (2014) [29] 5. Ahookhosh, M., Neumaier, A.: High-dimensional convex optimization via optimal affine subgradient algorithms, in ROKS workshop, 83-84 (2013) [29] 6. Ahookhosh, M., Neumaier, A.: Fast subgradient algorithms for costly convex optimization problems, Manuscript, University of Vienna, (2014) [29] 7. Andrews, H., Hunt, B.: Digital Image Restoration. Englewood Cliffs, NJ: Prentice-Hall, (1977) [11] 8. Beck, A., Teboulle, M.: Mirror descent and nonlinear projected subgradient methods for convex optimization, Operations Research Letters 31, 167–175 (2003) [2, 3] 9. Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems, SIAM Journal on Imaging Sciences 2, 183–202 (2009) [2, 12, 23] 10. Beck, A., Teboulle, M.: Fast gradient-based algorithms for constrained total variation image denoising and deblurring, IEEE Transactions on Image Processing 18(11), 2419–2434 (2009) [12] 11. Becker, S.R., Candès, E.J., Grant, M.C.: Templates for convex cone problems with applications to sparse signal recovery, Mathematical Programming Computation 3, 165–218 (2011) [2, 3, 24] 12. Ben-Tal, A., Nemirovski, A.: Non-Euclidean restricted memory level method for large-scale convex optimization, Mathematical Programming 102, 407–456 (2005) [2] 13. Bertalmio, M., Sapiro, G., Caselles, V., Ballester, C.: Image inpainting, in Proceedings of SIGGRAPH, 417–424 (2000) [12] 14. Bertero, M., Boccacci, P.: Introduction to Inverse Problems in Imaging, Bristol, U.K.: IOP, (1998) [11] 15. Bioucas-Dias, J., Figueiredo, M.: A new TwIST: two-step iterative shrinkage/thresholding algorithms for image restoration, IEEE Transactions on Image Processing 16(12), 2992–3004 (2007) [12] 16. Boţ, R.I., Hendrich, C.: A double smoothing technique for solving unconstrained nondifferentiable convex optimization problems, Computational Optimization and Applications 54(2), 239–262 (2013) [23] 17. Boţ, R.I., Hendrich, C.: On the acceleration of the double smoothing technique for unconstrained convex optimization problems, Optimization, 1–24, iFirst (2012) [23] 18. Bruckstein, A.M., Donoho, D.L., Elad, M.: From sparse solutions of systems of equations to sparse modeling of signals and images, SIAM Review 51(1), 34–81 (2009) [2] 19. Boyd, S., Xiao, L., Mutapcic, A.: Subgradient methods, (2003). http://www.stanford.edu/class/ee392o/subgrad_ method.pdf [23, 24] 20. Candés, E.: Compressive sampling, in Proceedings of International Congress of Mathematics, Vol. 3, Madrid, Spain, 1433–1452 (2006) [26] 21. Candés, E., Tao, T.: Near-optimal signal recovery from random projections: universal encoding etrategies?, IEEE Transactions of Information Theory 52(12), 5406–5425 (2006) [26, 27] 22. Candés, E., Romberg, J., Tao, T.: Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information, IEEE Transactions of Information Theory 52(2), 489–509 (2006) [26, 27] 23. Chambolle, A.: An algorithm for total variation minimization and applications. Journal of Mathematical Imaging and Vision 20(1-2), 89–97 (2004) [12] 24. Chan, T.F., Shen, J., Zhou, H.M.: Total variation wavelet inpainting, Journal of Mathematical Imaging and Vision 25, 107–125 (2006) [12] 25. Chen, X., Zhou, W.: Smoothing nonlinear conjugate gradient method for image restoration using nonsmooth nonconvex minimization, SIAM Journal on Imaging Sciences 3(4), 765–790 (2010) [2] 26. Devolder, O., Glineur, F., Nesterov, Y.: First-order methods of smooth convex optimization with inexact oracle, Mathematical Programming, (2013) DOI 10.1007/s10107-013-0677-5. [3] 27. Dolan, E., Moré, J.J.: Benchmarking optimization software with performance profiles, Mathematical Programming 91, 201–213 (2002) [15] 28. Donoho, D.L.: Compressed sensing, IEEE Transactions of Information Theory 52(4), 1289–1306 (2006) [26, 27] 29. Elad, M., Milanfar, P., Rubinstein, R.: Analysis versus synthesis in signal priors, Inverse Problems 23, 947–968 (2007) [11] 30. Figueiredo, M.A.T., Nowak, R.D., Wright, S.J.: Gradient projection for sparse reconstruction: Application to compressed sensing and other inverse problems, IEEE Journal of Selected Topics in Signal Processing 1(4), 586–597 (2007) [27] 31. Gonzaga, C.C., Karas, E.W.: Fine tuning Nesterov’s steepest descent algorithm for differentiable convex programming, Mathematical Programming 138, 141–166 (2013) [3] 32. Gonzaga, C.C., Karas, E.W., Rossetto, D.R.: An optimal algorithm for constrained differentiable convex optimization, Optimization-online preprint 3053, (2011) http://www.optimization-online.org/DB_FILE/2011/06/3053.pdf [3] 33. Kaufman, L., Neumaier, A.: PET regularization by envelope guided conjugate gradients, IEEE Transactions on Medical Imaging 15, 385–389 (1996) [11] Optimal subgradient algorithms with application to large-scale linear inverse problems 31 34. Kaufman, L., Neumaier, A.: Regularization of ill-posed problems by envelope guided conjugate gradients, Journal of Computational and Graphical Statistics 6(4), 451–463 (1997) [11] 35. Lan, G.: Bundle-level type methods uniformly optimal for smooth and nonsmooth convex optimization, Mathematical Programming, (2013) DOI 10.1007/s10107-013-0737-x. [2, 3] 36. Lewis, A.S., Overton, M.L.: Nonsmooth optimization via quasi-Newton methods, Mathematical Programming 141, 135-163 (2013) [23] 37. Meng, X., Chen, H.: Accelerating Nesterov’s method for strongly convex functions with Lipschitz gradient, (2011) http://arxiv.org/abs/1109.6058 [3] 38. Nemirovsky, A.S., Nesterov, Y.: Optimal methods for smooth convex minimization, Zh. Vichisl. Mat. Fiz. (In Russian) 25(3), 356–369 (1985) [3] 39. Nemirovsky, A.S., Yudin, D.B.: Problem complexity and method efficiency in optimization, Wiley, New York (1983) [2, 3, 6] 40. Nesterov, Y.: A method of solving a convex programming problem with convergence rate O(1/k2 ), Doklady AN SSSR (In Russian), 269 (1983), 543-547. English translation: Soviet Math. Dokl. 27, 372–376 (1983) [3, 23, 24] 41. Nesterov, Y.: On an approach to the construction of optimal methods of minimization of smooth convex function, Èkonom. i. Mat. Metody (In Russian) 24, 509–517 (1988) [3] 42. Nesterov, Y.: Introductory lectures on convex optimization: A basic course, Kluwer, Dordrecht (2004) [2, 3, 6, 23, 24] 43. Nesterov, Y.: Smooth minimization of non-smooth functions, Mathematical Programming 103, 127–152 (2005) [2, 3, 23, 24] 44. Nesterov, Y.: Smoothing technique and its applications in semidefinite optimization, Mathematical Programming 110, 245–259 (2007) [3] 45. Nesterov, Y.: Excessive gap technique in nonsmooth convex minimization, SIAM Journal on Optimization 16(1), 235–249 (2005) [2, 3] 46. Nesterov, Y.: Gradient methods for minimizing composite objective function, Mathematical Programming, 140 (2013), 125–161. [2, 3, 23, 24] 47. Nesterov, Y.: Primal-dual subgradient methods for convex problems, Mathematical Programming 120, 221–259 (2009) [2] 48. Nesterov, Y.: Universal gradient methods for convex optimization problems, Technical Report, Center for Operations Research and Econometrics (CORE), Catholic University of Louvain (UCL), (2013) [2, 3, 23, 24] 49. Neumaier, A.: OSGA: a fast subgradient algorithm with optimal complexity, Manuscript, University of Vienna, (2014) http://arxiv.org/abs/1402.1125 [1, 3, 4, 5, 6, 7, 8, 9] 50. Neumaier, A.: Solving ill-conditioned and singular linear systems: A tutorial on regularization, SIAM Review 40(3), 636–666 (1998) [2, 11] 51. Osher, S., Rudin, L., Fatemi, E.: Nonlinear total variation basednoise removal algorithms, Physica 60, 259–268 (1992) [11, 12] 52. Parikh, N., Boyd, S.: Proximal Algorithms, Foundations and Trends in Optimization 1(3), 123–231 (2013) [2, 23] 53. Tikhonov, A.N.: Solution of incorrectly formulated problems and the regularization method, Soviet Mathematics Doklady 4, 1035–1038 (1963) [2] 54. Tseng, P.: On accelerated proximal gradient methods for convex-concave optimization, Manuscript (2008) http:// pages.cs.wisc.edu/~brecht/cs726docs/Tseng.APG.pdf [3] 55. Wright, S.J., Nowak, R.D., Figueiredo, M.A.T.: Sparse reconstruction by separable approximation, IEEE Transactions on Signal Processing 57(7), 2479–2493 (2009) [15]