Abstract
In this paper, we propose a perturbed proximal primal–dual algorithm (PProx-PDA) for an important class of linearly constrained optimization problems, whose objective is the sum of smooth (possibly nonconvex) and convex (possibly nonsmooth) functions. This family of problems can be used to model many statistical and engineering applications, such as high-dimensional subspace estimation and the distributed machine learning. The proposed method is of the Uzawa type, in which a primal gradient descent step is performed followed by an (approximate) dual gradient ascent step. One distinctive feature of the proposed algorithm is that the primal and dual steps are both perturbed appropriately using past iterates so that a number of asymptotic convergence and rate of convergence results (to first-order stationary solutions) can be obtained. Finally, we conduct extensive numerical experiments to validate the effectiveness of the proposed algorithm.
Similar content being viewed by others
References
Allen-Zhu, Z., Hazan, E.: Variance reduction for faster non-convex optimization. In: Proceedings of the 33rd International Conference on Machine Learning, ICML, pp. 699–707 (2016)
Ames, B., Hong, M.: Alternating directions method of multipliers for l1-penalized zero variance discriminant analysis and principal component analysis. Comput. Optim. Appl. 64(3), 725–754 (2016)
Andreani, R., Haeser, G., Martnez, J.M.: On sequential optimality conditions for smooth constrained optimization. Optimization 60(5), 627–641 (2011)
Antoniadis, A., Gijbels, I., Nikolova, M.: Penalized likelihood regression for generalized linear models with non-quadratic penalties. Ann. Inst. Stat. Math. 63(3), 585–615 (2009)
Arrow, K.J., Hurwicz, L., Uzawa, H.: Studies in Linear and Non-linear Programming. Stanford University Press, Palo Alto (1958)
Asteris, M., Papailiopoulos, D., Dimakis, A.: Nonnegative sparse PCA with provable guarantees. In: Proceedings of the 31st International Conference on Machine Learning (ICML), vol. 32, pp. 1728–1736 (2014)
Aybat, N.S., Hamedani, E.Y.: A primal–dual method for conic constrained distributed optimization problems. Adv. Neural Inf. Process. Syst. (NIPS) 5049–5057 (2016)
Bertsekas, D.P.: Constrained Optimization and Lagrange Multiplier Method. Academic Press, Cambridge (1982)
Bertsekas, D.P., Tsitsiklis, J.N.: Neuro-Dynamic Programming. Athena Scientific, Belmont (1996)
Bertsekas, D.P., Tsitsiklis, J.N.: Parallel and Distributed Computation: Numerical Methods, 2nd edn. Athena Scientific, Belmont (1997)
Bianchi, P., Jakubowicz, J.: Convergence of a multi-agent projected stochastic gradient algorithm for non-convex optimization. IEEE Trans. Autom. Control 58(2), 391–405 (2013)
Birgin, E., Martínez, J.: Practical Augmented Lagrangian Methods for Constrained Optimization. Society for Industrial and Applied Mathematics, Philadelphia (2014)
Björnson, E., Jorswieck, E.: Optimal resource allocation in coordinated multi-cell systems. Found. Trends Commun. Inf. Theory 9, 113–381 (2013)
Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3(1), 1–122 (2011)
Burachik, R.S., Kaya, C.Y., Mammadov, M.: An inexact modified subgradient algorithm for nonconvex optimization. Comput. Optim. Appl. 45(1), 1–24 (2008)
Chung, F.R.K.: Spectral Graph Theory. The American Mathematical Society, Providence (1997)
Cressie, N.: Statistics for Spatial Data. Wiley, Hoboken (2015)
Curtis, F.E., Gould, N.I.M., Jiang, H., Robinson, D.P.: Adaptive augmented Lagrangian methods: algorithms and practical numerical experience. Optim. Methods Softw. 31(1), 157–186 (2016)
D’Aspremont, A., Ghaoui, L.E., Jordan, M.I., Lanckriet, G.R.G.: A direct formulation for sparse PCA using semidefinite programming. SIAM Rev. 49(3), 434–448 (2007)
Deng, W., Yin, W.: On the global and linear convergence of the generalized alternating direction method of multipliers. J. Sci. Comput. 66(3), 889–916 (2016)
Dutta, J., Deb, K., Tulshyan, R., Arora, R.: Approximate KKT points and a proximity measure for termination. J. Glob. Optim. 56(4), 1463–1499 (2013)
Fan, J., Li, R.: Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 96(456), 1348–1360 (2001)
Fernández, D., Solodov, M.V.: Local convergence of exact and inexact augmented Lagrangian methods under the second-order sufficient optimality condition. SIAM J. Optim. 22(2), 384–407 (2012)
Fleiss, J.L., Levin, B., Paik, M.C.: Statistical Methods for Rates and Proportions. Wiley, Hoboken (2003)
Forero, P.A., Cano, A., Giannakis, G.B.: Distributed clustering using wireless sensor networks. IEEE J. Sel. Top. Signal Proces. 5(4), 707–724 (2011)
Gabay, D., Mercier, B.: A dual algorithm for the solution of nonlinear variational problems via finite element approximation. Comput. Math. Appl. 2, 17–40 (1976)
Giannakis, G.B., Ling, Q., Mateos, G., Schizas, I.D., Zhu, H.: Decentralized learning for wireless communications and networking. In: Splitting Methods in Communication and Imaging. Springer, New York (2015)
Glowinski, R., Marroco, A.: Sur l’approximation, par éléments finis d’ordre un, et la résolution, par pénalisation-dualité d’une classe de problémes de dirichlet non linéares. Revue Franqaise d’Automatique, Informatique et Recherche Opirationelle 9, 41–76 (1975)
Gu, Q., Z. Wang, Z., Liu, H.: Sparse PCA with oracle property. In: Proceedings of the 27th International Conference on Neural Information Processing Systems (NIPS), pp. 1529–1537 (2014)
Haeser, G., Melo, V.: On sequential optimality conditions for smooth constrained optimization. Preprint (2013)
Hajinezhad, D., Chang, T.H., Wang, X., Shi, Q., Hong, M.: Nonnegative matrix factorization using ADMM: algorithm and convergence analysis. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4742–4746 (2016)
Hajinezhad, D., Hong, M.: Nonconvex alternating direction method of multipliers for distributed sparse principal component analysis. In: IEEE Global Conference on Signal and Information Processing (GlobalSIP). IEEE (2015)
Hajinezhad, D., Hong, M., Garcia, A.: Zeroth order nonconvex multi-agent optimization over networks. arXiv preprint arXiv:1710.09997 (2017)
Hajinezhad, D., Hong, M., Zhao, T., Wang, Z.: NESTT: A nonconvex primal–dual splitting method for distributed and stochastic optimization. In: Advances in Neural Information Processing Systems (NIPS), pp. 3215–3223 (2016)
Hajinezhad, D., Shi, Q.: Alternating direction method of multipliers for a class of nonconvex bilinear optimization: convergence analysis and applications. J. Glob. Optim. 70, 1–28 (2018)
Hamdi, A., Mishra, S.K.: Decomposition Methods Based on Augmented Lagrangians: A Survey, pp. 175–203. Springer, New York (2011)
Hestenes, M.R.: Multiplier and gradient methods. J. Optim. Appl. 4, 303–320 (1969)
Hong, M., Hajinezhad, D., Zhao, M.M.: Prox-PDA: the proximal primal-dual algorithm for fast distributed nonconvex optimization and learning over networks. In: Proceedings of the 34th International Conference on Machine Learning (ICML), (70), pp. 1529–1538 (2017)
Hong, M., Luo, Z.Q.: On the linear convergence of the alternating direction method of multipliers. Math. Program. 162(1), 165–199 (2017)
Hong, M., Luo, Z.Q., Razaviyayn, M.: Convergence analysis of alternating direction method of multipliers for a family of nonconvex problems. SIAM J. Optim. 26(1), 337–364 (2016)
Houska, B., Frasch, J., Diehl, M.: An augmented Lagrangian based algorithm for distributed nonconvex optimization. SIAM J. Optim. 26(2), 1101–1127 (2016)
Koppel, A., Sadler, B.M., Ribeiro, A.: Proximity without consensus in online multi-agent optimization. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3726–3730 (2016)
Koshal, J., Nedić, A., Shanbhag, Y.V.: Multiuser optimization: distributed algorithms and error analysis. SIAM J. Optim. 21(3), 1046–1081 (2011)
Lan, G., Monteiro, R.D.C.: Iteration-complexity of first-order augmented Lagrangian methods for convex programming. Math. Program. 155(1), 511–547 (2015)
Li, G., Pong, T.K.: Global convergence of splitting methods for nonconvex composite optimization. SIAM J. Optim. 25(4), 2434–2460 (2015)
Liao, W., Hong, M., Farmanbar, H., Luo, Z.: Semi-asynchronous routing for large scale hierarchical networks. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2894–2898 (2015)
Liavas, A.P., Sidiropoulos, N.D.: Parallel algorithms for constrained tensor factorization via alternating direction method of multipliers. IEEE Trans. Signal Process. 63(20), 5450–5463 (2015)
Liu, Y.F., Liu, X., Ma, S.: On the non-ergodic convergence rate of an inexact augmented Lagrangian framework for composite convex programming. arXiv preprint arXiv:1603.05738 (2016)
Lobel, I., Ozdaglar, A.: Distributed subgradient methods for convex optimization over random networks. IEEE Trans. Autom. Control 56(6), 1291–1306 (2011)
Lorenzo, P.D., Scutari, G.: NEXT: in-network nonconvex optimization. IEEE Trans. Signal Inf. Process Over Netw. 2(2), 120–136 (2016)
Lu, Z., Zhang, Y.: Sparse approximation via penalty decomposition methods. SIAM J. Optim. 23(4), 2448–2478 (2013)
Mateos, G., Bazerque, J.A., Giannakis, G.B.: Distributed sparse linear regression. IEEE Trans. Signal Process. 58(10), 5262–5276 (2010)
Max L.N. Goncalves, J.G.M., Monteiro, R.D.: Convergence rate bounds for a proximal ADMM with over-relaxation stepsize parameter for solving nonconvex linearly constrained problems (2017). Preprint arXiv:1702.01850
Nedić, A., Olshevsky, A.: Distributed optimization over time-varying directed graphs. IEEE Trans. Autom. Control 60(3), 601–615 (2015)
Nedić, A., Ozdaglar, A.: Distributed subgradient methods for multi-agent optimization. IEEE Trans. Autom. Control 54(1), 48–61 (2009)
Nedić, A., Ozdaglar, A., Parrilo, P.A.: Constrained consensus and optimization in multi-agent networks. IEEE Trans. Autom. Control 55(4), 922–938 (2010)
Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course. Springer, Berlin (2004)
Nocedal, J., Wright, S.J.: Numerical Optimization. Springer, Berlin (1999)
Powell, M.M.D.: An efficient method for nonlinear constraints in minimization problems. In: Optimization. Academic Press, pp. 283–298 (1969)
Razaviyayn, M., Hong, M., Luo, Z.Q.: A unified convergence analysis of block successive minimization methods for nonsmooth optimization. SIAM J. Optim. 23(2), 1126–1153 (2013)
Rockafellar, R.T.: Augmented Lagrangians and applications of the proximal point algorithm in convex programming. Math. Oper. Res. 1(2), 97–116 (1976)
Ruszczyński, A.: Nonlinear Optimization. Princeton University, Princeton (2011)
Schizas, I., Ribeiro, A., Giannakis, G.: Consensus in ad hoc WSNs with noisy links—part I: distributed estimation of deterministic signals. IEEE Trans. Signal Process. 56(1), 350–364 (2008)
Scutari, G., Facchinei, F., Song, P., Palomar, D.P., Pang, J.S.: Decomposition by partial linearization: parallel optimization of multi-agent systems. IEEE Trans. Signal Process. 63(3), 641–656 (2014)
Shi, W., Ling, Q., Wu, G., Yin, W.: EXTRA: an exact first-order algorithm for decentralized consensus optimization. SIAM J. Optim. 25(2), 944–966 (2014)
Sun, Y., Scutari, G., Palomar, D.: Distributed nonconvex multiagent optimization over time-varying networks. In: 50th Asilomar Conference on Signals, Systems and Computers, pp. 788–794 (2016)
Tsitsiklis, J.: Problems in decentralized decision making and computation. Ph.D. thesis, Massachusetts Institute of Technology (1984)
Vu, V.Q., Cho, J., Lei, J., Rohe, K.: Fantope projection and selection: a near-optimal convex relaxation of sparse PCA. In: Advances in Neural Information Processing Systems (NIPS), pp. 2670–2678 (2013)
Wang, Y., Yin, W., Zeng, J.: Global convergence of ADMM in nonconvex nonsmooth optimization. J. Sci. Comput. 78(1), 29–63 (2019)
Wright, S.J.: Implementing proximal point methods for linear programming. J. Optim. Theory Appl. 65(3), 531–554 (1990)
Wen, Z., Yang, C., Liu, X., Marchesini, S.: Alternating direction methods for classical and ptychographic phase retrieval. Inverse Probl. 28(11), 1–18 (2012)
Yildiz, M.E., Scaglione, A.: Coding with side information for rate-constrained consensus. IEEE Trans. Signal Process. 56(8), 3753–3764 (2008)
Zhang, C.H.: Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 38(2), 894–942 (2010)
Zhang, Y.: Convergence of a class of stationary iterative methods for saddle point problems. Preprint (2010)
Zhu, H., Cano, A., Giannakis, G.: Distributed consensus-based demodulation: algorithms and error analysis. IEEE Trans. Wirel. Commun. 9(6), 2044–2054 (2010)
Acknowledgements
The authors would like to thank Dr. Quanquan Gu who provided us with the codes of [29]. The authors would also like to thanks Dr. Gesualdo Scutari for helpful discussions about the numerical results.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This work was completed when Davood Hajinezhad was a Ph.D. student at Iowa State University. Mingyi Hong is supported by NSF Grants CMMI-1727757, and an AFOSR Grant 15RT0767.
Appendices
Appendix A
In this section, we justify Assumption [B4], which imposes the boundedness of the sequence of dual variables. Throughout this section we will assume that Assumptions A and [B1]–[B3] hold. First, we prove that when \(\Vert \lambda ^{t+1}||\rightarrow \infty \), we have \(\lim \inf _{r\rightarrow \infty } \frac{\beta ^{r+1}\Vert x^{r+1}-x^r\Vert }{\Vert \lambda ^{r+1}\Vert }= 0\). Using Assumption [B3] we have the following identity
Assume the contrary, that there exists \(c_1>0\) such that
Then from (74), it is easy to show that when r is large enough, P is decreasing.
Similarly as in Lemma 3, it is relatively easy to show that the potential function is lower and upper bounded (the proof is included in Lemma 10–11 in the online version). The lower boundedness of the potential function and the fact that it is descending, implies that (75) holds true, which further implies that \(\frac{1}{\beta ^{r+1}}\Vert \lambda ^{r+1}\Vert ^2\rightarrow 0\), according to (83). Examine the definition of the potential function in (73) and use the choice of c in (70) we conclude that \(\frac{\beta ^{r+1}\rho ^{r+1}}{2}\Vert x^{r+1}-x^r\Vert ^2\) in the potential function is bounded. Therefore, there exists \(D_1\) such that
It follows that \(c_1\Vert \lambda ^{r+1}\Vert ^2\) is also upper bounded. This contradicts our assumption that \(\Vert \lambda ^{r}\Vert \rightarrow \infty \).
Next, we make use of some constraint qualification to argue the boundedness of the dual variables. The technique used in the proof is relatively standard, see recent works [21, 51]. Assume that the so-called Robinson’s condition is satisfied for problem (1) at \({\hat{x}}\) [62, Chap. 3]. This means \(\{A d_x\mid d_x\in {\mathcal {T}}_{X}(\hat{x})\}={\mathbb {R}}^M,\) where \(d_x\) is the tangent direction for convex set X, and \({\mathcal {T}}_{X}(\hat{x})\) is the tangent cone to the feasible set X at the point \(\hat{x}\). Utilizing this assumption we will prove that the dual variable is bounded.
Lemma 6
Suppose the Robinson’s condition holds true for problem (1). Then the sequence of dual variable \(\{\lambda ^r\}\) generated by (67b) is bounded.
Proof
We argue by contradiction. Suppose that the dual variable sequence is not bounded, i.e.,
From the optimality condition of \(x^{r+1}\) we have for all \(x\in X\)
Note that \(\lim \inf _{r\rightarrow \infty } \frac{\beta ^{r+1}\Vert x^{r+1}-x^r\Vert }{\Vert \lambda ^{r+1}\Vert } =0\), so the following holds:
Let us define a new bounded sequence as \(\mu ^r = \lambda ^r/\Vert \lambda ^r\Vert , r=1,2, \ldots \). Let \((x^*, \mu ^*)\) be an accumulation point of \(\{x^{r+1}, \mu ^{r+1}\}\). Assume that the Robinson’s condition holds at \(x^*\). Dividing both sides of the above inequality by \(\Vert \lambda ^{r+1}\Vert \) we obtain for all \(x\in X\)
Taking the limit, passing a subsequence if necessary and utilizing the assumption that \(\Vert \lambda ^{r+1}\Vert \rightarrow \infty \), and that X is a compact set, we obtain
Utilizing the Robinson’s condition, we know that there exists \(x\in X\) and a scaling constant \(c>0\) that such \(cA(x-x^*) = - \mu ^*\), which combined with the above relation yields: \(-c\Vert \mu ^*\Vert ^2\le 0\). Therefore we must have \(\mu ^* = 0\). However, this contradicts to the fact that \(\Vert \mu ^*\Vert =1\). Therefore, we conclude that \(\{\lambda ^r\}\) is a bounded sequence.\(\square \)
Appendix B
We show how the sufficient conditions developed in “Appendix A” can be applied to the problems discussed in Sect. 1.2. We will focus on the partial consensus problem (10).
To proceed, we note that the Robinson’s condition reduces to the well-known Mangasarian–Fromovitz constraint qualification (MFCQ) if we set \(X={\mathbb {R}}^{N}\), and write out explicitly the inequality constraints as \(g(x)\le 0\) [62, Lemma 3.16]. To state the MFCQ, consider the following system
where \(p_i:{\mathbb {R}}^N\rightarrow {\mathbb {R}}\) and \(g_j:{\mathbb {R}}^N\rightarrow {\mathbb {R}}\) are all continuously differentiable functions. For a given feasible solution \({\hat{y}}\) let us use \({\mathcal {A}}({\hat{y}})\) to denote the indices for active inequality constraints, that is
Let us define
Then the MFCQ holds for system (86) at point \({\hat{y}}\) if we have: 1) The rows of Jacobian matrix of p(y) denoted by \(\nabla p({\hat{y}})\) are linearly independent. 2) There exists a vector \(d_y\in {\mathbb {R}}^N\) such that
See [62, Lemma 3.17] for more details. In the following, we show that MFCQ holds true for problem (10) at any point (x, z) that satisfies \(z\in Z\). Comparing the constraint set of this problem with system (86) we have the following specifications. The optimization variable \(y=[x;z]\), where \(x\in {\mathbb {R}}^N\) stacks all \(x_i\in {\mathbb {R}}\) from N nodes (here we assume \(x_i\in {\mathbb {R}}\) only for the ease of presentation). Also, \(z\in {\mathbb {R}}^E\) stacks all \(z_{e}\in {\mathbb {R}}\) for \(e\in {\mathcal {E}}\). The equality constraint is written as \(p(y)=[A,-I]y =0\), where \(A\in {\mathbb {R}}^{E\times N}\) and I is an \(E\times E\) identity matrix. Finally, for the inequality constraint we have \(g_{e}(y)= |z_{e}| - \xi \), and the active set is given by \({\mathcal {A}}(y):={\mathcal {A}}^+(y)\cup {\mathcal {A}}^-(y)\), where
Without loss of generality we assume \(\xi =1\). To show that MFCQ holds, consider a solution \({\hat{y}}:=({\hat{x}},{\hat{z}})\). First observe that the Jacobian of equality constraint is \(\nabla p({\hat{y}})= [A,-I]\) which has full row rank. In order to verify the second condition we need to find a vector \(d_y:=[d_x;d_z]\in {\mathbb {R}}^{N+E}\) such that
where \([d_z]_e\) denotes the eth component of vector \(d_z\). Let us denote an all-one vector and all-zero vector by \(\mathbf{1}\) and \(\mathbf{0}\) respectively. To proceed, let us consider two different cases:
Case 1 For the vector \({\hat{z}}\in {\mathbb {R}}^E\) we have \({\hat{z}}\ne \mathbf{1}\) and \({\hat{z}}\ne -\mathbf{1}\). Let us take
First we can show that \(d_z\in \text {col}(A)\). Note that for our problem when the graph is connected, the only null space of A (which is the incidence of the graph) is spanned by the vector \({\mathbf {1}}\) [16]. Using this fact, we have \(\mathbf{1}^Td_z = {\hat{z}}^T\mathbf{1} - \mathbf{1}^T{\hat{z}}=0\), therefore, \(Ad_x=d_z\) holds true. Second, for \(e\in {\mathcal {A}}^+({\hat{y}})\) we have that \({\hat{z}}_e=1\). Therefore, we can check that \([d_z]_e=\left[ \frac{1}{E}({\hat{z}}^T\mathbf{1})\mathbf{1} - {\hat{z}}\right] _e<0\), because \(\frac{1}{E}({\hat{z}}^T\mathbf{1})\mathbf{1}<1\) from the fact that \({\hat{z}}\ne \mathbf{1}\). Condition (96b) is verified. Using similar argument we can verify condition (96c).
Case 2 Suppose we have \({\hat{z}}=\mathbf{1}\) (resp. \({\hat{z}}=-\mathbf{1}\)). Since \({\hat{z}}\in \text {null}(A)\) let us set \(d_x=\mathbf{0}\) and \(d_z = -{\hat{z}}\) (resp. \(d_z = {\hat{z}}\)). First we have \(Ad_x=d_z\). Second, for \(e\in {\mathcal {A}}^+({\hat{y}})\) we have that \([d_z]_e<0\). Similarly, we have \([d_z]_e>0\) for \(e\in {\mathcal {A}}^-({\hat{y}})\). All conditions (96a)–(96c) are verified. The above proof shows that MFCQ holds true for the sequence \(\{(x^r,z^r)\}\) generated by the PProx-PDA algorithm, since in the algorithm it is always guaranteed that \(z^r\in Z\).
Rights and permissions
About this article
Cite this article
Hajinezhad, D., Hong, M. Perturbed proximal primal–dual algorithm for nonconvex nonsmooth optimization. Math. Program. 176, 207–245 (2019). https://doi.org/10.1007/s10107-019-01365-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10107-019-01365-4