research-article

Complexity of block coordinate descent with proximal regularization and applications to Wasserstein CP-dictionary learning

AUTHORs:

Hanbaek LyuAuthors Info & Claims

ICML'23: Proceedings of the 40th International Conference on Machine Learning

Article No.: 748, Pages 18114 - 18134

Published: 23 July 2023 Publication History

Abstract

We consider the block coordinate descent methods of Gauss-Seidel type with proximal regularization (BCD-PR), which is a classical method of minimizing general nonconvex objectives under constraints that has a wide range of practical applications. We theoretically establish the worst-case complexity bound for this algorithm. Namely, we show that for general nonconvex smooth objective with block-wise constraints, the classical BCD-PR algorithm converges to an e-stationary point within Õ(ε^-1) iterations. Under a mild condition, this result still holds even if the algorithm is executed inexactly in each step. As an application, we propose a provable and efficient algorithm for 'Wasserstein CP-dictionary learning', which seeks a set of elementary probability distributions that can well-approximate a given set of d-dimensional joint probability distributions. Our algorithm is a version of BCD-PR that operates in the dual space, where the primal problem is regularized both entropically and proximally

References

[1]

Attouch, H., Bolte, J., Redont, P., and Soubeyran, A. Proximal alternating minimization and projection methods for nonconvex problems: An approach based on the kurdykałojasiewicz inequality. Mathematics of operations research, 35(2):438-457, 2010.

[2]

Bauschke, H. H., Combettes, P. L., et al. Convex analysis and monotone operator theory in Hilbert spaces, volume 408. Springer, 2011.

[3]

Bolte, J., Sabach, S., and Teboulle, M. Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Mathematical Programming, 146(1-2):459- 494, 2014.

Digital Library

[4]

Bottou, L. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT'2010, pp. 177-186. Springer, 2010.

[5]

Boyd, S., Boyd, S. P., and Vandenberghe, L. Convex optimization. Cambridge university press, 2004.

[6]

Carroll, J. D. and Chang, J.-J. Analysis of individual differences in multidimensional scaling via an n-way generalization of "Eckart-Young" decomposition. Psychometrika, 35(3):283-319, 1970.

[7]

Cartis, C., Gould, N. I., and Toint, P. L. On the complexity of steepest descent, newton's and regularized newton's methods for nonconvex unconstrained optimization problems. SIAM Journal on Optimization, 20(6):2833-2852, 2010.

Digital Library

[8]

Cuturi, M. Sinkhorn distances: Lightspeed computation of optimal transport. Advances in Neural Information Processing Systems, 26, 2013.

Digital Library

[9]

Cuturi, M. and Peyré, G. A smoothed dual approach for variational wasserstein problems. SIAM Journal on Imaging Sciences, 9(1):320-343, 2016.

Digital Library

[10]

Daniilidis, A. and Malick, J. Filling the gap between lower-c1 and lower-c2 functions. Journal of Convex Analysis, 12(2):315-329, 2005.

[11]

Elad, M. and Aharon, M. Image denoising via sparse and redundant representations over learned dictionaries. IEEE Transactions on Image processing, 15(12):3736-3745, 2006.

Digital Library

[12]

Ghassemi, M., Shakeri, Z., Sarwate, A. D., and Bajwa, W. U. Stark: Structured dictionary learning through rank-one tensor recovery. In Proceedings of the IEEE International Workshop on Computational Advances in Multi-Sensor Adaptive Processing, pp. 1-5. IEEE, 2017.

[13]

Grippo, L. and Sciandrone, M. On the convergence of the block nonlinear gauss-seidel method under convex constraints. Operations research letters, 26(3):127-136, 2000.

[14]

Harshman, R. A. Foundations of the parafac procedure: Models and conditions for an "explanatory" multimodal factor analysis. 1970.

[15]

Hong, M., Wang, X., Razaviyayn, M., and Luo, Z.-Q. Iteration complexity analysis of block coordinate descent methods. Mathematical Programming, 163(1):85-114, 2017.

Digital Library

[16]

Kolda, T. G. and Bader, B. W. Tensor decompositions and applications. SIAM Review, 51(3):455-500, 2009.

Digital Library

[17]

Lee, D. D. and Seung, H. S. Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755): 788, 1999.

[18]

Lee, D. D. and Seung, H. S. Algorithms for non-negative matrix factorization. In Advances in Neural Information Processing Systems, pp. 556-562, 2001.

[19]

Lyu, H. Convergence and complexity of stochastic block majorization-minimization. arXiv preprint arXiv:2201.01652, 2022.

[20]

Lyu, H., Strohmeier, C., and Needell, D. Online tensor factorization and cp-dictionary learning for markovian data. arXiv preprint arXiv:2009.07612, 2020.

[21]

Mairal, J. Optimization with first-order surrogate functions. In International Conference on Machine Learning, pp. 783-791, 2013.

Digital Library

[22]

Mairal, J., Elad, M., and Sapiro, G. Sparse representation for color image restoration. IEEE Transactions on Image Processing, 17(1):53-69, 2007.

Digital Library

[23]

Mairal, J., Bach, F., Ponce, J., and Sapiro, G. Online learning for matrix factorization and sparse coding. Journal of Machine Learning Research, 11(Jan):19-60, 2010.

[24]

Nesterov, Y. Introductory lectures on convex programming volume i: Basic course. Lecture notes, 3(4):5, 1998.

[25]

Nesterov, Y. Gradient methods for minimizing composite functions. Mathematical programming, 140(1):125-161, 2013.

[26]

Peyre, G. Sparse modeling of textures. Journal of Mathematical Imaging and Vision, 34(1):17-31, 2009.

Digital Library

[27]

Powell, M. J. On search directions for minimization algorithms. Mathematical programming, 4(1):193-201, 1973.

[28]

Rolet, A., Cuturi, M., and Peyre, G. Fast dictionary learning with a smoothed wasserstein loss. In Artificial Intelligence and Statistics, pp. 630-638. PMLR, 2016.

[29]

Sandler, R. and Lindenbaum, M. Nonnegative matrix factorization with earth mover's distance metric for image analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(8):1590-1602, 2011.

Digital Library

[30]

Santambrogio, F. Optimal transport for applied mathematicians. Birkauser, NY, 55(58-63):94, 2015.

[31]

Shakeri, Z., Bajwa, W. U., and Sarwate, A. D. Minimax lower bounds for kronecker-structured dictionary learning. In IEEE International Symposium on Information Theory, pp. 1148-1152. IEEE, 2016.

Digital Library

[32]

Shashua, A. and Hazan, T. Non-negative tensor factorization with applications to statistics and computer vision. In Proceedings of the 22nd international conference on Machine learning, pp. 792-799. ACM, 2005.

Digital Library

[33]

Sun, J., Qu, Q., and Wright, J. When are nonconvex problems not scary? arXiv preprint arXiv:1510.06096, 2015.

[34]

Tucker, L. R. Some mathematical notes on three-mode factor analysis. Psychometrika, 31(3):279-311, 1966.

[35]

Wang, Y.-X. and Zhang, Y.-J. Nonnegative matrix factorization: A comprehensive review. IEEE Transactions on knowledge and data engineering, 25(6):1336-1353, 2012.

Digital Library

[36]

Wright, S. J. Coordinate descent algorithms. Mathematical Programming, 151(1):3-34, 2015.

Digital Library

[37]

Xu, Y. and Yin, W. A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor factorization and completion. SIAM Journal on Imaging Sciences, 6(3):1758-1789, 2013.

Digital Library

[38]

Zafeiriou, S. Algorithms for nonnegative tensor factorization. In Tensors in Image Processing and Computer Vision, pp. 105-124. Springer, 2009.

[39]

Zen, G., Ricci, E., and Sebe, N. Simultaneous ground metric learning and matrix factorization with earth mover's distance. In 2014 22nd International Conference on Pattern Recognition, pp. 3690-3695. IEEE, 2014.

Digital Library

[40]

Zeng, J., Lau, T. T.-K., Lin, S., and Yao, Y. Global convergence of block coordinate descent in deep learning. In International Conference on Machine Learning, pp. 7313-7323. PMLR, 2019.

[41]

Zhang, Z. and Brand, M. Convergent block coordinate descent for training tikhonov regularized deep neural networks. Advances in Neural Information Processing Systems, 30, 2017.

Recommendations

On the complexity analysis of randomized block-coordinate descent methods

In this paper we analyze the randomized block-coordinate descent (RBCD) methods proposed in Nesterov (SIAM J Optim 22(2):341---362, 2012), Richtárik and Takáă (Math Program 144(1---2):1---38, 2014) for minimizing the sum of a smooth convex function and ...
Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function

In this paper we develop a randomized block-coordinate descent method for minimizing the sum of a smooth and a simple nonsmooth block-separable convex function and prove that it obtains an $$\varepsilon $$-accurate solution with probability at least $$1-...
Block coordinate proximal gradient methods with variable Bregman functions for nonsmooth separable optimization

In this paper, we propose a class of block coordinate proximal gradient (BCPG) methods for solving large-scale nonsmooth separable optimization problems. The proposed BCPG methods are based on the Bregman functions, which may vary at each iteration. ...

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

ICML'23: Proceedings of the 40th International Conference on Machine Learning

July 2023

43479 pages

Copyright © 2023.

Publisher

JMLR.org

Publication History

Published: 23 July 2023

Qualifiers

Research-article
Research
Refereed limited

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 04 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

Figures

Tables

Media

View Table of Conten