research-article

Learning unnormalized statistical models via compositional optimization

AUTHORs:

Lijun ZhangAuthors Info & Claims

ICML'23: Proceedings of the 40th International Conference on Machine Learning

Article No.: 616, Pages 15105 - 15124

Published: 23 July 2023 Publication History

Abstract

Learning unnormalized statistical models (e.g., energy-based models) is computationally challenging due to the complexity of handling the partition function. To eschew this complexity, noise-contrastive estimation (NCE) has been proposed by formulating the objective as the logistic loss of the real data and the artificial noise. However, as found in previous works, NCE may perform poorly in many tasks due to its flat loss landscape and slow convergence. In this paper, we study a direct approach for optimizing the negative log-likelihood of unnormalized models from the perspective of compositional optimization. To tackle the partition function, a noise distribution is introduced such that the log partition function can be written as a compositional function whose inner function can be estimated with stochastic samples. Hence, the objective can be optimized by stochastic compositional optimization algorithms. Despite being a simple method, we demonstrate that it is more favorable than NCE by (1) establishing a fast convergence rate and quantifying its dependence on the noise distribution through the variance of stochastic estimators; (2) developing better results for one-dimensional Gaussian mean estimation by showing our objective has a much favorable loss landscape and hence our method enjoys faster convergence; (3) demonstrating better performance on multiple applications, including density estimation, out-of-distribution detection, and real image generation.

References

[1]

Allen-Zhu, Z., Li, Y., and Song, Z. A convergence theory for deep learning via over-parameterization. In Proceedings of the 36th International Conference on Machine Learning, pp. 242-252, 2019.

[2]

An, D., Xie, J., and Li, P. Learning deep latent variable models by short-run mcmc inference with optimal transport correction. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15410-15419, 2021.

[3]

Bose, A. J., Ling, H., and Cao, Y. Adversarial contrastive estimation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pp. 1021- 1032, 2018.

[4]

Ceylan, C. and Gutmann, M. U. Conditional noise-contrastive estimation of unnormalised models. In Proceedings of the 35th International Conference on Machine Learning, pp. 726-734, 2018.

[5]

Chen, T., Sun, Y., and Yin, W. Solving stochastic compositional optimization is nearly as easy as solving stochastic optimization. IEEE Transactions on Signal Processing, 69:4937-4948, 2021.

Digital Library

[6]

Christian P. Robert, G. C. Monte Carlo Statistical Methods. Springer, 2004.

[7]

Cutkosky, A. and Orabona, F. Momentum-based variance reduction in non-convex SGD. In Advances in Neural Information Processing Systems 32, pp. 15210-15219, 2019.

[8]

Du, S. S., Zhai, X., Poczos, B., and Singh, A. Gradient descent provably optimizes over-parameterized neural networks. In International Conference on Learning Representations, 2019.

[9]

Du, Y. and Mordatch, I. Implicit generation and modeling with energy based models. In Advances in Neural Information Processing Systems 32, pp. 3603-3613, 2019.

[10]

Fang, C., Li, C. J., Lin, Z., and Zhang, T. Spider: Near-optimal non-convex optimization via stochastic path integrated differential estimator. ArXiv e-prints, arXiv:1807.01695, 2018.

[11]

Gao, R., Lu, Y., Zhou, J., Zhu, S.-C., and Wu, Y. N. Learning generative convnets via multi-grid modeling and sampling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9155- 9164, 2018.

[12]

Geng, C., Wang, J., Gao, Z., Frellsen, J., and Hauberg, S. Bounds all around: training energy-based models with bidirectional bounds. In Advances in Neural Information Processing Systems 34, pp. 19808-19821, 2021.

[13]

Ghadimi, S., Ruszczynski, A., and Wang, M. A single timescale stochastic approximation method for nested stochastic optimization. SIAM Journal on Optimization, 30(1):960-979, 2020.

Digital Library

[14]

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in Neural Information Processing Systems 27, pp. 2672-2680, 2014.

Digital Library

[15]

Grathwohl, W. Joint energy models. https://github.com/wgrathwohl/JEM, 2020.

[16]

Grathwohl, W., Wang, K.-C., Jacobsen, J.-H., Duvenaud, D., Norouzi, M., and Swersky, K. Your classifier is secretly an energy based model and you should treat it like one. In International Conference on Learning Representations, 2020a.

[17]

Grathwohl, W., Wang, K.-C., Jacobsen, J.-H., Duvenaud, D., and Zemel, R. Learning the stein discrepancy for training and evaluating energy-based models without sampling. In Proceedings of the 37th International Conference on Machine Learning, pp. 3732-3747, 2020b.

Digital Library

[18]

Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B., and Smola, A. A kernel two-sample test. Journal of Machine Learning Research, 13(25):723-773, 2012.

Digital Library

[19]

Guo, Z., Xu, Y., Yin, W., Jin, R., and Yang, T. On stochastic moving-average estimators for non-convex optimization. ArXiv e-prints, arXiv:2104.14840, 2021.

[20]

Gutmann, M. U. and Hyvärinen, A. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics, pp. 297-304, 2010.

[21]

Gutmann, M. U. and Hyvärinen, A. Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics. Journal of Machine Learning Research, 13(11):307-361, 2012.

Digital Library

[22]

Han, T., Nijkamp, E., Fang, X., Hill, M., Zhu, S., and Wu, Y. N. Divergence triangle for joint training of generator model, energy-based model, and inferential model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8670-8679, 2019.

[23]

Havtorn, J. D., Frellsen, J., Hauberg, S., and Maaloe, L. Hierarchical VAEs know what they don't know. In Proceedings of the 38th International Conference on Machine Learning, pp. 4117-4128, 2021.

[24]

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 770-778, 2016.

[25]

Henaff, O. Data-efficient image recognition with contrastive predictive coding. In Proceedings of the 37th International Conference on Machine Learning, pp. 4182-4192, 2020.

Digital Library

[26]

Hinton, G. E. Training products of experts by minimizing contrastive divergence. Neural computation, 14(8): 1771-1800, 2002.

[27]

Hjelm, R. D., Fedorov, A., Lavoie-Marchildon, S., Grewal, K., Bachman, P., Trischler, A., and Bengio, Y. Learning deep representations by mutual information estimation and maximization. In International Conference on Learning Representations, 2019.

[28]

Hyvärinen, A. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(24):695-709, 2005.

Digital Library

[29]

Jiang, W., Li, G., Wang, Y., Zhang, L., and Yang, T. Multi-block-single-probe variance reduced estimator for coupled compositional optimization. In Advances in Neural Information Processing Systems 35, 2022a.

[30]

Jiang, W., Wang, B., Wang, Y., Zhang, L., and Yang, T. Optimal algorithms for stochastic multi-level compositional optimization. In Proceedings of the 39th International Conference on Machine Learning, pp. 10195- 10216, 2022b.

[31]

Jianwen, X., Lu, Y., Zhu, S., and Wu, Y. Cooperative training of descriptor and generator networks. IEEE transactions on pattern analysis and machine intelligence, 2018.

[32]

Karimi, H., Nutini, J., and Schmidt, M. Linear convergence of gradient and proximal-gradient methods under the Polyak-Lojasiewicz condition. In Machine Learning and Knowledge Discovery in Databases, pp. 795-811, 2016.

Digital Library

[33]

Kong, L., de Masson d'Autume, C., Yu, L., Ling, W., Dai, Z., and Yogatama, D. A mutual information maximization perspective of language representation learning. In International Conference on Learning Representations, 2020.

[34]

Krizhevsky, A. Learning multiple layers of features from tiny images. Masters Thesis, Deptartment of Computer Science, University of Toronto, 2009.

[35]

LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition. In Proceedings of the IEEE, pp. 2278-2324, 1998.

[36]

LeCun, Y., Chopra, S., Hadsell, R., Ranzato, A., and Huang, F. J. A tutorial on energy-based learning. Predicting structured data, 2006.

[37]

Liu, B., Rosenfeld, E., Ravikumar, P. K., and Risteski, A. Analyzing and improving the optimization landscape of noise-contrastive estimation. In International Conference on Learning Representations, 2022a.

[38]

Liu, M., Liu, H., and Ji, S. Gradient-guided importance sampling for learning binary energy-based models. ArXiv e-prints, arXiv:2210.05782, 2022b.

[39]

Mnih, A. and Kavukcuoglu, K. Learning word embeddings efficiently with noise-contrastive estimation. In Advances in Neural Information Processing Systems 26, pp. 2265- 2273, 2013.

[40]

Nash, C. and Durkan, C. Autoregressive energy machines. In Proceedings of the 36th International Conference on Machine Learning, pp. 1735-1744, 2019.

[41]

Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., and Ng, A. Y. Reading digits in natural images with unsupervised feature learning. In Advances in Neural Information Processing Systems Workshop on Deep Learning and Unsupervised Feature Learning, 2011.

[42]

Nguyen, L. M., Liu, J., Scheinberg, K., and Takac, M. SARAH: A novel method for machine learning problems using stochastic recursive gradient. In Proceedings of the 34th International Conference on Machine Learning, pp. 2613-2621, 2017.

[43]

Nijkamp, E., Hill, M., Han, T., Zhu, S.-C., and Wu, Y. N. On the anatomy of MCMC-based maximum likelihood learning of energy-based models. In Proceedings of the 33th AAAI Conference on Artificial Intelligence, 2019a.

[44]

Nijkamp, E., Hill, M., Zhu, S.-C., and Wu, Y. N. Learning non-convergent non-persistent short-run MCMC toward energy-based model. In Advances in Neural Information Processing Systems 32, pp. 5233-5243, 2019b.

[45]

Pang, T., Xu, K., LI, C., Song, Y., Ermon, S., and Zhu, J. Efficient learning of generative models via finite-difference score matching. In Advances in Neural Information Processing Systems 33, pp. 19175-19188, 2020.

[46]

Qi, Q., Guo, Z., Xu, Y., Jin, R., and Yang, T. An online method for a class of distributionally robust optimization with non-convex objectives. ArXiv e-prints, arXiv:2006.10138, 2021a.

[47]

Qi, Q., Luo, Y., Xu, Z., Ji, S., and Yang, T. Stochastic optimization of areas under precision-recall curves with provable convergence. In Advances in Neural Information Processing Systems 34, pp. 1752-1765, 2021b.

[48]

Ren, J., Liu, P. J., Fertig, E., Snoek, J., Poplin, R., Depristo, M., Dillon, J., and Lakshminarayanan, B. Likelihood ratios for out-of-distribution detection. In Advances in Neural Information Processing Systems 32, pp. 14680- 14691, 2019.

[49]

Rhodes, B., Xu, K., and Gutmann, M. U. Telescoping density-ratio estimation. In Advances in Neural Information Processing Systems 33, pp. 4905-4916, 2020.

[50]

Ruiqi, G., Nijkamp, E., Kingma, D., Xu, Z., Dai, A., and Wu, Y. Flow contrastive estimation of energy-based models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7518-7528, 2020.

[51]

Seitzer, M. pytorch-fid: FID Score for PyTorch, 2020.

[52]

Song, Y. and Ermon, S. Generative modeling by estimating gradients of the data distribution. In Advances in Neural Information Processing Systems 32, pp. 11895-11907, 2019.

[53]

Song, Y., Garg, S., Shi, J., and Ermon, S. Sliced score matching: A scalable approach to density and score estimation. In Proceedings of the 35th Conference on Uncertainty in Artificial Intelligence, 2019.

[54]

Tian, Y., Krishnan, D., and Isola, P. Contrastive Multiview Coding. Springer, 2020.

Digital Library

[55]

Tieleman, T. Training restricted Boltzmann machines using approximations to the likelihood gradient. In Proceedings of the 25th International Conference on Machine Learning, 2008.

Digital Library

[56]

Vaart, A. W. v. d. Asymptotic Statistics. Cambridge University Press, 1998.

[57]

Vincent, P. A connection between score matching and denoising autoencoders. Neural Computation, 23(7):1661- 1674, 2011.

Digital Library

[58]

Wang, B. and Yang, T. Finite-sum coupled compositional stochastic optimization: Theory and applications. In Proceedings of the 39th International Conference on Machine Learning, pp. 23292-23317, 2022.

[59]

Wang, M., Liu, J., and Fang, E. Accelerating stochastic composition optimization. In Advances in Neural Information Processing Systems 29, pp. 1714-1722, 2016.

[60]

Wang, M., Fang, E. X., and Liu, H. Stochastic compositional gradient descent: algorithms for minimizing compositions of expected-value functions. Mathematical Programming, 161(1-2):419-449, 2017a.

Digital Library

[61]

Wang, M., Liu, J., and Fang, E. X. Accelerating stochastic composition optimization. Journal of Machine Learning Research, 18:105:1-105:23, 2017b.

[62]

Wasserman, L. All of Statistics. Springer, 2004.

[63]

Welling, M. and Teh, Y.W. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning, 2011.

[64]

Xie, J., Lu, Y., Zhu, S.-C., and Wu, Y. N. A theory of generative convnet. In Proceedings of the 33rd International Conference on International Conference on Machine Learning, pp. 2635-2644, 2016.

Digital Library

[65]

Xie, J., Lu, Y., Gao, R., and Wu, Y. N. Cooperative learning of energy-based model and latent variable model via mcmc teaching. In Proceedings of the AAAI Conference on Artificial Intelligence, 2018.

[66]

Xie, J., Zheng, Z., and Li, P. Learning energy-based model with variational auto-encoder as amortized sampler. In Proceedings of the AAAI Conference on Artificial Intelligence, 2021.

[67]

Xie, J., Zhu, Y., Li, J., and Li, P. A tale of two flows: Cooperative learning of langevin flow and normalizing flow toward energy-based model. In International Conference on Learning Representations, 2022.

[68]

Yu, F., Seff, A., Zhang, Y., Song, S., Funkhouser, T., and Xiao, J. LSUN: Construction of a large-scale image dataset using deep learning with humans in the loop. ArXiv e-prints, arXiv:1506.03365, 2015.

[69]

Yu, L., Song, Y., Song, J., and Ermon, S. Training deep energy-based models with f-divergence minimization. In Proceedings of the 37th International Conference on Machine Learning, volume 119, pp. 10957-10967, 2020.

[70]

Zagoruyko, S. and Komodakis, N. Wide residual networks. In Proceedings of the British Machine Vision Conference, 2016.

[71]

Zhang, J. and Xiao, L. A stochastic composite gradient method with incremental variance reduction. In Advances in Neural Information Processing Systems 32, pp. 9075- 9085, 2019.

[72]

Zhang, J. and Xiao, L. Multilevel composite stochastic optimization via nested variance reduction. SIAM Journal on Optimization, 31(2):1131-1157, 2021.

Digital Library

[73]

Zhao, Y., Xie, J., and Li, P. Learning energy-based generative models via coarse-to-fine expanding and sampling. In International Conference on Learning Representations, 2021.

Cited By

Jiang WYang SYang WWang YWan YZhang LSalakhutdinov RKolter ZHeller KWeller AOliver NScarlett JBerkenkamp F(2024)Projection-free variance reduction methods for stochastic constrained multi-level compositional optimizationProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3692952(21962-21987)Online publication date: 21-Jul-2024
https://dl.acm.org/doi/10.5555/3692070.3692952

Recommendations

Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics

We consider the task of estimating, from observed data, a probabilistic model that is parameterized by a finite number of parameters. In particular, we are considering the situation where the model probability density function is unnormalized. That is, ...
Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics

We consider the task of estimating, from observed data, a probabilistic model that is parameterized by a finite number of parameters. In particular, we are considering the situation where the model probability density function is unnormalized. That is, ...
Timing optimization via nest-loop pipelining considering code size

Embedded systems have strict timing and code size requirements. Software pipelining is one of the most important optimization techniques to improve the execution time of loops by increasing the parallelism among successive loop iterations. However, ...

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

ICML'23: Proceedings of the 40th International Conference on Machine Learning

July 2023

43479 pages

Copyright © 2023.

Publisher

JMLR.org

Publication History

Published: 23 July 2023

Qualifiers

Research-article
Research
Refereed limited

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 03 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Jiang WYang SYang WWang YWan YZhang LSalakhutdinov RKolter ZHeller KWeller AOliver NScarlett JBerkenkamp F(2024)Projection-free variance reduction methods for stochastic constrained multi-level compositional optimizationProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3692952(21962-21987)Online publication date: 21-Jul-2024
https://dl.acm.org/doi/10.5555/3692070.3692952

View Options

View options

Media

Figures

Other

Tables

View Table of Contents