Sample size selection in optimization methods for machine learning

Byrd, Richard H.; Chin, Gillian M.; Nocedal, Jorge; Wu, Yuchen

doi:10.1007/s10107-012-0572-5

Sample size selection in optimization methods for machine learning

Full Length Paper
Series B
Published: 24 June 2012

Volume 134, pages 127–155, (2012)
Cite this article

Mathematical Programming Submit manuscript

Richard H. Byrd¹,
Gillian M. Chin²,
Jorge Nocedal² &
…
Yuchen Wu³

3652 Accesses
219 Citations
3 Altmetric
Explore all metrics

Abstract

This paper presents a methodology for using varying sample sizes in batch-type optimization methods for large-scale machine learning problems. The first part of the paper deals with the delicate issue of dynamic sample selection in the evaluation of the function and gradient. We propose a criterion for increasing the sample size based on variance estimates obtained during the computation of a batch gradient. We establish an ${O(1/\epsilon)}$ complexity bound on the total cost of a gradient method. The second part of the paper describes a practical Newton method that uses a smaller sample to compute Hessian vector-products than to evaluate the function and the gradient, and that also employs a dynamic sampling technique. The focus of the paper shifts in the third part of the paper to L ₁-regularized problems designed to produce sparse solutions. We propose a Newton-like method that consists of two phases: a (minimalistic) gradient projection phase that identifies zero variables, and subspace phase that applies a subsampled Hessian Newton iteration in the free variables. Numerical tests on speech recognition problems illustrate the performance of the algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Adaptive Sampling for Incremental Optimization Using Stochastic Gradient Descent

A mini-batch stochastic conjugate gradient algorithm with variance reduction

Article 01 July 2022

Greedy Sampling Using Nonlinear Optimization

References

Agarwal, A., Duchi, J.: Distributed delayed stochastic optimization. Arxiv preprint arXiv:1104.5525 (2011)
Google Scholar
Andrew, G., Gao, J.: Scalable training of l 1-regularized log-linear models. In: Proceedings of the 24th International Conference on Machine Learning, pp. 33–40. ACM (2007)
Bastin F., Cirillo C., Toint P.L.: An adaptive monte carlo algorithm for computing mixed logit estimators. Comput. Manag. Sci. 3(1), 55–79 (2006)
Article MathSciNet MATH Google Scholar
Beck A., Teboulle M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1), 183–202 (2009)
Article MathSciNet MATH Google Scholar
Bertsekas D.P.: On the Goldstein-Levitin-Poljak gradient projection method. IEEE Trans. Autom. Control AC-21, 174–184 (1976)
Article MathSciNet Google Scholar
Bottou L., Bousquet O.: The tradeoffs of large scale learning. In: Platt, J., Koller, D., Singer, Y., Roweis, S. (eds) Advances in Neural Information Processing Systems, vol. 20, pp. 161–168. MIT Press, Cambridge, MA (2008)
Google Scholar
Byrd, R., Chin, G.M., Neveitt, W., Nocedal, J.: On the use of stochastic Hessian information in unconstrained optimization. SIAM J. Optim. 21(3), 977–995 (2011)
Google Scholar
Conn A.R., Gould N.I.M., Toint P.L.: A globally convergent augmented Lagrangian algorithm for optimization with general constraints and simple bounds. SIAM J. Numer. Anal. 28(2), 545–572 (1991)
Article MathSciNet MATH Google Scholar
Dai Y., Fletcher R.: Projected Barzilai-Borwein methods for large-scale box-constrained quadratic programming. Numerische Mathematik 100(1), 21–47 (2005)
Article MathSciNet MATH Google Scholar
Dekel, O., Gilad-Bachrach, R., Shamir, O., Xiao, L.: Optimal distributed online prediction using mini-batches. Arxiv preprint arXiv:1012.1367 (2010)
Google Scholar
Deng G., Ferris M.C.: Variable-number sample-path optimization. Math. Program. 117(1–2), 81–109 (2009)
Article MathSciNet MATH Google Scholar
Donoho D.: De-noising by soft-thresholding. Inf. Theory IEEE Trans. 41(3), 613–627 (1995)
Article MathSciNet MATH Google Scholar
Duchi, J., Shalev-Shwartz, S., Singer, Y., Tewari, A.: Composite objective mirror descent. In: Proceedings of the Twenty Third Annual Conference on Computational Learning Theory. Citeseer (2010)
Duchi J., Singer Y.: Efficient online and batch learning using forward backward splitting. J. Mach. Learn. Res. 10, 2899–2934 (2009)
MathSciNet MATH Google Scholar
Figueiredo M., Nowak R., Wright S.: Gradient projection for sparse reconstruction: Application to compressed sensing and other inverse problems. IEEE J. Sel. Top. Signal Process. 1(4), 586–597 (2007)
Article Google Scholar
Freund J.E.: Mathematical Statistics. Prentice Hall, Englewood Cliffs, NJ (1962)
Google Scholar
Friedlander, M., Schmidt, M.: Hybrid deterministic-stochastic methods for data fitting. Arxiv preprint arXiv:1104.2373 (2011)
Google Scholar
Hager W.W., Zhang H.: A new active set algorithm for box constrained optimization. SIOPT 17(2), 526–557 (2007)
MathSciNet Google Scholar
Homem-de-Mello T.: Variable-sample methods for stochastic optimization. ACM Trans. Model. Comput. Simul. 13(2), 108–133 (2003)
Article Google Scholar
Kleywegt A.J., Shapiro A., Homem-de-Mello T.: The sample average approximation method for stochastic discrete optimization. SIAM J. Optim. 12(2), 479–502 (2001)
Article MathSciNet MATH Google Scholar
Lin C., Moré J. et al.: Newton’s method for large bound-constrained optimization problems. SIAM J. Optim. 9(4), 1100–1127 (1999)
Article MathSciNet MATH Google Scholar
Martens, J.: Deep learning via Hessian-free optimization. In: Proceedings of the 27th International Conference on Machine Learning (ICML) (2010)
Nesterov Y.: Primal-dual subgradient methods for convex problems. Math. Program. 120(1), 221–259 (2009). doi:10.1007/s10107-007-0149-x
Article MathSciNet MATH Google Scholar
Niu, F., Recht, B., Ré, C., Wright, S.: Hogwild!: a lock-free approach to parallelizing stochastic gradient descent. Arxiv preprint arXiv:1106.5730 (2011)
Google Scholar
Polyak B., Juditsky A.: Acceleration of stochastic approximation by averaging. SIAM J. Control Optim. 30, 838 (1992)
Article MathSciNet MATH Google Scholar
Polyak B.T.: The conjugate gradient method in extremal problems. USSR Comput. Math. Math. Phys. 9, 94–112 (1969)
Article Google Scholar
Robbins H., Monro S.: A stochastic approximation method. Ann. Math. Stat. 22(3), 400–407 (1951)
Article MathSciNet MATH Google Scholar
Shapiro A., Homem-de-Mello T.: A simulation-based approach to two-stage stochastic programming with recourse. Math. Program. 81, 301–325 (1998)
MathSciNet MATH Google Scholar
Shapiro A., Homem-de-Mello T.: On the rate of convergence of optimal solutions of monte carlo approximations of stochastic programs. SIAM J. Optim. 11(1), 70–86 (2000)
Article MathSciNet MATH Google Scholar
Shapiro A., Wardi Y.: Convergence of stochastic algorithms. Math. Oper. Res. 21(3), 615–628 (1996)
Article MathSciNet MATH Google Scholar
Vishwanathan, S., Schraudolph, N., Schmidt, M., Murphy, K.: Accelerated training of conditional random fields with stochastic gradient methods. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 969–976. ACM (2006)
Wright, S.: Accelerated block-coordinate relaxation for regularized optimization. Tech. rep., Computer Science Department, University of Wisconsin (2010)
Wright S., Nowak R., Figueiredo M.: Sparse reconstruction by separable approximation. Signal Process. IEEE Trans. 57(7), 2479–2493 (2009)
Article MathSciNet Google Scholar
Xiao L.: Dual averaging methods for regularized stochastic learning and online optimization. J. Mach. Learn. Res. 9999, 2543–2596 (2010)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Colorado, Boulder, CO, USA
Richard H. Byrd
Department of Industrial Engineering and Management Sciences, Northwestern University, Evanston, IL, USA
Gillian M. Chin & Jorge Nocedal
Google Inc., Mountain View, CA, USA
Yuchen Wu

Authors

Richard H. Byrd
View author publications
You can also search for this author in PubMed Google Scholar
Gillian M. Chin
View author publications
You can also search for this author in PubMed Google Scholar
Jorge Nocedal
View author publications
You can also search for this author in PubMed Google Scholar
Yuchen Wu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jorge Nocedal.

Additional information

This work was supported by National Science Foundation grant CMMI 0728190 and grant DMS-0810213, by Department of Energy grant DE-FG02-87ER25047-A004 and grant DE-SC0001774, and by an NSERC fellowship and a Google grant.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Byrd, R.H., Chin, G.M., Nocedal, J. et al. Sample size selection in optimization methods for machine learning. Math. Program. 134, 127–155 (2012). https://doi.org/10.1007/s10107-012-0572-5

Download citation

Received: 18 October 2011
Accepted: 19 May 2012
Published: 24 June 2012
Issue Date: August 2012
DOI: https://doi.org/10.1007/s10107-012-0572-5

Mathematics Subject Classification

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Sample size selection in optimization methods for machine learning

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Adaptive Sampling for Incremental Optimization Using Stochastic Gradient Descent

A mini-batch stochastic conjugate gradient algorithm with variance reduction

Greedy Sampling Using Nonlinear Optimization

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Mathematics Subject Classification

Subscribe and save

Buy Now

Navigation

Sample size selection in optimization methods for machine learning

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Adaptive Sampling for Incremental Optimization Using Stochastic Gradient Descent

A mini-batch stochastic conjugate gradient algorithm with variance reduction

Greedy Sampling Using Nonlinear Optimization

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Mathematics Subject Classification

Subscribe and save

Buy Now

Search

Navigation