Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/3433701.3433776acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Newton-ADMM: a distributed GPU-accelerated optimizer for multiclass classification problems

Published: 09 November 2020 Publication History

Abstract

First-order optimization techniques, such as stochastic gradient descent (SGD) and its variants, are widely used in machine learning applications due to their simplicity and low per-iteration costs. However, they often require larger numbers of iterations, with associated communication costs in distributed environments. In contrast, Newton-type methods, while having higher per-iteration computation costs, typically require a significantly smaller number of iterations, which directly translates to reduced communication costs.
We present a novel distributed optimizer for classification problems, which integrates a GPU-accelerated Newton-type solver with the global consensus formulation of Alternating Direction of Method Multipliers (ADMM). By leveraging the communication efficiency of ADMM, a highly efficient GPU-accelerated inexact-Newton solver, and an effective spectral penalty parameter selection strategy, we show that our proposed method (i) yields better generalization performance on several classification problems; (ii) significantly outperforms state-of-the-art methods in distributed time to solution; and (iii) offers better scaling on large distributed platforms.

References

[1]
Pierre Baldi, Peter Sadowski, and Daniel Whiteson. "Searching for exotic particles in high-energy physics with deep learning". In: Nature communications 5 (2014), p. 4308.
[2]
Amir Beck. First-Order Methods in Optimization. Vol. 25. SIAM, 2017.
[3]
Albert S Berahas, Raghu Bollapragada, and Jorge Nocedal. "An Investigation of Newton-Sketch and Subsampled Newton Methods". In: arXiv preprint arXiv:1705.06211 (2017).
[4]
Léon Bottou, Frank E Curtis, and Jorge Nocedal. "Optimization methods for large-scale machine learning". In: arXiv preprint arXiv:1606.04838 (2016).
[5]
Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, Jonathan Eckstein, et al. "Distributed optimization and statistical learning via the alternating direction method of multipliers". In: Foundations and Trends® in Machine learning 3.1 (2011), pp. 1--122.
[6]
Sébastien Bubeck et al. "Convex optimization: Algorithms and complexity". In: Foundations and Trends® in Machine Learning 8.3--4 (2015), pp. 231--357.
[7]
Yves Chauvin and David E Rumelhart. Backpropagation: theory, architectures, and applications. Psychology press, 2013.
[8]
Jianmin Chen, Xinghao Pan, Rajat Monga, Samy Bengio, and Rafal Jozefowicz. "Revisiting distributed synchronous SGD". In: arXiv preprint arXiv:1604.00981 (2016).
[9]
Rixon Crane and Fred Roosta. "DINGO: Distributed Newton-Type Method for Gradient-Norm Optimization". In: Proceedings of the Advances in Neural Information Processing Systems. Accepted. 2019.
[10]
BD Craven. "Invex functions and constrained local minima". In: Bulletin of the Australian Mathematical society 24.03 (1981), pp. 357--366.
[11]
Hadi Daneshmand, Aurelien Lucchi, and Thomas Hofmann. "DynaNewton-Accelerating Newton's Method for Machine Learning". In: arXiv preprint arXiv:1605.06561 (2016).
[12]
Jeffrey Dean et al. "Large scale distributed deep networks". In: Advances in neural information processing systems. 2012, pp. 1223--1231.
[13]
Celestine Dünner, Aurelien Lucchi, Matilde Gargiani, An Bian, Thomas Hofmann, and Martin Jaggi. "A Distributed Second-Order Algorithm You Can Trust". In: arXiv preprint arXiv:1806.07569 (2018).
[14]
Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning. Vol. 1. Springer series in statistics Springer, Berlin, 2001.
[15]
Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. "Accurate, large minibatch SGD: training imagenet in 1 hour". In: arXiv preprint arXiv:1706.02677 (2017).
[16]
Peter H Jin, Qiaochu Yuan, Forrest Iandola, and Kurt Keutzer. "How to scale distributed deep learning?" In: arXiv preprint arXiv:1611.04581 (2016).
[17]
Rie Johnson and Tong Zhang. "Accelerating stochastic gradient descent using predictive variance reduction". In: Advances in neural information processing systems. 2013, pp. 315--323.
[18]
Sudhir Kylasa, Farbod Roosta-Khorasani, Michael W Mahoney, and Ananth Grama. "GPU Accelerated Sub-Sampled Newton's Method for Convex Classification Problems". In: Proceedings of the 2019 SIAM International Conference on Data Mining. SIAM. 2019, pp. 702--710.
[19]
Sudhir B Kylasa. "HIGHER ORDER OPTIMIZATION TECHNIQUES FOR MACHINE LEARNING". PhD thesis. Purdue University Graduate School, 2019.
[20]
Kevin P Murphy. Machine learning: a probabilistic perspective. The MIT Press, 2012.
[21]
Jorge Nocedal and Stephen Wright. Numerical optimization. Springer Science & Business Media, 2006.
[22]
Sashank J Reddi, Jakub KonečnᏳ, Peter Richtárik, Barnabás Póczós, and Alex Smola. "AIDE: Fast and communication efficient distributed optimization". In: arXiv preprint arXiv:1608.06879 (2016).
[23]
Fred Roosta, Yang Liu, Peng Xu, and Michael W Mahoney. "Newton-MR: Newton's Method Without Smoothness or Convexity". In: arXiv preprint arXiv:1810.00303 (2018).
[24]
Farbod Roosta-Khorasani and Michael W Mahoney. "Sub-sampled Newton methods". In: Mathematical Programming 174.1--2 (2019), pp. 293--326.
[25]
Suvrit Sra, Sebastian Nowozin, and Stephen J Wright. Optimization for machine learning. Mit Press, 2012.
[26]
Robert Tibshirani. "Regression shrinkage and selection via the lasso". In: Journal of the Royal Statistical Society. Series B (Methodological) (1996), pp. 267--288.
[27]
Shusen Wang, Farbod Roosta-Khorasani, Peng Xu, and Michael W Mahoney. "GIANT: Globally Improved Approximate Newton Method for Distributed Optimization". In: Advances in Neural Information Processing Systems (NIPS). 2018, pp. 2338--2348.
[28]
Peng Xu, Farbod Roosta-Khorasani, and Michael W. Mahoney. "Second-Order Optimization for Non-Convex Machine Learning: An Empirical Study". In: arXiv preprint arXiv:1708.07827 (2017).
[29]
Zheng Xu, Gavin Taylor, Hao Li, Mario Figueiredo, Xiaoming Yuan, and Tom Goldstein. "Adaptive consensus ADMM for distributed optimization". In: arXiv preprint arXiv:1706.02869 (2017).
[30]
Yuchen Zhang and Xiao Lin. "DiSCO: Distributed optimization for self-concordant empirical loss". In: International conference on machine learning. 2015, pp. 362--370.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '20: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
November 2020
1454 pages
ISBN:9781728199986

Sponsors

In-Cooperation

  • IEEE CS

Publisher

IEEE Press

Publication History

Published: 09 November 2020

Check for updates

Author Tags

  1. ADMM
  2. classification
  3. convex optimization
  4. machine learning
  5. newton
  6. second-order method

Qualifiers

  • Research-article

Conference

SC '20
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 122
    Total Downloads
  • Downloads (Last 12 months)6
  • Downloads (Last 6 weeks)1
Reflects downloads up to 30 Aug 2024

Other Metrics

Citations

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media