research-article

Newton-ADMM: a distributed GPU-accelerated optimizer for multiclass classification problems

Authors:

Sudhir B. Kylasa,

Michael W. Mahoney,

Ananth GramaAuthors Info & Claims

SC '20: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Article No.: 57, Pages 1 - 12

Published: 09 November 2020 Publication History

Abstract

First-order optimization techniques, such as stochastic gradient descent (SGD) and its variants, are widely used in machine learning applications due to their simplicity and low per-iteration costs. However, they often require larger numbers of iterations, with associated communication costs in distributed environments. In contrast, Newton-type methods, while having higher per-iteration computation costs, typically require a significantly smaller number of iterations, which directly translates to reduced communication costs.

We present a novel distributed optimizer for classification problems, which integrates a GPU-accelerated Newton-type solver with the global consensus formulation of Alternating Direction of Method Multipliers (ADMM). By leveraging the communication efficiency of ADMM, a highly efficient GPU-accelerated inexact-Newton solver, and an effective spectral penalty parameter selection strategy, we show that our proposed method (i) yields better generalization performance on several classification problems; (ii) significantly outperforms state-of-the-art methods in distributed time to solution; and (iii) offers better scaling on large distributed platforms.

References

[1]

Pierre Baldi, Peter Sadowski, and Daniel Whiteson. "Searching for exotic particles in high-energy physics with deep learning". In: Nature communications 5 (2014), p. 4308.

[2]

Amir Beck. First-Order Methods in Optimization. Vol. 25. SIAM, 2017.

[3]

Albert S Berahas, Raghu Bollapragada, and Jorge Nocedal. "An Investigation of Newton-Sketch and Subsampled Newton Methods". In: arXiv preprint arXiv:1705.06211 (2017).

[4]

Léon Bottou, Frank E Curtis, and Jorge Nocedal. "Optimization methods for large-scale machine learning". In: arXiv preprint arXiv:1606.04838 (2016).

[5]

Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, Jonathan Eckstein, et al. "Distributed optimization and statistical learning via the alternating direction method of multipliers". In: Foundations and Trends® in Machine learning 3.1 (2011), pp. 1--122.

Digital Library

[6]

Sébastien Bubeck et al. "Convex optimization: Algorithms and complexity". In: Foundations and Trends® in Machine Learning 8.3--4 (2015), pp. 231--357.

[7]

Yves Chauvin and David E Rumelhart. Backpropagation: theory, architectures, and applications. Psychology press, 2013.

[8]

Jianmin Chen, Xinghao Pan, Rajat Monga, Samy Bengio, and Rafal Jozefowicz. "Revisiting distributed synchronous SGD". In: arXiv preprint arXiv:1604.00981 (2016).

[9]

Rixon Crane and Fred Roosta. "DINGO: Distributed Newton-Type Method for Gradient-Norm Optimization". In: Proceedings of the Advances in Neural Information Processing Systems. Accepted. 2019.

[10]

BD Craven. "Invex functions and constrained local minima". In: Bulletin of the Australian Mathematical society 24.03 (1981), pp. 357--366.

[11]

Hadi Daneshmand, Aurelien Lucchi, and Thomas Hofmann. "DynaNewton-Accelerating Newton's Method for Machine Learning". In: arXiv preprint arXiv:1605.06561 (2016).

[12]

Jeffrey Dean et al. "Large scale distributed deep networks". In: Advances in neural information processing systems. 2012, pp. 1223--1231.

[13]

Celestine Dünner, Aurelien Lucchi, Matilde Gargiani, An Bian, Thomas Hofmann, and Martin Jaggi. "A Distributed Second-Order Algorithm You Can Trust". In: arXiv preprint arXiv:1806.07569 (2018).

[14]

Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning. Vol. 1. Springer series in statistics Springer, Berlin, 2001.

[15]

Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. "Accurate, large minibatch SGD: training imagenet in 1 hour". In: arXiv preprint arXiv:1706.02677 (2017).

[16]

Peter H Jin, Qiaochu Yuan, Forrest Iandola, and Kurt Keutzer. "How to scale distributed deep learning?" In: arXiv preprint arXiv:1611.04581 (2016).

[17]

Rie Johnson and Tong Zhang. "Accelerating stochastic gradient descent using predictive variance reduction". In: Advances in neural information processing systems. 2013, pp. 315--323.

[18]

Sudhir Kylasa, Farbod Roosta-Khorasani, Michael W Mahoney, and Ananth Grama. "GPU Accelerated Sub-Sampled Newton's Method for Convex Classification Problems". In: Proceedings of the 2019 SIAM International Conference on Data Mining. SIAM. 2019, pp. 702--710.

[19]

Sudhir B Kylasa. "HIGHER ORDER OPTIMIZATION TECHNIQUES FOR MACHINE LEARNING". PhD thesis. Purdue University Graduate School, 2019.

[20]

Kevin P Murphy. Machine learning: a probabilistic perspective. The MIT Press, 2012.

Digital Library

[21]

Jorge Nocedal and Stephen Wright. Numerical optimization. Springer Science & Business Media, 2006.

[22]

Sashank J Reddi, Jakub KonečnᏳ, Peter Richtárik, Barnabás Póczós, and Alex Smola. "AIDE: Fast and communication efficient distributed optimization". In: arXiv preprint arXiv:1608.06879 (2016).

[23]

Fred Roosta, Yang Liu, Peng Xu, and Michael W Mahoney. "Newton-MR: Newton's Method Without Smoothness or Convexity". In: arXiv preprint arXiv:1810.00303 (2018).

[24]

Farbod Roosta-Khorasani and Michael W Mahoney. "Sub-sampled Newton methods". In: Mathematical Programming 174.1--2 (2019), pp. 293--326.

[25]

Suvrit Sra, Sebastian Nowozin, and Stephen J Wright. Optimization for machine learning. Mit Press, 2012.

Digital Library

[26]

Robert Tibshirani. "Regression shrinkage and selection via the lasso". In: Journal of the Royal Statistical Society. Series B (Methodological) (1996), pp. 267--288.

[27]

Shusen Wang, Farbod Roosta-Khorasani, Peng Xu, and Michael W Mahoney. "GIANT: Globally Improved Approximate Newton Method for Distributed Optimization". In: Advances in Neural Information Processing Systems (NIPS). 2018, pp. 2338--2348.

[28]

Peng Xu, Farbod Roosta-Khorasani, and Michael W. Mahoney. "Second-Order Optimization for Non-Convex Machine Learning: An Empirical Study". In: arXiv preprint arXiv:1708.07827 (2017).

[29]

Zheng Xu, Gavin Taylor, Hao Li, Mario Figueiredo, Xiaoming Yuan, and Tom Goldstein. "Adaptive consensus ADMM for distributed optimization". In: arXiv preprint arXiv:1706.02869 (2017).

[30]

Yuchen Zhang and Xiao Lin. "DiSCO: Distributed optimization for self-concordant empirical loss". In: International conference on machine learning. 2015, pp. 362--370.

Recommendations

A Flexible ADMM Algorithm for Big Data Applications

We present a Flexible Alternating Direction Method of Multipliers (F-ADMM) algorithm for solving optimization problems involving a strongly convex objective function that is separable into $$n \ge 2$$nź2 blocks, subject to (non-separable) linear ...
Douglas–Rachford splitting and ADMM for nonconvex optimization: accelerated and Newton-type linesearch algorithms
Abstract
Although the performance of popular optimization algorithms such as the Douglas–Rachford splitting (DRS) and the ADMM is satisfactory in convex and well-scaled problems, ill conditioning and nonconvexity pose a severe obstacle to their reliable ...
An algorithm twisted from generalized ADMM for multi-block separable convex minimization models

The alternating direction method with multipliers (ADMM) has been one of most powerful and successful methods for solving a two-block linearly constrained convex minimization model whose objective function is the sum of two functions without coupled ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '20: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

November 2020

1454 pages

ISBN:9781728199986

General Chair:
Christine Cuicchi,
Program Chairs:
Irene Qualters,
William Kramer

Sponsors

SIGHPC: ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing

In-Cooperation

IEEE CS

Publisher

IEEE Press

Publication History

Published: 09 November 2020

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SC '20

Sponsor:

SIGHPC

SC '20: The International Conference for High Performance Computing, Networking, Storage and Analysis

November 9 - 19, 2020

Georgia, Atlanta

Acceptance Rates

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
122
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)1

Reflects downloads up to 30 Aug 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents