research-article

Open access

Rich Information is Affordable: A Systematic Performance Analysis of Second-order Optimization Using K-FAC

Authors:

Rio YokotaAuthors Info & Claims

KDD '20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

Pages 2145 - 2153

https://doi.org/10.1145/3394486.3403265

Published: 20 August 2020 Publication History

Abstract

Rich information matrices from first and second-order derivatives have many potential applications in both theoretical and practical problems in deep learning. However, computing these information matrices is extremely expensive and this enormous cost is currently limiting its application to important problems regarding generalization, hyperparameter tuning, and optimization of deep neural networks. One of the most challenging use cases of information matrices is their use as a preconditioner for the optimizers, since the information matrices need to be updated every step. In this work, we conduct a step-by-step performance analysis when computing the Fisher information matrix during training of ResNet-50 on ImageNet, and show that the overhead can be reduced to the same amount as the cost of performing a single SGD step. We also show that the resulting Fisher preconditioned optimizer can converge in 1/3 the number of epochs compared to SGD, while achieving the same Top-1 validation accuracy. This is the first work to achieve such accuracy with K-FAC while reducing the training time to match that of SGD.

References

[1]

Takuya Akiba, Shuji Suzuki, and Keisuke Fukuda. 2017. Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes. arXiv:1711.04325 [cs] (Nov. 2017). http://arxiv.org/abs/1711.04325

[2]

Shun-ichi Amari. 1998. Natural Gradient Works Efficiently in Learning. Neural Computation, Vol. 10, 2 (Feb. 1998), 251--276. https://doi.org/10.1162/089976698300017746

Digital Library

[3]

Aleksandar Botev, Hippolyt Ritter, and David Barber. 2017. Practical Gauss-Newton optimisation for deep learning. In Proceedings of the 34th International Conference on Machine Learning - Volume 70 (ICML'17). JMLR.org, Sydney, NSW, Australia, 557--565.

[4]

Felix Dangel, Frederik Kunstner, and Philipp Hennig. 2019. BackPACK: Packing more into backprop. arXiv:1912.10985 [cs, stat] (Dec. 2019). http://arxiv.org/abs/1912.10985

[5]

Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2017. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. arXiv:1706.02677 [cs] (June 2017). http://arxiv.org/abs/1706.02677

[6]

Roger Grosse and James Martens. 2016. A Kronecker-factored approximate Fisher matrix for convolution layers. arXiv:1602.01407 [cs, stat] (May 2016). http://arxiv.org/abs/1602.01407

[7]

K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 770--778. https://doi.org/10.1109/CVPR.2016.90

[8]

Tong He, Zhi Zhang, Hang Zhang, Zhongyue Zhang, Junyuan Xie, and Mu Li. 2018. Bag of Tricks for Image Classification with Convolutional Neural Networks. arXiv:1812.01187 [cs] (Dec. 2018). http://arxiv.org/abs/1812.01187

[9]

Tsuyoshi Ichimura, Kohei Fujita, Takuma Yamaguchi, Akira Naruse, Jack C. Wells, Thomas C. Schulthess, Tjerk P. Straatsma, Christopher J. Zimmer, Maxime Martinasso, Kengo Nakajima, Muneo Hori, and Lalith Maddegedara. 2018. A fast scalable implicit solver for nonlinear time-evolution earthquake city problem on low-ordered unstructured finite elements with artificial intelligence and transprecision computing. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC '18). IEEE Press, Dallas, Texas, 1--11.

Digital Library

[10]

Xianyan Jia, Shutao Song, Wei He, Yangzihao Wang, Haidong Rong, Feihu Zhou, Liqiang Xie, Zhenyu Guo, Yuanzhou Yang, Liwei Yu, Tiegang Chen, Guangxiao Hu, Shaohuai Shi, and Xiaowen Chu. 2018. Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes. arXiv:1807.11205 [cs, stat] (July 2018). http://arxiv.org/abs/1807.11205

[11]

Wonkyung Jung, Daejin Jung, and Byeongho Kim, Sunjung Lee, Wonjong Rhee, and Jung Ho Ahn. 2018. Restructuring Batch Normalization to Accelerate CNN Training. arXiv:1807.01702 [cs] (July 2018). http://arxiv.org/abs/1807.01702

[12]

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. 2017. Overcoming catastrophic forgetting in neural networks. arXiv:1612.00796 [cs, stat] (Jan. 2017). http://arxiv.org/abs/1612.00796

[13]

Frederik Kunstner, Lukas Balles, and Philipp Hennig. 2019. Limitations of the Empirical Fisher Approximation for Natural Gradient Descent. arXiv:1905.12558 [cs, stat] (Dec. 2019). http://arxiv.org/abs/1905.12558

[14]

James Martens and Roger Grosse. 2015. Optimizing Neural Networks with Kronecker-factored Approximate Curvature. arXiv:1503.05671 [cs, stat] (March 2015). http://arxiv.org/abs/1503.05671

[15]

Sam McCandlish, Jared Kaplan, Dario Amodei, and OpenAI Dota Team. 2018. An Empirical Model of Large-Batch Training. arXiv:1812.06162 [cs, stat] (Dec. 2018). http://arxiv.org/abs/1812.06162

[16]

Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. 2018. Mixed Precision Training. arXiv:1710.03740 [cs, stat] (Feb. 2018). http://arxiv.org/abs/1710.03740

[17]

Hiroaki Mikami, Hisahiro Suganuma, Pongsakorn U-chupala, Yoshiki Tanaka, and Yuichi Kageyama. 2018. ImageNet/ResNet-50 Training in 224 Seconds. (Nov. 2018). https://arxiv.org/abs/1811.05233

[18]

Kazuki Osawa, Yohei Tsuji, Yuichiro Ueno, Akira Naruse, Chuan-Sheng Foo, and Rio Yokota. 2020. Scalable and Practical Natural Gradient for Large-Scale Deep Learning. arXiv:2002.06015 [cs, stat] (Feb. 2020). http://arxiv.org/abs/2002.06015

[19]

Kazuki Osawa, Yohei Tsuji, Yuichiro Ueno, Akira Naruse, Rio Yokota, and Satoshi Matsuoka. 2019. Large-Scale Distributed Second-Order Optimization Using Kronecker-Factored Approximate Curvature for Deep Convolutional Neural Networks. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 12351--12359. https://doi.org/10.1109/CVPR.2019.01264

[20]

Pitch Patarasuk and Xin Yuan. 2009. Bandwidth Optimal All-reduce Algorithms for Clusters of Workstations. J. Parallel and Distrib. Comput., Vol. 69, 2 (Feb. 2009), 117--124. https://doi.org/10.1016/j.jpdc.2008.09.002

Digital Library

[21]

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision, Vol. 115, 3 (Dec. 2015), 211--252. https://doi.org/10.1007/s11263-015-0816-y

Digital Library

[22]

Valentin Thomas, Fabian Pedregosa, Bart van Merriënboer, Pierre-Antoine Mangazol, Yoshua Bengio, and Nicolas Le Roux. 2019. Information matrices and generalization. arXiv:1906.07774 [cs, stat] (June 2019). http://arxiv.org/abs/1906.07774

[23]

Yohei Tsuji, Kazuki Osawa, Yuichiro Ueno, Akira Naruse, Rio Yokota, and Satoshi Matsuoka. 2019. Performance Optimizations and Analysis of Distributed Deep Learning with Approximated Second-Order Optimization Method. In Proceedings of the 48th International Conference on Parallel Processing: Workshops (ICPP 2019). Association for Computing Machinery, Kyoto, Japan, 1--8. https://doi.org/10.1145/3339186.3339202

Digital Library

[24]

Yuichiro Ueno and Rio Yokota. 2019. Exhaustive Study of Hierarchical AllReduce Patterns for Large Messages Between GPUs. In 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID). 430--439. https://doi.org/10.1109/CCGRID.2019.00057

[25]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need. arXiv:1706.03762 [cs] (June 2017). http://arxiv.org/abs/1706.03762

[26]

Chaoqi Wang, Roger Grosse, Sanja Fidler, and Guodong Zhang. 2019. EigenDamage: Structured Pruning in the Kronecker-Factored Eigenbasis. arXiv:1905.05934 [cs, stat] (May 2019). http://arxiv.org/abs/1905.05934

[27]

Masafumi Yamazaki, Akihiko Kasagi, Akihiro Tabuchi, Takumi Honda, Masahiro Miwa, Naoto Fukumoto, Tsuguchika Tabaru, Atsushi Ike, and Kohta Nakashima. 2019. Yet Another Accelerated SGD: ResNet-50 Training on ImageNet in 74.7 seconds. arXiv:1903.12650 [cs, stat] (March 2019). http://arxiv.org/abs/1903.12650

[28]

Chris Ying, Sameer Kumar, Dehao Chen, Tao Wang, and Youlong Cheng. 2018. Image Classification at Supercomputer Scale. arXiv:1811.06992 [cs, stat] (Nov. 2018). http://arxiv.org/abs/1811.06992

[29]

Yang You, Zhao Zhang, Cho-Jui Hsieh, James Demmel, and Kurt Keutzer. 2018. ImageNet Training in Minutes. In Proceedings of the 47th International Conference on Parallel Processing (ICPP 2018). ACM, New York, NY, USA, 1:1--1:10. https://doi.org/10.1145/3225058.3225069

Digital Library

[30]

Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, and David Lopez-Paz. 2018. mixup: Beyond Empirical Risk Minimization. arXiv:1710.09412 [cs, stat] (April 2018). http://arxiv.org/abs/1710.09412

Cited By

Mozaffari MLi SZhang ZDehnavi MOh ANaumann TGloberson ASaenko KHardt MLevine S(2023)MKORProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3666905(17832-17853)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3666905
Zhang LShi SLi B(2023)Accelerating Distributed K-FAC with Efficient Collective Communication and SchedulingIEEE INFOCOM 2023 - IEEE Conference on Computer Communications10.1109/INFOCOM53939.2023.10228871(1-10)Online publication date: 17-May-2023
https://doi.org/10.1109/INFOCOM53939.2023.10228871
Pauloski JHuang LXu WChard KFoster IZhang Z(2022)Deep Neural Network Training With Distributed K-FACIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.316118733:12(3616-3627)Online publication date: 22-Mar-2022
https://dl.acm.org/doi/10.1109/TPDS.2022.3161187
Show More Cited By

Index Terms

Rich Information is Affordable: A Systematic Performance Analysis of Second-order Optimization Using K-FAC
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches
      1. Neural networks

Recommendations

Performance Optimizations and Analysis of Distributed Deep Learning with Approximated Second-Order Optimization Method
ICPP Workshops '19: Workshop Proceedings of the 48th International Conference on Parallel Processing

Faster training of deep neural networks is desired to speed up the research and development cycle in deep learning. Distributed deep learning and second-order optimization methods are two different techniques to accelerate the training of deep neural ...
Search performance improvement of Particle Swarm Optimization by second best particle information

In the original Particle Swarm Optimization (PSO), the particle position vectors denote the potential solutions of the optimization problem. Then, the position vectors are updated from the information of the global best and the personal best particles, ...
Application of QPSO Algorithm in Aeroengine Maximum Thrust Optimization
CCIE '10: Proceedings of the 2010 International Conference on Computing, Control and Industrial Engineering - Volume 02

A new and practical solution, Quantum-behaved Particle Swam Optimization (QPSO) algorithm, is applied to Aeroengine maximum thrust optimization implemented for some turbo fan engine. Simulation is carried out under different altitudes and velocities and ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

KDD '20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

August 2020

3664 pages

ISBN:9781450379984

DOI:10.1145/3394486

General Chairs:
Rajesh Gupta
UC San Diego, USA
,
Yan Liu
USC, USA
,
Program Chairs:
Mohak Shah
LG Electronics, USA
,
Suju Rajan
Linkedin, USA
,
Publications Chairs:
Jiliang Tang
Michigan State, USA
,
B. Aditya Prakash
Georgia Tech, USA

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 August 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Japan Science and Technology Agency

Conference

KDD '20

Sponsor:

KDD '20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

July 6 - 10, 2020

CA, Virtual Event, USA

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
727
Total Downloads

Downloads (Last 12 months)130
Downloads (Last 6 weeks)14

Reflects downloads up to 15 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Mozaffari MLi SZhang ZDehnavi MOh ANaumann TGloberson ASaenko KHardt MLevine S(2023)MKORProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3666905(17832-17853)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3666905
Zhang LShi SLi B(2023)Accelerating Distributed K-FAC with Efficient Collective Communication and SchedulingIEEE INFOCOM 2023 - IEEE Conference on Computer Communications10.1109/INFOCOM53939.2023.10228871(1-10)Online publication date: 17-May-2023
https://doi.org/10.1109/INFOCOM53939.2023.10228871
Pauloski JHuang LXu WChard KFoster IZhang Z(2022)Deep Neural Network Training With Distributed K-FACIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.316118733:12(3616-3627)Online publication date: 22-Mar-2022
https://dl.acm.org/doi/10.1109/TPDS.2022.3161187
Zhang LShi SWang WLi B(2022)Scalable K-FAC Training for Deep Neural Networks With Distributed PreconditioningIEEE Transactions on Cloud Computing10.1109/TCC.2022.3205918(1-14)Online publication date: 2022
https://doi.org/10.1109/TCC.2022.3205918
Mu BSoori SCan BGürbüzbalaban MDehnavi M(2022)HyLo: A Hybrid Low-Rank Natural Gradient Descent MethodSC22: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41404.2022.00052(1-16)Online publication date: Nov-2022
https://doi.org/10.1109/SC41404.2022.00052
Jiang SWang S(2021)Fast Training Methods and Their Experiments for Deep Learning CNN Models2021 40th Chinese Control Conference (CCC)10.23919/CCC52363.2021.9549817(8253-8260)Online publication date: 26-Jul-2021
https://doi.org/10.23919/CCC52363.2021.9549817
Pauloski JHuang QHuang LVenkataraman SChard KFoster IZhang Zde Supinski BHall MGamblin T(2021)KAISAProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476152(1-14)Online publication date: 14-Nov-2021
https://dl.acm.org/doi/10.1145/3458817.3476152
Shi SZhang LLi B(2021)Accelerating Distributed K-FAC with Smart Parallelism of Computing and Communication Tasks2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS)10.1109/ICDCS51616.2021.00059(550-560)Online publication date: Jul-2021
https://doi.org/10.1109/ICDCS51616.2021.00059

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents