Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3394486.3403265acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article
Open access

Rich Information is Affordable: A Systematic Performance Analysis of Second-order Optimization Using K-FAC

Published: 20 August 2020 Publication History

Abstract

Rich information matrices from first and second-order derivatives have many potential applications in both theoretical and practical problems in deep learning. However, computing these information matrices is extremely expensive and this enormous cost is currently limiting its application to important problems regarding generalization, hyperparameter tuning, and optimization of deep neural networks. One of the most challenging use cases of information matrices is their use as a preconditioner for the optimizers, since the information matrices need to be updated every step. In this work, we conduct a step-by-step performance analysis when computing the Fisher information matrix during training of ResNet-50 on ImageNet, and show that the overhead can be reduced to the same amount as the cost of performing a single SGD step. We also show that the resulting Fisher preconditioned optimizer can converge in 1/3 the number of epochs compared to SGD, while achieving the same Top-1 validation accuracy. This is the first work to achieve such accuracy with K-FAC while reducing the training time to match that of SGD.

References

[1]
Takuya Akiba, Shuji Suzuki, and Keisuke Fukuda. 2017. Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes. arXiv:1711.04325 [cs] (Nov. 2017). http://arxiv.org/abs/1711.04325
[2]
Shun-ichi Amari. 1998. Natural Gradient Works Efficiently in Learning. Neural Computation, Vol. 10, 2 (Feb. 1998), 251--276. https://doi.org/10.1162/089976698300017746
[3]
Aleksandar Botev, Hippolyt Ritter, and David Barber. 2017. Practical Gauss-Newton optimisation for deep learning. In Proceedings of the 34th International Conference on Machine Learning - Volume 70 (ICML'17). JMLR.org, Sydney, NSW, Australia, 557--565.
[4]
Felix Dangel, Frederik Kunstner, and Philipp Hennig. 2019. BackPACK: Packing more into backprop. arXiv:1912.10985 [cs, stat] (Dec. 2019). http://arxiv.org/abs/1912.10985
[5]
Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2017. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. arXiv:1706.02677 [cs] (June 2017). http://arxiv.org/abs/1706.02677
[6]
Roger Grosse and James Martens. 2016. A Kronecker-factored approximate Fisher matrix for convolution layers. arXiv:1602.01407 [cs, stat] (May 2016). http://arxiv.org/abs/1602.01407
[7]
K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 770--778. https://doi.org/10.1109/CVPR.2016.90
[8]
Tong He, Zhi Zhang, Hang Zhang, Zhongyue Zhang, Junyuan Xie, and Mu Li. 2018. Bag of Tricks for Image Classification with Convolutional Neural Networks. arXiv:1812.01187 [cs] (Dec. 2018). http://arxiv.org/abs/1812.01187
[9]
Tsuyoshi Ichimura, Kohei Fujita, Takuma Yamaguchi, Akira Naruse, Jack C. Wells, Thomas C. Schulthess, Tjerk P. Straatsma, Christopher J. Zimmer, Maxime Martinasso, Kengo Nakajima, Muneo Hori, and Lalith Maddegedara. 2018. A fast scalable implicit solver for nonlinear time-evolution earthquake city problem on low-ordered unstructured finite elements with artificial intelligence and transprecision computing. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC '18). IEEE Press, Dallas, Texas, 1--11.
[10]
Xianyan Jia, Shutao Song, Wei He, Yangzihao Wang, Haidong Rong, Feihu Zhou, Liqiang Xie, Zhenyu Guo, Yuanzhou Yang, Liwei Yu, Tiegang Chen, Guangxiao Hu, Shaohuai Shi, and Xiaowen Chu. 2018. Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes. arXiv:1807.11205 [cs, stat] (July 2018). http://arxiv.org/abs/1807.11205
[11]
Wonkyung Jung, Daejin Jung, and Byeongho Kim, Sunjung Lee, Wonjong Rhee, and Jung Ho Ahn. 2018. Restructuring Batch Normalization to Accelerate CNN Training. arXiv:1807.01702 [cs] (July 2018). http://arxiv.org/abs/1807.01702
[12]
James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. 2017. Overcoming catastrophic forgetting in neural networks. arXiv:1612.00796 [cs, stat] (Jan. 2017). http://arxiv.org/abs/1612.00796
[13]
Frederik Kunstner, Lukas Balles, and Philipp Hennig. 2019. Limitations of the Empirical Fisher Approximation for Natural Gradient Descent. arXiv:1905.12558 [cs, stat] (Dec. 2019). http://arxiv.org/abs/1905.12558
[14]
James Martens and Roger Grosse. 2015. Optimizing Neural Networks with Kronecker-factored Approximate Curvature. arXiv:1503.05671 [cs, stat] (March 2015). http://arxiv.org/abs/1503.05671
[15]
Sam McCandlish, Jared Kaplan, Dario Amodei, and OpenAI Dota Team. 2018. An Empirical Model of Large-Batch Training. arXiv:1812.06162 [cs, stat] (Dec. 2018). http://arxiv.org/abs/1812.06162
[16]
Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. 2018. Mixed Precision Training. arXiv:1710.03740 [cs, stat] (Feb. 2018). http://arxiv.org/abs/1710.03740
[17]
Hiroaki Mikami, Hisahiro Suganuma, Pongsakorn U-chupala, Yoshiki Tanaka, and Yuichi Kageyama. 2018. ImageNet/ResNet-50 Training in 224 Seconds. (Nov. 2018). https://arxiv.org/abs/1811.05233
[18]
Kazuki Osawa, Yohei Tsuji, Yuichiro Ueno, Akira Naruse, Chuan-Sheng Foo, and Rio Yokota. 2020. Scalable and Practical Natural Gradient for Large-Scale Deep Learning. arXiv:2002.06015 [cs, stat] (Feb. 2020). http://arxiv.org/abs/2002.06015
[19]
Kazuki Osawa, Yohei Tsuji, Yuichiro Ueno, Akira Naruse, Rio Yokota, and Satoshi Matsuoka. 2019. Large-Scale Distributed Second-Order Optimization Using Kronecker-Factored Approximate Curvature for Deep Convolutional Neural Networks. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 12351--12359. https://doi.org/10.1109/CVPR.2019.01264
[20]
Pitch Patarasuk and Xin Yuan. 2009. Bandwidth Optimal All-reduce Algorithms for Clusters of Workstations. J. Parallel and Distrib. Comput., Vol. 69, 2 (Feb. 2009), 117--124. https://doi.org/10.1016/j.jpdc.2008.09.002
[21]
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision, Vol. 115, 3 (Dec. 2015), 211--252. https://doi.org/10.1007/s11263-015-0816-y
[22]
Valentin Thomas, Fabian Pedregosa, Bart van Merriënboer, Pierre-Antoine Mangazol, Yoshua Bengio, and Nicolas Le Roux. 2019. Information matrices and generalization. arXiv:1906.07774 [cs, stat] (June 2019). http://arxiv.org/abs/1906.07774
[23]
Yohei Tsuji, Kazuki Osawa, Yuichiro Ueno, Akira Naruse, Rio Yokota, and Satoshi Matsuoka. 2019. Performance Optimizations and Analysis of Distributed Deep Learning with Approximated Second-Order Optimization Method. In Proceedings of the 48th International Conference on Parallel Processing: Workshops (ICPP 2019). Association for Computing Machinery, Kyoto, Japan, 1--8. https://doi.org/10.1145/3339186.3339202
[24]
Yuichiro Ueno and Rio Yokota. 2019. Exhaustive Study of Hierarchical AllReduce Patterns for Large Messages Between GPUs. In 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID). 430--439. https://doi.org/10.1109/CCGRID.2019.00057
[25]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need. arXiv:1706.03762 [cs] (June 2017). http://arxiv.org/abs/1706.03762
[26]
Chaoqi Wang, Roger Grosse, Sanja Fidler, and Guodong Zhang. 2019. EigenDamage: Structured Pruning in the Kronecker-Factored Eigenbasis. arXiv:1905.05934 [cs, stat] (May 2019). http://arxiv.org/abs/1905.05934
[27]
Masafumi Yamazaki, Akihiko Kasagi, Akihiro Tabuchi, Takumi Honda, Masahiro Miwa, Naoto Fukumoto, Tsuguchika Tabaru, Atsushi Ike, and Kohta Nakashima. 2019. Yet Another Accelerated SGD: ResNet-50 Training on ImageNet in 74.7 seconds. arXiv:1903.12650 [cs, stat] (March 2019). http://arxiv.org/abs/1903.12650
[28]
Chris Ying, Sameer Kumar, Dehao Chen, Tao Wang, and Youlong Cheng. 2018. Image Classification at Supercomputer Scale. arXiv:1811.06992 [cs, stat] (Nov. 2018). http://arxiv.org/abs/1811.06992
[29]
Yang You, Zhao Zhang, Cho-Jui Hsieh, James Demmel, and Kurt Keutzer. 2018. ImageNet Training in Minutes. In Proceedings of the 47th International Conference on Parallel Processing (ICPP 2018). ACM, New York, NY, USA, 1:1--1:10. https://doi.org/10.1145/3225058.3225069
[30]
Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, and David Lopez-Paz. 2018. mixup: Beyond Empirical Risk Minimization. arXiv:1710.09412 [cs, stat] (April 2018). http://arxiv.org/abs/1710.09412

Cited By

View all
  • (2023)MKORProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3666905(17832-17853)Online publication date: 10-Dec-2023
  • (2023)Accelerating Distributed K-FAC with Efficient Collective Communication and SchedulingIEEE INFOCOM 2023 - IEEE Conference on Computer Communications10.1109/INFOCOM53939.2023.10228871(1-10)Online publication date: 17-May-2023
  • (2022)Deep Neural Network Training With Distributed K-FACIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.316118733:12(3616-3627)Online publication date: 22-Mar-2022
  • Show More Cited By

Index Terms

  1. Rich Information is Affordable: A Systematic Performance Analysis of Second-order Optimization Using K-FAC

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    KDD '20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining
    August 2020
    3664 pages
    ISBN:9781450379984
    DOI:10.1145/3394486
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 20 August 2020

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. distributed training
    2. information matrix
    3. performance optimization

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    KDD '20
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)130
    • Downloads (Last 6 weeks)14
    Reflects downloads up to 15 Oct 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)MKORProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3666905(17832-17853)Online publication date: 10-Dec-2023
    • (2023)Accelerating Distributed K-FAC with Efficient Collective Communication and SchedulingIEEE INFOCOM 2023 - IEEE Conference on Computer Communications10.1109/INFOCOM53939.2023.10228871(1-10)Online publication date: 17-May-2023
    • (2022)Deep Neural Network Training With Distributed K-FACIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.316118733:12(3616-3627)Online publication date: 22-Mar-2022
    • (2022)Scalable K-FAC Training for Deep Neural Networks With Distributed PreconditioningIEEE Transactions on Cloud Computing10.1109/TCC.2022.3205918(1-14)Online publication date: 2022
    • (2022)HyLo: A Hybrid Low-Rank Natural Gradient Descent MethodSC22: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41404.2022.00052(1-16)Online publication date: Nov-2022
    • (2021)Fast Training Methods and Their Experiments for Deep Learning CNN Models2021 40th Chinese Control Conference (CCC)10.23919/CCC52363.2021.9549817(8253-8260)Online publication date: 26-Jul-2021
    • (2021)KAISAProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476152(1-14)Online publication date: 14-Nov-2021
    • (2021)Accelerating Distributed K-FAC with Smart Parallelism of Computing and Communication Tasks2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS)10.1109/ICDCS51616.2021.00059(550-560)Online publication date: Jul-2021

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media