Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

Optimizing Convolutional Neural Networks on the Sunway TaihuLight Supercomputer

Published: 22 March 2018 Publication History
  • Get Citation Alerts
  • Abstract

    The Sunway TaihuLight supercomputer is powered by SW26010, a new 260-core processor designed with on-chip fusion of heterogeneous cores. In this article, we present our work on optimizing the training process of convolutional neural networks (CNNs) on the Sunway TaihuLight supercomputer. Specifically, a highly efficient library (swDNN) and a customized Caffe framework (swCaffe) are proposed. Architecture-oriented optimization methods targeting the many-core architecture of SW26010 are introduced and are able to achieve 48× speedup for the convolution routine in swDNN and 4× speedup for the complete training process of the VGG-16 network using swCaffe, compared to the unoptimized algorithm and framework. Compared to the cuDNN library and the Caffe framework based on the NVIDIA K40m GPU, the proposed swDNN library and swCaffe framework on SW26010 have nearly half the performance of K40m in single -precision and have 3.6× and 1.8× speedup over K40m in double precision, respectively.

    References

    [1]
    Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, et al. 2016. TensorFlow: Large-scale machine learning on heterogeneous distributed systems. arXiv:1603.044670
    [2]
    Kumar Chellapilla, Sidd Puri, and Patrice Simard. 2006. High performance convolutional neural networks for document processing. In Proceedings of the 10th International Workshop on Frontiers in Handwriting Recognition.
    [3]
    Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen, and Olivier Temam. 2014. Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. ACM SIGPLAN Notices 49, 269--284.
    [4]
    Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2015. MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv:1512.01274.
    [5]
    Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, et al. 2014. DaDianNao: A machine-learning supercomputer. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, Los Alamitos, CA, 609--622.
    [6]
    Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cuDNN: Efficient primitives for deep learning. arXiv:1410.0759.
    [7]
    Ronan Collobert, Samy Bengio, and Johnny Marithoz. 2002. Torch: A Modular Machine Learning Software Library. Idiap.
    [8]
    George E. Dahl, Dong Yu, Li Deng, and Alex Acero. 2011. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Transactions on Audio Speech and Language Processing 20, 1, 30--42.
    [9]
    Zidong Du, Robert Fasthuber, Tianshi Chen, Paolo Ienne, Ling Li, Tao Luo, Xiaobing Feng, Yunji Chen, and Olivier Temam. 2015. ShiDianNao: Shifting vision processing closer to the sensor. ACM SIGARCH Computer Architecture News 43, 92--104.
    [10]
    Jiarui Fang, Haohuan Fu, Wenlai Zhao, Bingwei Chen, Weijie Zheng, and Guangwen Yang. 2017. swDNN: A library for accelerating deep learning applications on Sunway TaihuLight. In Proceedings of the Parallel and Distributed Processing Symposium. 615--624.
    [11]
    Haohuan Fu, Junfeng Liao, Jinzhe Yang, Lanning Wang, Zhenya Song, Xiaomeng Huang, Chao Yang, et al. 2016. The Sunway TaihuLight supercomputer: System and applications. Science China Information Sciences 59, 7, 072001.
    [12]
    Geoffrey Hinton, Li Deng, Dong Yu, George E. Dahl, Abdel Rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, and Tara N. Sainath. 2012. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine 29, 6, 82--97.
    [13]
    Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM International Conference on Multimedia. ACM, New York, NY, 675--678.
    [14]
    Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. 1097--1105.
    [15]
    Andrew Lavin. 2015. maxDNN: An efficient convolution kernel for deep learning with maxwell GPUs. arXiv:1501.06633.
    [16]
    Andrew Lavin and Scott Gray. 2016. Fast algorithms for convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4013--4021.
    [17]
    Daofu Liu, Tianshi Chen, Shaoli Liu, Jinhong Zhou, Shengyuan Zhou, Olivier Teman, Xiaobing Feng, Xuehai Zhou, and Yunji Chen. 2015. PuDianNao: A polyvalent machine learning accelerator. ACM SIGARCH Computer Architecture News 43, 369--381.
    [18]
    Michael Mathieu, Mikael Henaff, and Yann LeCun. 2013. Fast training of convolutional networks through FFTs. arXiv:1312.5851.
    [19]
    Jiantao Qiu, Jie Wang, Song Yao, Kaiyuan Guo, Boxun Li, Erjin Zhou, Jincheng Yu, et al. 2016. Going deeper with embedded FPGA platform for convolutional neural network. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, New York, NY, 26--35.
    [20]
    D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, and M. Lanctot. 2016. Mastering the game of go with deep neural networks and tree search. Nature 529, 7587, 484.
    [21]
    Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556.
    [22]
    Yi Sun, Xiaogang Wang, and Xiaoou Tang. 2014. Deeply learned face representations are sparse, selective, and robust. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2892--2900.
    [23]
    Nicolas Vasilache, Jeff Johnson, Michael Mathieu, Soumith Chintala, Serkan Piantino, and Yann LeCun. 2014. Fast convolutional nets with fbfft: A GPU performance evaluation. arXiv:1412.7580.
    [24]
    Matthew D. Zeiler and Rob Fergus. 2014. Visualizing and understanding convolutional networks. In Computer Vision—ECCV 2014. Lecture Notes in Computer Science, Vol. 8689. Springer, 818--833.
    [25]
    Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. 2015. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, New York, NY, 161--170.
    [26]
    Chen Zhang, Di Wu, Jiayu Sun, Guangyu Sun, Guojie Luo, and Jason Cong. 2016. Energy-efficient CNN implementation on a deeply pipelined FPGA cluster. In Proceedings of the 2016 International Symposium on Low Power Electronics and Design. ACM, New York, NY, 326--331.
    [27]
    Wenlai Zhao, Haohuan Fu, Wayne Luk, Teng Yu, Shaojun Wang, Bo Feng, Yuchun Ma, and Guangwen Yang. 2016. F-CNN: An FPGA-based framework for training convolutional neural networks. In Proceedings of the IEEE International Conference on Application-Specific Systems, Architectures, and Processors. 107--114.

    Cited By

    View all
    • (2023)YaConv: Convolution with Low Cache FootprintACM Transactions on Architecture and Code Optimization10.1145/357030520:1(1-18)Online publication date: 10-Feb-2023
    • (2023)Automatic Deep Learning Operator Fusion on Sunway SW26010 Many-Core Processor2023 IEEE 29th International Conference on Parallel and Distributed Systems (ICPADS)10.1109/ICPADS60453.2023.00266(1943-1950)Online publication date: 17-Dec-2023
    • (2022)Scaling Poisson Solvers on Many Cores via MMEwaldIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.312713833:8(1888-1901)Online publication date: 1-Aug-2022
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Architecture and Code Optimization
    ACM Transactions on Architecture and Code Optimization  Volume 15, Issue 1
    March 2018
    401 pages
    ISSN:1544-3566
    EISSN:1544-3973
    DOI:10.1145/3199680
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 22 March 2018
    Accepted: 01 January 2018
    Revised: 01 December 2017
    Received: 01 June 2017
    Published in TACO Volume 15, Issue 1

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Convolutional neural network
    2. Sunway TaihuLight supercomputer
    3. deep learning
    4. heterogeneous many-core architecture

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    • National Natural Science Foundation of China
    • National Key R8D Program of China
    • China Postdoctoral Science Foundation

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)99
    • Downloads (Last 6 weeks)16
    Reflects downloads up to 11 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)YaConv: Convolution with Low Cache FootprintACM Transactions on Architecture and Code Optimization10.1145/357030520:1(1-18)Online publication date: 10-Feb-2023
    • (2023)Automatic Deep Learning Operator Fusion on Sunway SW26010 Many-Core Processor2023 IEEE 29th International Conference on Parallel and Distributed Systems (ICPADS)10.1109/ICPADS60453.2023.00266(1943-1950)Online publication date: 17-Dec-2023
    • (2022)Scaling Poisson Solvers on Many Cores via MMEwaldIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.312713833:8(1888-1901)Online publication date: 1-Aug-2022
    • (2022)Optimizing small channel 3D convolution on GPU with tensor coreParallel Computing10.1016/j.parco.2022.102954113(102954)Online publication date: Oct-2022
    • (2021)Extreme-scale ab initio quantum raman spectra simulations on the leadership HPC system in ChinaProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3487402(1-13)Online publication date: 14-Nov-2021
    • (2021)Enable simultaneous DNN services based on deterministic operator overlap and precise latency predictionProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476143(1-15)Online publication date: 14-Nov-2021
    • (2021)Parallelization and Optimization of NSGA-II on Sunway TaihuLight SystemIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2020.303708232:4(975-987)Online publication date: 1-Apr-2021
    • (2021)Reliability Analysis of the Cactus-Based NetworksTheoretical Computer Science10.1016/j.tcs.2021.07.029Online publication date: Aug-2021
    • (2020)An Efficient Method for Training Deep Learning Networks DistributedIEICE Transactions on Information and Systems10.1587/transinf.2020PAP0007E103.D:12(2444-2456)Online publication date: 1-Dec-2020
    • (2020)Bandwidth-Aware Loop Tiling for DMA-Supported Scratchpad MemoryProceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques10.1145/3410463.3414637(97-109)Online publication date: 30-Sep-2020
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media