research-article

Open access

Optimizing Convolutional Neural Networks on the Sunway TaihuLight Supercomputer

Authors:

Guangwen YangAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization (TACO), Volume 15, Issue 1

Article No.: 13, Pages 1 - 26

https://doi.org/10.1145/3177885

Published: 22 March 2018 Publication History

Abstract

The Sunway TaihuLight supercomputer is powered by SW26010, a new 260-core processor designed with on-chip fusion of heterogeneous cores. In this article, we present our work on optimizing the training process of convolutional neural networks (CNNs) on the Sunway TaihuLight supercomputer. Specifically, a highly efficient library (swDNN) and a customized Caffe framework (swCaffe) are proposed. Architecture-oriented optimization methods targeting the many-core architecture of SW26010 are introduced and are able to achieve 48× speedup for the convolution routine in swDNN and 4× speedup for the complete training process of the VGG-16 network using swCaffe, compared to the unoptimized algorithm and framework. Compared to the cuDNN library and the Caffe framework based on the NVIDIA K40m GPU, the proposed swDNN library and swCaffe framework on SW26010 have nearly half the performance of K40m in single -precision and have 3.6× and 1.8× speedup over K40m in double precision, respectively.

References

[1]

Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, et al. 2016. TensorFlow: Large-scale machine learning on heterogeneous distributed systems. arXiv:1603.044670

[2]

Kumar Chellapilla, Sidd Puri, and Patrice Simard. 2006. High performance convolutional neural networks for document processing. In Proceedings of the 10th International Workshop on Frontiers in Handwriting Recognition.

[3]

Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen, and Olivier Temam. 2014. Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. ACM SIGPLAN Notices 49, 269--284.

Digital Library

[4]

Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2015. MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv:1512.01274.

[5]

Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, et al. 2014. DaDianNao: A machine-learning supercomputer. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, Los Alamitos, CA, 609--622.

Digital Library

[6]

Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cuDNN: Efficient primitives for deep learning. arXiv:1410.0759.

[7]

Ronan Collobert, Samy Bengio, and Johnny Marithoz. 2002. Torch: A Modular Machine Learning Software Library. Idiap.

[8]

George E. Dahl, Dong Yu, Li Deng, and Alex Acero. 2011. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Transactions on Audio Speech and Language Processing 20, 1, 30--42.

Digital Library

[9]

Zidong Du, Robert Fasthuber, Tianshi Chen, Paolo Ienne, Ling Li, Tao Luo, Xiaobing Feng, Yunji Chen, and Olivier Temam. 2015. ShiDianNao: Shifting vision processing closer to the sensor. ACM SIGARCH Computer Architecture News 43, 92--104.

Digital Library

[10]

Jiarui Fang, Haohuan Fu, Wenlai Zhao, Bingwei Chen, Weijie Zheng, and Guangwen Yang. 2017. swDNN: A library for accelerating deep learning applications on Sunway TaihuLight. In Proceedings of the Parallel and Distributed Processing Symposium. 615--624.

[11]

Haohuan Fu, Junfeng Liao, Jinzhe Yang, Lanning Wang, Zhenya Song, Xiaomeng Huang, Chao Yang, et al. 2016. The Sunway TaihuLight supercomputer: System and applications. Science China Information Sciences 59, 7, 072001.

[12]

Geoffrey Hinton, Li Deng, Dong Yu, George E. Dahl, Abdel Rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, and Tara N. Sainath. 2012. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine 29, 6, 82--97.

[13]

Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM International Conference on Multimedia. ACM, New York, NY, 675--678.

Digital Library

[14]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. 1097--1105.

Digital Library

[15]

Andrew Lavin. 2015. maxDNN: An efficient convolution kernel for deep learning with maxwell GPUs. arXiv:1501.06633.

[16]

Andrew Lavin and Scott Gray. 2016. Fast algorithms for convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4013--4021.

[17]

Daofu Liu, Tianshi Chen, Shaoli Liu, Jinhong Zhou, Shengyuan Zhou, Olivier Teman, Xiaobing Feng, Xuehai Zhou, and Yunji Chen. 2015. PuDianNao: A polyvalent machine learning accelerator. ACM SIGARCH Computer Architecture News 43, 369--381.

Digital Library

[18]

Michael Mathieu, Mikael Henaff, and Yann LeCun. 2013. Fast training of convolutional networks through FFTs. arXiv:1312.5851.

[19]

Jiantao Qiu, Jie Wang, Song Yao, Kaiyuan Guo, Boxun Li, Erjin Zhou, Jincheng Yu, et al. 2016. Going deeper with embedded FPGA platform for convolutional neural network. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, New York, NY, 26--35.

Digital Library

[20]

D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, and M. Lanctot. 2016. Mastering the game of go with deep neural networks and tree search. Nature 529, 7587, 484.

[21]

Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556.

[22]

Yi Sun, Xiaogang Wang, and Xiaoou Tang. 2014. Deeply learned face representations are sparse, selective, and robust. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2892--2900.

[23]

Nicolas Vasilache, Jeff Johnson, Michael Mathieu, Soumith Chintala, Serkan Piantino, and Yann LeCun. 2014. Fast convolutional nets with fbfft: A GPU performance evaluation. arXiv:1412.7580.

[24]

Matthew D. Zeiler and Rob Fergus. 2014. Visualizing and understanding convolutional networks. In Computer Vision—ECCV 2014. Lecture Notes in Computer Science, Vol. 8689. Springer, 818--833.

[25]

Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. 2015. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, New York, NY, 161--170.

Digital Library

[26]

Chen Zhang, Di Wu, Jiayu Sun, Guangyu Sun, Guojie Luo, and Jason Cong. 2016. Energy-efficient CNN implementation on a deeply pipelined FPGA cluster. In Proceedings of the 2016 International Symposium on Low Power Electronics and Design. ACM, New York, NY, 326--331.

Digital Library

[27]

Wenlai Zhao, Haohuan Fu, Wayne Luk, Teng Yu, Shaojun Wang, Bo Feng, Yuchun Ma, and Guangwen Yang. 2016. F-CNN: An FPGA-based framework for training convolutional neural networks. In Proceedings of the IEEE International Conference on Application-Specific Systems, Architectures, and Processors. 107--114.

Cited By

Korostelev IL. De Carvalho JMoreira JAmaral J(2023)YaConv: Convolution with Low Cache FootprintACM Transactions on Architecture and Code Optimization10.1145/357030520:1(1-18)Online publication date: 10-Feb-2023
https://dl.acm.org/doi/10.1145/3570305
Gao WZhang WWu WZhen YZhao WYang G(2023)Automatic Deep Learning Operator Fusion on Sunway SW26010 Many-Core Processor2023 IEEE 29th International Conference on Parallel and Distributed Systems (ICPADS)10.1109/ICPADS60453.2023.00266(1943-1950)Online publication date: 17-Dec-2023
https://doi.org/10.1109/ICPADS60453.2023.00266
Wu MWu YShang HLiu YCui HLi FDuan XZhang YFeng X(2022)Scaling Poisson Solvers on Many Cores via MMEwaldIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.312713833:8(1888-1901)Online publication date: 1-Aug-2022
https://doi.org/10.1109/TPDS.2021.3127138
Show More Cited By

Index Terms

Optimizing Convolutional Neural Networks on the Sunway TaihuLight Supercomputer
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Multicore architectures
2. Computing methodologies
  1. Machine learning
    1. Machine learning approaches
      1. Neural networks

Recommendations

Performance Optimization of the HPCG Benchmark on the Sunway TaihuLight Supercomputer

In this article, we present some key techniques for optimizing HPCG on Sunway TaihuLight and demonstrate how to achieve high performance in memory-bound applications by exploiting specific characteristics of the hardware architecture. In particular, we ...
Scaling the Training of Recurrent Neural Networks on Sunway TaihuLight Supercomputer
Computational Science – ICCS 2019
Abstract
The recurrent neural network (RNN) models require longer training time with larger datasets and bigger number of parameters. Distributed training with large mini-batch size is a potential solution to accelerate the whole training process. This ...
Mapping of option pricing algorithms onto heterogeneous many-core architectures

The rapid development of technologies and applications in recent years poses high demands and challenges for high-performance computing. Because of their competitive performance/price ratio, heterogeneous many-core architectures are widely used in high-...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization

ACM Transactions on Architecture and Code Optimization Volume 15, Issue 1

March 2018

401 pages

ISSN:1544-3566

EISSN:1544-3973

DOI:10.1145/3199680

Editor:
Koen De Bosschere
Ghent University

Issue’s Table of Contents

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 March 2018

Accepted: 01 January 2018

Revised: 01 December 2017

Received: 01 June 2017

Published in TACO Volume 15, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

National Natural Science Foundation of China
National Key R8D Program of China
China Postdoctoral Science Foundation

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

15
Total Citations
View Citations
903
Total Downloads

Downloads (Last 12 months)99
Downloads (Last 6 weeks)16

Reflects downloads up to 11 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Korostelev IL. De Carvalho JMoreira JAmaral J(2023)YaConv: Convolution with Low Cache FootprintACM Transactions on Architecture and Code Optimization10.1145/357030520:1(1-18)Online publication date: 10-Feb-2023
https://dl.acm.org/doi/10.1145/3570305
Gao WZhang WWu WZhen YZhao WYang G(2023)Automatic Deep Learning Operator Fusion on Sunway SW26010 Many-Core Processor2023 IEEE 29th International Conference on Parallel and Distributed Systems (ICPADS)10.1109/ICPADS60453.2023.00266(1943-1950)Online publication date: 17-Dec-2023
https://doi.org/10.1109/ICPADS60453.2023.00266
Wu MWu YShang HLiu YCui HLi FDuan XZhang YFeng X(2022)Scaling Poisson Solvers on Many Cores via MMEwaldIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.312713833:8(1888-1901)Online publication date: 1-Aug-2022
https://doi.org/10.1109/TPDS.2021.3127138
Jiang JHuang DDu JLu YLiao X(2022)Optimizing small channel 3D convolution on GPU with tensor coreParallel Computing10.1016/j.parco.2022.102954113(102954)Online publication date: Oct-2022
https://doi.org/10.1016/j.parco.2022.102954
Shang HLi FZhang YZhang LFu YGao YWu YDuan XLin RLiu XLiu YChen Dde Supinski BHall MGamblin T(2021)Extreme-scale ab initio quantum raman spectra simulations on the leadership HPC system in ChinaProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3487402(1-13)Online publication date: 14-Nov-2021
https://dl.acm.org/doi/10.1145/3458817.3487402
Cui WZhao HChen QZheng NLeng JZhao JSong ZMa TYang YLi CGuo Mde Supinski BHall MGamblin T(2021)Enable simultaneous DNN services based on deterministic operator overlap and precise latency predictionProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476143(1-15)Online publication date: 14-Nov-2021
https://dl.acm.org/doi/10.1145/3458817.3476143
Liu XSun JZheng LWang SLiu YWei T(2021)Parallelization and Optimization of NSGA-II on Sunway TaihuLight SystemIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2020.303708232:4(975-987)Online publication date: 1-Apr-2021
https://doi.org/10.1109/TPDS.2020.3037082
Liu JZhou SCheng EZhou QLiu X(2021)Reliability Analysis of the Cactus-Based NetworksTheoretical Computer Science10.1016/j.tcs.2021.07.029Online publication date: Aug-2021
https://doi.org/10.1016/j.tcs.2021.07.029
WANG CLU YCHEN ZLI J(2020)An Efficient Method for Training Deep Learning Networks DistributedIEICE Transactions on Information and Systems10.1587/transinf.2020PAP0007E103.D:12(2444-2456)Online publication date: 1-Dec-2020
https://doi.org/10.1587/transinf.2020PAP0007
Wu MLiu YCui HWei QLi QLi LLv FXue JFeng XSarkar VKim H(2020)Bandwidth-Aware Loop Tiling for DMA-Supported Scratchpad MemoryProceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques10.1145/3410463.3414637(97-109)Online publication date: 30-Sep-2020
https://dl.acm.org/doi/10.1145/3410463.3414637
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents