Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3673038.3673093acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicppConference Proceedingsconference-collections
research-article
Open access

High-Performance 3D convolution on the Latest Generation Sunway Processor

Published: 12 August 2024 Publication History

Abstract

The emergence of High-Performance Computing (HPC) and Artificial Intelligence (AI) has significantly expanded the applications of three-dimensional convolutional neural networks (3D CNNs). At the same time, the next-generation Sunway supercomputer has evidenced its superior computational capabilities in the HPC+AI domain. However, complex 3D convolution remains a primary performance limitation in many applications. The optimization of tensor-like operators on the Sunway processor is usually implemented via a multi-level blocking approach, adapting to its architecture. Although it can effectively mitigate the differences in memory access latency among different memory hierarchies, the performance of 3D convolutions is still frequently limited by the transfer bandwidth.
In this paper, we propose a high-performance 3D convolution algorithm on the latest Sunway processor. We first derive a general performance model of the three-level blocking algorithms for tensor-like operators on the Sunway processor. Then, a detailed analysis of the 3D convolution operator is conducted, revealing situations of performance bottlenecks. Most importantly, a novel scatter-communication scheme and memory access optimization are proposed to fully exploit on-chip network bandwidth. In addition, further pipeline optimizations to improve the execution efficiency and overlap the memory access latencies are conducted to improve performance. Experiment results show that our 3D convolution implementation can achieve up to 2.12 Tflop/s of single-precision, 92% of the theoretical peak performance.

References

[1]
Andrew Anderson, Aravind Vasudevan, Cormac Keane, and David Gregg. 2017. Low-memory gemm-based convolution algorithms for deep neural networks. arXiv preprint arXiv:1709.03395 (2017).
[2]
Xin Chen, Yingxiang Gao, Honghui Shang, Fang Li, Zhiqian Xu, Xin Liu, and Dexun Chen. 2022. Increasing the efficiency of massively parallel sparse matrix-matrix multiplication in first-principles calculation on the new-generation Sunway supercomputer. IEEE Transactions on Parallel and Distributed Systems 33, 12 (2022), 4752–4766.
[3]
Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cudnn: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759 (2014).
[4]
Jack J Dongarra, Jeremy Du Croz, Sven Hammarling, and Iain S Duff. 1990. A set of level 3 basic linear algebra subprograms. ACM Transactions on Mathematical Software (TOMS) 16, 1 (1990), 1–17.
[5]
Jiarui Fang, Haohuan Fu, Wenlai Zhao, Bingwei Chen, Weijie Zheng, and Guangwen Yang. 2017. swdnn: A library for accelerating deep learning applications on sunway taihulight. In 2017 IEEE international parallel and distributed processing symposium (IPDPS). IEEE, 615–624.
[6]
Haohuan Fu, Junfeng Liao, Jinzhe Yang, Lanning Wang, Zhenya Song, Xiaomeng Huang, Chao Yang, Wei Xue, Fangfang Liu, Fangli Qiao, 2016. The Sunway TaihuLight supercomputer: system and applications. Science China Information Sciences 59 (2016), 1–16.
[7]
Zixian Ge, Guo Cao, Xuesong Li, and Peng Fu. 2020. Hyperspectral image classification method based on 2D–3D CNN and multibranch feature fusion. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 13 (2020), 5776–5788.
[8]
Evangelos Georganas, Sasikanth Avancha, Kunal Banerjee, Dhiraj Kalamkar, Greg Henry, Hans Pabst, and Alexander Heinecke. 2018. Anatomy of high-performance deep learning convolutions on simd architectures. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 830–841.
[9]
Qingchang Han, Yongmin Hu, Fengwei Yu, Hailong Yang, Bing Liu, Peng Hu, Ruihao Gong, Yanfei Wang, Rui Wang, Zhongzhi Luan, 2020. Extremely low-bit convolution optimization for quantized neural network on modern computer architectures. In Proceedings of the 49th International Conference on Parallel Processing. 1–12.
[10]
Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. 2017. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4700–4708.
[11]
National Supercomputing Center in Wuxi. 2016. xMath User Manual v1.0 (in Chinese). http://www.nsccwx.cn:1337/uploads/595bce0bed1b4537994d927ef6be922d.pdf.
[12]
Walaa N Ismail, Mohammad Mehedi Hassan, Hessah A Alsalamah, and Giancarlo Fortino. 2020. CNN-based health model for regular health factors analysis in internet-of-medical things environment. IEEE Access 8 (2020), 52541–52549.
[13]
Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 2012. 3D convolutional neural networks for human action recognition. IEEE transactions on pattern analysis and machine intelligence 35, 1 (2012), 221–231.
[14]
Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia. 675–678.
[15]
Jiazhi Jiang, Dan Huang, Jiangsu Du, Yutong Lu, and Xiangke Liao. 2022. Optimizing small channel 3D convolution on GPU with tensor core. Parallel Comput. 113 (2022), 102954.
[16]
Lijuan Jiang, Chao Yang, Yulong Ao, Wanwang Yin, Wenjing Ma, Qiao Sun, Fangfang Liu, Rongfen Lin, and Peng Zhang. 2017. Towards highly efficient DGEMM on the emerging SW26010 many-core processor. In 2017 46th International Conference on Parallel Processing (ICPP). IEEE, 422–431.
[17]
Heehoon Kim, Hyoungwook Nam, Wookeun Jung, and Jaejin Lee. 2017. Performance analysis of CNN frameworks for GPUs. In 2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 55–64.
[18]
Dongsheng Li, Dan Huang, Zhiguang Chen, and Yutong Lu. 2021. Optimizing massively parallel winograd convolution on arm processor. In Proceedings of the 50th International Conference on Parallel Processing. 1–12.
[19]
Liandeng Li, Jiarui Fang, Haohuan Fu, Jinlei Jiang, Wenlai Zhao, Conghui He, Xin You, and Guangwen Yang. 2018. swcaffe: A parallel framework for accelerating deep learning applications on sunway taihulight. In 2018 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 413–422.
[20]
Mingfan Li, Junshi Chen, Qian Xiao, Fei Wang, Qingcai Jiang, Xuncheng Zhao, Rongfen Lin, Hong An, Xiao Liang, and Lixin He. 2022. Bridging the gap between deep learning and frustrated quantum spin system for extreme-scale simulations on new generation of Sunway supercomputer. IEEE Transactions on Parallel and Distributed Systems 33, 11 (2022), 2846–2859.
[21]
Han Lin, Zeng Lin, Jose Monsalve Diaz, Mingfan Li, Hong An, and Guang R Gao. 2019. swFLOW: A dataflow deep learning framework on sunway taihulight supercomputer. In 2019 IEEE 21st International Conference on High Performance Computing and Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems (HPCC/SmartCity/DSS). IEEE, 2467–2475.
[22]
Junhong Liu, Dongxu Yang, and Junjie Lai. 2021. Optimizing winograd-based convolution with tensor cores. In Proceedings of the 50th International Conference on Parallel Processing. 1–10.
[23]
Michael Mathieu, Mikael Henaff, and Yann LeCun. 2013. Fast training of convolutional networks through ffts. arXiv preprint arXiv:1312.5851 (2013).
[24]
Amrita Mathuriya, Deborah Bard, Peter Mendygral, Lawrence Meadows, James Arnemann, Lei Shao, Siyu He, Tuomas Kärnä, Diana Moise, Simon J Pennycook, 2018. CosmoFlow: Using deep learning to learn the universe at scale. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 819–829.
[25]
Du Tran, Lubomir D Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2014. C3D: generic features for video analysis. CoRR, abs/1412.0767 2, 7 (2014), 8.
[26]
Nicolas Vasilache, Jeff Johnson, Michael Mathieu, Soumith Chintala, Serkan Piantino, and Yann LeCun. 2014. Fast convolutional nets with fbfft: A GPU performance evaluation. arXiv preprint arXiv:1412.7580 (2014).
[27]
Brice Videau, Vania Marangozova-Martin, Luigi Genovese, and Thierry Deutsch. 2013. Optimizing 3D convolutions for wavelet transforms on CPUs with SSE units and GPUs. In Euro-Par 2013 Parallel Processing: 19th International Conference, Aachen, Germany, August 26-30, 2013. Proceedings 19. Springer, 826–837.
[28]
Wubing Wan, Lin Gan, Wenqiang Wang, Zekun Yin, Haodong Tian, Zhenguo Zhang, Yinuo Wang, Mengyuan Hua, Xiaohui Liu, Shengye Xiang, 2023. 7-PFlops Extreme Scale Earthquake Simulation with Crossing Multi-faults and Topography on Sunway. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–15.
[29]
Hai Wang, Mengjun Shao, Yan Liu, and Wei Zhao. 2017. Enhanced efficiency 3D convolution based on optimal FPGA accelerator. IEEE Access 5 (2017), 6909–6916.
[30]
Zheng Wu. 2023. MG3MConv: Multi-Grained Matrix-Multiplication-Mapping Convolution Algorithm toward the SW26010 Processor. arXiv preprint arXiv:2307.04941 (2023).
[31]
Jiyuan Zhang, Franz Franchetti, and Tze Meng Low. 2018. High performance zero-memory overhead direct convolutions. In International Conference on Machine Learning. PMLR, 5776–5785.
[32]
Yi Zhang, Bing Shu, Yan Yin, Yawei Zhou, Shaodi Li, and Junmin Wu. 2019. Efficient processing of convolutional neural networks on sw26010. In IFIP International Conference on Network and Parallel Computing. Springer, 316–321.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICPP '24: Proceedings of the 53rd International Conference on Parallel Processing
August 2024
1279 pages
ISBN:9798400717932
DOI:10.1145/3673038
This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 August 2024

Check for updates

Author Tags

  1. 3D convolution
  2. Sunway Supercomputer
  3. heterogeneous architectures
  4. high-performance computing

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

ICPP '24

Acceptance Rates

Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 234
    Total Downloads
  • Downloads (Last 12 months)234
  • Downloads (Last 6 weeks)43
Reflects downloads up to 29 Jan 2025

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media