research-article

Open access

High-Performance 3D convolution on the Latest Generation Sunway Processor

Authors:

Jian ZhangAuthors Info & Claims

ICPP '24: Proceedings of the 53rd International Conference on Parallel Processing

Pages 241 - 251

https://doi.org/10.1145/3673038.3673093

Published: 12 August 2024 Publication History

All formats PDF

Abstract

The emergence of High-Performance Computing (HPC) and Artificial Intelligence (AI) has significantly expanded the applications of three-dimensional convolutional neural networks (3D CNNs). At the same time, the next-generation Sunway supercomputer has evidenced its superior computational capabilities in the HPC+AI domain. However, complex 3D convolution remains a primary performance limitation in many applications. The optimization of tensor-like operators on the Sunway processor is usually implemented via a multi-level blocking approach, adapting to its architecture. Although it can effectively mitigate the differences in memory access latency among different memory hierarchies, the performance of 3D convolutions is still frequently limited by the transfer bandwidth.

In this paper, we propose a high-performance 3D convolution algorithm on the latest Sunway processor. We first derive a general performance model of the three-level blocking algorithms for tensor-like operators on the Sunway processor. Then, a detailed analysis of the 3D convolution operator is conducted, revealing situations of performance bottlenecks. Most importantly, a novel scatter-communication scheme and memory access optimization are proposed to fully exploit on-chip network bandwidth. In addition, further pipeline optimizations to improve the execution efficiency and overlap the memory access latencies are conducted to improve performance. Experiment results show that our 3D convolution implementation can achieve up to 2.12 Tflop/s of single-precision, 92% of the theoretical peak performance.

References

[1]

Andrew Anderson, Aravind Vasudevan, Cormac Keane, and David Gregg. 2017. Low-memory gemm-based convolution algorithms for deep neural networks. arXiv preprint arXiv:1709.03395 (2017).

[2]

Xin Chen, Yingxiang Gao, Honghui Shang, Fang Li, Zhiqian Xu, Xin Liu, and Dexun Chen. 2022. Increasing the efficiency of massively parallel sparse matrix-matrix multiplication in first-principles calculation on the new-generation Sunway supercomputer. IEEE Transactions on Parallel and Distributed Systems 33, 12 (2022), 4752–4766.

Digital Library

[3]

Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cudnn: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759 (2014).

[4]

Jack J Dongarra, Jeremy Du Croz, Sven Hammarling, and Iain S Duff. 1990. A set of level 3 basic linear algebra subprograms. ACM Transactions on Mathematical Software (TOMS) 16, 1 (1990), 1–17.

Digital Library

[5]

Jiarui Fang, Haohuan Fu, Wenlai Zhao, Bingwei Chen, Weijie Zheng, and Guangwen Yang. 2017. swdnn: A library for accelerating deep learning applications on sunway taihulight. In 2017 IEEE international parallel and distributed processing symposium (IPDPS). IEEE, 615–624.

[6]

Haohuan Fu, Junfeng Liao, Jinzhe Yang, Lanning Wang, Zhenya Song, Xiaomeng Huang, Chao Yang, Wei Xue, Fangfang Liu, Fangli Qiao, 2016. The Sunway TaihuLight supercomputer: system and applications. Science China Information Sciences 59 (2016), 1–16.

[7]

Zixian Ge, Guo Cao, Xuesong Li, and Peng Fu. 2020. Hyperspectral image classification method based on 2D–3D CNN and multibranch feature fusion. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 13 (2020), 5776–5788.

[8]

Evangelos Georganas, Sasikanth Avancha, Kunal Banerjee, Dhiraj Kalamkar, Greg Henry, Hans Pabst, and Alexander Heinecke. 2018. Anatomy of high-performance deep learning convolutions on simd architectures. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 830–841.

Digital Library

[9]

Qingchang Han, Yongmin Hu, Fengwei Yu, Hailong Yang, Bing Liu, Peng Hu, Ruihao Gong, Yanfei Wang, Rui Wang, Zhongzhi Luan, 2020. Extremely low-bit convolution optimization for quantized neural network on modern computer architectures. In Proceedings of the 49th International Conference on Parallel Processing. 1–12.

Digital Library

[10]

Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. 2017. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4700–4708.

[11]

National Supercomputing Center in Wuxi. 2016. xMath User Manual v1.0 (in Chinese). http://www.nsccwx.cn:1337/uploads/595bce0bed1b4537994d927ef6be922d.pdf.

[12]

Walaa N Ismail, Mohammad Mehedi Hassan, Hessah A Alsalamah, and Giancarlo Fortino. 2020. CNN-based health model for regular health factors analysis in internet-of-medical things environment. IEEE Access 8 (2020), 52541–52549.

[13]

Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 2012. 3D convolutional neural networks for human action recognition. IEEE transactions on pattern analysis and machine intelligence 35, 1 (2012), 221–231.

Digital Library

[14]

Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia. 675–678.

Digital Library

[15]

Jiazhi Jiang, Dan Huang, Jiangsu Du, Yutong Lu, and Xiangke Liao. 2022. Optimizing small channel 3D convolution on GPU with tensor core. Parallel Comput. 113 (2022), 102954.

Digital Library

[16]

Lijuan Jiang, Chao Yang, Yulong Ao, Wanwang Yin, Wenjing Ma, Qiao Sun, Fangfang Liu, Rongfen Lin, and Peng Zhang. 2017. Towards highly efficient DGEMM on the emerging SW26010 many-core processor. In 2017 46th International Conference on Parallel Processing (ICPP). IEEE, 422–431.

[17]

Heehoon Kim, Hyoungwook Nam, Wookeun Jung, and Jaejin Lee. 2017. Performance analysis of CNN frameworks for GPUs. In 2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 55–64.

[18]

Dongsheng Li, Dan Huang, Zhiguang Chen, and Yutong Lu. 2021. Optimizing massively parallel winograd convolution on arm processor. In Proceedings of the 50th International Conference on Parallel Processing. 1–12.

Digital Library

[19]

Liandeng Li, Jiarui Fang, Haohuan Fu, Jinlei Jiang, Wenlai Zhao, Conghui He, Xin You, and Guangwen Yang. 2018. swcaffe: A parallel framework for accelerating deep learning applications on sunway taihulight. In 2018 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 413–422.

[20]

Mingfan Li, Junshi Chen, Qian Xiao, Fei Wang, Qingcai Jiang, Xuncheng Zhao, Rongfen Lin, Hong An, Xiao Liang, and Lixin He. 2022. Bridging the gap between deep learning and frustrated quantum spin system for extreme-scale simulations on new generation of Sunway supercomputer. IEEE Transactions on Parallel and Distributed Systems 33, 11 (2022), 2846–2859.

[21]

Han Lin, Zeng Lin, Jose Monsalve Diaz, Mingfan Li, Hong An, and Guang R Gao. 2019. swFLOW: A dataflow deep learning framework on sunway taihulight supercomputer. In 2019 IEEE 21st International Conference on High Performance Computing and Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems (HPCC/SmartCity/DSS). IEEE, 2467–2475.

[22]

Junhong Liu, Dongxu Yang, and Junjie Lai. 2021. Optimizing winograd-based convolution with tensor cores. In Proceedings of the 50th International Conference on Parallel Processing. 1–10.

Digital Library

[23]

Michael Mathieu, Mikael Henaff, and Yann LeCun. 2013. Fast training of convolutional networks through ffts. arXiv preprint arXiv:1312.5851 (2013).

[24]

Amrita Mathuriya, Deborah Bard, Peter Mendygral, Lawrence Meadows, James Arnemann, Lei Shao, Siyu He, Tuomas Kärnä, Diana Moise, Simon J Pennycook, 2018. CosmoFlow: Using deep learning to learn the universe at scale. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 819–829.

Digital Library

[25]

Du Tran, Lubomir D Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2014. C3D: generic features for video analysis. CoRR, abs/1412.0767 2, 7 (2014), 8.

[26]

Nicolas Vasilache, Jeff Johnson, Michael Mathieu, Soumith Chintala, Serkan Piantino, and Yann LeCun. 2014. Fast convolutional nets with fbfft: A GPU performance evaluation. arXiv preprint arXiv:1412.7580 (2014).

[27]

Brice Videau, Vania Marangozova-Martin, Luigi Genovese, and Thierry Deutsch. 2013. Optimizing 3D convolutions for wavelet transforms on CPUs with SSE units and GPUs. In Euro-Par 2013 Parallel Processing: 19th International Conference, Aachen, Germany, August 26-30, 2013. Proceedings 19. Springer, 826–837.

Digital Library

[28]

Wubing Wan, Lin Gan, Wenqiang Wang, Zekun Yin, Haodong Tian, Zhenguo Zhang, Yinuo Wang, Mengyuan Hua, Xiaohui Liu, Shengye Xiang, 2023. 7-PFlops Extreme Scale Earthquake Simulation with Crossing Multi-faults and Topography on Sunway. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–15.

Digital Library

[29]

Hai Wang, Mengjun Shao, Yan Liu, and Wei Zhao. 2017. Enhanced efficiency 3D convolution based on optimal FPGA accelerator. IEEE Access 5 (2017), 6909–6916.

[30]

Zheng Wu. 2023. MG3MConv: Multi-Grained Matrix-Multiplication-Mapping Convolution Algorithm toward the SW26010 Processor. arXiv preprint arXiv:2307.04941 (2023).

[31]

Jiyuan Zhang, Franz Franchetti, and Tze Meng Low. 2018. High performance zero-memory overhead direct convolutions. In International Conference on Machine Learning. PMLR, 5776–5785.

[32]

Yi Zhang, Bing Shu, Yan Yin, Yawei Zhou, Shaodi Li, and Junmin Wu. 2019. Efficient processing of convolutional neural networks on sw26010. In IFIP International Conference on Network and Parallel Computing. Springer, 316–321.

Index Terms

High-Performance 3D convolution on the Latest Generation Sunway Processor
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Multicore architectures
2. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel algorithms

Recommendations

A hierarchical grid algorithm for accelerating high-performance conjugate gradient benchmark on sunway many-core processor
ICCIP '17: Proceedings of the 3rd International Conference on Communication and Information Processing

This paper presents analysis and optimizations for High Performance Conjugate Gradient benchmark (HPCG) on the Sunway many-core processor. For modern multi-core and many-core processors, HPCG always presents a poor performance and under-utilizes ...
Performance evaluation of Intel® transactional synchronization extensions for high-performance computing
SC '13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Intel has recently introduced Intel^® Transactional Synchronization Extensions (Intel^® TSX) in the Intel 4th Generation Core™ Processors. With Intel TSX, a processor can dynamically determine whether threads need to serialize through lock-protected ...
Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition 2nd Edition

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICPP '24: Proceedings of the 53rd International Conference on Parallel Processing

August 2024

1279 pages

ISBN:9798400717932

DOI:10.1145/3673038

Copyright © 2024 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 August 2024

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

ICPP '24

ICPP '24: the 53rd International Conference on Parallel Processing

August 12 - 15, 2024

Gotland, Sweden

Acceptance Rates

Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
234
Total Downloads

Downloads (Last 12 months)234
Downloads (Last 6 weeks)43

Reflects downloads up to 29 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten