research-article

Anatomy of high-performance deep learning convolutions on SIMD architectures

Authors:

Evangelos Georganas,

Sasikanth Avancha,

Kunal Banerjee,

Dhiraj Kalamkar,

Alexander HeineckeAuthors Info & Claims

SC '18: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis

Article No.: 66, Pages 1 - 12

https://doi.org/10.1109/SC.2018.00069

Published: 26 July 2019 Publication History

Abstract

Convolution layers are prevalent in many classes of deep neural networks, including Convolutional Neural Networks (CNNs) which provide state-of-the-art results for tasks like image recognition, neural machine translation and speech recognition. The computationally expensive nature of a convolution operation has led to the proliferation of implementations including matrix-matrix multiplication formulation, and direct convolution primarily targeting GPUs. In this paper, we introduce direct convolution kernels for x86 architectures, in particular for Xeon and Xeon Phi systems, which are implemented via a dynamic compilation approach. Our JIT-based implementation shows close to theoretical peak performance, depending on the setting and the CPU architecture at hand. We additionally demonstrate how these JIT-optimized kernels can be integrated into a light-weight multi-node graph execution model. This illustrates that single- and multi-node runs yield high efficiencies and high image-throughputs when executing state-of-the-art image recognition tasks on CPUs.

References

[1]

M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, "TensorFlow: Large-scale machine learning on heterogeneous systems," 2015. {Online}. Available: http://tensorflow.org/

[2]

Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, "Caffe: Convolutional Architecture for Fast Feature Embedding," arXiv preprint arXiv:1408.5093, 2014.

[3]

A. Krizhevsky, I. Sutskever, and G. Hinton, "Image classification with deep convolutional neural networks," Advances in neural information processing systems, pp. 1097--1105, 2012.

Digital Library

[4]

S. Chintala, "Convnet Benchmarks," https://github.com/soumith/convnet-benchmarks, 2015.

[5]

S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro, and E. Shelhamer, "cuDNN: Efficient Primitives for Deep Learning," CoRR, vol. abs/1410.0759, 2014. {Online}. Available: http://arxiv.org/abs/1410.0759

[6]

A. Vasudevan, A. Anderson, and D. Gregg, "Parallel multi channel convolution using general matrix multiplication," arXiv preprint arXiv:1704.04428, 2017.

[7]

A. Anderson, A. Vasudevan, C. Keane, and D. Gregg, "Low-memory gemm-based convolution algorithms for deep neural networks," arXiv preprint arXiv:1709.03395, 2017.

[8]

P. Goyal, P. Dollar, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He, "Accurate, large minibatch sgd: Training imagenet in 1 hour," arXiv preprint arXiv:1706.02677, 2017.

[9]

M. Cho, U. Finkler, S. Kumar, D. Kung, V. Saxena, and D. Sreedhar, "Powerai ddl," arXiv preprint arXiv:1708.02188, 2017.

[10]

D. Das, S. Avancha, D. Mudigere, K. Vaidyanathan, S. Sridharan, D. D. Kalamkar, B. Kaul, and P. Dubey, "Distributed deep learning using synchronous stochastic gradient descent," CoRR, vol. abs/1602.06709, 2016. {Online}. Available: http://arxiv.org/abs/1602.06709

[11]

T. Ben-Nun and T. Hoefler, "Demystifying parallel and distributed deep learning: An in-depth concurrency analysis," arXiv preprint arXiv:1802.09941, 2018.

[12]

A. Zlateski, K. Lee, and H. S. Seung, "ZNN - A fast and scalable algorithm for training 3d convolutional networks on multi-core and many-core shared memory machines," CoRR, vol. abs/1510.06706, 2015. {Online}. Available: http://arxiv.org/abs/1510.06706

[13]

N. Systems, "NEON," https://github.com/NervanaSystems/neon, 2016.

[14]

A. Heinecke, G. Henry, M. Hutchinson, and H. Pabst, "Libxsmm: Accelerating small matrix multiplications by runtime code generation," in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC '16. Piscataway, NJ, USA: IEEE Press, 2016, pp. 84:1--84:11. {Online}. Available: http://dl.acm.org/citation.cfm?id=3014904.3015017

Digital Library

[15]

J. Demmel and G. Dinh, "Communication-optimal convolutional neural nets," arXiv preprint arXiv:1802.06905, 2018.

[16]

P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Garcia, B. Ginsburg, M. Houston, O. Kuchaiev, G. Venkatesh, and H. Wu, "Mixed precision training," arXiv preprint arXiv:1710.03740, 2017.

[17]

K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770--778.

[18]

D. Das, N. Mellempudi, D. Mudigere, D. Kalamkar, S. Avancha, K. Banerjee, S. Sridharan, K. Vaidyanathan, B. Kaul, E. Georganas, A. Heinecke, P. Dubey, J. Corbal, N. Shustrov, R. Dubtsov, E. Fomenko, and V. Pirogov, "Mixed precision training of convolutional neural networks using integer operations," arXiv preprint arXiv:1802.00930, 2018.

[19]

S. Sridharan, K. Vaidyanathan, D. Kalamkar, D. Das, M. E. Smorkalov, M. Shiryaev, D. Mudigere, N. Mellempudi, S. Avancha, B. Kaul, and P. Dubey, "On scale-out deep learning training for cloud and hpc," arXiv preprint arXiv:1801.08030, 2018.

[20]

Google, "Protocol Buffers - Google's data interchange format," 2018. {Online}. Available: https://github.com/google/protobuf

[21]

C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, "Rethinking the inception architecture for computer vision," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2818--2826.

[22]

Intel, "Intel MKL-DNN," 2018. {Online}. Available: https://github.com/intel/mkl-dnn

[23]

Google, "Tensorflow Benchmarks," 2018. {Online}. Available: https://www.tensorflow.org/performance/benchmarks

[24]

Intel, "TensorFlow Optimizations for the Intel Xeon Scalable Processor," 2018. {Online}. Available: https://ai.intel.com/tensorflow-optimizations-intel-xeon-scalable-processor/

[25]

UBERl, "Horovod," 2018. {Online}. Available: https://github.com/uber/horovod

Cited By

Mo HWang QLiao LLi BChi LLiu J(2024)Detailed Analysis and Optimization of Irregular-Shaped Matrix Multiplication on Multi-Core DSPsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673101(1176-1186)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673101
Li JFeng ZGao YTian SZhang HYe HZhang J(2024)High-Performance 3D convolution on the Latest Generation Sunway ProcessorProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673093(241-251)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673093
Zeng ZDavies MPulijala PSankaralingam KSingh VKrause ABrunskill ECho KEngelhardt BSabato SScarlett J(2023)LookupFFNProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3620114(40707-40718)Online publication date: 23-Jul-2023
https://dl.acm.org/doi/10.5555/3618408.3620114
Show More Cited By

Recommendations

Anatomy of high-performance deep learning convolutions on SIMD architectures
SC '18: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis

Convolution layers are prevalent in many classes of deep neural networks, including Convolutional Neural Networks (CNNs) which provide state-of-the-art results for tasks like image recognition, neural machine translation and speech recognition. The ...
Practical SIMD Vectorization Techniques for Intel® Xeon Phi Coprocessors
IPDPSW '13: Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Workshops and PhD Forum

Intel® Xeon Phi™ coprocessor is based on the Intel® Many Integrated Core (Intel® MIC) architecture, which is an innovative new processor architecture that combines abundant thread parallelism with long SIMD vector units. Efficiently exploiting SIMD ...
Anatomy of High-Performance Many-Threaded Matrix Multiplication
IPDPS '14: Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium

BLIS is a new framework for rapid instantiation of the BLAS. We describe how BLIS extends the "GotoBLAS approach" to implementing matrix multiplication (GEMM). While GEMM was previously implemented as three loops around an inner kernel, BLIS exposes two ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '18: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis

November 2018

932 pages

Sponsors

SIGHPC: ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing

In-Cooperation

IEEE CS

Publisher

IEEE Press

Publication History

Published: 26 July 2019

Check for updates

Badges

Artifacts Available

Qualifiers

Research-article

Conference

SC18

Sponsor:

SIGHPC

SC18: The International Conference for High Performance Computing, Networking, Storage and Analysis

November 11 - 16, 2018

Texas, Dallas

Acceptance Rates

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

22
Total Citations
View Citations
92
Total Downloads

Downloads (Last 12 months)12
Downloads (Last 6 weeks)0

Reflects downloads up to 30 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Mo HWang QLiao LLi BChi LLiu J(2024)Detailed Analysis and Optimization of Irregular-Shaped Matrix Multiplication on Multi-Core DSPsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673101(1176-1186)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673101
Li JFeng ZGao YTian SZhang HYe HZhang J(2024)High-Performance 3D convolution on the Latest Generation Sunway ProcessorProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673093(241-251)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673093
Zeng ZDavies MPulijala PSankaralingam KSingh VKrause ABrunskill ECho KEngelhardt BSabato SScarlett J(2023)LookupFFNProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3620114(40707-40718)Online publication date: 23-Jul-2023
https://dl.acm.org/doi/10.5555/3618408.3620114
Wang PYang WFang JDong DHuang CZhang PTang TWang ZMohror KArnold DBadia R(2023)Optimizing Direct Convolutions on ARM Multi-CoresProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607107(1-13)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3581784.3607107
Santana AArmejach ACasas MDehnavi MKulkarni MKrishnamoorthy S(2023)Efficient Direct Convolution Using Long SIMD InstructionsProceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3572848.3577435(342-353)Online publication date: 25-Feb-2023
https://dl.acm.org/doi/10.1145/3572848.3577435
Korostelev IL. De Carvalho JMoreira JAmaral J(2023)YaConv: Convolution with Low Cache FootprintACM Transactions on Architecture and Code Optimization10.1145/357030520:1(1-18)Online publication date: 10-Feb-2023
https://dl.acm.org/doi/10.1145/3570305
Barrachina SCastelló ADolz MLow TMartínez HQuintana-Ortí ESridhar UTomás A(2023)Reformulating the direct convolution for high-performance deep learning inference on ARM processorsJournal of Systems Architecture: the EUROMICRO Journal10.1016/j.sysarc.2022.102806135:COnline publication date: 1-Feb-2023
https://dl.acm.org/doi/10.1016/j.sysarc.2022.102806
Martínez PPeccerillo BBartolini SGarcía JBernabé G(2022)Performance portability in a real world applicationInternational Journal of High Performance Computing Applications10.1177/1094342022107710736:3(419-439)Online publication date: 1-May-2022
https://dl.acm.org/doi/10.1177/10943420221077107
Sankaralingam KNowatzki TGangadhar VShah PDavies MGalliher WGuo ZKhare JVijay DPalamuttam PPunde MTan AThiruvengadam VWang RXu SSalapura VZahran MChong FTang L(2022)The Mozart reuse exposed dataflow processor for AI and beyondProceedings of the 49th Annual International Symposium on Computer Architecture10.1145/3470496.3533040(978-992)Online publication date: 18-Jun-2022
https://dl.acm.org/doi/10.1145/3470496.3533040
Yang WFang JDong DSu XWang Zde Supinski BHall MGamblin T(2021)LIBSHALOMProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476217(1-14)Online publication date: 14-Nov-2021
https://dl.acm.org/doi/10.1145/3458817.3476217
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents