Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1109/SC.2018.00069acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Anatomy of high-performance deep learning convolutions on SIMD architectures

Published: 26 July 2019 Publication History

Abstract

Convolution layers are prevalent in many classes of deep neural networks, including Convolutional Neural Networks (CNNs) which provide state-of-the-art results for tasks like image recognition, neural machine translation and speech recognition. The computationally expensive nature of a convolution operation has led to the proliferation of implementations including matrix-matrix multiplication formulation, and direct convolution primarily targeting GPUs. In this paper, we introduce direct convolution kernels for x86 architectures, in particular for Xeon and Xeon Phi systems, which are implemented via a dynamic compilation approach. Our JIT-based implementation shows close to theoretical peak performance, depending on the setting and the CPU architecture at hand. We additionally demonstrate how these JIT-optimized kernels can be integrated into a light-weight multi-node graph execution model. This illustrates that single- and multi-node runs yield high efficiencies and high image-throughputs when executing state-of-the-art image recognition tasks on CPUs.

References

[1]
M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, "TensorFlow: Large-scale machine learning on heterogeneous systems," 2015. {Online}. Available: http://tensorflow.org/
[2]
Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, "Caffe: Convolutional Architecture for Fast Feature Embedding," arXiv preprint arXiv:1408.5093, 2014.
[3]
A. Krizhevsky, I. Sutskever, and G. Hinton, "Image classification with deep convolutional neural networks," Advances in neural information processing systems, pp. 1097--1105, 2012.
[4]
S. Chintala, "Convnet Benchmarks," https://github.com/soumith/convnet-benchmarks, 2015.
[5]
S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro, and E. Shelhamer, "cuDNN: Efficient Primitives for Deep Learning," CoRR, vol. abs/1410.0759, 2014. {Online}. Available: http://arxiv.org/abs/1410.0759
[6]
A. Vasudevan, A. Anderson, and D. Gregg, "Parallel multi channel convolution using general matrix multiplication," arXiv preprint arXiv:1704.04428, 2017.
[7]
A. Anderson, A. Vasudevan, C. Keane, and D. Gregg, "Low-memory gemm-based convolution algorithms for deep neural networks," arXiv preprint arXiv:1709.03395, 2017.
[8]
P. Goyal, P. Dollar, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He, "Accurate, large minibatch sgd: Training imagenet in 1 hour," arXiv preprint arXiv:1706.02677, 2017.
[9]
M. Cho, U. Finkler, S. Kumar, D. Kung, V. Saxena, and D. Sreedhar, "Powerai ddl," arXiv preprint arXiv:1708.02188, 2017.
[10]
D. Das, S. Avancha, D. Mudigere, K. Vaidyanathan, S. Sridharan, D. D. Kalamkar, B. Kaul, and P. Dubey, "Distributed deep learning using synchronous stochastic gradient descent," CoRR, vol. abs/1602.06709, 2016. {Online}. Available: http://arxiv.org/abs/1602.06709
[11]
T. Ben-Nun and T. Hoefler, "Demystifying parallel and distributed deep learning: An in-depth concurrency analysis," arXiv preprint arXiv:1802.09941, 2018.
[12]
A. Zlateski, K. Lee, and H. S. Seung, "ZNN - A fast and scalable algorithm for training 3d convolutional networks on multi-core and many-core shared memory machines," CoRR, vol. abs/1510.06706, 2015. {Online}. Available: http://arxiv.org/abs/1510.06706
[13]
N. Systems, "NEON," https://github.com/NervanaSystems/neon, 2016.
[14]
A. Heinecke, G. Henry, M. Hutchinson, and H. Pabst, "Libxsmm: Accelerating small matrix multiplications by runtime code generation," in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC '16. Piscataway, NJ, USA: IEEE Press, 2016, pp. 84:1--84:11. {Online}. Available: http://dl.acm.org/citation.cfm?id=3014904.3015017
[15]
J. Demmel and G. Dinh, "Communication-optimal convolutional neural nets," arXiv preprint arXiv:1802.06905, 2018.
[16]
P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Garcia, B. Ginsburg, M. Houston, O. Kuchaiev, G. Venkatesh, and H. Wu, "Mixed precision training," arXiv preprint arXiv:1710.03740, 2017.
[17]
K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770--778.
[18]
D. Das, N. Mellempudi, D. Mudigere, D. Kalamkar, S. Avancha, K. Banerjee, S. Sridharan, K. Vaidyanathan, B. Kaul, E. Georganas, A. Heinecke, P. Dubey, J. Corbal, N. Shustrov, R. Dubtsov, E. Fomenko, and V. Pirogov, "Mixed precision training of convolutional neural networks using integer operations," arXiv preprint arXiv:1802.00930, 2018.
[19]
S. Sridharan, K. Vaidyanathan, D. Kalamkar, D. Das, M. E. Smorkalov, M. Shiryaev, D. Mudigere, N. Mellempudi, S. Avancha, B. Kaul, and P. Dubey, "On scale-out deep learning training for cloud and hpc," arXiv preprint arXiv:1801.08030, 2018.
[20]
Google, "Protocol Buffers - Google's data interchange format," 2018. {Online}. Available: https://github.com/google/protobuf
[21]
C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, "Rethinking the inception architecture for computer vision," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2818--2826.
[22]
Intel, "Intel MKL-DNN," 2018. {Online}. Available: https://github.com/intel/mkl-dnn
[23]
Google, "Tensorflow Benchmarks," 2018. {Online}. Available: https://www.tensorflow.org/performance/benchmarks
[24]
Intel, "TensorFlow Optimizations for the Intel Xeon Scalable Processor," 2018. {Online}. Available: https://ai.intel.com/tensorflow-optimizations-intel-xeon-scalable-processor/
[25]
UBERl, "Horovod," 2018. {Online}. Available: https://github.com/uber/horovod

Cited By

View all
  • (2024)Detailed Analysis and Optimization of Irregular-Shaped Matrix Multiplication on Multi-Core DSPsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673101(1176-1186)Online publication date: 12-Aug-2024
  • (2024)High-Performance 3D convolution on the Latest Generation Sunway ProcessorProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673093(241-251)Online publication date: 12-Aug-2024
  • (2023)LookupFFNProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3620114(40707-40718)Online publication date: 23-Jul-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '18: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis
November 2018
932 pages

Sponsors

In-Cooperation

  • IEEE CS

Publisher

IEEE Press

Publication History

Published: 26 July 2019

Check for updates

Badges

Qualifiers

  • Research-article

Conference

SC18
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)12
  • Downloads (Last 6 weeks)0
Reflects downloads up to 30 Aug 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Detailed Analysis and Optimization of Irregular-Shaped Matrix Multiplication on Multi-Core DSPsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673101(1176-1186)Online publication date: 12-Aug-2024
  • (2024)High-Performance 3D convolution on the Latest Generation Sunway ProcessorProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673093(241-251)Online publication date: 12-Aug-2024
  • (2023)LookupFFNProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3620114(40707-40718)Online publication date: 23-Jul-2023
  • (2023)Optimizing Direct Convolutions on ARM Multi-CoresProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607107(1-13)Online publication date: 12-Nov-2023
  • (2023)Efficient Direct Convolution Using Long SIMD InstructionsProceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3572848.3577435(342-353)Online publication date: 25-Feb-2023
  • (2023)YaConv: Convolution with Low Cache FootprintACM Transactions on Architecture and Code Optimization10.1145/357030520:1(1-18)Online publication date: 10-Feb-2023
  • (2023)Reformulating the direct convolution for high-performance deep learning inference on ARM processorsJournal of Systems Architecture: the EUROMICRO Journal10.1016/j.sysarc.2022.102806135:COnline publication date: 1-Feb-2023
  • (2022)Performance portability in a real world applicationInternational Journal of High Performance Computing Applications10.1177/1094342022107710736:3(419-439)Online publication date: 1-May-2022
  • (2022)The Mozart reuse exposed dataflow processor for AI and beyondProceedings of the 49th Annual International Symposium on Computer Architecture10.1145/3470496.3533040(978-992)Online publication date: 18-Jun-2022
  • (2021)LIBSHALOMProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476217(1-14)Online publication date: 14-Nov-2021
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media