Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

An implementation of matrix–matrix multiplication on the Intel KNL processor with AVX-512

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

The second generation Intel Xeon Phi processor codenamed Knights Landing (KNL) have recently emerged with 2D tile mesh architecture and the Intel AVX-512 instructions. However, it is very difficult for general users to get the maximum performance from the new architecture since they are not familiar with optimal cache reuse, efficient vectorization, and assembly language. In this paper, we illustrate several developing strategies to achieve good performance with C programming language by carrying out general matrix–matrix multiplications and without the use of assembly language. Our implementation of matrix–matrix multiplication is based on blocked matrix multiplication as an optimization technique that improves data reuse. We use data prefetching, loop unrolling, and the Intel AVX-512 to optimize the blocked matrix multiplications. When we use a single core of the KNL, our implementation achieves up to 98% of SGEMM and 99% of DGEMM using the Intel MKL, which is the current state-of-the-art library. Our implementation of the parallel DGEMM using all 68 cores of the KNL achieves up to 90% of DGEMM using the Intel MKL.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

References

  1. Jeffers, J., Reinders, J., Sodani, A.: Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition. Morgan Kaufmann (2016)

  2. Bilmes, J., Asanovic, K., Chin, C.W., Demmel, J.: Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology. In: ACM International Conference on Supercomputing 25th Anniversary Volume, pp. 253–260. ACM (2014)

  3. Goto, K., van de Geijn, R.A.: Anatomy of high-performance matrix multiplication. ACM Trans. Math. Softw. (TOMS) 34(3), 12 (2008)

    Article  MathSciNet  Google Scholar 

  4. Heinecke, A., Vaidyanathan, K., Smelyanskiy, M., Kobotov, A., Dubtsov, R., Henry, G., Shet, A.G., Chrysos, G., Dubey, P.: Design and implementation of the linpack benchmark for single and multi-node systems based on Intel® Xeon Phi Coprocessor. In: 2013 IEEE 27th International Symposium on Parallel & Distributed Processing (IPDPS), pp. 126–137. IEEE (2013)

  5. Peyton, J.L.: Programming dense linear algebra kernels on vectorized architectures. Master’s thesis, The University of Tennessee, Knoxville (2013)

  6. Van Zee, F.G., Van De Geijn, R.A.: BLIS: a framework for rapidly instantiating BLAS functionality. ACM Transactions on Mathematical Software (TOMS) 41(3), 14 (2015)

    Article  MathSciNet  Google Scholar 

  7. Whaley, R.C., Dongarra, J.J.: Automatically tuned linear algebra software. In: Proceedings of the 1998 ACM/IEEE conference on Supercomputing, pp. 1–27. IEEE Computer Society (1998)

  8. Low, T.M., Igual, F.D., Smith, T.M., Quintana-Orti, E.S.: Analytical modeling is enough for high-performance blis. ACM Trans. Math. Softw. (TOMS) 43(2), 12 (2016)

    Article  MathSciNet  Google Scholar 

  9. Whaley, R.C., Petitet, A., Dongarra, J.J.: Automated empirical optimizations of software and the atlas project. Parallel Comput. 27(1), 3–35 (2001)

    Article  Google Scholar 

  10. Gunnels, J.A., Henry, G.M., Van De Geijn, R.A.: A family of high-performance matrix multiplication algorithms. In: International Conference on Computational Science, pp. 51–60. Springer (2001)

  11. Whaley, R.C., Petitet, A.: Minimizing development and maintenance costs in supporting persistently optimized BLAS. Softw. Pract. Exp. 35(2), 101–121 (2005)

    Google Scholar 

  12. Lee, J., Kim, H., Vuduc, R.: When prefetching works, when it doesn’t, and why. Architecture and Code Optimization (TACO), vol. 9(2) (2012)

    Article  Google Scholar 

  13. Smith, T.M., Van De Geijn, R.A., Smelyanskiy, M., Hammond, J.R., Van Zee, F.G.: Anatomy of high-performance many-threaded matrix multiplication. In: 2014 IEEE 28th International Parallel and Distributed Processing Symposium, pp. 1049–1059. IEEE (2014)

  14. Marker, B., Van Zee, F.G., Goto, K., Quintana-Ortí, G., Van De Geijn, R.A.: Toward scalable matrix multiply on multithreaded architectures. In: European Conference on Parallel Processing, pp. 748–757. Springer (2007)

Download references

Acknowledgements

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. NRF-2015M3C4A7075662).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jaeyoung Choi.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lim, R., Lee, Y., Kim, R. et al. An implementation of matrix–matrix multiplication on the Intel KNL processor with AVX-512. Cluster Comput 21, 1785–1795 (2018). https://doi.org/10.1007/s10586-018-2810-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-018-2810-y

Keywords