Abstract
The second generation Intel Xeon Phi processor codenamed Knights Landing (KNL) have recently emerged with 2D tile mesh architecture and the Intel AVX-512 instructions. However, it is very difficult for general users to get the maximum performance from the new architecture since they are not familiar with optimal cache reuse, efficient vectorization, and assembly language. In this paper, we illustrate several developing strategies to achieve good performance with C programming language by carrying out general matrix–matrix multiplications and without the use of assembly language. Our implementation of matrix–matrix multiplication is based on blocked matrix multiplication as an optimization technique that improves data reuse. We use data prefetching, loop unrolling, and the Intel AVX-512 to optimize the blocked matrix multiplications. When we use a single core of the KNL, our implementation achieves up to 98% of SGEMM and 99% of DGEMM using the Intel MKL, which is the current state-of-the-art library. Our implementation of the parallel DGEMM using all 68 cores of the KNL achieves up to 90% of DGEMM using the Intel MKL.
Similar content being viewed by others
References
Jeffers, J., Reinders, J., Sodani, A.: Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition. Morgan Kaufmann (2016)
Bilmes, J., Asanovic, K., Chin, C.W., Demmel, J.: Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology. In: ACM International Conference on Supercomputing 25th Anniversary Volume, pp. 253–260. ACM (2014)
Goto, K., van de Geijn, R.A.: Anatomy of high-performance matrix multiplication. ACM Trans. Math. Softw. (TOMS) 34(3), 12 (2008)
Heinecke, A., Vaidyanathan, K., Smelyanskiy, M., Kobotov, A., Dubtsov, R., Henry, G., Shet, A.G., Chrysos, G., Dubey, P.: Design and implementation of the linpack benchmark for single and multi-node systems based on Intel® Xeon Phi Coprocessor. In: 2013 IEEE 27th International Symposium on Parallel & Distributed Processing (IPDPS), pp. 126–137. IEEE (2013)
Peyton, J.L.: Programming dense linear algebra kernels on vectorized architectures. Master’s thesis, The University of Tennessee, Knoxville (2013)
Van Zee, F.G., Van De Geijn, R.A.: BLIS: a framework for rapidly instantiating BLAS functionality. ACM Transactions on Mathematical Software (TOMS) 41(3), 14 (2015)
Whaley, R.C., Dongarra, J.J.: Automatically tuned linear algebra software. In: Proceedings of the 1998 ACM/IEEE conference on Supercomputing, pp. 1–27. IEEE Computer Society (1998)
Low, T.M., Igual, F.D., Smith, T.M., Quintana-Orti, E.S.: Analytical modeling is enough for high-performance blis. ACM Trans. Math. Softw. (TOMS) 43(2), 12 (2016)
Whaley, R.C., Petitet, A., Dongarra, J.J.: Automated empirical optimizations of software and the atlas project. Parallel Comput. 27(1), 3–35 (2001)
Gunnels, J.A., Henry, G.M., Van De Geijn, R.A.: A family of high-performance matrix multiplication algorithms. In: International Conference on Computational Science, pp. 51–60. Springer (2001)
Whaley, R.C., Petitet, A.: Minimizing development and maintenance costs in supporting persistently optimized BLAS. Softw. Pract. Exp. 35(2), 101–121 (2005)
Lee, J., Kim, H., Vuduc, R.: When prefetching works, when it doesn’t, and why. Architecture and Code Optimization (TACO), vol. 9(2) (2012)
Smith, T.M., Van De Geijn, R.A., Smelyanskiy, M., Hammond, J.R., Van Zee, F.G.: Anatomy of high-performance many-threaded matrix multiplication. In: 2014 IEEE 28th International Parallel and Distributed Processing Symposium, pp. 1049–1059. IEEE (2014)
Marker, B., Van Zee, F.G., Goto, K., Quintana-Ortí, G., Van De Geijn, R.A.: Toward scalable matrix multiply on multithreaded architectures. In: European Conference on Parallel Processing, pp. 748–757. Springer (2007)
Acknowledgements
This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. NRF-2015M3C4A7075662).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Lim, R., Lee, Y., Kim, R. et al. An implementation of matrix–matrix multiplication on the Intel KNL processor with AVX-512. Cluster Comput 21, 1785–1795 (2018). https://doi.org/10.1007/s10586-018-2810-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10586-018-2810-y