short-paper

OpenMP-based parallel implementation of matrix-matrix multiplication on the intel knights landing

Authors:

Roktaek Lim,

Yeongha Lee,

Raehyun Kim,

Jaeyoung ChoiAuthors Info & Claims

HPCAsia '18 Workshops: Proceedings of Workshops of HPC Asia

Pages 63 - 66

https://doi.org/10.1145/3176364.3176374

Published: 31 January 2018 Publication History

Get Access

Abstract

The second generation Intel Xeon Phi processor codenamed Knights Landing (KNL) have emerged with 2D tile mesh architecture. Implementing of the general matrix-matrix multiplication on a new architecture is an important practice. To date, there has not been a sufficient description on a parallel implementation of the general matrix-matrix multiplication. In this study, we describe the parallel implementation of the double-precision general matrix-matrix multiplication (DGEMM) with OpenMP on the KNL. The implementation is based on the blocked matrix-matrix multiplication. We propose a method for choosing the cache block sizes and discuss the parallelism within the implementation of DGEMM. We show that the performance of DGEMM varies by the thread affinity environment variables. We conducted the performance experiments with the Intel Xeon Phi 7210 and 7250. The performance experiments validate our method.

References

[1]

Kazushige Goto and Robert A van de Geijn. 2008. Anatomy of high-performance matrix multiplication. ACM Transactions on Mathematical Software (TOMS) 34, 3 (2008), 12.

Digital Library

Google Scholar

[2]

Guide for Intel C++ Compiler 2015. User and reference Guide for the Intel C++ Compiler 15.0. (2015). https://software.intel.com/en-us/node/522691

Google Scholar

[3]

Murat Efe Guney, Kazushige Goto, Timothy B Costa, Sarah Knepper, Louise Huot, Arthur Mitrano, and Shane Story. 2017. Optimizing Matrix Multiplication on Intel® Xeon Phi x200 Architecture. In Computer Arithmetic (ARITH), 2017 IEEE 24th Symposium on. IEEE, 144--145.

Crossref

Google Scholar

[4]

James Jeffers, James Reinders, and Avinash Sodani. 2016. Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition. Morgan Kaufmann.

Digital Library

Google Scholar

[5]

Roktaek Lim, Yeongha Lee, Raehyun Kim, and Jaeyoung Choi. Submitted. An implementation of matrix-matrix multiplication on the Intel KNL processor with AVX-512. Cluster Computing (Submitted).

Google Scholar

[6]

Tze Meng Low, Francisco D Igual, Tyler M Smith, and Enrique S Quintana-Orti. 2016. Analytical modeling is enough for high-performance BLIS. ACM Transactions on Mathematical Software (TOMS) 43, 2 (2016), 12.

Digital Library

Google Scholar

[7]

Bryan Marker, Field G Van Zee, Kazushige Goto, Gregorio Quintana-Ortí, and Robert A Van De Geijn. 2007. Toward scalable matrix multiply on multithreaded architectures. In European Conference on Parallel Processing. Springer, 748--757.

Digital Library

Google Scholar

[8]

Jonathan Lawrence Peyton. 2013. Programming Dense Linear Algebra Kernels on Vectorized Architectures. Master's thesis. The University of Tennessee, Knoxville.

Google Scholar

[9]

Tyler M Smith, Robert A Van De Geijn, Mikhail Smelyanskiy, Jeff R Hammond, and Field G Van Zee. 2014. Anatomy of high-performance many-threaded matrix multiplication. In Parallel and Distributed Processing Symposium, 2014 IEEE 28th International. IEEE, 1049--1059.

Digital Library

Google Scholar

[10]

R Clint Whaley and Jack J Dongarra. 1998. Automatically tuned linear algebra software. In Proceedings of the 1998 ACM/IEEE conference on Supercomputing. IEEE Computer Society, 1--27.

Digital Library

Google Scholar

[11]

R Clint Whaley and Antoine Petitet. 2005. Minimizing development and maintenance costs in supporting persistently optimized BLAS. Software: Practice and Experience 35, 2 (2005), 101--121.

Digital Library

Google Scholar

Cited By

View all

Lorenzon AMarques SNavarro ABeltran VRauchwerger LCameron KNikolopoulos DPnevmatikatos D(2022)Seamless optimization of the GEMM kernel for task-based programming modelsProceedings of the 36th ACM International Conference on Supercomputing10.1145/3524059.3532385(1-11)Online publication date: 28-Jun-2022
https://dl.acm.org/doi/10.1145/3524059.3532385
Rizwan MJung EPark YChoi JKim Y(2022)Optimization of Matrix-Matrix Multiplication Algorithm for Matrix-Panel Multiplication on Intel KNL2022 IEEE/ACS 19th International Conference on Computer Systems and Applications (AICCSA)10.1109/AICCSA56895.2022.10017947(1-7)Online publication date: Dec-2022
https://doi.org/10.1109/AICCSA56895.2022.10017947
Park YKim RNguyen TChoi J(2021)Improving blocked matrix-matrix multiplication routine by utilizing AVX-512 instructions on intel knights landing and xeon scalable processorsCluster Computing10.1007/s10586-021-03274-826:5(2539-2549)Online publication date: 12-Apr-2021
https://dl.acm.org/doi/10.1007/s10586-021-03274-8
Show More Cited By

Index Terms

OpenMP-based parallel implementation of matrix-matrix multiplication on the intel knights landing

Index terms have been assigned to the content through auto-classification.

Recommendations

Optimizing parallel GEMM routines using auto-tuning with Intel AVX-512
HPCAsia '19: Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region

This paper presents the optimal implementations of single- and double-precision general matrix-matrix multiplication (GEMM) routines for the Intel Xeon Phi Processor code-named Knights Landing (KNL) and the Intel Xeon Scalable Processors based on an auto-...
An implementation of matrix---matrix multiplication on the Intel KNL processor with AVX-512

The second generation Intel Xeon Phi processor codenamed Knights Landing (KNL) have recently emerged with 2D tile mesh architecture and the Intel AVX-512 instructions. However, it is very difficult for general users to get the maximum performance from ...
Performance Characterization of Parallel Discrete Event Simulation on Knights Landing Processor
SIGSIM-PADS '17: Proceedings of the 2017 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation

Performance and scalability of Parallel Discrete Event Simulation (PDES) is often limited by fine-grain communication, especially in execution environments with high communication cost. However, the low cost of on-chip communication in emerging many-...

Comments

Information & Contributors

Information

Published In

HPCAsia '18 Workshops: Proceedings of Workshops of HPC Asia

January 2018

86 pages

ISBN:9781450363471

DOI:10.1145/3176364

Conference Chair:
Toshio Endo
Tokyo Institute of Technology
,
General Chairs:
Mitsuo Yokokawa
Kobe University
,
Toshihiro Hanawa
The University of Tokyo
,
Program Chair:
Osamu Tatebe
University of Tsukuba

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

SIGHPC: ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 January 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper

Funding Sources

Korea Ministry of Science and ICT (MSIT)

Conference

HPC Asia 2018 WS

Sponsor:

IPSJ

HPC Asia 2018 WS: Workshops of HPC Asia 2018

January 31, 2018

Tokyo, Chiyoda

Acceptance Rates

Overall Acceptance Rate 69 of 143 submissions, 48%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
418
Total Downloads

Downloads (Last 12 months)13
Downloads (Last 6 weeks)1

Reflects downloads up to 29 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Lorenzon AMarques SNavarro ABeltran VRauchwerger LCameron KNikolopoulos DPnevmatikatos D(2022)Seamless optimization of the GEMM kernel for task-based programming modelsProceedings of the 36th ACM International Conference on Supercomputing10.1145/3524059.3532385(1-11)Online publication date: 28-Jun-2022
https://dl.acm.org/doi/10.1145/3524059.3532385
Rizwan MJung EPark YChoi JKim Y(2022)Optimization of Matrix-Matrix Multiplication Algorithm for Matrix-Panel Multiplication on Intel KNL2022 IEEE/ACS 19th International Conference on Computer Systems and Applications (AICCSA)10.1109/AICCSA56895.2022.10017947(1-7)Online publication date: Dec-2022
https://doi.org/10.1109/AICCSA56895.2022.10017947
Park YKim RNguyen TChoi J(2021)Improving blocked matrix-matrix multiplication routine by utilizing AVX-512 instructions on intel knights landing and xeon scalable processorsCluster Computing10.1007/s10586-021-03274-826:5(2539-2549)Online publication date: 12-Apr-2021
https://dl.acm.org/doi/10.1007/s10586-021-03274-8
Nguyen TPark YChoi JKim R(2020)Evaluating performance of Parallel Matrix Multiplication Routine on Intel KNL and Xeon Scalable Processors2020 IEEE International Conference on Autonomic Computing and Self-Organizing Systems Companion (ACSOS-C)10.1109/ACSOS-C51401.2020.00027(42-47)Online publication date: Aug-2020
https://doi.org/10.1109/ACSOS-C51401.2020.00027
Kim RChoi JLee M(2019)Optimizing parallel GEMM routines using auto-tuning with Intel AVX-512Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region10.1145/3293320.3293334(101-110)Online publication date: 14-Jan-2019
https://dl.acm.org/doi/10.1145/3293320.3293334
Byun CKlein AMilechin LMichaleas PMullen JProut ARosa ASamsi SYee CReuther AKepner JArcand WBestor DBergeron WHubbell MGadepally VHoule MJones M(2019)Optimizing Xeon Phi for Interactive Data Analysis2019 IEEE High Performance Extreme Computing Conference (HPEC)10.1109/HPEC.2019.8916300(1-6)Online publication date: Sep-2019
https://doi.org/10.1109/HPEC.2019.8916300
Lim RLee YKim RChoi JLee M(2018)Auto-tuning GEMM kernels on the Intel KNL and Intel Skylake-SP processorsThe Journal of Supercomputing10.1007/s11227-018-2702-175:12(7895-7908)Online publication date: 26-Nov-2018
https://doi.org/10.1007/s11227-018-2702-1

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Abstract

References

Cited By

Index Terms

Recommendations

Optimizing parallel GEMM routines using auto-tuning with Intel AVX-512

An implementation of matrix---matrix multiplication on the Intel KNL processor with AVX-512

Performance Characterization of Parallel Discrete Event Simulation on Knights Landing Processor

Comments

Information

Published In

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations