research-article

Optimizing parallel GEMM routines using auto-tuning with Intel AVX-512

Authors:

Myungho LeeAuthors Info & Claims

HPCAsia '19: Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region

Pages 101 - 110

https://doi.org/10.1145/3293320.3293334

Published: 14 January 2019 Publication History

Abstract

This paper presents the optimal implementations of single- and double-precision general matrix-matrix multiplication (GEMM) routines for the Intel Xeon Phi Processor code-named Knights Landing (KNL) and the Intel Xeon Scalable Processors based on an auto-tuning approach with the Intel AVX-512 intrinsic functions. Our auto-tuning approach precisely determines the parameters reflecting the target architectural features. Our approach significantly reduces the search space and derives optimal parameter sets including the size of submatrices, prefetch distances, loop unrolling depth, and parallelization scheme. Without a single line of assembly code, our GEMM kernels show the comparable performance results to the Intel MKL and outperform other open-source BLAS libraries.

References

[1]

AMD. 2018. AMD Core Math Library. Retrieved 2018-08-15 from http://developer.amd.com/acml.jsp

[2]

Jeff Bilmes, Krste Asanovic, Chee-Whye Chin, and Jim Demmel. 1997. Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology. In Proceedings of the 11th international conference on Supercomputing. ACM, 340--347.

Digital Library

[3]

Kazushige Goto and Robert A Geijn. 2008. Anatomy of high-performance matrix multiplication. ACM Transactions on Mathematical Software (TOMS) 34, 3 (2008), 12.

Digital Library

[4]

Alexander Heinecke, Karthikeyan Vaidyanathan, Mikhail Smelyanskiy, Alexander Kobotov, Roman Dubtsov, Greg Henry, Aniruddha G Shet, George Chrysos, and Pradeep Dubey. 2013. Design and implementation of the linpack benchmark for single and multi-node systems based on Intel® Xeon Phi Coprocessor. In Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on. IEEE, 126--137.

Digital Library

[5]

Intel. 2018. Math Kernel Library. Retrieved 2018-08-15 from https://software.intel.com/en-us/intel-mkl

[6]

James Jeffers, James Reinders, and Avinash Sodani. 2016. Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition. Morgan Kaufmann.

Digital Library

[7]

Roktaek Lim, Yeongha Lee, Raehyun Kim, and Jaeyoung Choi. 2018. An implementation of matrix--matrix multiplication on the Intel KNL processor with AVX-512. Cluster Computing (2018), 1--11.

Digital Library

[8]

Roktaek Lim, Yeongha Lee, Raehyun Kim, and Jaeyoung Choi. 2018. OpenMP-based parallel implementation of matrix-matrix multiplication on the intel knights landing. In Proceedings of Workshops of HPC Asia. ACM, 63--66.

Digital Library

[9]

Roktaek Lim, Yeongha Lee, Raehyun Kim, Jaeyoung Choi, and Myungho Lee. 2018. Auto-tuning GEMM kernels on the Intel KNL and Intel Skylake-SP processors. Journal of Supercomputing (2018).

[10]

Tze Meng Low, Francisco D Igual, Tyler M Smith, and Enrique S Quintana-Orti. 2016. Analytical modeling is enough for high-performance BLIS. ACM Transactions on Mathematical Software (TOMS) 43, 2 (2016), 12.

Digital Library

[11]

Tyler M Smith, Robert A Van De Geijn, Mikhail Smelyanskiy, Jeff R Hammond, and Field G Van Zee. 2014. Anatomy of high-performance many-threaded matrix multiplication. In Parallel and Distributed Processing Symposium, 2014 IEEE 28th International. IEEE, 1049--1059.

Digital Library

[12]

Qian Wang, Xianyi Zhang, Yunquan Zhang, and Qing Yi. 2013. AUGEM: automatically generate high performance dense linear algebra kernels on x86 CPUs. In High Performance Computing, Networking, Storage and Analysis (SC), 2013 International Conference for. IEEE, 1--12.

Digital Library

[13]

R Clint Whaley and Jack J Dongarra. 1998. Automatically tuned linear algebra software. In Proceedings of the 1998 ACM/IEEE conference on Supercomputing. IEEE Computer Society, 1--27.

Digital Library

[14]

R Clint Whaley, Antoine Petitet, and Jack J Dongarra. 2001. Automated empirical optimizations of software and the ATLAS project. Parallel Comput. 27, 1--2 (2001), 3--35.

[15]

Zhang Xianyi, Wang Qian, and Zhang Yunquan. 2012. Model-driven level 3 BLAS performance optimization on Loongson 3A processor. In 2012 IEEE 18th International Conference on Parallel and Distributed Systems. IEEE, 684--691.

Digital Library

[16]

Kamen Yotov, Xiaoming Li, Gang Ren, MJS Garzaran, David Padua, Keshav Pingali, and Paul Stodghill. 2005. Is search really necessary to generate high-performance BLAS? Proc. IEEE 93, 2 (2005), 358--386.

[17]

Field G Van Zee, Tyler M Smith, Bryan Marker, Tze Meng Low, Robert A Geijn, Francisco D Igual, Mikhail Smelyanskiy, Xianyi Zhang, Michael Kistler, Vernon Austel, et al. 2016. The BLIS Framework: Experiments in Portability. ACM Transactions on Mathematical Software (TOMS) 42, 2 (2016), 12:1--12:19.

Digital Library

[18]

Werner Saar Zhang Xianyi, Wang Qian. 2018. OpenBLAS. Retrieved 2018-08-15 from http://www.openblas.net

Cited By

Xie CWu HZhou J(2023)Vectorization Programming Based on HR DSP Using SIMDElectronics10.3390/electronics1213292212:13(2922)Online publication date: 3-Jul-2023
https://doi.org/10.3390/electronics12132922
Le Fèvre VCasas MButt AMi NChard K(2023)Efficient Execution of SpGEMM on Long Vector ArchitecturesProceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3588195.3593000(101-113)Online publication date: 7-Aug-2023
https://dl.acm.org/doi/10.1145/3588195.3593000
Zhang ZChen YHe BZhang Z(2023)NIOT: A Novel Inference Optimization of Transformers on Modern CPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.326953034:6(1982-1995)Online publication date: 1-Jun-2023
https://dl.acm.org/doi/10.1109/TPDS.2023.3269530
Show More Cited By

Index Terms

Optimizing parallel GEMM routines using auto-tuning with Intel AVX-512
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Multicore architectures
2. Computing methodologies
  1. Symbolic and algebraic manipulation
    1. Symbolic and algebraic algorithms
      1. Linear algebra algorithms
      2. Optimization algorithms

Recommendations

Benchmarking Performance of a Hybrid Intel Xeon/Xeon Phi System for Parallel Computation of Similarity Measures Between Large Vectors

The paper deals with parallelization of computing similarity measures between large vectors. Such computations are important components within many applications and consequently are of high importance. Rather than focusing on optimization of the ...
Benchmarking Parallel Chess Search in Stockfish on Intel Xeon and Intel Xeon Phi Processors
Computational Science – ICCS 2018
Abstract
The paper presents results from benchmarking the parallel multithreaded Stockfish chess engine on selected multi- and many-core processors. It is shown how the strength of play for an n-thread version compares to 1-thread version on both Intel ...
OpenMP-based parallel implementation of matrix-matrix multiplication on the intel knights landing
HPCAsia '18 Workshops: Proceedings of Workshops of HPC Asia

The second generation Intel Xeon Phi processor codenamed Knights Landing (KNL) have emerged with 2D tile mesh architecture. Implementing of the general matrix-matrix multiplication on a new architecture is an important practice. To date, there has not ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

HPCAsia '19: Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region

January 2019

143 pages

ISBN:9781450366328

DOI:10.1145/3293320

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

In-Cooperation

Sun Yat-Sen University
CCF: China Computer Federation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 January 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

HPC Asia 2019

HPC Asia 2019: International Conference on High Performance Computing in Asia-Pacific Region

January 14 - 16, 2019

Guangzhou, China

Acceptance Rates

HPCAsia '19 Paper Acceptance Rate 15 of 32 submissions, 47%;

Overall Acceptance Rate 69 of 143 submissions, 48%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

16
Total Citations
View Citations
523
Total Downloads

Downloads (Last 12 months)103
Downloads (Last 6 weeks)10

Reflects downloads up to 18 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Xie CWu HZhou J(2023)Vectorization Programming Based on HR DSP Using SIMDElectronics10.3390/electronics1213292212:13(2922)Online publication date: 3-Jul-2023
https://doi.org/10.3390/electronics12132922
Le Fèvre VCasas MButt AMi NChard K(2023)Efficient Execution of SpGEMM on Long Vector ArchitecturesProceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3588195.3593000(101-113)Online publication date: 7-Aug-2023
https://dl.acm.org/doi/10.1145/3588195.3593000
Zhang ZChen YHe BZhang Z(2023)NIOT: A Novel Inference Optimization of Transformers on Modern CPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.326953034:6(1982-1995)Online publication date: 1-Jun-2023
https://dl.acm.org/doi/10.1109/TPDS.2023.3269530
Lorenzon AMarques SNavarro ABeltran VRauchwerger LCameron KNikolopoulos DPnevmatikatos D(2022)Seamless optimization of the GEMM kernel for task-based programming modelsProceedings of the 36th ACM International Conference on Supercomputing10.1145/3524059.3532385(1-11)Online publication date: 28-Jun-2022
https://dl.acm.org/doi/10.1145/3524059.3532385
Mummidi CBal SGoldstein BSrinivasan SKundu S(2022)A Highly-Efficient Error Detection Technique for General Matrix Multiplication using Tiled Processing on SIMD Architecture2022 IEEE 40th International Conference on Computer Design (ICCD)10.1109/ICCD56317.2022.00084(529-536)Online publication date: Oct-2022
https://doi.org/10.1109/ICCD56317.2022.00084
Zhang YWang YMo ZZhou YSun TXu GXing CYang L(2022)Accelerating small matrix multiplications by adaptive batching strategy on GPU2022 IEEE 24th Int Conf on High Performance Computing & Communications; 8th Int Conf on Data Science & Systems; 20th Int Conf on Smart City; 8th Int Conf on Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys)10.1109/HPCC-DSS-SmartCity-DependSys57074.2022.00143(882-887)Online publication date: Dec-2022
https://doi.org/10.1109/HPCC-DSS-SmartCity-DependSys57074.2022.00143
Rizwan MJung EPark YChoi JKim Y(2022)Optimization of Matrix-Matrix Multiplication Algorithm for Matrix-Panel Multiplication on Intel KNL2022 IEEE/ACS 19th International Conference on Computer Systems and Applications (AICCSA)10.1109/AICCSA56895.2022.10017947(1-7)Online publication date: Dec-2022
https://doi.org/10.1109/AICCSA56895.2022.10017947
Zhong DCao QBosilca GDongarra J(2022)Using long vector extensions for MPI reductionsParallel Computing10.1016/j.parco.2021.102871109:COnline publication date: 1-Mar-2022
https://dl.acm.org/doi/10.1016/j.parco.2021.102871
Yang ZLu LWang R(2022)A batched GEMM optimization framework for deep learningThe Journal of Supercomputing10.1007/s11227-022-04336-378:11(13393-13408)Online publication date: 1-Jul-2022
https://dl.acm.org/doi/10.1007/s11227-022-04336-3
Labini PCianfriglia MPerri DGervasi OFursin GLokhmotov ANugteren CCarpentieri BZollo FVella F(2021)On the Anatomy of Predictive Models for Accelerating GPU Convolution Kernels and BeyondACM Transactions on Architecture and Code Optimization10.1145/343440218:1(1-24)Online publication date: 7-Jan-2021
https://dl.acm.org/doi/10.1145/3434402
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents