Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3293320.3293334acmotherconferencesArticle/Chapter ViewAbstractPublication PageshpcasiaConference Proceedingsconference-collections
research-article

Optimizing parallel GEMM routines using auto-tuning with Intel AVX-512

Published: 14 January 2019 Publication History

Abstract

This paper presents the optimal implementations of single- and double-precision general matrix-matrix multiplication (GEMM) routines for the Intel Xeon Phi Processor code-named Knights Landing (KNL) and the Intel Xeon Scalable Processors based on an auto-tuning approach with the Intel AVX-512 intrinsic functions. Our auto-tuning approach precisely determines the parameters reflecting the target architectural features. Our approach significantly reduces the search space and derives optimal parameter sets including the size of submatrices, prefetch distances, loop unrolling depth, and parallelization scheme. Without a single line of assembly code, our GEMM kernels show the comparable performance results to the Intel MKL and outperform other open-source BLAS libraries.

References

[1]
AMD. 2018. AMD Core Math Library. Retrieved 2018-08-15 from http://developer.amd.com/acml.jsp
[2]
Jeff Bilmes, Krste Asanovic, Chee-Whye Chin, and Jim Demmel. 1997. Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology. In Proceedings of the 11th international conference on Supercomputing. ACM, 340--347.
[3]
Kazushige Goto and Robert A Geijn. 2008. Anatomy of high-performance matrix multiplication. ACM Transactions on Mathematical Software (TOMS) 34, 3 (2008), 12.
[4]
Alexander Heinecke, Karthikeyan Vaidyanathan, Mikhail Smelyanskiy, Alexander Kobotov, Roman Dubtsov, Greg Henry, Aniruddha G Shet, George Chrysos, and Pradeep Dubey. 2013. Design and implementation of the linpack benchmark for single and multi-node systems based on Intel® Xeon Phi Coprocessor. In Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on. IEEE, 126--137.
[5]
Intel. 2018. Math Kernel Library. Retrieved 2018-08-15 from https://software.intel.com/en-us/intel-mkl
[6]
James Jeffers, James Reinders, and Avinash Sodani. 2016. Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition. Morgan Kaufmann.
[7]
Roktaek Lim, Yeongha Lee, Raehyun Kim, and Jaeyoung Choi. 2018. An implementation of matrix--matrix multiplication on the Intel KNL processor with AVX-512. Cluster Computing (2018), 1--11.
[8]
Roktaek Lim, Yeongha Lee, Raehyun Kim, and Jaeyoung Choi. 2018. OpenMP-based parallel implementation of matrix-matrix multiplication on the intel knights landing. In Proceedings of Workshops of HPC Asia. ACM, 63--66.
[9]
Roktaek Lim, Yeongha Lee, Raehyun Kim, Jaeyoung Choi, and Myungho Lee. 2018. Auto-tuning GEMM kernels on the Intel KNL and Intel Skylake-SP processors. Journal of Supercomputing (2018).
[10]
Tze Meng Low, Francisco D Igual, Tyler M Smith, and Enrique S Quintana-Orti. 2016. Analytical modeling is enough for high-performance BLIS. ACM Transactions on Mathematical Software (TOMS) 43, 2 (2016), 12.
[11]
Tyler M Smith, Robert A Van De Geijn, Mikhail Smelyanskiy, Jeff R Hammond, and Field G Van Zee. 2014. Anatomy of high-performance many-threaded matrix multiplication. In Parallel and Distributed Processing Symposium, 2014 IEEE 28th International. IEEE, 1049--1059.
[12]
Qian Wang, Xianyi Zhang, Yunquan Zhang, and Qing Yi. 2013. AUGEM: automatically generate high performance dense linear algebra kernels on x86 CPUs. In High Performance Computing, Networking, Storage and Analysis (SC), 2013 International Conference for. IEEE, 1--12.
[13]
R Clint Whaley and Jack J Dongarra. 1998. Automatically tuned linear algebra software. In Proceedings of the 1998 ACM/IEEE conference on Supercomputing. IEEE Computer Society, 1--27.
[14]
R Clint Whaley, Antoine Petitet, and Jack J Dongarra. 2001. Automated empirical optimizations of software and the ATLAS project. Parallel Comput. 27, 1--2 (2001), 3--35.
[15]
Zhang Xianyi, Wang Qian, and Zhang Yunquan. 2012. Model-driven level 3 BLAS performance optimization on Loongson 3A processor. In 2012 IEEE 18th International Conference on Parallel and Distributed Systems. IEEE, 684--691.
[16]
Kamen Yotov, Xiaoming Li, Gang Ren, MJS Garzaran, David Padua, Keshav Pingali, and Paul Stodghill. 2005. Is search really necessary to generate high-performance BLAS? Proc. IEEE 93, 2 (2005), 358--386.
[17]
Field G Van Zee, Tyler M Smith, Bryan Marker, Tze Meng Low, Robert A Geijn, Francisco D Igual, Mikhail Smelyanskiy, Xianyi Zhang, Michael Kistler, Vernon Austel, et al. 2016. The BLIS Framework: Experiments in Portability. ACM Transactions on Mathematical Software (TOMS) 42, 2 (2016), 12:1--12:19.
[18]
Werner Saar Zhang Xianyi, Wang Qian. 2018. OpenBLAS. Retrieved 2018-08-15 from http://www.openblas.net

Cited By

View all
  • (2023)Vectorization Programming Based on HR DSP Using SIMDElectronics10.3390/electronics1213292212:13(2922)Online publication date: 3-Jul-2023
  • (2023)Efficient Execution of SpGEMM on Long Vector ArchitecturesProceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3588195.3593000(101-113)Online publication date: 7-Aug-2023
  • (2023)NIOT: A Novel Inference Optimization of Transformers on Modern CPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.326953034:6(1982-1995)Online publication date: 1-Jun-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
HPCAsia '19: Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region
January 2019
143 pages
ISBN:9781450366328
DOI:10.1145/3293320
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

In-Cooperation

  • Sun Yat-Sen University
  • CCF: China Computer Federation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 January 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. AVX-512
  2. Autotuning
  3. Intel Xeon
  4. Intel Xeon Phi
  5. Manycore
  6. matrix-matrix multiplication

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

HPC Asia 2019

Acceptance Rates

HPCAsia '19 Paper Acceptance Rate 15 of 32 submissions, 47%;
Overall Acceptance Rate 69 of 143 submissions, 48%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)103
  • Downloads (Last 6 weeks)10
Reflects downloads up to 18 Aug 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Vectorization Programming Based on HR DSP Using SIMDElectronics10.3390/electronics1213292212:13(2922)Online publication date: 3-Jul-2023
  • (2023)Efficient Execution of SpGEMM on Long Vector ArchitecturesProceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3588195.3593000(101-113)Online publication date: 7-Aug-2023
  • (2023)NIOT: A Novel Inference Optimization of Transformers on Modern CPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.326953034:6(1982-1995)Online publication date: 1-Jun-2023
  • (2022)Seamless optimization of the GEMM kernel for task-based programming modelsProceedings of the 36th ACM International Conference on Supercomputing10.1145/3524059.3532385(1-11)Online publication date: 28-Jun-2022
  • (2022)A Highly-Efficient Error Detection Technique for General Matrix Multiplication using Tiled Processing on SIMD Architecture2022 IEEE 40th International Conference on Computer Design (ICCD)10.1109/ICCD56317.2022.00084(529-536)Online publication date: Oct-2022
  • (2022)Accelerating small matrix multiplications by adaptive batching strategy on GPU2022 IEEE 24th Int Conf on High Performance Computing & Communications; 8th Int Conf on Data Science & Systems; 20th Int Conf on Smart City; 8th Int Conf on Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys)10.1109/HPCC-DSS-SmartCity-DependSys57074.2022.00143(882-887)Online publication date: Dec-2022
  • (2022)Optimization of Matrix-Matrix Multiplication Algorithm for Matrix-Panel Multiplication on Intel KNL2022 IEEE/ACS 19th International Conference on Computer Systems and Applications (AICCSA)10.1109/AICCSA56895.2022.10017947(1-7)Online publication date: Dec-2022
  • (2022)Using long vector extensions for MPI reductionsParallel Computing10.1016/j.parco.2021.102871109:COnline publication date: 1-Mar-2022
  • (2022)A batched GEMM optimization framework for deep learningThe Journal of Supercomputing10.1007/s11227-022-04336-378:11(13393-13408)Online publication date: 1-Jul-2022
  • (2021)On the Anatomy of Predictive Models for Accelerating GPU Convolution Kernels and BeyondACM Transactions on Architecture and Code Optimization10.1145/343440218:1(1-24)Online publication date: 7-Jan-2021
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media