research-article

Open access

KernelFaRer: Replacing Native-Code Idioms with High-Performance Library Calls

Authors:

João P. L. De Carvalho,

Ivan Korostelev,

José Nelson Amaral,

Christopher Barton,

Guido AraujoAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization (TACO), Volume 18, Issue 3

Article No.: 38, Pages 1 - 22

https://doi.org/10.1145/3459010

Published: 28 June 2021 Publication History

All formats PDF

Abstract

Well-crafted libraries deliver much higher performance than code generated by sophisticated application programmers using advanced optimizing compilers. When a code pattern for which a well-tuned library implementation exists is found in the source code of an application, the highest performing solution is to replace the pattern with a call to the library. Idiom-recognition solutions in the past either required pattern matching machinery that was outside of the compilation framework or provided a very brittle solution that would fail even for minor variants in the pattern source code. This article introduces Kernel Find & Replacer (KernelFaRer), an idiom recognizer implemented entirely in the existing LLVM compiler framework. The versatility of KernelFaRer is demonstrated by matching and replacing two linear algebra idioms, general matrix-matrix multiplication (GEMM), and symmetric rank-2k update (SYR2K). Both GEMM and SYR2K are used extensively in scientific computation, and GEMM is also a central building block for deep learning and computer graphics algorithms. The idiom recognition in KernelFaRer is much more robust than alternative solutions, has a much lower compilation overhead, and is fully integrated in the broadly used LLVM compilation tools. KernelFaRer replaces existing GEMM and SYR2K idioms with computations performed by BLAS, Eigen, MKL (Intel’s x86), ESSL (IBM’s PowerPC), and BLIS (AMD). Gains in performance that reach 2000× over hand-crafted source code compiled at the highest optimization level demonstrate that replacing application code with library call is a performant solution.

References

[1]

Kazushige Goto and Robert A. van de Geijn. 2008. Anatomy of high-performance matrix multiplication. ACM Trans. Math. Softw. 34, 3, Article Article 12 (May 2008), 25 pages.

Digital Library

[2]

Perlis, Alan J. and Rugaber, Spencer. 1979. Programming with idioms in APL. SIGAPL APL Quote Quad 9, 4-P1 (May 1979), 232–235.

[3]

Tobias Christian Grosser. 2011. Enabling Polyhedral Optimizations in LLVM. Ph.D. Dissertation. Universität Passau.

[4]

Uday Bondhugula, Albert Hartono, J. Ramanujam, and P. Sadayappan. 2008. A practical automatic polyhedral parallelizer and locality optimizer. In Proceedings of the ACM SIGPLAN International Conference on Programming Language Design and Implementation. Tucson, AZ, USA, 101–113.

[5]

Ginsbach, Philip and Remmelg, Toomas and Steuwer, Michel and Bodin, Bruno and Dubach, Christophe and O’Boyle, Michael F. P.2018. Automatic matching of legacy code to heterogeneous APIs: An idiomatic approach. In Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’18). Association for Computing Machinery, New York, NY, 139–153.

[6]

L. Susan Blackford, Antoine Petitet, Roldan Pozo, Karin Remington, R. Clint Whaley, James Demmel, Jack Dongarra, Iain Duff, Sven Hammarling, Greg Henry, et al. 2002. An updated set of basic linear algebra subprograms (BLAS). ACM Trans. Math. Softw. 28, 2 (2002), 135–151.

Digital Library

[7]

Z. Xianyi, W. Qian, and Z. Yunquan. 2012. Model-driven level 3 BLAS performance optimization on loongson 3A processor. In Proceedings of the 2012 IEEE 18th International Conference on Parallel and Distributed Systems. 684–691.

Digital Library

[8]

cuBLAS–NVIDIA Developer. Retrieved January 2020 from https://developer.nvidia.com/cublas.

[9]

Chelini, Lorenzo and Zinenko, Oleksandr and Grosser, Tobias and Corporaal, Henk. 2019. Declarative loop tactics for domain-specific optimization. ACM Trans. Archit. Code Optim. 16, 4, Article 55 (Dec. 2019), 25 pages.

[10]

Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis & transformation. In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization (CGO’04). IEEE Computer Society, Washington, DC, 75.

Digital Library

[11]

Callahan, David. 1992. Recognizing and parallelizing bounded recurrences. In Languages and Compilers for Parallel Computing. Springer, Berlin, 169–185.

[12]

Pinter, Shlomit S. and Pinter, Ron Y.1994. Program optimization and parallelization using idioms. ACM Trans. Program. Lang. Syst. 16, 3 (May 1994), 305–327.

Digital Library

[13]

Menon, Vijay and Pingali, Keshav. 1999. High-level semantic optimization of numerical codes. In Proceedings of the 13th International Conference on Supercomputing (ICS ’99). Association for Computing Machinery, New York, NY, 434–443.

[14]

João P. L. de Carvalho, Braedy Kuzma, and Guido Araujo. 2020. Acceleration opportunities in linear algebra applications via idiom recognition. In Companion of the ACM/SPEC International Conference on Performance Engineering (ICPE’20). Association for Computing Machinery, New York, NY, 34–35.

Digital Library

[15]

Kenneth E. Iverson. 1962. A programming language. In Proceedings of the May 1-3, 1962, Spring Joint Computer Conference. ACM, 345–351.

[16]

Lawrence Snyder. 1982. Recognition and selection of idioms for code optimization. Acta Inf. 17, 3 (01 Aug 1982), 327–348.

Digital Library

[17]

Sato Hiroyuki. 2000. Array form representation of idiom recognition system for numerical programs. SIGAPL APL Quote Quad 31, 2 (Dec. 2000), 87–98.

Digital Library

[18]

Hiroyuki, Sato. 2009. Idiom recognition and program scheme recognition based program transformations for performance tuning–beyond compiler optimizations. In Proceedings of the International Conference on Parallel and Distributed Computing, Applications and Technologies. IEEE, 272–279.

[19]

Jiahua He, Allan E. Snavely, Rob F. Van der Wijngaart, and Michael A. Frumkin. 2011. Automatic recognition of performance idioms in scientific applications. In Proceedings of the IEEE International Parallel & Distributed Processing Symposium. IEEE, 118–127.

[20]

Kawahito, Motohiro and Komatsu, Hideaki and Moriyama, Takao and Inoue, Hiroshi and Nakatani, Toshio. 2013. Idiom recognition framework using topological embedding. ACM Trans. Archit. Code Optim. 10, 3, Article 13 (Sept. 2013), 34 pages.

[21]

Jens Palsberg and C. Barry Jay. 1998. The essence of the visitor pattern. In Proceedings of the 22nd Annual International Computer Software and Applications Conference (Compsac’98). IEEE, 9–15.

[22]

Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman. 1986. Compilers, Principles, Techniques (2nd Ed.). Addison wesley.

[23]

Roger A. Horn and Charles R. Johnson. 2012. Matrix Analysis. Cambridge University Press.

Digital Library

[24]

Kumar Chellapilla, Sidd Puri, and Patrice Simard. 2006. High performance convolutional neural networks for document processing. In Proceedings of the 10th International Workshop on Frontiers in Handwriting Recognition. Université de Rennes.

[25]

IBM. 2020. ESSL Guide and Reference (Version 5, Release 5).

[26]

Intel. 2020. Intel Math Kernel Library: Developer Reference Manual (Revision 26).

[27]

OpenBLAS: An Optimized BLAS Library. Retrieved January 2020 from https://www.openblas.net/.

[28]

Field G. Van Zee and Robert A. van de Geijn. 2015. BLIS: A framework for rapidly instantiating BLAS functionality. ACM Trans. Math. Softw. 41, 3 (Jun. 2015), 14:1–14:33. http://doi.acm.org/10.1145/2764454

Digital Library

[29]

Gaël Guennebaud, Benoît Jacob, et al. 2010. Eigen v3. Retrieved from http://eigen.tuxfamily.org.

[30]

Clang: A C language family frontend for LLVM. Retrieved January 2020 from https://clang.llvm.org.

[31]

N. Birkbeck, J. Levesque, and J. N. Amaral. 2007. A dimension abstraction approach to vectorization in Matlab. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’07). 115–130.

[32]

Fu, James Jianghai. 1997. Directed graph pattern matching and topological embedding. J. Algor. 22, 2 (1997), 372–391.

Digital Library

[33]

Daniele G. Spampinato, Diego Fabregat-Traver, Paolo Bientinesi, and Markus Püschel. 2018. Program generation for small-scale linear algebra applications. In Proceedings of the 2018 International Symposium on Code Generation and Optimization. 327–339.

Digital Library

[34]

Norman A. Rink, Immo Huismann, Adilla Susungi, Jeronimo Castrillon, Jörg Stiller, Jochen Fröhlich, and Claude Tadonki. 2018. CFDlang: High-level code generation for high-order methods in fluid dynamics. In Proceedings of the Real World Domain Specific Languages Workshop 2018. 1–10.

Digital Library

[35]

Philip Ginsbach, Bruce Collie, and Michael F. P. O’Boyle. 2020. Automatically harnessing sparse acceleration. In Proceedings of the 29th International Conference on Compiler Construction (CC’20). Association for Computing Machinery, New York, NY, 179–190.

[36]

Verdoolaege, Sven and Guelton, Serge and Grosser, Tobias and Cohen, Albert. 2014. Schedule trees. In Proceedings of the International Workshop on Polyhedral Compilation Techniques.

[37]

David A. Padua and Michael J. Wolfe. 1986. Advanced compiler optimizations for supercomputers. Commun. ACM 29, 12 (1986), 1184–1201.

Digital Library

[38]

Louis-Noël Pouchet and Tomofumi Yuki. 2019. PolyBench/C 4.2.1: The Polyhedral Benchmark Suite. Retrieved from http://polybench.sf.net.

[39]

IBM. 2018. Power9 Processor User’s Manual (Version 2.0).

[40]

AMD. 2020. Software Optimization Guide for AMD Family 17th Models 30h and Greater Processors (Revision 3.01).

[41]

Agner Fog. 2019. Instruction Tables: Lists of Instruction Latencies, Throughputs and Micro-operation Breakdowns for Intel, AMD and VIA CPUs. Technical University of Denmark (08 2019), 383. Retrieved March 2020 from https://www.agner.org/optimize/instruction_tables.pdf.

[42]

Tomas Kalibera and Richard Jones. 2013. Rigorous benchmarking in reasonable time. In Proceedings of the 2013 International Symposium on Memory Management (ISMM’13). Association for Computing Machinery, New York, NY, 63–74.

Digital Library

[43]

Tyler M. Smith, Robert Van De Geijn, Mikhail Smelyanskiy, Jeff R. Hammond, and Field G. Van Zee. 2014. Anatomy of high-performance many-threaded matrix multiplication. In Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium. IEEE, 1049–1059.

[44]

Jianyu Huang and Robert A. Van de Geijn. 2016. BLISlab: A sandbox for optimizing GEMM. arXiv:1609.00076. Retrieved from https://arxiv.org/abs/1609.00076.

[45]

The Science of High-Performance Computing Group. Retrieved March 2020 https://shpc.oden.utexas.edu/.

[46]

S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S. Lee, and K. Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC’09). 44–54.

Digital Library

[47]

John L. Henning. 2006. SPEC CPU2006 benchmark descriptions. ACM SIGARCH Comput. Arch. News 34, 4 (2006), 1–17.

Digital Library

[48]

James Bucek, Klaus-Dieter Lange, and Jóakim v. Kistowski. 2018. SPEC CPU2017: Next-Generation compute benchmark. In Companion of the 2018 ACM/SPEC International Conference on Performance Engineering (ICPE’18). Association for Computing Machinery, New York, NY, 41–42.

Digital Library

[49]

P. Fischer and K. Heisey. 2013. NEKBONE: Thermal Hydraulics Mini-Application. Quick Starter Guide, Release 2.1, 1st ed., 2013.

Cited By

Jungmair MEngelke AGiceva J(2024)HiPy: Extracting High-Level Semantics from Python Code for Data ProcessingProceedings of the ACM on Programming Languages10.1145/36897378:OOPSLA2(736-762)Online publication date: 8-Oct-2024
https://dl.acm.org/doi/10.1145/3689737
Laird ALiu BBjørner NDehnavi M(2024)SpEQ: Translation of Sparse Codes using EquivalencesProceedings of the ACM on Programming Languages10.1145/36564458:PLDI(1680-1703)Online publication date: 20-Jun-2024
https://dl.acm.org/doi/10.1145/3656445
Cai QTan GYang WHe XYan YLi KLi K(2024)COALA: A Compiler-Assisted Adaptive Library Routines Allocation Framework for Heterogeneous SystemsIEEE Transactions on Computers10.1109/TC.2024.338526973:7(1724-1737)Online publication date: 9-Apr-2024
https://dl.acm.org/doi/10.1109/TC.2024.3385269
Show More Cited By

Index Terms

KernelFaRer: Replacing Native-Code Idioms with High-Performance Library Calls
1. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Runtime environments
2. Theory of computation
  1. Design and analysis of algorithms
    1. Parallel algorithms
      1. Massively parallel algorithms

Recommendations

CLBlast: A Tuned OpenCL BLAS Library
IWOCL '18: Proceedings of the International Workshop on OpenCL

This work introduces CLBlast, an open-source BLAS library providing optimized OpenCL routines to accelerate dense linear algebra for a wide variety of devices. It is targeted at machine learning and HPC applications and thus provides a fast matrix-...
Support OpenCL 2.0 Compiler on LLVM for PTX Simulators

Heterogeneous systems that consist of multiple CPUs and GPUs for high-performance computing are becoming increasingly popular, and OpenCL (Open Computing Language) provides a framework for writing programs that can be executed across heterogeneous ...
Bringing low-level languages to the JVM: efficient execution of LLVM IR on Truffle
VMIL 2016: Proceedings of the 8th International Workshop on Virtual Machines and Intermediate Languages

Although the Java platform has been used as a multi-language platform, most of the low-level languages (such as C, Fortran, and C++) cannot be executed efficiently on the JVM. We propose Sulong, a system that can execute LLVM-based languages on the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization

ACM Transactions on Architecture and Code Optimization Volume 18, Issue 3

September 2021

370 pages

ISSN:1544-3566

EISSN:1544-3973

DOI:10.1145/3460978

Editor:
David Kaeli
Northeastern University, USA

Issue’s Table of Contents

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 June 2021

Accepted: 01 March 2021

Revised: 01 March 2021

Received: 01 October 2020

Published in TACO Volume 18, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

FAPESP

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
1,571
Total Downloads

Downloads (Last 12 months)434
Downloads (Last 6 weeks)40

Reflects downloads up to 11 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Jungmair MEngelke AGiceva J(2024)HiPy: Extracting High-Level Semantics from Python Code for Data ProcessingProceedings of the ACM on Programming Languages10.1145/36897378:OOPSLA2(736-762)Online publication date: 8-Oct-2024
https://dl.acm.org/doi/10.1145/3689737
Laird ALiu BBjørner NDehnavi M(2024)SpEQ: Translation of Sparse Codes using EquivalencesProceedings of the ACM on Programming Languages10.1145/36564458:PLDI(1680-1703)Online publication date: 20-Jun-2024
https://dl.acm.org/doi/10.1145/3656445
Cai QTan GYang WHe XYan YLi KLi K(2024)COALA: A Compiler-Assisted Adaptive Library Routines Allocation Framework for Heterogeneous SystemsIEEE Transactions on Computers10.1109/TC.2024.338526973:7(1724-1737)Online publication date: 9-Apr-2024
https://dl.acm.org/doi/10.1109/TC.2024.3385269
Van der Cruysse JDubach CGrosser TDubach CSteuwer MXue JOttoni GQuintão Pereira F(2024)Latent Idiom Recognition for a Minimalist Functional Array Language Using Equality SaturationProceedings of the 2024 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO57630.2024.10444879(270-282)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1109/CGO57630.2024.10444879
Martínez PBernabé GGarcía J(2024)Code Detection for Hardware Acceleration Using Large Language ModelsIEEE Access10.1109/ACCESS.2024.337285312(35271-35281)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3372853
Espindola VZago LYviquel HAraujo G(2023)Source Matching and Rewriting for MLIR Using String-Based AutomataACM Transactions on Architecture and Code Optimization10.1145/357128320:2(1-26)Online publication date: 1-Mar-2023
https://dl.acm.org/doi/10.1145/3571283
Brauckmann APolgreen EGrosser TO'Boyle M(2023)mlirSynth: Automatic, Retargetable Program Raising in Multi-Level IR Using Program Synthesis2023 32nd International Conference on Parallel Architectures and Compilation Techniques (PACT)10.1109/PACT58117.2023.00012(39-50)Online publication date: 21-Oct-2023
https://doi.org/10.1109/PACT58117.2023.00012
Martínez PBernabé GGarcía J(2022)HDNN: a cross-platform MLIR dialect for deep neural networksThe Journal of Supercomputing10.1007/s11227-022-04417-378:11(13814-13830)Online publication date: 1-Jul-2022
https://dl.acm.org/doi/10.1007/s11227-022-04417-3
Moreira JBarton KBergner PBhat PFossum GIvanovic NSadasivam SSaleil BSchmidt BSrinivasaraghavan R(2022)Exploiting the New Power ISA™ Matrix Math Instructions Through Compiler Built-insLanguages and Compilers for Parallel Computing10.1007/978-3-031-31445-2_8(106-122)Online publication date: 12-Oct-2022
https://dl.acm.org/doi/10.1007/978-3-031-31445-2_8

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Figures

Tables

Media

View Issue’s Table of Contents