Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

KernelFaRer: Replacing Native-Code Idioms with High-Performance Library Calls

Published: 28 June 2021 Publication History

Abstract

Well-crafted libraries deliver much higher performance than code generated by sophisticated application programmers using advanced optimizing compilers. When a code pattern for which a well-tuned library implementation exists is found in the source code of an application, the highest performing solution is to replace the pattern with a call to the library. Idiom-recognition solutions in the past either required pattern matching machinery that was outside of the compilation framework or provided a very brittle solution that would fail even for minor variants in the pattern source code. This article introduces Kernel Find & Replacer (KernelFaRer), an idiom recognizer implemented entirely in the existing LLVM compiler framework. The versatility of KernelFaRer is demonstrated by matching and replacing two linear algebra idioms, general matrix-matrix multiplication (GEMM), and symmetric rank-2k update (SYR2K). Both GEMM and SYR2K are used extensively in scientific computation, and GEMM is also a central building block for deep learning and computer graphics algorithms. The idiom recognition in KernelFaRer is much more robust than alternative solutions, has a much lower compilation overhead, and is fully integrated in the broadly used LLVM compilation tools. KernelFaRer replaces existing GEMM and SYR2K idioms with computations performed by BLAS, Eigen, MKL (Intel’s x86), ESSL (IBM’s PowerPC), and BLIS (AMD). Gains in performance that reach 2000× over hand-crafted source code compiled at the highest optimization level demonstrate that replacing application code with library call is a performant solution.

References

[1]
Kazushige Goto and Robert A. van de Geijn. 2008. Anatomy of high-performance matrix multiplication. ACM Trans. Math. Softw. 34, 3, Article Article 12 (May 2008), 25 pages.
[2]
Perlis, Alan J. and Rugaber, Spencer. 1979. Programming with idioms in APL. SIGAPL APL Quote Quad 9, 4-P1 (May 1979), 232–235.
[3]
Tobias Christian Grosser. 2011. Enabling Polyhedral Optimizations in LLVM. Ph.D. Dissertation. Universität Passau.
[4]
Uday Bondhugula, Albert Hartono, J. Ramanujam, and P. Sadayappan. 2008. A practical automatic polyhedral parallelizer and locality optimizer. In Proceedings of the ACM SIGPLAN International Conference on Programming Language Design and Implementation. Tucson, AZ, USA, 101–113.
[5]
Ginsbach, Philip and Remmelg, Toomas and Steuwer, Michel and Bodin, Bruno and Dubach, Christophe and O’Boyle, Michael F. P.2018. Automatic matching of legacy code to heterogeneous APIs: An idiomatic approach. In Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’18). Association for Computing Machinery, New York, NY, 139–153.
[6]
L. Susan Blackford, Antoine Petitet, Roldan Pozo, Karin Remington, R. Clint Whaley, James Demmel, Jack Dongarra, Iain Duff, Sven Hammarling, Greg Henry, et al. 2002. An updated set of basic linear algebra subprograms (BLAS). ACM Trans. Math. Softw. 28, 2 (2002), 135–151.
[7]
Z. Xianyi, W. Qian, and Z. Yunquan. 2012. Model-driven level 3 BLAS performance optimization on loongson 3A processor. In Proceedings of the 2012 IEEE 18th International Conference on Parallel and Distributed Systems. 684–691.
[8]
cuBLAS–NVIDIA Developer. Retrieved January 2020 from https://developer.nvidia.com/cublas.
[9]
Chelini, Lorenzo and Zinenko, Oleksandr and Grosser, Tobias and Corporaal, Henk. 2019. Declarative loop tactics for domain-specific optimization. ACM Trans. Archit. Code Optim. 16, 4, Article 55 (Dec. 2019), 25 pages.
[10]
Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis & transformation. In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization (CGO’04). IEEE Computer Society, Washington, DC, 75.
[11]
Callahan, David. 1992. Recognizing and parallelizing bounded recurrences. In Languages and Compilers for Parallel Computing. Springer, Berlin, 169–185.
[12]
Pinter, Shlomit S. and Pinter, Ron Y.1994. Program optimization and parallelization using idioms. ACM Trans. Program. Lang. Syst. 16, 3 (May 1994), 305–327.
[13]
Menon, Vijay and Pingali, Keshav. 1999. High-level semantic optimization of numerical codes. In Proceedings of the 13th International Conference on Supercomputing (ICS ’99). Association for Computing Machinery, New York, NY, 434–443.
[14]
João P. L. de Carvalho, Braedy Kuzma, and Guido Araujo. 2020. Acceleration opportunities in linear algebra applications via idiom recognition. In Companion of the ACM/SPEC International Conference on Performance Engineering (ICPE’20). Association for Computing Machinery, New York, NY, 34–35.
[15]
Kenneth E. Iverson. 1962. A programming language. In Proceedings of the May 1-3, 1962, Spring Joint Computer Conference. ACM, 345–351.
[16]
Lawrence Snyder. 1982. Recognition and selection of idioms for code optimization. Acta Inf. 17, 3 (01 Aug 1982), 327–348.
[17]
Sato Hiroyuki. 2000. Array form representation of idiom recognition system for numerical programs. SIGAPL APL Quote Quad 31, 2 (Dec. 2000), 87–98.
[18]
Hiroyuki, Sato. 2009. Idiom recognition and program scheme recognition based program transformations for performance tuning–beyond compiler optimizations. In Proceedings of the International Conference on Parallel and Distributed Computing, Applications and Technologies. IEEE, 272–279.
[19]
Jiahua He, Allan E. Snavely, Rob F. Van der Wijngaart, and Michael A. Frumkin. 2011. Automatic recognition of performance idioms in scientific applications. In Proceedings of the IEEE International Parallel & Distributed Processing Symposium. IEEE, 118–127.
[20]
Kawahito, Motohiro and Komatsu, Hideaki and Moriyama, Takao and Inoue, Hiroshi and Nakatani, Toshio. 2013. Idiom recognition framework using topological embedding. ACM Trans. Archit. Code Optim. 10, 3, Article 13 (Sept. 2013), 34 pages.
[21]
Jens Palsberg and C. Barry Jay. 1998. The essence of the visitor pattern. In Proceedings of the 22nd Annual International Computer Software and Applications Conference (Compsac’98). IEEE, 9–15.
[22]
Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman. 1986. Compilers, Principles, Techniques (2nd Ed.). Addison wesley.
[23]
Roger A. Horn and Charles R. Johnson. 2012. Matrix Analysis. Cambridge University Press.
[24]
Kumar Chellapilla, Sidd Puri, and Patrice Simard. 2006. High performance convolutional neural networks for document processing. In Proceedings of the 10th International Workshop on Frontiers in Handwriting Recognition. Université de Rennes.
[25]
IBM. 2020. ESSL Guide and Reference (Version 5, Release 5).
[26]
Intel. 2020. Intel Math Kernel Library: Developer Reference Manual (Revision 26).
[27]
OpenBLAS: An Optimized BLAS Library. Retrieved January 2020 from https://www.openblas.net/.
[28]
Field G. Van Zee and Robert A. van de Geijn. 2015. BLIS: A framework for rapidly instantiating BLAS functionality. ACM Trans. Math. Softw. 41, 3 (Jun. 2015), 14:1–14:33. http://doi.acm.org/10.1145/2764454
[29]
Gaël Guennebaud, Benoît Jacob, et al. 2010. Eigen v3. Retrieved from http://eigen.tuxfamily.org.
[30]
Clang: A C language family frontend for LLVM. Retrieved January 2020 from https://clang.llvm.org.
[31]
N. Birkbeck, J. Levesque, and J. N. Amaral. 2007. A dimension abstraction approach to vectorization in Matlab. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’07). 115–130.
[32]
Fu, James Jianghai. 1997. Directed graph pattern matching and topological embedding. J. Algor. 22, 2 (1997), 372–391.
[33]
Daniele G. Spampinato, Diego Fabregat-Traver, Paolo Bientinesi, and Markus Püschel. 2018. Program generation for small-scale linear algebra applications. In Proceedings of the 2018 International Symposium on Code Generation and Optimization. 327–339.
[34]
Norman A. Rink, Immo Huismann, Adilla Susungi, Jeronimo Castrillon, Jörg Stiller, Jochen Fröhlich, and Claude Tadonki. 2018. CFDlang: High-level code generation for high-order methods in fluid dynamics. In Proceedings of the Real World Domain Specific Languages Workshop 2018. 1–10.
[35]
Philip Ginsbach, Bruce Collie, and Michael F. P. O’Boyle. 2020. Automatically harnessing sparse acceleration. In Proceedings of the 29th International Conference on Compiler Construction (CC’20). Association for Computing Machinery, New York, NY, 179–190.
[36]
Verdoolaege, Sven and Guelton, Serge and Grosser, Tobias and Cohen, Albert. 2014. Schedule trees. In Proceedings of the International Workshop on Polyhedral Compilation Techniques.
[37]
David A. Padua and Michael J. Wolfe. 1986. Advanced compiler optimizations for supercomputers. Commun. ACM 29, 12 (1986), 1184–1201.
[38]
Louis-Noël Pouchet and Tomofumi Yuki. 2019. PolyBench/C 4.2.1: The Polyhedral Benchmark Suite. Retrieved from http://polybench.sf.net.
[39]
IBM. 2018. Power9 Processor User’s Manual (Version 2.0).
[40]
AMD. 2020. Software Optimization Guide for AMD Family 17th Models 30h and Greater Processors (Revision 3.01).
[41]
Agner Fog. 2019. Instruction Tables: Lists of Instruction Latencies, Throughputs and Micro-operation Breakdowns for Intel, AMD and VIA CPUs. Technical University of Denmark (08 2019), 383. Retrieved March 2020 from https://www.agner.org/optimize/instruction_tables.pdf.
[42]
Tomas Kalibera and Richard Jones. 2013. Rigorous benchmarking in reasonable time. In Proceedings of the 2013 International Symposium on Memory Management (ISMM’13). Association for Computing Machinery, New York, NY, 63–74.
[43]
Tyler M. Smith, Robert Van De Geijn, Mikhail Smelyanskiy, Jeff R. Hammond, and Field G. Van Zee. 2014. Anatomy of high-performance many-threaded matrix multiplication. In Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium. IEEE, 1049–1059.
[44]
Jianyu Huang and Robert A. Van de Geijn. 2016. BLISlab: A sandbox for optimizing GEMM. arXiv:1609.00076. Retrieved from https://arxiv.org/abs/1609.00076.
[45]
The Science of High-Performance Computing Group. Retrieved March 2020 https://shpc.oden.utexas.edu/.
[46]
S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S. Lee, and K. Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC’09). 44–54.
[47]
John L. Henning. 2006. SPEC CPU2006 benchmark descriptions. ACM SIGARCH Comput. Arch. News 34, 4 (2006), 1–17.
[48]
James Bucek, Klaus-Dieter Lange, and Jóakim v. Kistowski. 2018. SPEC CPU2017: Next-Generation compute benchmark. In Companion of the 2018 ACM/SPEC International Conference on Performance Engineering (ICPE’18). Association for Computing Machinery, New York, NY, 41–42.
[49]
P. Fischer and K. Heisey. 2013. NEKBONE: Thermal Hydraulics Mini-Application. Quick Starter Guide, Release 2.1, 1st ed., 2013.

Cited By

View all
  • (2024)HiPy: Extracting High-Level Semantics from Python Code for Data ProcessingProceedings of the ACM on Programming Languages10.1145/36897378:OOPSLA2(736-762)Online publication date: 8-Oct-2024
  • (2024)SpEQ: Translation of Sparse Codes using EquivalencesProceedings of the ACM on Programming Languages10.1145/36564458:PLDI(1680-1703)Online publication date: 20-Jun-2024
  • (2024)COALA: A Compiler-Assisted Adaptive Library Routines Allocation Framework for Heterogeneous SystemsIEEE Transactions on Computers10.1109/TC.2024.338526973:7(1724-1737)Online publication date: 9-Apr-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization
ACM Transactions on Architecture and Code Optimization  Volume 18, Issue 3
September 2021
370 pages
ISSN:1544-3566
EISSN:1544-3973
DOI:10.1145/3460978
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 June 2021
Accepted: 01 March 2021
Revised: 01 March 2021
Received: 01 October 2020
Published in TACO Volume 18, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. GEMM
  2. Idiom recognition
  3. LLVM
  4. compiler analysis and transformations

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • FAPESP

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)434
  • Downloads (Last 6 weeks)40
Reflects downloads up to 11 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)HiPy: Extracting High-Level Semantics from Python Code for Data ProcessingProceedings of the ACM on Programming Languages10.1145/36897378:OOPSLA2(736-762)Online publication date: 8-Oct-2024
  • (2024)SpEQ: Translation of Sparse Codes using EquivalencesProceedings of the ACM on Programming Languages10.1145/36564458:PLDI(1680-1703)Online publication date: 20-Jun-2024
  • (2024)COALA: A Compiler-Assisted Adaptive Library Routines Allocation Framework for Heterogeneous SystemsIEEE Transactions on Computers10.1109/TC.2024.338526973:7(1724-1737)Online publication date: 9-Apr-2024
  • (2024)Latent Idiom Recognition for a Minimalist Functional Array Language Using Equality SaturationProceedings of the 2024 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO57630.2024.10444879(270-282)Online publication date: 2-Mar-2024
  • (2024)Code Detection for Hardware Acceleration Using Large Language ModelsIEEE Access10.1109/ACCESS.2024.337285312(35271-35281)Online publication date: 2024
  • (2023)Source Matching and Rewriting for MLIR Using String-Based AutomataACM Transactions on Architecture and Code Optimization10.1145/357128320:2(1-26)Online publication date: 1-Mar-2023
  • (2023)mlirSynth: Automatic, Retargetable Program Raising in Multi-Level IR Using Program Synthesis2023 32nd International Conference on Parallel Architectures and Compilation Techniques (PACT)10.1109/PACT58117.2023.00012(39-50)Online publication date: 21-Oct-2023
  • (2022)HDNN: a cross-platform MLIR dialect for deep neural networksThe Journal of Supercomputing10.1007/s11227-022-04417-378:11(13814-13830)Online publication date: 1-Jul-2022
  • (2022)Exploiting the New Power ISA™ Matrix Math Instructions Through Compiler Built-insLanguages and Compilers for Parallel Computing10.1007/978-3-031-31445-2_8(106-122)Online publication date: 12-Oct-2022

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Login options

Full Access

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media