research-article

SIMD intrinsics on managed language runtimes

Authors:

Markus PüschelAuthors Info & Claims

CGO 2018: Proceedings of the 2018 International Symposium on Code Generation and Optimization

Pages 2 - 15

https://doi.org/10.1145/3168810

Published: 24 February 2018 Publication History

Abstract

Managed language runtimes such as the Java Virtual Machine (JVM) provide adequate performance for a wide range of applications, but at the same time, they lack much of the low-level control that performance-minded programmers appreciate in languages like <pre>C/C++</pre>. One important example is the intrinsics interface that exposes instructions of SIMD (Single Instruction Multiple Data) vector ISAs (Instruction Set Architectures). In this paper we present an automatic approach for including native intrinsics in the runtime of a managed language. Our implementation consists of two parts. First, for each vector ISA, we automatically generate the intrinsics API from the vendor-provided XML specification. Second, we employ a metaprogramming approach that enables programmers to generate and load native code at runtime. In this setting, programmers can use the entire high-level language as a kind of macro system to define new high-level vector APIs with zero overhead. As an example use case we show a variable precision API. We provide an end-to-end implementation of our approach in the HotSpot VM that supports all 5912 Intel SIMD intrinsics from <pre>MMX</pre> to <pre>AVX-512</pre>. Our benchmarks demonstrate that this combination of SIMD and metaprogramming enables developers to write high-performance, vectorized code on an unmodified JVM that outperforms the auto-vectorizing HotSpot just-in-time (JIT) compiler and provides tight integration between vectorized native code and the managed JVM ecosystem.

References

[1]

Léon Bottou. 1991. Stochastic gradient learning in neural networks. Proceedings of Neuro-Nımes 91, 8 (1991).

[2]

Hassan Chafi, Zach DeVito, Adriaan Moors, Tiark Rompf, Arvind K. Sujeeth, Pat Hanrahan, Martin Odersky, and Kunle Olukotun. 2010. Language virtualization for heterogeneous parallel computing. In Proc. Object-Oriented Programming, Systems, Languages and Applications (OOPSLA). 835–847.

Digital Library

[3]

Intel Corporation. 2012. Intel Intrinsics Guide. https://software.intel. com/sites/landingpage/IntrinsicsGuide/ . (2012). {Online; accessed 4-August-2017}.

[4]

Sara Elshobaky, Ahmed El-Mahdy, and Ahmed El-Nahas. 2009. Automatic vectorization using dynamic compilation and tree pattern matching technique in Jikes RVM. In Proc. Workshop on the Implementation, Compilation, Optimization of Object-Oriented Languages and Programming Systems, (ICOOOLPS). 63–69.

Digital Library

[5]

Apache Software Foundation. 2004. The Central Repository. https: //search.maven.org/ . (2004). {Online; accessed 4-August-2017}.

[6]

Nithin George, HyoukJoong Lee, David Novo, Tiark Rompf, Kevin J. Brown, Arvind K. Sujeeth, Martin Odersky, Kunle Olukotun, and Paolo Ienne. 2014. Hardware system synthesis from Domain-Specific Languages. In Proc. Field-Programmable Logic and Applications (FPL). 1–8.

[7]

Nithin George, David Novo, Tiark Rompf, Martin Odersky, and Paolo Ienne. 2013. Making domain-specific hardware synthesis tools costefficient. In Field-Programmable Technology (FPT). 120–127.

[8]

IBM. 2005. J9 Virtual Machine (JVM). https://www.ibm.com/support/ knowledgecenter/SSYKE2_8.0.0/com.ibm.java.lnx.80.doc/user/java_ jvm.html . (2005). {Online; accessed 4-August-2017}.

[9]

Vladimir Ivanov. 2017. VectorizaAon in HotSpot JVM. http://cr.openjdk. java.net/~vlivanov/talks/2017_Vectorization_in_HotSpot_JVM.pdf . (2017). {Online; accessed 4-August-2017}.

[10]

Vojin Jovanovic, Amir Shaikhha, Sandro Stucki, Vladimir Nikolaev, Christoph Koch, and Martin Odersky. 2014. Yin-Yang: concealing the deep embedding of DSLs. In Proc. Generative Programming: Concepts and Experiences (GPCE). 73–82.

Digital Library

[11]

Samuel Larsen and Saman Amarasinghe. 2000. Exploiting Superword Level Parallelism with Multimedia Instruction Sets. In Proc. Programming Language Design and Implementation (PLDI). 145–156.

Digital Library

[12]

C. L. Lawson, Richard J. Hanson, D. R. Kincaid, and Fred T. Krogh. 1979. Basic Linear Algebra Subprograms for Fortran Usage. Transactions on Mathematical Software (TOMS) 5, 3 (1979), 308–323.

Digital Library

[13]

Jiutao Nie, Buqi Cheng, Shisheng Li, Ligang Wang, and Xiao-Feng Li. 2010. Vectorization for Java. In Proc. Network and Parallel Computing (NPC). 3–17.

Digital Library

[14]

Nate Nystrom. 2013. Scala Unsigned. https://github.com/nystrom/ scala-unsigned . (2013). {Online; accessed 4-August-2017}.

[15]

Oracle. 2002. Oracle JRockit JVM. http://www.oracle.com/ technetwork/middleware/jrockit/overview/index.html . (2002). {Online; accessed 4-August-2017}.

[16]

Michael Paleczny, Christopher Vick, and Cliff Click. 2001. The Java hotspotTM Server Compiler. In Proc. Symposium on JavaTM Virtual Machine Research and Technology Symposium - Volume 1.

Digital Library

[17]

Aleksandar Prokopec. 2012. ScalaMeter. https://scalameter.github.io . (2012). {Online; accessed 4-August-2017}.

[18]

Manuel Rigger, Matthias Grimmer, Christian Wimmer, Thomas Würthinger, and Hanspeter Mössenböck. 2016. Bringing low-level languages to the JVM: efficient execution of LLVM IR on Truffle. In Proc. Workshop on Virtual Machines and Intermediate Languages (VMIL). 6–15.

Digital Library

[19]

Tiark Rompf, Nada Amin, Adriaan Moors, Philipp Haller, and Martin Odersky. 2012. Scala-Virtualized: linguistic reuse for deep embeddings. Higher-Order and Symbolic Computation 25, 1 (2012), 165–207.

Digital Library

[20]

Tiark Rompf and Martin Odersky. 2010. Lightweight modular staging: a pragmatic approach to runtime code generation and compiled DSLs. In Proc. Generative Programming And Component Engineering, Proceedings (GPCE). 127–136.

Digital Library

[21]

Tiark Rompf and Martin Odersky. 2012. Lightweight modular staging: a pragmatic approach to runtime code generation and compiled DSLs. Commun. ACM 55, 6 (2012), 121–130.

Digital Library

[22]

David E Rumelhart, Geoffrey E Hinton, Ronald J Williams, et al. 1988. Learning representations by back-propagating errors. Cognitive modeling 5, 3 (1988), 1.

[23]

Christopher De Sa, Matthew Feldman, Christopher Ré, and Kunle Olukotun. 2017. Understanding and Optimizing Asynchronous LowPrecision Stochastic Gradient Descent. In Proc. International Symposium on Computer Architecture, (ISCA). 561–574.

Digital Library

[24]

SAP. 2011. SAP Java Virtual Machine (JVM). https: //help.sap.com/viewer/65de2977205c403bbc107264b8eccf4b/Cloud/ en-US/da030d10d97610149defa1084cb0b2f1.html . (2011). {Online; accessed 4-August-2017}.

[25]

Fridtjof Siebert. 2007. Realtime garbage collection in the JamaicaVM 3.0. In Proc. Workshop on Java Technologies for Real-time and Embedded Systems (JTRES). 94–103.

Digital Library

[26]

Skelmir. 1998. Embedded Virtual Machines (VM) to host Java applications. https://www.skelmir.com/products . (1998). {Online; accessed 4-August-2017}.

[27]

Alen Stojanov, Georg Ofenbeck, Tiark Rompf, and Markus Püschel. 2014. Abstracting Vector Architectures in Library Generators: Case Study Convolution Filters. In Proc. Workshop on Libraries, Languages, and Compilers for Array Programming (ARRAY). 14–19.

Digital Library

[28]

Arvind K. Sujeeth, Austin Gibbons, Kevin J. Brown, HyoukJoong Lee, Tiark Rompf, Martin Odersky, and Kunle Olukotun. 2013. Forge: generating a high performance DSL implementation from a declarative specification. In Proc. Generative Programming: Concepts and Experiences (GPCE). 145–154.

Digital Library

[29]

Christian Wimmer, Michael Haupt, Michael L. Van de Vanter, Mick J. Jordan, Laurent Daynès, and Doug Simon. 2013. Maxine: An Approachable Virtual Machine For, and In, Java. ACM Transactions on Architecture and Code Optimization (TACO) 9, 4 (2013), 30:1–30:24.

Digital Library

[30]

Jiaxiang Wu, Cong Leng, Yuhang Wang, Qinghao Hu, and Jian Cheng. 2016. Quantized Convolutional Neural Networks for Mobile Devices. In Proc. Conference on Computer Vision and Pattern Recognition (CVPR). 4820–4828.

[31]

Thomas Würthinger, Christian Wimmer, Andreas Wöß, Lukas Stadler, Gilles Duboscq, Christian Humer, Gregor Richards, Doug Simon, and Mario Wolczko. 2013. One VM to rule them all. In Proc. Symposium on New Ideas in Programming and Reflections on Software (Onward!). 187–204.

Digital Library

[32]

Kamen Yotov, Xiaoming Li, Gang Ren, María Jesús Garzarán, David A. Padua, Keshav Pingali, and Paul Stodghill. 2005. Is Search Really Necessary to Generate High-Performance BLAS? Proc. IEEE 93, 2 (2005), 358–386.

[33]

Hantian Zhang, Jerry Li, Kaan Kara, Dan Alistarh, Ji Liu, and Ce Zhang. 2017. ZipML: Training Linear Models with End-to-End Low Precision, and a Little Bit of Deep Learning. In Proc. International Conference on Machine Learning (ICML). 4035–4043.

Cited By

Löff JSchiavio FRosà ABasso MBinder WBalsamo SKnottenbelt WAbad CShang W(2024)Vectorized Intrinsics Can Be Replaced with Pure Java Code without Impairing Steady-State PerformanceProceedings of the 15th ACM/SPEC International Conference on Performance Engineering10.1145/3629526.3645051(14-24)Online publication date: 7-May-2024
https://dl.acm.org/doi/10.1145/3629526.3645051
Stpiczyński P(2024)Parallel Vectorized Algorithms for Computing Trigonometric Sums Using AVX-512 ExtensionsComputational Science – ICCS 202410.1007/978-3-031-63778-0_12(158-172)Online publication date: 2-Jul-2024
https://dl.acm.org/doi/10.1007/978-3-031-63778-0_12
Basso MRosà AOmini LBinder WVerbrugge CLhoták OShen X(2023)Java Vector API: Benchmarking and Performance AnalysisProceedings of the 32nd ACM SIGPLAN International Conference on Compiler Construction10.1145/3578360.3580265(1-12)Online publication date: 17-Feb-2023
https://dl.acm.org/doi/10.1145/3578360.3580265
Show More Cited By

Index Terms

SIMD intrinsics on managed language runtimes
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Single instruction, multiple data
2. Software and its engineering
  1. Software notations and tools
    1. Compilers
  2. Software organization and properties
    1. Contextual software domains
      1. Software infrastructure
        Virtual machines

Recommendations

Almost first-class language embedding: taming staged embedded DSLs
GPCE 2015: Proceedings of the 2015 ACM SIGPLAN International Conference on Generative Programming: Concepts and Experiences

Embedded domain-specific languages (EDSLs), inheriting a general-purpose language's features as well as look-and-feel, have traditionally been second-class or rather non-citizens in terms of host-language design. This makes sense when one regards them ...
MorphScala: safe class morphing with macros
SCALA '14: Proceedings of the Fifth Annual Scala Workshop

The goal of this paper is to design an easy type-safe metaprogramming API for Scala to capture generative metaprogramming tasks that depend on existing definitions to generate others, by writing meta-code as close as possible to regular Scala code.

...
Almost first-class language embedding: taming staged embedded DSLs
GPCE '15

Embedded domain-specific languages (EDSLs), inheriting a general-purpose language's features as well as look-and-feel, have traditionally been second-class or rather non-citizens in terms of host-language design. This makes sense when one regards them ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CGO 2018: Proceedings of the 2018 International Symposium on Code Generation and Optimization

February 2018

377 pages

ISBN:9781450356176

DOI:10.1145/3179541

General Chairs:
Jens Knoop
Vienna University of Technology, Austria
,
Markus Schordan
Lawrence Livermore National Laboratory, USA
,
Program Chairs:
Teresa Johnson
Google, USA
,
Michael O'Boyle
University of Edinburgh, UK

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGPLAN: ACM Special Interest Group on Programming Languages
SIGMICRO: ACM Special Interest Group on Microarchitectural Research and Processing
IEEE-CS: Computer Society

Publisher

Association for Computing Machinery

New York, NY, United States

Publication Notes

Badge change: Article originally badged under Version 1.0 guidelines https://www.acm.org/publications/policies/artifact-review-badging

Publication History

Published: 24 February 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Badges

Author Tags

Qualifiers

Research-article

Conference

CGO '18

Sponsor:

CGO '18: 16th Annual IEEE/ACM International Symposium on Code Generation and Optimization

February 24 - 28, 2018

Vienna, Austria

Acceptance Rates

Overall Acceptance Rate 312 of 1,061 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

12
Total Citations
View Citations
280
Total Downloads

Downloads (Last 12 months)20
Downloads (Last 6 weeks)0

Reflects downloads up to 11 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Löff JSchiavio FRosà ABasso MBinder WBalsamo SKnottenbelt WAbad CShang W(2024)Vectorized Intrinsics Can Be Replaced with Pure Java Code without Impairing Steady-State PerformanceProceedings of the 15th ACM/SPEC International Conference on Performance Engineering10.1145/3629526.3645051(14-24)Online publication date: 7-May-2024
https://dl.acm.org/doi/10.1145/3629526.3645051
Stpiczyński P(2024)Parallel Vectorized Algorithms for Computing Trigonometric Sums Using AVX-512 ExtensionsComputational Science – ICCS 202410.1007/978-3-031-63778-0_12(158-172)Online publication date: 2-Jul-2024
https://dl.acm.org/doi/10.1007/978-3-031-63778-0_12
Basso MRosà AOmini LBinder WVerbrugge CLhoták OShen X(2023)Java Vector API: Benchmarking and Performance AnalysisProceedings of the 32nd ACM SIGPLAN International Conference on Compiler Construction10.1145/3578360.3580265(1-12)Online publication date: 17-Feb-2023
https://dl.acm.org/doi/10.1145/3578360.3580265
Dmitruk BStpiczyński P(2023)Parallel Vectorized Implementations of Compensated Summation AlgorithmsParallel Processing and Applied Mathematics10.1007/978-3-031-30445-3_6(63-74)Online publication date: 27-Apr-2023
https://doi.org/10.1007/978-3-031-30445-3_6
Noor MKent KKonno KMaier DSterpone LBartolini AButko A(2022)SIMD support to improve eclipse OpenJ9 performance on the AArch64 platformProceedings of the 19th ACM International Conference on Computing Frontiers10.1145/3528416.3530233(49-57)Online publication date: 17-May-2022
https://dl.acm.org/doi/10.1145/3528416.3530233
Wu JDong JFang RZhao ZGong XWang WZuo DTitzer BXu HZhang I(2021)Effective exploitation of SIMD resources in cross-ISA virtualizationProceedings of the 17th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments10.1145/3453933.3454016(84-97)Online publication date: 7-Apr-2021
https://dl.acm.org/doi/10.1145/3453933.3454016
Rivera JFranchetti FPüschel MLee J(2021)An interval compiler for sound floating-point computationsProceedings of the 2021 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO51591.2021.9370307(52-64)Online publication date: 27-Feb-2021
https://dl.acm.org/doi/10.1109/CGO51591.2021.9370307
Mohebbi HHaspel NSimovici DQuach J(2020)Fusion Transcript Detection from RNA-Seq using Jaccard DistanceProceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics10.1145/3388440.3415585(1-6)Online publication date: 21-Sep-2020
https://dl.acm.org/doi/10.1145/3388440.3415585
Eidt CGooding TMars JTang LXue JWu P(2020)SIMD support in .NET: abstract and concrete vector types and operationsProceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization10.1145/3368826.3377926(229-241)Online publication date: 22-Feb-2020
https://dl.acm.org/doi/10.1145/3368826.3377926
Stojanov ARompf TPüschel MSchaefer IReichenbach CStorm T(2019)A stage-polymorphic IR for compiling MATLAB-style dynamic tensor expressionsProceedings of the 18th ACM SIGPLAN International Conference on Generative Programming: Concepts and Experiences10.1145/3357765.3359514(34-47)Online publication date: 21-Oct-2019
https://dl.acm.org/doi/10.1145/3357765.3359514
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents