Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3168810acmconferencesArticle/Chapter ViewAbstractPublication PagescgoConference Proceedingsconference-collections
research-article

SIMD intrinsics on managed language runtimes

Published: 24 February 2018 Publication History
  • Get Citation Alerts
  • Abstract

    Managed language runtimes such as the Java Virtual Machine (JVM) provide adequate performance for a wide range of applications, but at the same time, they lack much of the low-level control that performance-minded programmers appreciate in languages like <pre>C/C++</pre>. One important example is the intrinsics interface that exposes instructions of SIMD (Single Instruction Multiple Data) vector ISAs (Instruction Set Architectures). In this paper we present an automatic approach for including native intrinsics in the runtime of a managed language. Our implementation consists of two parts. First, for each vector ISA, we automatically generate the intrinsics API from the vendor-provided XML specification. Second, we employ a metaprogramming approach that enables programmers to generate and load native code at runtime. In this setting, programmers can use the entire high-level language as a kind of macro system to define new high-level vector APIs with zero overhead. As an example use case we show a variable precision API. We provide an end-to-end implementation of our approach in the HotSpot VM that supports all 5912 Intel SIMD intrinsics from <pre>MMX</pre> to <pre>AVX-512</pre>. Our benchmarks demonstrate that this combination of SIMD and metaprogramming enables developers to write high-performance, vectorized code on an unmodified JVM that outperforms the auto-vectorizing HotSpot just-in-time (JIT) compiler and provides tight integration between vectorized native code and the managed JVM ecosystem.

    References

    [1]
    Léon Bottou. 1991. Stochastic gradient learning in neural networks. Proceedings of Neuro-Nımes 91, 8 (1991).
    [2]
    Hassan Chafi, Zach DeVito, Adriaan Moors, Tiark Rompf, Arvind K. Sujeeth, Pat Hanrahan, Martin Odersky, and Kunle Olukotun. 2010. Language virtualization for heterogeneous parallel computing. In Proc. Object-Oriented Programming, Systems, Languages and Applications (OOPSLA). 835–847.
    [3]
    Intel Corporation. 2012. Intel Intrinsics Guide. https://software.intel. com/sites/landingpage/IntrinsicsGuide/ . (2012). {Online; accessed 4-August-2017}.
    [4]
    Sara Elshobaky, Ahmed El-Mahdy, and Ahmed El-Nahas. 2009. Automatic vectorization using dynamic compilation and tree pattern matching technique in Jikes RVM. In Proc. Workshop on the Implementation, Compilation, Optimization of Object-Oriented Languages and Programming Systems, (ICOOOLPS). 63–69.
    [5]
    Apache Software Foundation. 2004. The Central Repository. https: //search.maven.org/ . (2004). {Online; accessed 4-August-2017}.
    [6]
    Nithin George, HyoukJoong Lee, David Novo, Tiark Rompf, Kevin J. Brown, Arvind K. Sujeeth, Martin Odersky, Kunle Olukotun, and Paolo Ienne. 2014. Hardware system synthesis from Domain-Specific Languages. In Proc. Field-Programmable Logic and Applications (FPL). 1–8.
    [7]
    Nithin George, David Novo, Tiark Rompf, Martin Odersky, and Paolo Ienne. 2013. Making domain-specific hardware synthesis tools costefficient. In Field-Programmable Technology (FPT). 120–127.
    [8]
    IBM. 2005. J9 Virtual Machine (JVM). https://www.ibm.com/support/ knowledgecenter/SSYKE2_8.0.0/com.ibm.java.lnx.80.doc/user/java_ jvm.html . (2005). {Online; accessed 4-August-2017}.
    [9]
    Vladimir Ivanov. 2017. VectorizaAon in HotSpot JVM. http://cr.openjdk. java.net/~vlivanov/talks/2017_Vectorization_in_HotSpot_JVM.pdf . (2017). {Online; accessed 4-August-2017}.
    [10]
    Vojin Jovanovic, Amir Shaikhha, Sandro Stucki, Vladimir Nikolaev, Christoph Koch, and Martin Odersky. 2014. Yin-Yang: concealing the deep embedding of DSLs. In Proc. Generative Programming: Concepts and Experiences (GPCE). 73–82.
    [11]
    Samuel Larsen and Saman Amarasinghe. 2000. Exploiting Superword Level Parallelism with Multimedia Instruction Sets. In Proc. Programming Language Design and Implementation (PLDI). 145–156.
    [12]
    C. L. Lawson, Richard J. Hanson, D. R. Kincaid, and Fred T. Krogh. 1979. Basic Linear Algebra Subprograms for Fortran Usage. Transactions on Mathematical Software (TOMS) 5, 3 (1979), 308–323.
    [13]
    Jiutao Nie, Buqi Cheng, Shisheng Li, Ligang Wang, and Xiao-Feng Li. 2010. Vectorization for Java. In Proc. Network and Parallel Computing (NPC). 3–17.
    [14]
    Nate Nystrom. 2013. Scala Unsigned. https://github.com/nystrom/ scala-unsigned . (2013). {Online; accessed 4-August-2017}.
    [15]
    Oracle. 2002. Oracle JRockit JVM. http://www.oracle.com/ technetwork/middleware/jrockit/overview/index.html . (2002). {Online; accessed 4-August-2017}.
    [16]
    Michael Paleczny, Christopher Vick, and Cliff Click. 2001. The Java hotspotTM Server Compiler. In Proc. Symposium on JavaTM Virtual Machine Research and Technology Symposium - Volume 1.
    [17]
    Aleksandar Prokopec. 2012. ScalaMeter. https://scalameter.github.io . (2012). {Online; accessed 4-August-2017}.
    [18]
    Manuel Rigger, Matthias Grimmer, Christian Wimmer, Thomas Würthinger, and Hanspeter Mössenböck. 2016. Bringing low-level languages to the JVM: efficient execution of LLVM IR on Truffle. In Proc. Workshop on Virtual Machines and Intermediate Languages (VMIL). 6–15.
    [19]
    Tiark Rompf, Nada Amin, Adriaan Moors, Philipp Haller, and Martin Odersky. 2012. Scala-Virtualized: linguistic reuse for deep embeddings. Higher-Order and Symbolic Computation 25, 1 (2012), 165–207.
    [20]
    Tiark Rompf and Martin Odersky. 2010. Lightweight modular staging: a pragmatic approach to runtime code generation and compiled DSLs. In Proc. Generative Programming And Component Engineering, Proceedings (GPCE). 127–136.
    [21]
    Tiark Rompf and Martin Odersky. 2012. Lightweight modular staging: a pragmatic approach to runtime code generation and compiled DSLs. Commun. ACM 55, 6 (2012), 121–130.
    [22]
    David E Rumelhart, Geoffrey E Hinton, Ronald J Williams, et al. 1988. Learning representations by back-propagating errors. Cognitive modeling 5, 3 (1988), 1.
    [23]
    Christopher De Sa, Matthew Feldman, Christopher Ré, and Kunle Olukotun. 2017. Understanding and Optimizing Asynchronous LowPrecision Stochastic Gradient Descent. In Proc. International Symposium on Computer Architecture, (ISCA). 561–574.
    [24]
    SAP. 2011. SAP Java Virtual Machine (JVM). https: //help.sap.com/viewer/65de2977205c403bbc107264b8eccf4b/Cloud/ en-US/da030d10d97610149defa1084cb0b2f1.html . (2011). {Online; accessed 4-August-2017}.
    [25]
    Fridtjof Siebert. 2007. Realtime garbage collection in the JamaicaVM 3.0. In Proc. Workshop on Java Technologies for Real-time and Embedded Systems (JTRES). 94–103.
    [26]
    Skelmir. 1998. Embedded Virtual Machines (VM) to host Java applications. https://www.skelmir.com/products . (1998). {Online; accessed 4-August-2017}.
    [27]
    Alen Stojanov, Georg Ofenbeck, Tiark Rompf, and Markus Püschel. 2014. Abstracting Vector Architectures in Library Generators: Case Study Convolution Filters. In Proc. Workshop on Libraries, Languages, and Compilers for Array Programming (ARRAY). 14–19.
    [28]
    Arvind K. Sujeeth, Austin Gibbons, Kevin J. Brown, HyoukJoong Lee, Tiark Rompf, Martin Odersky, and Kunle Olukotun. 2013. Forge: generating a high performance DSL implementation from a declarative specification. In Proc. Generative Programming: Concepts and Experiences (GPCE). 145–154.
    [29]
    Christian Wimmer, Michael Haupt, Michael L. Van de Vanter, Mick J. Jordan, Laurent Daynès, and Doug Simon. 2013. Maxine: An Approachable Virtual Machine For, and In, Java. ACM Transactions on Architecture and Code Optimization (TACO) 9, 4 (2013), 30:1–30:24.
    [30]
    Jiaxiang Wu, Cong Leng, Yuhang Wang, Qinghao Hu, and Jian Cheng. 2016. Quantized Convolutional Neural Networks for Mobile Devices. In Proc. Conference on Computer Vision and Pattern Recognition (CVPR). 4820–4828.
    [31]
    Thomas Würthinger, Christian Wimmer, Andreas Wöß, Lukas Stadler, Gilles Duboscq, Christian Humer, Gregor Richards, Doug Simon, and Mario Wolczko. 2013. One VM to rule them all. In Proc. Symposium on New Ideas in Programming and Reflections on Software (Onward!). 187–204.
    [32]
    Kamen Yotov, Xiaoming Li, Gang Ren, María Jesús Garzarán, David A. Padua, Keshav Pingali, and Paul Stodghill. 2005. Is Search Really Necessary to Generate High-Performance BLAS? Proc. IEEE 93, 2 (2005), 358–386.
    [33]
    Hantian Zhang, Jerry Li, Kaan Kara, Dan Alistarh, Ji Liu, and Ce Zhang. 2017. ZipML: Training Linear Models with End-to-End Low Precision, and a Little Bit of Deep Learning. In Proc. International Conference on Machine Learning (ICML). 4035–4043.

    Cited By

    View all
    • (2024)Vectorized Intrinsics Can Be Replaced with Pure Java Code without Impairing Steady-State PerformanceProceedings of the 15th ACM/SPEC International Conference on Performance Engineering10.1145/3629526.3645051(14-24)Online publication date: 7-May-2024
    • (2024)Parallel Vectorized Algorithms for Computing Trigonometric Sums Using AVX-512 ExtensionsComputational Science – ICCS 202410.1007/978-3-031-63778-0_12(158-172)Online publication date: 2-Jul-2024
    • (2023)Java Vector API: Benchmarking and Performance AnalysisProceedings of the 32nd ACM SIGPLAN International Conference on Compiler Construction10.1145/3578360.3580265(1-12)Online publication date: 17-Feb-2023
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CGO 2018: Proceedings of the 2018 International Symposium on Code Generation and Optimization
    February 2018
    377 pages
    ISBN:9781450356176
    DOI:10.1145/3179541
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication Notes

    Badge change: Article originally badged under Version 1.0 guidelines https://www.acm.org/publications/policies/artifact-review-badging

    Publication History

    Published: 24 February 2018

    Permissions

    Request permissions for this article.

    Check for updates

    Badges

    Author Tags

    1. JVM
    2. Managed Languages
    3. Metaprogramming
    4. SIMD instruction set
    5. Scala
    6. Staging

    Qualifiers

    • Research-article

    Conference

    CGO '18
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 312 of 1,061 submissions, 29%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)20
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 11 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Vectorized Intrinsics Can Be Replaced with Pure Java Code without Impairing Steady-State PerformanceProceedings of the 15th ACM/SPEC International Conference on Performance Engineering10.1145/3629526.3645051(14-24)Online publication date: 7-May-2024
    • (2024)Parallel Vectorized Algorithms for Computing Trigonometric Sums Using AVX-512 ExtensionsComputational Science – ICCS 202410.1007/978-3-031-63778-0_12(158-172)Online publication date: 2-Jul-2024
    • (2023)Java Vector API: Benchmarking and Performance AnalysisProceedings of the 32nd ACM SIGPLAN International Conference on Compiler Construction10.1145/3578360.3580265(1-12)Online publication date: 17-Feb-2023
    • (2023)Parallel Vectorized Implementations of Compensated Summation AlgorithmsParallel Processing and Applied Mathematics10.1007/978-3-031-30445-3_6(63-74)Online publication date: 27-Apr-2023
    • (2022)SIMD support to improve eclipse OpenJ9 performance on the AArch64 platformProceedings of the 19th ACM International Conference on Computing Frontiers10.1145/3528416.3530233(49-57)Online publication date: 17-May-2022
    • (2021)Effective exploitation of SIMD resources in cross-ISA virtualizationProceedings of the 17th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments10.1145/3453933.3454016(84-97)Online publication date: 7-Apr-2021
    • (2021)An interval compiler for sound floating-point computationsProceedings of the 2021 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO51591.2021.9370307(52-64)Online publication date: 27-Feb-2021
    • (2020)Fusion Transcript Detection from RNA-Seq using Jaccard DistanceProceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics10.1145/3388440.3415585(1-6)Online publication date: 21-Sep-2020
    • (2020)SIMD support in .NET: abstract and concrete vector types and operationsProceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization10.1145/3368826.3377926(229-241)Online publication date: 22-Feb-2020
    • (2019)A stage-polymorphic IR for compiling MATLAB-style dynamic tensor expressionsProceedings of the 18th ACM SIGPLAN International Conference on Generative Programming: Concepts and Experiences10.1145/3357765.3359514(34-47)Online publication date: 21-Oct-2019
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media