Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3629526.3645051acmconferencesArticle/Chapter ViewAbstractPublication PagesicpeConference Proceedingsconference-collections
research-article
Open access

Vectorized Intrinsics Can Be Replaced with Pure Java Code without Impairing Steady-State Performance

Published: 07 May 2024 Publication History

Abstract

Several methods of the Java Class Library (JCL) rely on vectorized intrinsics. While these intrinsics undoubtedly lead to better performance, implementing them is extremely challenging, tedious, error-prone, and significantly increases the effort in understanding and maintaining the code. Moreover, their implementation is platform-dependent. An unexplored, easier-to-implement alternative is to replace vectorized intrinsics with portable Java code using the Java Vector API. However, this is attractive only if the Java code achieves similar steady-state performance as the intrinsics. This paper shows that this is the case. We focus on the hashCode and equals computations for byte arrays. We replace the platform-dependent vectorized intrinsics with pure-Java code employing the Java Vector API, resulting in similar steady-state performance. We show that our Java implementations are easy to fine-tune by exploiting characteristics of the input (i.e., the array length), while such tuning would be much more difficult and cumbersome in a vectorized intrinsic. Additionally, we propose a new vectorized hashCode computation for long arrays, for which a corresponding intrinsic is currently missing. We evaluate the performance of the tuned implementations on four popular benchmark suites, showing that the performance are in line with those of the original OpenJDK 21 with intrinsics. Finally, we describe a general approach to integrate code using the Java Vector API into the core classes of the JCL, which is challenging because premature use of the Java Vector API would crash the JVM during its fragile initialization phase. Our approach can be adopted by developers to modify JCL classes without any changes to the native codebase.

References

[1]
2013. Intel Xeon Phi Coprocessor High Performance Programming. Morgan Kaufmann.
[2]
2015. Software Analysis and Optimization. In Power and Performance, Jim Kukunas (Ed.). Morgan Kaufmann.
[3]
Bowen Alpern, Steve Augart, Stephen M Blackburn, Maria Butrico, Anthony Cocchi, Perry Cheng, Julian Dolby, Stephen Fink, David Grove, Michael Hind, et al. 2005. The Jikes Research Virtual Machine project: Building an open-source research community. IBM Systems Journal 44, 2 (2005), 399--417.
[4]
Andrew Binstock. 2019. Epsilon: The JDK's Do-Nothing Garbage Collector. https://blogs.oracle.com/javamagazine/post/epsilon-the-jdks-do-nothinggarbage- collector.
[5]
Matteo Basso, Andrea Rosà, Luca Omini, and Walter Binder. 2023. Java Vector API: Benchmarking and Performance Analysis. In CC. 1--12.
[6]
Walter Binder, Philippe Moret, Éric Tanter, and Danilo Ansaloni. 2016. Polymorphic Bytecode Instrumentation. Softw. Pract. Exper. 46, 10 (2016), 1351--1380.
[7]
Stephen M. Blackburn, Robin Garner, Chris Hoffmann, Asjad M. Khang, Kathryn S. McKinley, Rotem Bentzur, Amer Diwan, Daniel Feinberg, Daniel Frampton, Samuel Z. Guyer, Martin Hirzel, Antony Hosking, Maria Jump, Han Lee, J. Eliot B. Moss, Aashish Phansalkar, Darko Stefanovi, Thomas VanDrunen, Daniel von Dincklage, and Ben Wiedermann. 2006. The DaCapo Benchmarks: Java Benchmarking Development and Analysis. In OOPSLA. 169--190.
[8]
Brian Goetz. 2019. dont intrinsify Objects::hash. https://mail.openjdk.org/ pipermail/amber-dev/2019-April/004264.html.
[9]
Kevin J Brown, Arvind K Sujeeth, Hyouk Joong Lee, Tiark Rompf, Hassan Chafi, Martin Odersky, and Kunle Olukotun. 2011. A Heterogeneous Parallel Framework for Domain-Specific Languages. In 2011 International Conference on Parallel Architectures and Compilation Techniques. 89--100.
[10]
Maximilian Böther, Lawrence Benson, Ana Klimovic, and Tilmann Rabl. 2023. Analyzing Vectorized Hash Tables Across CPU Architectures. Proceedings of the VLDB Endowment 16 (8 2023), 2755--2768. Issue 11.
[11]
David Detlefs, Christine Flood, Steve Heller, and Tony Printezis. 2004. Garbage- First Garbage Collection. In Proceedings of the 4th International Symposium on Memory Management. 37--48.
[12]
Daniel Frampton, Stephen M Blackburn, Perry Cheng, Robin J Garner, David Grove, J Eliot B Moss, and Sergey I Salishev. 2009. Demystifying Magic: High- Level Low-Level Programming. In Proceedings of the 2009 ACM SIGPLAN/SIGOPS international conference on Virtual execution environments. 81--90.
[13]
GitHub. 2023. 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash . . . · openjdk/jdk21u@e37078f
[14]
GitHub. 2023. jdk21. https://github.com/openjdk/jdk21.
[15]
GitHub. 2023. Package java.lang.instrument. https://docs.oracle.com/ en/java/javase/21/docs/api/java.instrument/java/lang/instrument/packagesummary. html.
[16]
GitHub. 2023. Source code of arrays_equals in c2_MacroAssembler_x86.cpp. https://github.com/openjdk/jdk21/blob/890adb6410dab4606a4f26a942aed02fb2f55387/src/hotspot/cpu/x86/c2_ MacroAssembler_x86.cpp#L4086.
[17]
GitHub. 2023. Source code of arrays_hashcode in c2_MacroAssembler_x86.cpp. https://github.com/openjdk/jdk21/blob/ 890adb6410dab4606a4f26a942aed02fb2f55387/src/hotspot/cpu/x86/c2_ MacroAssembler_x86.cpp#L3285.
[18]
GitHub. 2023. Source code of java.util.zip.ZipCoder. https://github.com/openjdk/ jdk21u/blob/89aea0dace4afd2b1bbc5d636322969afcf0072c/src/java.base/share/ classes/java/util/zip/ZipCoder.java#L294.
[19]
GitHub. 2023. Source code of vmIntrinsics.hpp. https://github.com/openjdk/jdk/ blob/master/src/hotspot/share/classfile/vmIntrinsics.hpp.
[20]
GitHub. 2023. Source code of vmIntrinsics.hpp, comment on bytecode instrinsics. https://github.com/openjdk/jdk/blob/master/src/hotspot/share/classfile/ vmIntrinsics.hpp#L78.
[21]
GitHub. 2023. Source code of vmIntrinsics.hpp, comment on instrinsics. https://github.com/openjdk/jdk/blob/master/src/hotspot/share/classfile/ vmIntrinsics.hpp#L72.
[22]
GitHub. 2023. Source code of vmIntrinsics.hpp, comment on library instrinsics. https://github.com/openjdk/jdk/blob/master/src/hotspot/share/classfile/ vmIntrinsics.hpp#L74.
[23]
Tobias Groth, Sven Groppe, Thilo Pionteck, Franz Valdiek, and Martin Koppehel. 2022. Accelerated Parallel Hybrid GPU/CPU Hash Table Queries with String Keys. In Database and Expert Systems Applications. 191--203.
[24]
Intel. 2007. Intel SSE4 Programming Reference. https://www.intel.com/content/ dam/develop/external/us/en/documents/\d9156103--138479.pdf.
[25]
Intel. 2015. Intel Advanced Vector Extensions 512. https://www.intel.com/ content/www/us/en/architecture-and-technology/avx-512-overview.html.
[26]
Java Platform, Standard Edition & Java Development Kit -- Version 21 API Specification. 2023. Class Vector. https://docs.oracle.com/en/java/javase/21/docs/api/ jdk.incubator.vector/jdk/incubator/vector/Vector.html.
[27]
Alireza Khadem, Daichi Fujiki, Nishil Talati, Scott Mahlke, and Reetuparna Das. 2023. Vector-Processing for Mobile Devices: Benchmark and Analysis. In 2023 IEEE International Symposium on Workload Characterization. 15--27.
[28]
Donald E. Knuth. 1998. The Art of Computer Programming, Volume 3: (2nd Ed.) Sorting and Searching. Addison Wesley Longman Publishing Co., Inc.
[29]
Philipp Lengauer, Verena Bitto, Hanspeter Mössenböck, and Markus Weninger. 2017. A Comprehensive Java Benchmark Study on Memory and Garbage Collection Behavior of DaCapo, DaCapo Scala, and SPECjvm2008. In Proceedings of the 8th ACM/SPEC on International Conference on Performance Engineering. 3--14.
[30]
Chung Yu Liao and Cheng Hung Lin. 2017. A Novel Parallel Dual-character String Matching Algorithm on Graphical Processing Units. In ICA3PP. 197--210.
[31]
Cheng-Hung Lin, Jin-Cheng Li, Chen-Hsiung Liu, and Shih-Chieh Chang. 2017. Perfect Hashing Based Parallel Algorithms for Multiple String Matching on Graphic Processing Units. IEEE Transactions on Parallel and Distributed Systems 28, 9 (2017), 2639--2650.
[32]
Tim Lindholm, Frank Tellin, Gilad Bracha, Alex Buckley, and Daniel Smith. 2023. The Java Virtual Machine Specification -- Java SE 20 Edition -- Chapter 5. Loading, Linking, and Initializing. https://docs.oracle.com/javase/specs/jvms/se20/html/jvms-5.html#jvms-5.5.
[33]
Alex Villazón, Yudi Zheng, Danilo Ansaloni, Walter Binder, and Zhengwei Qi. 2012. DiSL: A Domain-Specific Language for Bytecode Instrumentation. In AOSD. 239--250.
[34]
Oracle. 2023. Java Reflection API. https://docs.oracle.com/javase/8/docs/technotes/guides/reflection/index.html.
[35]
Oracle Corporation. 2023. JEP 448: Vector API (Sixth Incubator). https://openjdk. org/jeps/448.
[36]
Prokopec, A. and Rosà, A. and Leopoldseder, D. and Duboscq, G. and Tma, P. and Studener, M. and Bulej, L. and Zheng, Y. and Villazón, A. and Simon, D. and Würthinger, T. and Binder,W. 2019. Renaissance: Benchmarking Suite for Parallel Applications on the JVM. In PLDI. 31--47.
[37]
Renaissance Suite. 2019. Renaissance Suite - Documentation. https://renaissance. dev/docs.
[38]
Tiark Rompf and Martin Odersky. 2010. Lightweight modular staging: a pragmatic approach to runtime code generation and compiled DSLs. In Proceedings of the ninth international conference on Generative programming and component engineering. 127--136.
[39]
Andrea Rosà, Eduardo Rosales, and Walter Binder. 2017. Accurate Reification of Complete Supertype Information for Dynamic Analysis on the JVM. In GPCE. 104--116.
[40]
Andrea Rosà, Eduardo Rosales, and Walter Binder. 2019. Analysis and Optimization of Task Granularity on the Java Virtual Machine. ACM Trans. Program. Lang. Syst. 41, 3, Article 19 (jul 2019), 47 pages.
[41]
Andrea Rosà and Walter Binder. 2018. Optimizing Type-specific Instrumentation on the JVM with Reflective Supertype Information. Journal of Visual Languages & Computing 49 (2018), 29--45.
[42]
Andreas Sewe, Mira Mezini, Aibek Sarimbekov, and Walter Binder. 2011. Da Capo Con Scala: Design and Analysis of a Scala Benchmark Suite for the Java Virtual Machine. In Proceedings of the 2011 ACM International Conference on Object Oriented Programming Systems Languages and Applications (Portland, Oregon, USA) (OOPSLA '11). Association for Computing Machinery, New York, NY, USA, 657--676.
[43]
Alen Stojanov, Ivaylo Toskov, Tiark Rompf, and Markus Püschel. 2018. SIMD intrinsics on managed language runtimes. In Proceedings of the 2018 International Symposium on Code Generation and Optimization. 2--15.
[44]
Arvind Sujeeth, HyoukJoong Lee, Kevin Brown, Tiark Rompf, Hassan Chafi, Michael Wu, Anand Atreya, Martin Odersky, and Kunle Olukotun. 2011. OptiML: an implicitly parallel domain-specific language for machine learning. In Proceedings of the 28th International Conference on Machine Learning (ICML-11). 609--616.
[45]
Arvind K Sujeeth, Kevin J Brown, Hyoukjoong Lee, Tiark Rompf, Hassan Chafi, Martin Odersky, and Kunle Olukotun. 2014. Delite: A compiler architecture for performance-oriented embedded domain-specific languages. ACM Transactions on Embedded Computing Systems (TECS) 13, 4s (2014), 1--25.
[46]
Tran Ngoc Thinh, Surin Kittitornkun, and Shigenori Tomiyama. 2007. Applying Cuckoo Hashing for FPGA-based Pattern Matching in NIDS/NIPS. In FPT. 121-- 128.
[47]
Kuo-Kun Tseng, Ying-Dar Lin, Tsern-Huei Lee, and Yuan-Cheng Lai. 2005. A Parallel Automaton String Matching with Pre-hashing and Root-indexing Techniques for Content Filtering Coprocessor. In ASAP. 113--118.
[48]
Christian Wimmer, Michael Haupt, Michael L. Van De Vanter, Mick Jordan, Laurent Daynès, and Douglas Simon. 2013. Maxine: An Approachable Virtual Machine for, and in, Java. ACM Trans. Archit. Code Optim. 9, 4, Article 30 (jan 2013), 24 pages.
[49]
Tianqi Zheng, Zhibin Zhang, and Xueqi Cheng. 2020. SAHA: A String Adaptive Hash Table for Analytical Databases. Applied Sciences 10, 6 (2020), 1--18.
[50]
Zoltán Majó. 2016. C1 arraycopy intrinsic type checks missing. https://mail.openjdk.org/pipermail/hotspot-compiler-dev/2016-June/023527.html.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICPE '24: Proceedings of the 15th ACM/SPEC International Conference on Performance Engineering
May 2024
310 pages
ISBN:9798400704444
DOI:10.1145/3629526
This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 May 2024

Check for updates

Badges

Author Tags

  1. compiler intrinsics
  2. java vector api
  3. java virtual machine
  4. portability
  5. simd
  6. vector instructions

Qualifiers

  • Research-article

Funding Sources

  • Oracle
  • Swiss National Science Foundation

Conference

ICPE '24

Acceptance Rates

Overall Acceptance Rate 252 of 851 submissions, 30%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 324
    Total Downloads
  • Downloads (Last 12 months)324
  • Downloads (Last 6 weeks)103
Reflects downloads up to 15 Oct 2024

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media