article

Vc: A C++ library for explicit vectorization

Authors:

Matthias Kretz,

Volker LindenstruthAuthors Info & Claims

Software—Practice & Experience, Volume 42, Issue 11

Pages 1409 - 1430

https://doi.org/10.1002/spe.1149

Published: 01 November 2012 Publication History

Abstract

It is an established trend that CPU development takes advantage of Moore's Law to improve in parallelism much more than in scalar execution speed. This results in higher hardware thread counts (MIMD) and improved vector units (SIMD), of which the MIMD developments have received the focus of library research and development in recent years. To make use of the latest hardware improvements, SIMD must receive a stronger focus of API research and development because the computational power can no longer be neglected and often auto-vectorizing compilers cannot generate the necessary SIMD code, as will be shown in this paper. Nowadays, the SIMD capabilities are sufficiently significant to warrant vectorization of algorithms requiring more conditional execution than was originally expected for Streaming SIMD Extension to handle. The Vc library (http://compeng.uni-frankfurt.de/?vc) was designed to support developers in the creation of portable vectorized code. Its capabilities and performance have been thoroughly tested. Vc provides portability of the source code, allowing full utilization of the hardware's SIMD capabilities, without introducing any overhead. Copyright © 2011 John Wiley & Sons, Ltd.

References

[1]

ALICE Collaboration. Technical proposal for a large ion collider experiment at the CERN LHC. Technical Report, CERN, Dec 1995.

[2]

Gorbunov S, Kretz M, Rohr D. Fast cellular automaton tracker for the ALICE high level trigger, 2009. URL http://www.gsi.de/informationen/wti/library/scientificreport2009/.

[3]

Kretz M. Efficient use of multi-and many-core systems with vectorization and multithreading. Diplomarbeit, University of Heidelberg, 2009. URL http://compeng.uni-frankfurt.de/index.php?id=vc.

[4]

Intel Corporation. C++ Larrabee Prototype Library, 2009. URL http://software.intel.com/en-us/articles/prototype-primitives-guide/.

[5]

OpenMP Architecture Review Board. OpenMP Application Program Interface, Jul 2011. URL http://www.openmp.org/mp-documents/OpenMP3.1.pdf.

[6]

Intel Corporation. Intel®;Threading Building Blocks, Aug 2011. URL http://threadingbuildingblocks.org/uploads/81/91/LatestOpenSourceDocumentation/Reference.pdf.

[7]

Guennebaud G, Jacob B et al. Eigen v3. http://eigen.tuxfamily.org, 2010. URL http://eigen.tuxfamily.org/.

[8]

Falcou J. NT2: the numerical template toolbox, 2007. URL http://nt2.sourceforge.net/.

[9]

Intel Corporation. Integrated performance primitives from Intel, 2011. URL http://software.intel.com/en-us/articles/intel-ipp/.

[10]

Advanced Micro Devices. AMD Core Math Library (ACML), 2011. URL http://developer.amd.com/libraries/acml/downloads/assets/acml.pdf.

[11]

‘SSEPlus’: SSEPlus Project Documentation, 2008. URL http://sseplus.sourceforge.net/.

[12]

IBM. Power ISA™Version 2.03, Sep 2006. URL http://www.power.org/resources/downloads/PowerISA_203_Final_Public.pdf.

[13]

ARM. Introducing NEON™. Development Article ARM, 2009. URL http://infocenter.arm.com/help/topic/com.arm.doc.dht0002a/DHT0002A_introducing_neon.pdf.

[14]

Intel Corporation. Intel®;Advanced Vector Extensions Programming Reference, Jun 2011. URL http://software.intel.com/file/36945.

[15]

Falcou J, Serot J. E.V.E., An object oriented SIMD library. Scalable Computing: Practice and Experience 2005; 6(4): 31–41.

[16]

Pixelglow Software. macstl, Sep 2005. URL http://www.pixelglow.com/macstl/.

[17]

Gorbunov S, Kebschull U, Kisel I, Lindenstruth V, Müller WFJ. Fast SIMDized Kalman filter based track fit. Computer Physics Communications 2008; 178: 374–383.

[18]

Kisel I, Kretz M, Kulakov I. Scalability of the SIMD Kalman filter track fit based on the vector classes, 2009. URL http://www.gsi.de/informationen/wti/library/scientificreport2009/.

[19]

Intel Corporation. C++ Classes and SIMD Operations. URL http://software.intel.com/sites/products/documentation/hpc/compilerpro/en-us/cpp/lin/compiler_c/cref_cls/common/cppref_class_cpp_simd.htm.

[20]

Kunzman DM, Kalé LV. Towards a framework for abstracting accelerators in parallel applications: experience with cell. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC '09. ACM: New York, NY, USA, 2009; 54:1–54:12. URL http://doi.acm.org/10.1145/1654059.1654114.

[21]

Kunzman DM, Kalé LV. Programming heterogeneous clusters with accelerators using object-based programming. Scientific Programming 2011; 19: 47–62.

[22]

Khronos OpenCL Working Group. The OpenCL specification, Jan 2011. URL http://www.khronos.org/registry/cl/specs/opencl-1.1.pdf.

[23]

NVIDIA. NVIDIA CUDA™, Jun 2011. URL http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/CUDA_C_Programming_Guide.pdf.

[24]

Microsoft®;Corporation. Compute shader overview (Windows). URL http://msdn.microsoft.com/en-us/library/ff476331\%28v=VS.85\%29.aspx.

[25]

Ateji. Java parallel programming, portfolio optimization software & Java threads made simple –– Ateji PX, 2011. URL http://www.ateji.com/.

[26]

Viry P. Ateji PX for Java parallel programming made simple, Whitepaper, Ateji, 2010. URL http://www.ateji.com/px/whitepapers/Ateji\%20PX\%20for\%20Java\%20v1.0.pdf.

[27]

Mono Project. Mono.Simd Namespace. URL http://api.xamarin.com/monodoc.ashx?link=N\%3aMono.Simd.

[28]

Franchetti F, Kral S, Lorenz J, Ueberhuber CW. Efficient utilization of SIMD extensions. IEEE Proceedings Special Issue on Program Generation Optimization and Platform Adaptation 2005; 93: 409–425. URL http://scholar.google.com/scholar?hl=en&btnG=Search&q=intitle:efficient+utilization+of+simd+extensions#0.

[29]

Intel Corporation. Intel®;64 and IA-32 architectures optimization reference manual, 2011. URL http://www.intel.com/products/processor/manuals/.

[30]

Kernighan BW, Ritchie DM. The C Programming Language, 2nd ed., Vol. 37. Prentice Hall Inc., A Simon & Schuster Company: Englewood Cliffs, New Jersey 07632, 1988.

[31]

Moshier SLB. Methods and Programs for Mathematical Functions. Ellis Horwood. Halsted Press: Chichester, England, New York, 1989. URL http://www.netlib.org/cephes.

[32]

Veldhuizen T. Expression templates. C++ Report 1995; 7(5): 26–31.

[33]

Advanced Micro Devices. Software optimization guide for AMD family 10h processors, May 2009. URL http://developer.amd.com/documentation/guides/Pages/default.aspx.

Cited By

Diehl PDaiß GHuck KMarcello DShiber SKaiser HPflüger D(2024)Simulating stellar merger using HPX/Kokkos on A64FX on Supercomputer FugakuThe Journal of Supercomputing10.1007/s11227-024-06113-w80:12(16947-16978)Online publication date: 1-Aug-2024
https://dl.acm.org/doi/10.1007/s11227-024-06113-w
Daiß GDiehl PKaiser HPflüger D(2023)Stellar Mergers with HPX-Kokkos and SYCL: Methods of using an Asynchronous Many-Task Runtime System with SYCLProceedings of the 2023 International Workshop on OpenCL10.1145/3585341.3585354(1-12)Online publication date: 18-Apr-2023
https://dl.acm.org/doi/10.1145/3585341.3585354
Phipps EPawlowski RTrott C(2022)Automatic Differentiation of C++ Codes on Emerging Manycore Architectures with SacadoACM Transactions on Mathematical Software10.1145/356026248:4(1-29)Online publication date: 19-Dec-2022
https://dl.acm.org/doi/10.1145/3560262
Show More Cited By

Index Terms

Vc: A C++ library for explicit vectorization
1. Networks
  1. Network performance evaluation

Index terms have been assigned to the content through auto-classification.

Recommendations

Simple, portable and fast SIMD intrinsic programming: generic simd library
WPMVP '14: Proceedings of the 2014 Workshop on Programming models for SIMD/Vector processing

Using SIMD (Single Instruction Multiple Data) is a cost-effective way to explore data parallelism on modern processors. Most processor vendors today provide SIMD engines, such as Altivec/VSX for POWER, SSE/AVX for Intel processors, and NEON for ARM. ...
Use of SIMD Vector Operations to Accelerate Application Code Performance on Low-Powered ARM and Intel Platforms
IPDPSW '13: Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Workshops and PhD Forum

Augmenting a processor with special hardware that is able to apply a Single Instruction to Multiple Data(SIMD) at the same time is a cost effective way of improving processor performance. It also offers a means of improving the ratio of processor ...
FlexVec: auto-vectorization for irregular loops
PLDI '16

Traditional vectorization techniques build a dependence graph with distance and direction information to determine whether a loop is vectorizable. Since vectorization reorders the execution of instructions across iterations, in general instructions ...

Comments

Information & Contributors

Information

Published In

cover image Software—Practice & Experience

Software—Practice & Experience Volume 42, Issue 11

November 2012

115 pages

ISSN:0038-0644

Issue’s Table of Contents

Publisher

John Wiley & Sons, Inc.

United States

Publication History

Published: 01 November 2012

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

22
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 10 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Diehl PDaiß GHuck KMarcello DShiber SKaiser HPflüger D(2024)Simulating stellar merger using HPX/Kokkos on A64FX on Supercomputer FugakuThe Journal of Supercomputing10.1007/s11227-024-06113-w80:12(16947-16978)Online publication date: 1-Aug-2024
https://dl.acm.org/doi/10.1007/s11227-024-06113-w
Daiß GDiehl PKaiser HPflüger D(2023)Stellar Mergers with HPX-Kokkos and SYCL: Methods of using an Asynchronous Many-Task Runtime System with SYCLProceedings of the 2023 International Workshop on OpenCL10.1145/3585341.3585354(1-12)Online publication date: 18-Apr-2023
https://dl.acm.org/doi/10.1145/3585341.3585354
Phipps EPawlowski RTrott C(2022)Automatic Differentiation of C++ Codes on Emerging Manycore Architectures with SacadoACM Transactions on Mathematical Software10.1145/356026248:4(1-29)Online publication date: 19-Dec-2022
https://dl.acm.org/doi/10.1145/3560262
Incardona PBianucci TSbalzarini I(2021)Distributed Sparse Block Grids on GPUsHigh Performance Computing10.1007/978-3-030-78713-4_15(272-290)Online publication date: 24-Jun-2021
https://dl.acm.org/doi/10.1007/978-3-030-78713-4_15
Kempf DHeß RMüthing SBastian P(2020)Automatic Code Generation for High-performance Discontinuous Galerkin Methods on Modern ArchitecturesACM Transactions on Mathematical Software10.1145/342414447:1(1-31)Online publication date: 8-Dec-2020
https://dl.acm.org/doi/10.1145/3424144
Amiri HShahbahrami A(2020)SIMD programming using Intel vector extensionsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2019.09.012135:C(83-100)Online publication date: 1-Jan-2020
https://dl.acm.org/doi/10.1016/j.jpdc.2019.09.012
Mendis CJain AJain PAmarasinghe SAmaral JKulkarni M(2019)Revec: program rejuvenation through revectorizationProceedings of the 28th International Conference on Compiler Construction10.1145/3302516.3307357(29-41)Online publication date: 16-Feb-2019
https://dl.acm.org/doi/10.1145/3302516.3307357
Kempf DBastian P(2019)An HPC perspective on generative programmingProceedings of the 14th International Workshop on Software Engineering for Science10.1109/SE4Science.2019.00008(9-16)Online publication date: 28-May-2019
https://dl.acm.org/doi/10.1109/SE4Science.2019.00008
Funke DLamm SMeyer UPenschuck MSanders PSchulz CStrash Dvon Looz M(2019)Communication-free massively distributed graph generationJournal of Parallel and Distributed Computing10.1016/j.jpdc.2019.03.011131:C(200-217)Online publication date: 1-Sep-2019
https://dl.acm.org/doi/10.1016/j.jpdc.2019.03.011
Dohr SZapletal JOf GMerta MKravčenko M(2019)A parallel space–time boundary element method for the heat equationComputers & Mathematics with Applications10.1016/j.camwa.2018.12.03178:9(2852-2866)Online publication date: 1-Nov-2019
https://dl.acm.org/doi/10.1016/j.camwa.2018.12.031
Show More Cited By

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents