Article

Whole-function vectorization

Authors:

Ralf Karrenberg,

Sebastian HackAuthors Info & Claims

CGO '11: Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization

Pages 141 - 150

Published: 02 April 2011 Publication History

Abstract

Data-parallel programming languages are an important component in today's parallel computing landscape. Among those are domain-specific languages like shading languages in graphics (HLSL, GLSL, RenderMan, etc.) and "general-purpose" languages like CUDA or OpenCL. Current implementations of those languages on CPUs solely rely on multi-threading to implement parallelism and ignore the additional intra-core parallelism provided by the SIMD instruction set of those processors (like Intel's SSE and the upcoming AVX or Larrabee instruction sets). In this paper, we discuss several aspects of implementing dataparallel languages on machines with SIMD instruction sets. Our main contribution is a language- and platform-independent code transformation that performs whole-function vectorization on low-level intermediate code given by a control flow graph in SSA form. We evaluate our technique in two scenarios: First, incorporated in a compiler for a domain-specific language used in realtime ray tracing. Second, in a stand-alone OpenCL driver. We observe average speedup factors of 3.9 for the ray tracer and factors between 0.6 and 5.2 for different OpenCL kernels.

References

[1]

J. C. H. Park and M. Schlansker, "On Predicated Execution," 1991.

[2]

J. R. Allen, K. Kennedy, C. Porterfield, and J. Warren, "Conversion of control dependence to data dependence," in POPL. ACM, 1983, pp. 177-189.

Digital Library

[3]

C. Lattner and V. Adve, "LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation," in CGO, Mar 2004.

Digital Library

[4]

B. Alpern, M. N. Wegman, and F. K. Zadeck, "Detecting equality of variables in programs," in POPL. ACM, 1988, pp. 1-11.

Digital Library

[5]

L. Seiler et al., "Larrabee: a many-core x86 architecture for visual computing," in SIGGRAPH. ACM, 2008, pp. 1-15.

Digital Library

[6]

J. Shin, "Introducing Control Flow into Vectorized Code," in PACT. IEEE Computer Society, 2007, pp. 280-291.

Digital Library

[7]

J. Janssen and H. Corporaal, "Making graphs reducible with controlled node splitting," ACM Trans. Program. Lang. Syst., vol. 19, no. 6, pp. 1031-1052, 1997.

Digital Library

[8]

L. Carter, J. Ferrante, and C. Thomborson, "Folklore confirmed: reducible flow graphs are exponentially larger," in POPL. ACM, 2003, pp. 106-114.

Digital Library

[9]

R. Allen and K. Kennedy, "Automatic translation of FORTRAN programs to vector form," ACM Trans. Program. Lang. Syst., vol. 9, no. 4, pp. 491-542, 1987.

Digital Library

[10]

A. Darte, Y. Robert, and F. Vivien, Scheduling and Automatic Parallelization. Birkhauser Boston, 2000.

Digital Library

[11]

G. Cheong and M. Lam, "An Optimizer for Multimedia Instruction Sets," in Second SUIF Compiler Workshop, 1997.

[12]

N. Sreraman and R. Govindarajan, "A vectorizing compiler for multimedia extensions," Int. J. Parallel Program., vol. 28, no. 4, pp. 363-400, 2000.

[13]

A. Krall and S. Lelait, "Compilation techniques for multimedia processors," Int. J. Parallel Program., vol. 28, no. 4, pp. 347-361, 2000.

[14]

D. Nuzman and R. Henderson, "Multi-platform auto-vectorization," in CGO, 2006, pp. 281-294.

Digital Library

[15]

S. Larsen and S. Amarasinghe, "Exploiting superword level parallelism with multimedia instruction sets," SIGPLAN Not., vol. 35, no. 5, pp. 145-156, 2000.

Digital Library

[16]

J. Shin, M. Hall, and J. Chame, "Superword-Level Parallelism in the Presence of Control Flow," in CGO. IEEE Computer Society, 2005, pp. 165-175.

Digital Library

[17]

R. G. Scarborough and H. G. Kolsky, "A vectorizing fortran compiler," IBM J. Res. Dev., vol. 30, no. 2, pp. 163-171, 1986.

Digital Library

[18]

V. Ngo, "Parallel loop transformation techniques for vector-based multiprocessor systems," Ph.D. dissertation, May 1994.

Digital Library

[19]

M. J. Wolfe, High Performance Compilers for Parallel Computing, C. Shanklin and L. Ortega, Eds. Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc., 1995.

Digital Library

[20]

D. Nuzman and A. Zaks, "Outer-loop vectorization: revisited for short simd architectures," in PACT. ACM, 2008, pp. 2-11.

Digital Library

[21]

G. E. Blelloch et al., "Implementation of a portable nested data-parallel language," in PPOPP. ACM, 1993, pp. 102-111.

Digital Library

[22]

A. Ghuloum et al., "Future-Proof Data Parallel Algorithms and Software on Intel Multi-Core Architecture," Intel Technology Journal, vol. 11, no. 04, November 2007. {Online}. Available: http://www.intel. com/technology/itj/2007/v11i4/7-future-proof/1-abstract.htm

[23]

N. Fritz, P. Lucas, and P. Slusallek, "CGiS, a New Language for Data-Parallel GPU Programming," in VMV, 2004, pp. 241-248.

[24]

NVIDIA, CUDA Programming Guide, 2009.

[25]

Khronos Group, OpenCL 1.0 Specification, http://khronos.org/registry/cl/specs/opencl-1.0.pdf, 2009.

[26]

S. Parker et al., "RTSL: A Ray Tracing Shading Language," IEEE Symposium on Interactive Ray Tracing, 2007.

Digital Library

[27]

I. Georgiev and P. Slusallek, "RTfact: Generic Concepts for Flexible and High Performance Ray Tracing," in Proceedings of the IEEE/Eurographics Symposium on Interactive Ray Tracing 2008, August 2008, pp. 115-122.

[28]

A. Apodaca and M. Mantle, "RenderMan: Pursuing the Future of Graphics," IEEE Computer Graphics & Applications, vol. 10, no. 4, pp. 44-49, July 1990.

Digital Library

[29]

"AOBench," http://lucille.atso-net.jp/blog.

[30]

AMD-ATI, ATI Stream SDK v2.1, http://developer.amd.com/gpu/atistreamsdk, March 2010.

Cited By

Chitre KKedia PPurandare R(2023)Rapid: Region-Based Pointer DisambiguationProceedings of the ACM on Programming Languages10.1145/36228597:OOPSLA2(1729-1757)Online publication date: 16-Oct-2023
https://dl.acm.org/doi/10.1145/3622859
Chitre KKedia PPurandare R(2022)The road not taken: exploring alias analysis based optimizations missed by the compilerProceedings of the ACM on Programming Languages10.1145/35633166:OOPSLA2(786-810)Online publication date: 31-Oct-2022
https://dl.acm.org/doi/10.1145/3563316
Liu BLaird ATsang WMahjour BDehnavi MKloeckner AMoreira J(2022)Combining Run-Time Checks and Compile-Time Analysis to Improve Control Flow Auto-VectorizationProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1145/3559009.3569663(439-450)Online publication date: 8-Oct-2022
https://dl.acm.org/doi/10.1145/3559009.3569663
Show More Cited By

Whole-function vectorization
1. Software and its engineering
  1. Software notations and tools

Recommendations

Function Call Re-Vectorization
PPoPP '17

Programming languages such as C for CUDA, OpenCL or ISPC have contributed to increase the programmability of SIMD accelerators and graphics processing units. However, these languages still lack the flexibility offered by low-level SIMD programming on ...
Function Call Re-Vectorization
PPoPP '17: Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

Programming languages such as C for CUDA, OpenCL or ISPC have contributed to increase the programmability of SIMD accelerators and graphics processing units. However, these languages still lack the flexibility offered by low-level SIMD programming on ...
Practical SIMD Vectorization Techniques for Intel® Xeon Phi Coprocessors
IPDPSW '13: Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Workshops and PhD Forum

Intel® Xeon Phi™ coprocessor is based on the Intel® Many Integrated Core (Intel® MIC) architecture, which is an innovative new processor architecture that combines abundant thread parallelism with long SIMD vector units. Efficiently exploiting SIMD ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CGO '11: Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization

April 2011

324 pages

ISBN:9781612843568

Sponsors

SIGMICRO: ACM Special Interest Group on Microarchitectural Research and Processing

Publisher

IEEE Computer Society

United States

Publication History

Published: 02 April 2011

Check for updates

Qualifiers

Article

Acceptance Rates

CGO '11 Paper Acceptance Rate 28 of 105 submissions, 27%;

Overall Acceptance Rate 312 of 1,061 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

55
Total Citations
View Citations
468
Total Downloads

Downloads (Last 12 months)15
Downloads (Last 6 weeks)3

Reflects downloads up to 09 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Chitre KKedia PPurandare R(2023)Rapid: Region-Based Pointer DisambiguationProceedings of the ACM on Programming Languages10.1145/36228597:OOPSLA2(1729-1757)Online publication date: 16-Oct-2023
https://dl.acm.org/doi/10.1145/3622859
Chitre KKedia PPurandare R(2022)The road not taken: exploring alias analysis based optimizations missed by the compilerProceedings of the ACM on Programming Languages10.1145/35633166:OOPSLA2(786-810)Online publication date: 31-Oct-2022
https://dl.acm.org/doi/10.1145/3563316
Liu BLaird ATsang WMahjour BDehnavi MKloeckner AMoreira J(2022)Combining Run-Time Checks and Compile-Time Analysis to Improve Control Flow Auto-VectorizationProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1145/3559009.3569663(439-450)Online publication date: 8-Oct-2022
https://dl.acm.org/doi/10.1145/3559009.3569663
Wang SYu LHer LHwang YLee J(2021)Pointer-Based Divergence Analysis for OpenCL 2.0 ProgramsACM Transactions on Parallel Computing10.1145/34706448:4(1-23)Online publication date: 15-Oct-2021
https://dl.acm.org/doi/10.1145/3470644
Kruppe ROppermann JSommer LKoch AKandemir MJimborean AMoseley T(2019)Extending LLVM for lightweight SPMD vectorization: using SIMD and vector instructions easily from any languageProceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization10.5555/3314872.3314912(278-279)Online publication date: 16-Feb-2019
https://dl.acm.org/doi/10.5555/3314872.3314912
Porpodas VRocha RBrevnov EGóes LMattson TKandemir MJimborean AMoseley T(2019)Super-Node SLP: optimized vectorization for code sequences containing operators and their inverse elementsProceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization10.5555/3314872.3314897(206-216)Online publication date: 16-Feb-2019
https://dl.acm.org/doi/10.5555/3314872.3314897
Sun HFey FZhao JGorlatch SEigenmann RDing CMcKee S(2019)WCCVProceedings of the ACM International Conference on Supercomputing10.1145/3330345.3331059(319-329)Online publication date: 26-Jun-2019
https://dl.acm.org/doi/10.1145/3330345.3331059
Moll SHack S(2018)Partial control-flow linearizationACM SIGPLAN Notices10.1145/3296979.319241353:4(543-556)Online publication date: 11-Jun-2018
https://dl.acm.org/doi/10.1145/3296979.3192413
Mendis CAmarasinghe S(2018)goSLP: globally optimized superword level parallelism frameworkProceedings of the ACM on Programming Languages10.1145/32764802:OOPSLA(1-28)Online publication date: 24-Oct-2018
https://dl.acm.org/doi/10.1145/3276480
Hückelheim JHovland PNarayanan SVelesko P(2018)Vectorised Computation of Diverging EnsemblesProceedings of the 47th International Conference on Parallel Processing10.1145/3225058.3225138(1-10)Online publication date: 13-Aug-2018
https://dl.acm.org/doi/10.1145/3225058.3225138
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents