Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/2190025.2190061acmconferencesArticle/Chapter ViewAbstractPublication PagescgoConference Proceedingsconference-collections
Article

Whole-function vectorization

Published: 02 April 2011 Publication History

Abstract

Data-parallel programming languages are an important component in today's parallel computing landscape. Among those are domain-specific languages like shading languages in graphics (HLSL, GLSL, RenderMan, etc.) and "general-purpose" languages like CUDA or OpenCL. Current implementations of those languages on CPUs solely rely on multi-threading to implement parallelism and ignore the additional intra-core parallelism provided by the SIMD instruction set of those processors (like Intel's SSE and the upcoming AVX or Larrabee instruction sets). In this paper, we discuss several aspects of implementing dataparallel languages on machines with SIMD instruction sets. Our main contribution is a language- and platform-independent code transformation that performs whole-function vectorization on low-level intermediate code given by a control flow graph in SSA form. We evaluate our technique in two scenarios: First, incorporated in a compiler for a domain-specific language used in realtime ray tracing. Second, in a stand-alone OpenCL driver. We observe average speedup factors of 3.9 for the ray tracer and factors between 0.6 and 5.2 for different OpenCL kernels.

References

[1]
J. C. H. Park and M. Schlansker, "On Predicated Execution," 1991.
[2]
J. R. Allen, K. Kennedy, C. Porterfield, and J. Warren, "Conversion of control dependence to data dependence," in POPL. ACM, 1983, pp. 177-189.
[3]
C. Lattner and V. Adve, "LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation," in CGO, Mar 2004.
[4]
B. Alpern, M. N. Wegman, and F. K. Zadeck, "Detecting equality of variables in programs," in POPL. ACM, 1988, pp. 1-11.
[5]
L. Seiler et al., "Larrabee: a many-core x86 architecture for visual computing," in SIGGRAPH. ACM, 2008, pp. 1-15.
[6]
J. Shin, "Introducing Control Flow into Vectorized Code," in PACT. IEEE Computer Society, 2007, pp. 280-291.
[7]
J. Janssen and H. Corporaal, "Making graphs reducible with controlled node splitting," ACM Trans. Program. Lang. Syst., vol. 19, no. 6, pp. 1031-1052, 1997.
[8]
L. Carter, J. Ferrante, and C. Thomborson, "Folklore confirmed: reducible flow graphs are exponentially larger," in POPL. ACM, 2003, pp. 106-114.
[9]
R. Allen and K. Kennedy, "Automatic translation of FORTRAN programs to vector form," ACM Trans. Program. Lang. Syst., vol. 9, no. 4, pp. 491-542, 1987.
[10]
A. Darte, Y. Robert, and F. Vivien, Scheduling and Automatic Parallelization. Birkhauser Boston, 2000.
[11]
G. Cheong and M. Lam, "An Optimizer for Multimedia Instruction Sets," in Second SUIF Compiler Workshop, 1997.
[12]
N. Sreraman and R. Govindarajan, "A vectorizing compiler for multimedia extensions," Int. J. Parallel Program., vol. 28, no. 4, pp. 363-400, 2000.
[13]
A. Krall and S. Lelait, "Compilation techniques for multimedia processors," Int. J. Parallel Program., vol. 28, no. 4, pp. 347-361, 2000.
[14]
D. Nuzman and R. Henderson, "Multi-platform auto-vectorization," in CGO, 2006, pp. 281-294.
[15]
S. Larsen and S. Amarasinghe, "Exploiting superword level parallelism with multimedia instruction sets," SIGPLAN Not., vol. 35, no. 5, pp. 145-156, 2000.
[16]
J. Shin, M. Hall, and J. Chame, "Superword-Level Parallelism in the Presence of Control Flow," in CGO. IEEE Computer Society, 2005, pp. 165-175.
[17]
R. G. Scarborough and H. G. Kolsky, "A vectorizing fortran compiler," IBM J. Res. Dev., vol. 30, no. 2, pp. 163-171, 1986.
[18]
V. Ngo, "Parallel loop transformation techniques for vector-based multiprocessor systems," Ph.D. dissertation, May 1994.
[19]
M. J. Wolfe, High Performance Compilers for Parallel Computing, C. Shanklin and L. Ortega, Eds. Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc., 1995.
[20]
D. Nuzman and A. Zaks, "Outer-loop vectorization: revisited for short simd architectures," in PACT. ACM, 2008, pp. 2-11.
[21]
G. E. Blelloch et al., "Implementation of a portable nested data-parallel language," in PPOPP. ACM, 1993, pp. 102-111.
[22]
A. Ghuloum et al., "Future-Proof Data Parallel Algorithms and Software on Intel Multi-Core Architecture," Intel Technology Journal, vol. 11, no. 04, November 2007. {Online}. Available: http://www.intel. com/technology/itj/2007/v11i4/7-future-proof/1-abstract.htm
[23]
N. Fritz, P. Lucas, and P. Slusallek, "CGiS, a New Language for Data-Parallel GPU Programming," in VMV, 2004, pp. 241-248.
[24]
NVIDIA, CUDA Programming Guide, 2009.
[25]
Khronos Group, OpenCL 1.0 Specification, http://khronos.org/registry/cl/specs/opencl-1.0.pdf, 2009.
[26]
S. Parker et al., "RTSL: A Ray Tracing Shading Language," IEEE Symposium on Interactive Ray Tracing, 2007.
[27]
I. Georgiev and P. Slusallek, "RTfact: Generic Concepts for Flexible and High Performance Ray Tracing," in Proceedings of the IEEE/Eurographics Symposium on Interactive Ray Tracing 2008, August 2008, pp. 115-122.
[28]
A. Apodaca and M. Mantle, "RenderMan: Pursuing the Future of Graphics," IEEE Computer Graphics & Applications, vol. 10, no. 4, pp. 44-49, July 1990.
[29]
"AOBench," http://lucille.atso-net.jp/blog.
[30]
AMD-ATI, ATI Stream SDK v2.1, http://developer.amd.com/gpu/atistreamsdk, March 2010.

Cited By

View all
  • (2023)Rapid: Region-Based Pointer DisambiguationProceedings of the ACM on Programming Languages10.1145/36228597:OOPSLA2(1729-1757)Online publication date: 16-Oct-2023
  • (2022)The road not taken: exploring alias analysis based optimizations missed by the compilerProceedings of the ACM on Programming Languages10.1145/35633166:OOPSLA2(786-810)Online publication date: 31-Oct-2022
  • (2022)Combining Run-Time Checks and Compile-Time Analysis to Improve Control Flow Auto-VectorizationProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1145/3559009.3569663(439-450)Online publication date: 8-Oct-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CGO '11: Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
April 2011
324 pages
ISBN:9781612843568

Sponsors

Publisher

IEEE Computer Society

United States

Publication History

Published: 02 April 2011

Check for updates

Qualifiers

  • Article

Acceptance Rates

CGO '11 Paper Acceptance Rate 28 of 105 submissions, 27%;
Overall Acceptance Rate 312 of 1,061 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)15
  • Downloads (Last 6 weeks)3
Reflects downloads up to 09 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Rapid: Region-Based Pointer DisambiguationProceedings of the ACM on Programming Languages10.1145/36228597:OOPSLA2(1729-1757)Online publication date: 16-Oct-2023
  • (2022)The road not taken: exploring alias analysis based optimizations missed by the compilerProceedings of the ACM on Programming Languages10.1145/35633166:OOPSLA2(786-810)Online publication date: 31-Oct-2022
  • (2022)Combining Run-Time Checks and Compile-Time Analysis to Improve Control Flow Auto-VectorizationProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1145/3559009.3569663(439-450)Online publication date: 8-Oct-2022
  • (2021)Pointer-Based Divergence Analysis for OpenCL 2.0 ProgramsACM Transactions on Parallel Computing10.1145/34706448:4(1-23)Online publication date: 15-Oct-2021
  • (2019)Extending LLVM for lightweight SPMD vectorization: using SIMD and vector instructions easily from any languageProceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization10.5555/3314872.3314912(278-279)Online publication date: 16-Feb-2019
  • (2019)Super-Node SLP: optimized vectorization for code sequences containing operators and their inverse elementsProceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization10.5555/3314872.3314897(206-216)Online publication date: 16-Feb-2019
  • (2019)WCCVProceedings of the ACM International Conference on Supercomputing10.1145/3330345.3331059(319-329)Online publication date: 26-Jun-2019
  • (2018)Partial control-flow linearizationACM SIGPLAN Notices10.1145/3296979.319241353:4(543-556)Online publication date: 11-Jun-2018
  • (2018)goSLP: globally optimized superword level parallelism frameworkProceedings of the ACM on Programming Languages10.1145/32764802:OOPSLA(1-28)Online publication date: 24-Oct-2018
  • (2018)Vectorised Computation of Diverging EnsemblesProceedings of the 47th International Conference on Parallel Processing10.1145/3225058.3225138(1-10)Online publication date: 13-Aug-2018
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media