research-article

Dynamic trace-based analysis of vectorization potential of applications

Authors:

Justin Holewinski,

Ragavendar Ramamurthi,

Mahesh Ravishankar,

Louis-Noël Pouchet,

Atanas Rountev,

P. SadayappanAuthors Info & Claims

ACM SIGPLAN Notices, Volume 47, Issue 6

Pages 371 - 382

https://doi.org/10.1145/2345156.2254108

Published: 11 June 2012 Publication History

Abstract

Recent hardware trends with GPUs and the increasing vector lengths of SSE-like ISA extensions for multicore CPUs imply that effective exploitation of SIMD parallelism is critical for achieving high performance on emerging and future architectures. A vast majority of existing applications were developed without any attention by their developers towards effective vectorizability of the codes. While developers of production compilers such as GNU gcc, Intel icc, PGI pgcc, and IBM xlc have invested considerable effort and made significant advances in enhancing automatic vectorization capabilities, these compilers still cannot effectively vectorize many existing scientific and engineering codes. It is therefore of considerable interest to analyze existing applications to assess the inherent latent potential for SIMD parallelism, exploitable through further compiler advances and/or via manual code changes.

In this paper we develop an approach to infer a program's SIMD parallelization potential by analyzing the dynamic data-dependence graph derived from a sequential execution trace. By considering only the observed run-time data dependences for the trace, and by relaxing the execution order of operations to allow any dependence-preserving reordering, we can detect potential SIMD parallelism that may otherwise be missed by more conservative compile-time analyses. We show that for several benchmarks our tool discovers regions of code within computationally-intensive loops that exhibit high potential for SIMD parallelism but are not vectorized by state-of-the-art compilers. We present several case studies of the use of the tool, both in identifying opportunities to enhance the transformation capabilities of vectorizing compilers, as well as in pointing to code regions to manually modify in order to enable auto-vectorization and performance improvement by existing compilers.

References

[1]

R. Allen and K. Kennedy. Optimizing Compilers for Modern Architectures: A Dependence-based Approach. Morgan Kaufmann, 2001.

Digital Library

[2]

T. Austin and G. Sohi. Dynamic dependency analysis of ordinary programs. In ISCA, pages 342--351, 1992.

Digital Library

[3]

M. Bridges, N. Vachharajani, Y. Zhang, T. Jablin, and D. August. Revisiting the sequential programming model for multi-core. In MICRO, pages 69--84, 2007.

Digital Library

[4]

Clang. clang.llvm.org.

[5]

DragonEgg. dragonegg.llvm.org.

[6]

A. Eichenberger, P. Wu, and K. O'Brien. Vectorization for SIMD architectures with alignment constraints. In PLDI, pages 82--93, 2004.

Digital Library

[7]

L. Fireman, E. Petrank, and A. Zaks. New algorithms for SIMD alignment. In CC, pages 1--15, 2007.

Digital Library

[8]

S. Garcia, D. Jeon, C. M. Louie, and M. B. Taylor. Kremlin: Rethinking and rebooting gprof for the multicore age. In PLDI, pages 458--469, 2011.

Digital Library

[9]

C. Hammacher, K. Streit, S. Hack, and A. Zeller. Profiling Java programs for parallelism. In IWMSE, pages 49--55, 2009.

Digital Library

[10]

HPCToolkit. www.hpctoolkit.org.

[11]

M. Kumar. Measuring parallelism in computation-intensive scientific/engineering applications. IEEE TC, 37 (9): 1088--1098, 1988.

Digital Library

[12]

M. Lam and R. Wilson. Limits of control flow on parallelism. In ISCA, pages 46--57, 1992.

Digital Library

[13]

S. Larsen and S. Amarasinghe. Exploiting superword level parallelism with multimedia instruction sets. In PLDI, pages 145--156, 2000.

Digital Library

[14]

J. Larus. Loop-level parallelism in numeric and symbolic programs. IEEE TPDS, 4 (1): 812--826, 1993.

Digital Library

[15]

C. Lattner and V. Adve. LLVM: A compilation framework for lifelong program analysis & transformation. In CGO, page 75, 2004.

Digital Library

[16]

J. Mak and A. Mycroft. Limits of parallelism using dynamic dependency graphs. In WODA, pages 42--48, 2009.

Digital Library

[17]

A. Nicolau and J. Fisher. Measuring the parallelism available for very long instruction word architectures. IEEE TC, 33 (11): 968--976, 1984.

Digital Library

[18]

D. Nuzman, I. Rosen, and A. Zaks. Auto-vectorization of interleaved data for SIMD. In PLDI, pages 132--143, 2006.

Digital Library

[19]

C. Oancea and A. Mycroft. Set-congruence dynamic analysis for thread-level speculation (TLS). In LCPC, pages 156--171, 2008.

Digital Library

[20]

PETSc. www.mcs.anl.gov/petsc.

[21]

M. Postiff, D. Greene, G. Tyson, and T. Mudge. The limits of instruction level parallelism in SPEC95 applications. SIGARCH Computer Architecture News, 27 (1): 31--34, 1999.

Digital Library

[22]

L. Rauchwerger and D. Padua. The LRPD test: Speculative run-time parallelization of loops with privatization and reduction parallelization. In PLDI, pages 218--232, 1995.

Digital Library

[23]

L. Rauchwerger, P. Dubey, and R. Nair. Measuring limits of parallelism and characterizing its vulnerability to resource constraints. In MICRO, pages 105--117, 1993.

Digital Library

[24]

A. Rountev, K. Van Valkenburgh, D. Yan, and P. Sadayappan. Understanding parallelism-inhibiting dependences in sequential Java programs. In ICSM, page 9, 2010.

Digital Library

[25]

D. Stefanović and M. Martonosi. Limits and graph structure of available instruction-level parallelism. In Euro-Par, pages 1018--1022, 2000.

Digital Library

[26]

S. Tallam and R. Gupta. Unified control flow and data dependence traces. ACM TACO, 4 (3): 19, 2007.

Digital Library

[27]

S. Tallam, C. Tian, R. Gupta, and X. Zhang. Enabling tracing of long-running multithreaded programs via dynamic execution reduction. In ISSTA, pages 207--218, 2007.

Digital Library

[28]

K. Theobald, G. Gao, and L. Hendren. On the limits of program parallelism and its smoothability. In MICRO, pages 10--19, 1992.

Digital Library

[29]

C. Tian, M. Feng, V. Nagarajan, and R. Gupta. Copy or discard execution model for speculative parallelization on multicores. In MICRO, pages 330--341, 2008.

Digital Library

[30]

C. Tian, M. Feng, V. Nagarajan, and R. Gupta. Speculative parallelization of sequential loops on multicores. JPP, 37 (5): 508--535, 2009.

Digital Library

[31]

G. Tournavitis, Z. Wang, Zheng, B. Franke, and M. O'Boyle. Towards a holistic approach to auto-parallelization. In PLDI, pages 177--187, 2009.

Digital Library

[32]

UTDSP Benchmarks. www.eecg.toronto.edu/~corinna.

[33]

D. Wall. Limits of instruction-level parallelism. In ASPLOS, pages 176--188, 1991.

Digital Library

[34]

M. Wolfe. High Performance Compilers For Parallel Computing. Addison-Wesley, 1996.

Digital Library

[35]

aval}wu-lcpc08P. Wu, A. Kejariwal, and C. Caşcaval. Compiler-driven dependence profiling to guide program parallelization. In LCPC, pages 232--248, 2008.

Digital Library

[36]

X. Zhang and R. Gupta. Cost effective dynamic program slicing. In PLDI, pages 94--106, 2004.

Digital Library

[37]

X. Zhang and R. Gupta. Whole execution traces and their applications. ACM TACO, 2 (3): 301--334, 2005.

Digital Library

[38]

X. Zhang, R. Gupta, and Y. Zhang. Cost and precision tradeoffs of dynamic data slicing algorithms. ACM TOPLAS, 27 (4): 631--661, 2005.

Digital Library

[39]

H. Zhong, M. Mehrara, S. Lieberman, and S. Mahlke. Uncovering hidden loop level parallelism in sequential applications. In HPCA, pages 290--301, 2008.

[40]

X. Zhuang, A. E. Eichenberger, Y. Luo, K. O'Brien, and K. O'Brien. Exploiting parallelism with dependence-aware scheduling. In PACT, pages 193--202, 2009.

Digital Library

Cited By

Sun PGabrielli GJones TMartínez JDuato JJohn L(2021)Speculative vectorisation with selective replayProceedings of the 48th Annual International Symposium on Computer Architecture10.1109/ISCA52012.2021.00026(223-236)Online publication date: 14-Jun-2021
https://dl.acm.org/doi/10.1109/ISCA52012.2021.00026
Kampars JIrbe JKalnins GMosans GGulbe RPinka K(2020)ReCoTOS: A Methodology for Vectorization-based Resource-saving Computing Task Optimization2020 61st International Scientific Conference on Information Technology and Management Science of Riga Technical University (ITMS)10.1109/ITMS51158.2020.9259289(1-6)Online publication date: 15-Oct-2020
https://doi.org/10.1109/ITMS51158.2020.9259289
Sahasrabudhe DPhipps ERajamanickam SBerzins M(2020)A Portable SIMD Primitive Using Kokkos for Heterogeneous ArchitecturesAccelerator Programming Using Directives10.1007/978-3-030-49943-3_7(140-163)Online publication date: 9-Jun-2020
https://doi.org/10.1007/978-3-030-49943-3_7
Show More Cited By

Index Terms

Dynamic trace-based analysis of vectorization potential of applications

Recommendations

Dynamic trace-based analysis of vectorization potential of applications
PLDI '12: Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation

Recent hardware trends with GPUs and the increasing vector lengths of SSE-like ISA extensions for multicore CPUs imply that effective exploitation of SIMD parallelism is critical for achieving high performance on emerging and future architectures. A ...
Outer-loop vectorization: revisited for short SIMD architectures
PACT '08: Proceedings of the 17th international conference on Parallel architectures and compilation techniques

Vectorization has been an important method of using data-level parallelism to accelerate scientific workloads on vector machines such as Cray for the past three decades. In the last decade it has also proven useful for accelerating multi-media and ...
FlexVec: auto-vectorization for irregular loops
PLDI '16

Traditional vectorization techniques build a dependence graph with distance and direction information to determine whether a loop is vectorizable. Since vectorization reorders the execution of instructions across iterations, in general instructions ...

Comments

Information & Contributors

Information

Published In

cover image ACM SIGPLAN Notices

ACM SIGPLAN Notices Volume 47, Issue 6

PLDI '12

June 2012

534 pages

ISSN:0362-1340

EISSN:1558-1160

DOI:10.1145/2345156

Issue’s Table of Contents

PLDI '12: Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation
June 2012
572 pages
ISBN:9781450312059
DOI:10.1145/2254064
General Chairs:
Jan Vitek
Purdue University
,
Haibo Lin
Microsoft China
,
Program Chair:
Frank Tip
IBM T.J. Watson Research Center

Copyright © 2012 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 June 2012

Published in SIGPLAN Volume 47, Issue 6

Check for updates

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

45
Total Citations
View Citations
748
Total Downloads

Downloads (Last 12 months)50
Downloads (Last 6 weeks)3

Reflects downloads up to 12 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Sun PGabrielli GJones TMartínez JDuato JJohn L(2021)Speculative vectorisation with selective replayProceedings of the 48th Annual International Symposium on Computer Architecture10.1109/ISCA52012.2021.00026(223-236)Online publication date: 14-Jun-2021
https://dl.acm.org/doi/10.1109/ISCA52012.2021.00026
Kampars JIrbe JKalnins GMosans GGulbe RPinka K(2020)ReCoTOS: A Methodology for Vectorization-based Resource-saving Computing Task Optimization2020 61st International Scientific Conference on Information Technology and Management Science of Riga Technical University (ITMS)10.1109/ITMS51158.2020.9259289(1-6)Online publication date: 15-Oct-2020
https://doi.org/10.1109/ITMS51158.2020.9259289
Sahasrabudhe DPhipps ERajamanickam SBerzins M(2020)A Portable SIMD Primitive Using Kokkos for Heterogeneous ArchitecturesAccelerator Programming Using Directives10.1007/978-3-030-49943-3_7(140-163)Online publication date: 9-Jun-2020
https://doi.org/10.1007/978-3-030-49943-3_7
Selva MGruber FSampaio DGuillon CPouchet LRastello F(2019)Building a Polyhedral Representation from an Instrumented ExecutionACM Transactions on Architecture and Code Optimization10.1145/336378516:4(1-26)Online publication date: 17-Dec-2019
https://dl.acm.org/doi/10.1145/3363785
Gruber FSelva MSampaio DGuillon CMoynault APouchet LRastello FHollingsworth JKeidar I(2019)Data-flow/dependence profiling for structured transformationsProceedings of the 24th Symposium on Principles and Practice of Parallel Programming10.1145/3293883.3295737(173-185)Online publication date: 16-Feb-2019
https://dl.acm.org/doi/10.1145/3293883.3295737
Shen DChabbi MLiu X(2018)An Evaluation of Vectorization and Cache Reuse Tradeoffs on Modern CPUsProceedings of the 9th International Workshop on Programming Models and Applications for Multicores and Manycores10.1145/3178442.3178445(21-30)Online publication date: 24-Feb-2018
https://dl.acm.org/doi/10.1145/3178442.3178445
Lin YLee J(2016)Vector data flow analysis for SIMD optimizations on OpenCL programsConcurrency and Computation: Practice & Experience10.1002/cpe.371428:5(1629-1654)Online publication date: 10-Apr-2016
https://dl.acm.org/doi/10.1002/cpe.3714
Rane AKrishnaiyer RNewburn CBrowne JFialho LMatveev Z(2015)Unification of Static and Dynamic Analyses to Enable VectorizationLanguages and Compilers for Parallel Computing10.1007/978-3-319-17473-0_24(367-381)Online publication date: 1-May-2015
https://doi.org/10.1007/978-3-319-17473-0_24
Cytron RFerrante JAllen FBurke MCharles P(2014)Author retrospective for PTRAN's analysis and optimization techniquesACM International Conference on Supercomputing 25th Anniversary Volume10.1145/2591635.2591638(1-3)Online publication date: 10-Jun-2014
https://dl.acm.org/doi/10.1145/2591635.2591638
Govindaraju VNowatzki TSankaralingam KFensch CO'Boyle MSeznec ABodin F(2013)Breaking SIMD shackles with an exposed flexible microarchitecture and the access execute PDGProceedings of the 22nd international conference on Parallel architectures and compilation techniques10.5555/2523721.2523767(341-352)Online publication date: 7-Oct-2013
https://dl.acm.org/doi/10.5555/2523721.2523767
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents