Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Dynamic trace-based analysis of vectorization potential of applications

Published: 11 June 2012 Publication History

Abstract

Recent hardware trends with GPUs and the increasing vector lengths of SSE-like ISA extensions for multicore CPUs imply that effective exploitation of SIMD parallelism is critical for achieving high performance on emerging and future architectures. A vast majority of existing applications were developed without any attention by their developers towards effective vectorizability of the codes. While developers of production compilers such as GNU gcc, Intel icc, PGI pgcc, and IBM xlc have invested considerable effort and made significant advances in enhancing automatic vectorization capabilities, these compilers still cannot effectively vectorize many existing scientific and engineering codes. It is therefore of considerable interest to analyze existing applications to assess the inherent latent potential for SIMD parallelism, exploitable through further compiler advances and/or via manual code changes.
In this paper we develop an approach to infer a program's SIMD parallelization potential by analyzing the dynamic data-dependence graph derived from a sequential execution trace. By considering only the observed run-time data dependences for the trace, and by relaxing the execution order of operations to allow any dependence-preserving reordering, we can detect potential SIMD parallelism that may otherwise be missed by more conservative compile-time analyses. We show that for several benchmarks our tool discovers regions of code within computationally-intensive loops that exhibit high potential for SIMD parallelism but are not vectorized by state-of-the-art compilers. We present several case studies of the use of the tool, both in identifying opportunities to enhance the transformation capabilities of vectorizing compilers, as well as in pointing to code regions to manually modify in order to enable auto-vectorization and performance improvement by existing compilers.

References

[1]
R. Allen and K. Kennedy. Optimizing Compilers for Modern Architectures: A Dependence-based Approach. Morgan Kaufmann, 2001.
[2]
T. Austin and G. Sohi. Dynamic dependency analysis of ordinary programs. In ISCA, pages 342--351, 1992.
[3]
M. Bridges, N. Vachharajani, Y. Zhang, T. Jablin, and D. August. Revisiting the sequential programming model for multi-core. In MICRO, pages 69--84, 2007.
[4]
Clang. clang.llvm.org.
[5]
DragonEgg. dragonegg.llvm.org.
[6]
A. Eichenberger, P. Wu, and K. O'Brien. Vectorization for SIMD architectures with alignment constraints. In PLDI, pages 82--93, 2004.
[7]
L. Fireman, E. Petrank, and A. Zaks. New algorithms for SIMD alignment. In CC, pages 1--15, 2007.
[8]
S. Garcia, D. Jeon, C. M. Louie, and M. B. Taylor. Kremlin: Rethinking and rebooting gprof for the multicore age. In PLDI, pages 458--469, 2011.
[9]
C. Hammacher, K. Streit, S. Hack, and A. Zeller. Profiling Java programs for parallelism. In IWMSE, pages 49--55, 2009.
[10]
HPCToolkit. www.hpctoolkit.org.
[11]
M. Kumar. Measuring parallelism in computation-intensive scientific/engineering applications. IEEE TC, 37 (9): 1088--1098, 1988.
[12]
M. Lam and R. Wilson. Limits of control flow on parallelism. In ISCA, pages 46--57, 1992.
[13]
S. Larsen and S. Amarasinghe. Exploiting superword level parallelism with multimedia instruction sets. In PLDI, pages 145--156, 2000.
[14]
J. Larus. Loop-level parallelism in numeric and symbolic programs. IEEE TPDS, 4 (1): 812--826, 1993.
[15]
C. Lattner and V. Adve. LLVM: A compilation framework for lifelong program analysis & transformation. In CGO, page 75, 2004.
[16]
J. Mak and A. Mycroft. Limits of parallelism using dynamic dependency graphs. In WODA, pages 42--48, 2009.
[17]
A. Nicolau and J. Fisher. Measuring the parallelism available for very long instruction word architectures. IEEE TC, 33 (11): 968--976, 1984.
[18]
D. Nuzman, I. Rosen, and A. Zaks. Auto-vectorization of interleaved data for SIMD. In PLDI, pages 132--143, 2006.
[19]
C. Oancea and A. Mycroft. Set-congruence dynamic analysis for thread-level speculation (TLS). In LCPC, pages 156--171, 2008.
[20]
PETSc. www.mcs.anl.gov/petsc.
[21]
M. Postiff, D. Greene, G. Tyson, and T. Mudge. The limits of instruction level parallelism in SPEC95 applications. SIGARCH Computer Architecture News, 27 (1): 31--34, 1999.
[22]
L. Rauchwerger and D. Padua. The LRPD test: Speculative run-time parallelization of loops with privatization and reduction parallelization. In PLDI, pages 218--232, 1995.
[23]
L. Rauchwerger, P. Dubey, and R. Nair. Measuring limits of parallelism and characterizing its vulnerability to resource constraints. In MICRO, pages 105--117, 1993.
[24]
A. Rountev, K. Van Valkenburgh, D. Yan, and P. Sadayappan. Understanding parallelism-inhibiting dependences in sequential Java programs. In ICSM, page 9, 2010.
[25]
D. Stefanović and M. Martonosi. Limits and graph structure of available instruction-level parallelism. In Euro-Par, pages 1018--1022, 2000.
[26]
S. Tallam and R. Gupta. Unified control flow and data dependence traces. ACM TACO, 4 (3): 19, 2007.
[27]
S. Tallam, C. Tian, R. Gupta, and X. Zhang. Enabling tracing of long-running multithreaded programs via dynamic execution reduction. In ISSTA, pages 207--218, 2007.
[28]
K. Theobald, G. Gao, and L. Hendren. On the limits of program parallelism and its smoothability. In MICRO, pages 10--19, 1992.
[29]
C. Tian, M. Feng, V. Nagarajan, and R. Gupta. Copy or discard execution model for speculative parallelization on multicores. In MICRO, pages 330--341, 2008.
[30]
C. Tian, M. Feng, V. Nagarajan, and R. Gupta. Speculative parallelization of sequential loops on multicores. JPP, 37 (5): 508--535, 2009.
[31]
G. Tournavitis, Z. Wang, Zheng, B. Franke, and M. O'Boyle. Towards a holistic approach to auto-parallelization. In PLDI, pages 177--187, 2009.
[32]
UTDSP Benchmarks. www.eecg.toronto.edu/~corinna.
[33]
D. Wall. Limits of instruction-level parallelism. In ASPLOS, pages 176--188, 1991.
[34]
M. Wolfe. High Performance Compilers For Parallel Computing. Addison-Wesley, 1996.
[35]
aval}wu-lcpc08P. Wu, A. Kejariwal, and C. Caşcaval. Compiler-driven dependence profiling to guide program parallelization. In LCPC, pages 232--248, 2008.
[36]
X. Zhang and R. Gupta. Cost effective dynamic program slicing. In PLDI, pages 94--106, 2004.
[37]
X. Zhang and R. Gupta. Whole execution traces and their applications. ACM TACO, 2 (3): 301--334, 2005.
[38]
X. Zhang, R. Gupta, and Y. Zhang. Cost and precision tradeoffs of dynamic data slicing algorithms. ACM TOPLAS, 27 (4): 631--661, 2005.
[39]
H. Zhong, M. Mehrara, S. Lieberman, and S. Mahlke. Uncovering hidden loop level parallelism in sequential applications. In HPCA, pages 290--301, 2008.
[40]
X. Zhuang, A. E. Eichenberger, Y. Luo, K. O'Brien, and K. O'Brien. Exploiting parallelism with dependence-aware scheduling. In PACT, pages 193--202, 2009.

Cited By

View all
  • (2021)Speculative vectorisation with selective replayProceedings of the 48th Annual International Symposium on Computer Architecture10.1109/ISCA52012.2021.00026(223-236)Online publication date: 14-Jun-2021
  • (2020)ReCoTOS: A Methodology for Vectorization-based Resource-saving Computing Task Optimization2020 61st International Scientific Conference on Information Technology and Management Science of Riga Technical University (ITMS)10.1109/ITMS51158.2020.9259289(1-6)Online publication date: 15-Oct-2020
  • (2020)A Portable SIMD Primitive Using Kokkos for Heterogeneous ArchitecturesAccelerator Programming Using Directives10.1007/978-3-030-49943-3_7(140-163)Online publication date: 9-Jun-2020
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM SIGPLAN Notices
ACM SIGPLAN Notices  Volume 47, Issue 6
PLDI '12
June 2012
534 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/2345156
Issue’s Table of Contents
  • cover image ACM Conferences
    PLDI '12: Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation
    June 2012
    572 pages
    ISBN:9781450312059
    DOI:10.1145/2254064
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 June 2012
Published in SIGPLAN Volume 47, Issue 6

Check for updates

Author Tags

  1. dynamic analysis
  2. performance analysis
  3. vectorization

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)50
  • Downloads (Last 6 weeks)3
Reflects downloads up to 12 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2021)Speculative vectorisation with selective replayProceedings of the 48th Annual International Symposium on Computer Architecture10.1109/ISCA52012.2021.00026(223-236)Online publication date: 14-Jun-2021
  • (2020)ReCoTOS: A Methodology for Vectorization-based Resource-saving Computing Task Optimization2020 61st International Scientific Conference on Information Technology and Management Science of Riga Technical University (ITMS)10.1109/ITMS51158.2020.9259289(1-6)Online publication date: 15-Oct-2020
  • (2020)A Portable SIMD Primitive Using Kokkos for Heterogeneous ArchitecturesAccelerator Programming Using Directives10.1007/978-3-030-49943-3_7(140-163)Online publication date: 9-Jun-2020
  • (2019)Building a Polyhedral Representation from an Instrumented ExecutionACM Transactions on Architecture and Code Optimization10.1145/336378516:4(1-26)Online publication date: 17-Dec-2019
  • (2019)Data-flow/dependence profiling for structured transformationsProceedings of the 24th Symposium on Principles and Practice of Parallel Programming10.1145/3293883.3295737(173-185)Online publication date: 16-Feb-2019
  • (2018)An Evaluation of Vectorization and Cache Reuse Tradeoffs on Modern CPUsProceedings of the 9th International Workshop on Programming Models and Applications for Multicores and Manycores10.1145/3178442.3178445(21-30)Online publication date: 24-Feb-2018
  • (2016)Vector data flow analysis for SIMD optimizations on OpenCL programsConcurrency and Computation: Practice & Experience10.1002/cpe.371428:5(1629-1654)Online publication date: 10-Apr-2016
  • (2015)Unification of Static and Dynamic Analyses to Enable VectorizationLanguages and Compilers for Parallel Computing10.1007/978-3-319-17473-0_24(367-381)Online publication date: 1-May-2015
  • (2014)Author retrospective for PTRAN's analysis and optimization techniquesACM International Conference on Supercomputing 25th Anniversary Volume10.1145/2591635.2591638(1-3)Online publication date: 10-Jun-2014
  • (2013)Breaking SIMD shackles with an exposed flexible microarchitecture and the access execute PDGProceedings of the 22nd international conference on Parallel architectures and compilation techniques10.5555/2523721.2523767(341-352)Online publication date: 7-Oct-2013
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media