Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2903150.2903169acmconferencesArticle/Chapter ViewAbstractPublication PagescfConference Proceedingsconference-collections
research-article

Automated compiler optimization of multiple vector loads/stores

Published: 16 May 2016 Publication History

Abstract

With widening vectors and the proliferation of advanced vector instructions in today's processors, vectorization plays an ever-increasing role in delivering application performance. Achieving the performance potential of this vector hardware has required significant support from the software level such as new explicit vector programming models and advanced vectorizing compilers. Today, with the combination of these software tools plus new SIMD ISA extensions like gather/scatter instructions it is not uncommon to find that even codes with complex and irregular data access patterns can be vectorized.
In this paper we focus on these vectorized codes with irregular accesses, and show that while the best-in-class Intel Compiler Vectorizer does indeed provide speedup through efficient vectorization, there are some opportunities where clever program transformations can increase performance further. After identifying these opportunities, this paper describes two automatic compiler optimizations to target these opportunities. The first optimization focuses on improving the performance for a group of adjacent gathers/scatters. The second optimization improves performance for a group of stencil vector accesses using more efficient SIMD instructions. Both optimizations are now implemented in a pre-release version of the Intel Compiler. We evaluate the optimizations using an extensive set of micro-kernels, representative benchmarks and application kernels. On these benchmarks, we demonstrate performance gains of 3-750% on the Intel® Xeon processor (Haswell-HSW) and up to 25% on the Intel® Xeon Phi™ coprocessor (Knights Corner-KNC).

References

[1]
R. Kennedy et al. 1999. Partial Redundancy Elimination in SSA Form. In ACM TOPLAS, May 1999.
[2]
P. Briggs and K. Cooper. 1994. Effective partial redundancy elimination. In PLDI, June 1994.
[3]
Intel® 64 and IA-32 Architectures Software Developer's Manual.
[4]
Intel's haswell CPU Microarchitecture, http://www.realworldtech.com/haswell-cpu/2/.
[5]
S. Kamil et al. 2006. Implicit and explicit optimizations for stencil computations. In MSPC '06.
[6]
D. Caballero et al. 2015. Optimizing overlapped Memory Accesses in User-directed vectorization. In ICS, 2015.
[7]
D. Talla, L. K. John, and D. 2003. Burger. Bottlenecks in Multimedia Processing with SIMD Style Extensions and Enhancements. IEEE Trans, August 2003.
[8]
A. E. Eichenberger et al. 2004. Vectorization for SIMD Architectures with Alignment Constraints. In PLDI, 2004.
[9]
D. Nuzman et al. 2006. Auto-vectorization of interleaved data for SIMD. In PLDI, June 2006.
[10]
S. Xu and D. Greg. 2014. Efficient Exploitation of Hyper Loop Parallelism in Vectorization□. 27th International Workshop, LCPC 2014.
[11]
T. Henretty et al., A stencil Compiler for Short-Vector SIMD Architectures. In ICS, 2013.
[12]
http://www.drdobbs.com/go-parallel/article/print?articleId=224202549.
[13]
J. A. Kahle et al. 2005. Introduction to the Cell Multiprocessor. IBM Journal of Research and Development, 49(4), July 2005.
[14]
F. Franchetti et al. 2002. A SIMD Vectorizing Compiler for Digital Signal Processing Algorithms. In IPDPS 2002.
[15]
J. Holewinsk et al. 2012. High-Performance Code Generation for Stencil Computations on GPU Architectures. In ICS, 2012.
[16]
R. Leupers. 2000. Code selection for media processors with SIMD instructions. In DATE '00
[17]
N. Heintze, O. Tardieu. 2001. Ultra-fast aliasing analysis using CLA. In PLDI, May 2001.
[18]
D. M. Gallagher. Memory Disambiguation To Facilitate Instruction-Level Parallelism Compilation. Ph.D. Thesis, Univ. of Illinois, Urbana, IL 1995.
[19]
R. Ghiya et al. 2001. On the importance of points-to analysis and other memory disambiguation methods for C programs. In PLDI, 2001.
[20]
W. W. Hwu et al. 1995 Compiler Technology for Future Microprocessors. In Proc. of the IEEE, 1995.
[21]
R. Dz-ching Ju et al. 1998. Probabilistic Memory Disambiguation and its Application to Data Speculation. In PACT'98.
[22]
P. Lowney et al. 1993. The Multiflow Trace Scheduling Compiler. The Journal of Supercomputing, 1993.
[23]
Seonggun Kim et al. 2012. Efficient SIMD code generation for irregular kernels. In PPoPP, August, 2012.
[24]
S. Larsen et al. 2000. Exploiting superword level parallelism with multimedia instruction sets. In PLDI '00
[25]
http://impact.crhc.illinois.edu/parboil/parboil.aspx.
[26]
N. Satish et al. 2012. Can Traditional Programming Bridge the Ninja Performance Gap for Parallel Computing Applications? In ISCA '12.
[27]
N. Sreraman et al. 2000. A vectorizing compiler for multimedia extensions. Intl. J. Parallel Program., Aug. 2000
[28]
S. Maleki et al. 2011. An Evaluation of Vectorizing Compilers. In PACT'11.
[29]
M. Kong et al. 2013. When Polyhedral Transformations Meet SIMD Code Generation. In PLDI'13.
[30]
R. Barik et al. 2010. Efficient Selection of Vector Instructions Using Dynamic Programming. In MICRO 2010.
[31]
A. Kudriavtsev et al. 2005. Generation of Permutations for SIMD processors. In LCTES, July 2005.
[32]
J. Liu et al. 2012. A Compiler Framework for Extracting Superword Level Parallelism. In PLDI'2012.
[33]
H. Dursun et al. 2009. In-core optimization of high-order stencil computations. In PDPTA, 2009.
[34]
Intel Corp. Intel® Cilk™ Plus Language Extension Specification Version 1.2.
[35]
X. Tian et al. 2013. Practical SIMD Vectorization Techniques for Intel® Xeon Phi Coprocessors. In IPDPSW, May 2013.
[36]
M. Klemm et al. 2012. Extending OpenMP with Vector Constructs for Multicore SIMD Architectures. In IWOMP'12.

Index Terms

  1. Automated compiler optimization of multiple vector loads/stores

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CF '16: Proceedings of the ACM International Conference on Computing Frontiers
    May 2016
    487 pages
    ISBN:9781450341288
    DOI:10.1145/2903150
    • General Chairs:
    • Gianluca Palermo,
    • John Feo,
    • Program Chairs:
    • Antonino Tumeo,
    • Hubertus Franke
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 16 May 2016

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. SIMD
    2. adjacent access
    3. gather
    4. scatter
    5. software write combining
    6. stencil codes
    7. vectorization

    Qualifiers

    • Research-article

    Conference

    CF'16
    Sponsor:
    CF'16: Computing Frontiers Conference
    May 16 - 19, 2016
    Como, Italy

    Acceptance Rates

    CF '16 Paper Acceptance Rate 30 of 94 submissions, 32%;
    Overall Acceptance Rate 273 of 785 submissions, 35%

    Upcoming Conference

    CF '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 250
      Total Downloads
    • Downloads (Last 12 months)13
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 22 Feb 2025

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media