research-article

Automated compiler optimization of multiple vector loads/stores

Authors:

Vyacheslav P. Zakharin,

Rakesh Krishaniyer,

David Kreitzer,

Chang-Sun Lin, JrAuthors Info & Claims

CF '16: Proceedings of the ACM International Conference on Computing Frontiers

Pages 82 - 91

https://doi.org/10.1145/2903150.2903169

Published: 16 May 2016 Publication History

Abstract

With widening vectors and the proliferation of advanced vector instructions in today's processors, vectorization plays an ever-increasing role in delivering application performance. Achieving the performance potential of this vector hardware has required significant support from the software level such as new explicit vector programming models and advanced vectorizing compilers. Today, with the combination of these software tools plus new SIMD ISA extensions like gather/scatter instructions it is not uncommon to find that even codes with complex and irregular data access patterns can be vectorized.

In this paper we focus on these vectorized codes with irregular accesses, and show that while the best-in-class Intel Compiler Vectorizer does indeed provide speedup through efficient vectorization, there are some opportunities where clever program transformations can increase performance further. After identifying these opportunities, this paper describes two automatic compiler optimizations to target these opportunities. The first optimization focuses on improving the performance for a group of adjacent gathers/scatters. The second optimization improves performance for a group of stencil vector accesses using more efficient SIMD instructions. Both optimizations are now implemented in a pre-release version of the Intel Compiler. We evaluate the optimizations using an extensive set of micro-kernels, representative benchmarks and application kernels. On these benchmarks, we demonstrate performance gains of 3-750% on the Intel® Xeon processor (Haswell-HSW) and up to 25% on the Intel® Xeon Phi™ coprocessor (Knights Corner-KNC).

References

[1]

R. Kennedy et al. 1999. Partial Redundancy Elimination in SSA Form. In ACM TOPLAS, May 1999.

Digital Library

[2]

P. Briggs and K. Cooper. 1994. Effective partial redundancy elimination. In PLDI, June 1994.

Digital Library

[3]

Intel® 64 and IA-32 Architectures Software Developer's Manual.

[4]

Intel's haswell CPU Microarchitecture, http://www.realworldtech.com/haswell-cpu/2/.

[5]

S. Kamil et al. 2006. Implicit and explicit optimizations for stencil computations. In MSPC '06.

Digital Library

[6]

D. Caballero et al. 2015. Optimizing overlapped Memory Accesses in User-directed vectorization. In ICS, 2015.

Digital Library

[7]

D. Talla, L. K. John, and D. 2003. Burger. Bottlenecks in Multimedia Processing with SIMD Style Extensions and Enhancements. IEEE Trans, August 2003.

Digital Library

[8]

A. E. Eichenberger et al. 2004. Vectorization for SIMD Architectures with Alignment Constraints. In PLDI, 2004.

Digital Library

[9]

D. Nuzman et al. 2006. Auto-vectorization of interleaved data for SIMD. In PLDI, June 2006.

Digital Library

[10]

S. Xu and D. Greg. 2014. Efficient Exploitation of Hyper Loop Parallelism in Vectorization&squ;. 27th International Workshop, LCPC 2014.

[11]

T. Henretty et al., A stencil Compiler for Short-Vector SIMD Architectures. In ICS, 2013.

Digital Library

[12]

http://www.drdobbs.com/go-parallel/article/print?articleId=224202549.

[13]

J. A. Kahle et al. 2005. Introduction to the Cell Multiprocessor. IBM Journal of Research and Development, 49(4), July 2005.

Digital Library

[14]

F. Franchetti et al. 2002. A SIMD Vectorizing Compiler for Digital Signal Processing Algorithms. In IPDPS 2002.

Digital Library

[15]

J. Holewinsk et al. 2012. High-Performance Code Generation for Stencil Computations on GPU Architectures. In ICS, 2012.

Digital Library

[16]

R. Leupers. 2000. Code selection for media processors with SIMD instructions. In DATE '00

Digital Library

[17]

N. Heintze, O. Tardieu. 2001. Ultra-fast aliasing analysis using CLA. In PLDI, May 2001.

Digital Library

[18]

D. M. Gallagher. Memory Disambiguation To Facilitate Instruction-Level Parallelism Compilation. Ph.D. Thesis, Univ. of Illinois, Urbana, IL 1995.

[19]

R. Ghiya et al. 2001. On the importance of points-to analysis and other memory disambiguation methods for C programs. In PLDI, 2001.

Digital Library

[20]

W. W. Hwu et al. 1995 Compiler Technology for Future Microprocessors. In Proc. of the IEEE, 1995.

[21]

R. Dz-ching Ju et al. 1998. Probabilistic Memory Disambiguation and its Application to Data Speculation. In PACT'98.

[22]

P. Lowney et al. 1993. The Multiflow Trace Scheduling Compiler. The Journal of Supercomputing, 1993.

Digital Library

[23]

Seonggun Kim et al. 2012. Efficient SIMD code generation for irregular kernels. In PPoPP, August, 2012.

Digital Library

[24]

S. Larsen et al. 2000. Exploiting superword level parallelism with multimedia instruction sets. In PLDI '00

Digital Library

[25]

http://impact.crhc.illinois.edu/parboil/parboil.aspx.

[26]

N. Satish et al. 2012. Can Traditional Programming Bridge the Ninja Performance Gap for Parallel Computing Applications? In ISCA '12.

Digital Library

[27]

N. Sreraman et al. 2000. A vectorizing compiler for multimedia extensions. Intl. J. Parallel Program., Aug. 2000

[28]

S. Maleki et al. 2011. An Evaluation of Vectorizing Compilers. In PACT'11.

Digital Library

[29]

M. Kong et al. 2013. When Polyhedral Transformations Meet SIMD Code Generation. In PLDI'13.

Digital Library

[30]

R. Barik et al. 2010. Efficient Selection of Vector Instructions Using Dynamic Programming. In MICRO 2010.

Digital Library

[31]

A. Kudriavtsev et al. 2005. Generation of Permutations for SIMD processors. In LCTES, July 2005.

Digital Library

[32]

J. Liu et al. 2012. A Compiler Framework for Extracting Superword Level Parallelism. In PLDI'2012.

Digital Library

[33]

H. Dursun et al. 2009. In-core optimization of high-order stencil computations. In PDPTA, 2009.

[34]

Intel Corp. Intel® Cilk™ Plus Language Extension Specification Version 1.2.

[35]

X. Tian et al. 2013. Practical SIMD Vectorization Techniques for Intel® Xeon Phi Coprocessors. In IPDPSW, May 2013.

Digital Library

[36]

M. Klemm et al. 2012. Extending OpenMP with Vector Constructs for Multicore SIMD Architectures. In IWOMP'12.

Digital Library

Index Terms

Automated compiler optimization of multiple vector loads/stores
1. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

Automated Compiler Optimization of Multiple Vector Loads/Stores

With widening vectors and the proliferation of advanced vector instructions in today's processors, vectorization plays an ever-increasing role in delivering application performance. Achieving the performance potential of this vector hardware has ...
Boundary element quadrature schemes for multi- and many-core architectures

In the paper we study the performance of the regularized boundary element quadrature routines implemented in the BEM4I library developed by the authors. Apart from the results obtained on the classical multi-core architecture represented by the Intel ...
Efficient gather and scatter operations on graphics processors
SC '07: Proceedings of the 2007 ACM/IEEE conference on Supercomputing

Gather and scatter are two fundamental data-parallel operations, where a large number of data items are read (gathered) from or are written (scattered) to given locations. In this paper, we study these two operations on graphics processing units (GPUs).
...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CF '16: Proceedings of the ACM International Conference on Computing Frontiers

May 2016

487 pages

ISBN:9781450341288

DOI:10.1145/2903150

General Chairs:
Gianluca Palermo
Politecnico di Milano, IT
,
John Feo
Pacific Northwest National Laboratory and Northwest Institute for Advanced Computing
,
Program Chairs:
Antonino Tumeo
Pacific Northwest National Laboratory, USA
,
Hubertus Franke
New York University and IBM Research, USA

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Micron Foundation: Micron Technology Foundation, Inc.
ACM: Association for Computing Machinery
Politecnico di Milano: Politecnico di Milano
SIGMICRO: ACM Special Interest Group on Microarchitectural Research and Processing
IBM: IBM

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 May 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

CF'16

Sponsor:

Micron Foundation
ACM
Politecnico di Milano
SIGMICRO
IBM

CF'16: Computing Frontiers Conference

May 16 - 19, 2016

Como, Italy

Acceptance Rates

CF '16 Paper Acceptance Rate 30 of 94 submissions, 32%;

Overall Acceptance Rate 273 of 785 submissions, 35%

Upcoming Conference

CF '25

Sponsor:
sigmicro

22nd ACM International Conference on Computing Frontiers

May 28 - 30, 2025

Cagliari , Italy

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
250
Total Downloads

Downloads (Last 12 months)13
Downloads (Last 6 weeks)2

Reflects downloads up to 22 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten