research-article

Open access

Improving Loop Dependence Analysis

Authors:

Nicklas Bo Jensen,

Sven KarlssonAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization (TACO), Volume 14, Issue 3

Article No.: 22, Pages 1 - 24

https://doi.org/10.1145/3095754

Published: 22 August 2017 Publication History

Abstract

Programmers can no longer depend on new processors to have significantly improved single-thread performance. Instead, gains have to come from other sources such as the compiler and its optimization passes. Advanced passes make use of information on the dependencies related to loops. We improve the quality of that information by reusing the information given by the programmer for parallelization. We have implemented a prototype based on GCC into which we also add a new optimization pass. Our approach improves the amount of correctly classified dependencies resulting in 46% average improvement in single-thread performance for kernel benchmarks compared to GCC 6.1.

Supplementary Material

TACO1403-22 (taco1403-22.pdf)

Slide deck associated with this paper

Download
2.20 MB

References

[1]

John Randal Allen. 1983. Dependence Analysis for Subscripted Variables and Its Application to Program Transformations. Ph.D. Dissertation. Rice University.

[2]

OpenMP Architecture Review Board. 2013. OpenMP Application Program Interface (version 4.0). (2013). OpenMP Specification.

[3]

OpenMP Architecture Review Board. 2015. OpenMP Application Program Interface (version 4.5). (2015). OpenMP Specification.

[4]

Uday Bondhugula, Muthu Baskaran, Sriram Krishnamoorthy, J. Ramanujam, Atanas Rountev, and P. Sadayappan. 2008. Automatic transformations for communication-minimized parallelization and locality optimization in the polyhedral model. In Compiler Construction.

Digital Library

[5]

S. Browne, J. Dongarra, N. Garner, G. Ho, and P. Mucci. 2000. A portable programming interface for performance evaluation on modern processors. Int. J. High Perf. Comput. Appl. 14, 3 (2000), 189--204.

Digital Library

[6]

Diego Caballero, Sara Royuela, Roger Ferrer, Alejandro Duran, and Xavier Martorell. 2015. Optimizing overlapped memory accesses in user-directed vectorization. In Proceedings of the 29th ACM on International Conference on Supercomputing (ICS’15).

Digital Library

[7]

D. Callahan, J. Dongarra, and D. Levine. 1988. Vectorizing compilers: A test suite and results. In Proceedings of the 1998 ACM/IEEE Conference on Supercomputing (SC’88).

Digital Library

[8]

Prasanth Chatarasi, Jun Shirako, and Vivek Sarkar. 2015. Polyhedral optimizations of explicitly parallel programs. In 2015 International Conference on Parallel Architecture and Compilation (PACT’15).

Digital Library

[9]

Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC’09).

Digital Library

[10]

Gregory Diamos and Benjamin Ashbaugh. 2011. SIMD re-convergence at thread frontiers. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture.

Digital Library

[11]

A.E. Eichenberger and A. Wang. 2005. Efficient SIMD code generation for runtime alignment and length conversion. In Proceedings of the International Symposium on Code Generation and Optimization.

Digital Library

[12]

Alexandre E. Eichenberger, Peng Wu, and Kevin O’Brien. 2004. Vectorization for SIMD architectures with alignment constraints. In Proceedings of the ACM SIGPLAN 2004 Conference on Programming Language Design and Implementation (PLDI’04).

Digital Library

[13]

Free Software Foundation. 2016. GNU Compiler Collection. Retrieved January 27, 2016 from http://gcc.gnu.org.

[14]

Gina Goff, Ken Kennedy, and Chau-Wen Tseng. 1991. Practical dependence testing. In Proceedings of the ACM SIGPLAN 1991 Conference on Programming Language Design and Implementation (PLDI’91).

Digital Library

[15]

Intel. 2015a. Intel Architecture Instruction Set Extensions Programming Reference. Technical Report. Retrieved from http://download-software.intel.com/sites/default/files/319433-014.pdf.

[16]

Intel. 2015b. Intel 64 and IA-32 Architectures Software Developers Manual.

[17]

Intel. 2016. Intel Composer XE 2015. Retrieved January 27, 2016 from http://software.intel.com/en-us/intel-composer-xe.

[18]

Ralf Karrenberg and Sebastian Hack. 2011. Whole-function vectorization. In Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO’11).

Digital Library

[19]

Ken Kennedy and John R. Allen. 2002. Optimizing Compilers for Modern Architectures: A Dependence-based Approach. Morgan Kaufmann Publishers Inc.

Digital Library

[20]

Michael Klemm, Alejandro Duran, Xinmin Tian, Hideki Saito, Diego Caballero, and Xavier Martorell. 2012. Extending OpenMP* with vector constructs for modern multicore SIMD architectures. International Workshop on OpenMP (2012).

Digital Library

[21]

D. J. Kuck, R. H. Kuhn, D. A. Padua, B. Leasure, and M. Wolfe. 1981. Dependence graphs and compiler optimizations. In Proceedings of the 8th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL’81).

Digital Library

[22]

Samuel Larsen and Saman Amarasinghe. 2000. Exploiting superword level parallelism with multimedia instruction sets. Proceedings of the ACM SIGPLAN 2000 Conference on Programming Language Design and Implementation.

Digital Library

[23]

Wei Li and Keshav Pingali. 1994. A singular loop transformation framework based on non-singular matrices. Int. J. Parallel Program. 22, 2 (1994), 183--205.

Digital Library

[24]

Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. 2005. Pin: Building customized program analysis tools with dynamic instrumentation. In Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’05).

Digital Library

[25]

Saeed Maleki, Yaoqing Gao, Maria J. Garzarán, Tommy Wong, and David A. Padua. 2011. An evaluation of vectorizing compilers. In Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques (PACT’11).

Digital Library

[26]

Dorit Naishlos. 2004. Autovectorization in GCC. Proceedings of the 2004 GCC Developers Summit. Retrieved from http://people.redhat.com/lockhart/.gcc2004/MasterGCC-2side.pdf.

[27]

Dorit Naishlos, Marina Biberstein, Shay Ben-David, and Ayal Zaks. 2003. Vectorizing for a SIMdD DSP architecture. In Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES’03).

Digital Library

[28]

Dorit Nuzman, Ira Rosen, and Ayal Zaks. 2006. Auto-vectorization of interleaved data for SIMD. ACM SIGPLAN Not. 41 (2006).

Digital Library

[29]

Dorit Nuzman and Ayal Zaks. 2008. Outer-loop vectorization. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques (PACT’08).

Digital Library

[30]

Antoniu Pop and Albert Cohen. 2010. Preserving high-level semantics of parallel programming annotations through the compilation flow of optimizing compilers. In Proceedings of the 15th Workshop on Compilers for Parallel Computers (CPC’10).

[31]

William Pugh. 1991. The omega test: A fast and practical integer programming algorithm for dependence analysis. In Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (SC’91).

Digital Library

[32]

Gil Rapaport, Ayal Zaks, and Yosi Ben-Asher. 2015. Streamlining whole function vectorization in C using higher order vector semantics. In Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium Workshop. IEEE.

Digital Library

[33]

Richard M. Russell. 1978. The CRAY-1 computer system. Commun. ACM 21, 1 (1978), 63--72. 0001-0782

Digital Library

[34]

The LLVM Foundation. 2016. clang: a C language family frontend for LLVM. Retrieved January 27, 2016 from http://clang.llvm.org.

[35]

Xinmin Tian, Hideki Saito, Milind Girkar, Serguei V. Preis, Sergey S. Kozhukhov, Aleksei G. Cherkasov, Clark Nelson, Nikolay Panchenko, and Robert Geva. 2012. Compiling C/C++ SIMD extensions for function and loop vectorizaion on multicore-SIMD processors. In Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

Digital Library

[36]

Konrad Trifunovic, Albert Cohen, David Edelsohn, Feng Li, Tobias Grosser, Harsha Jagasia, Razya Ladelsky, Sebastian Pop, Jan Sjödin, and Ramakrishna Upadrasta. 2010. GRAPHITE two years after: First lessons learned from real-world polyhedral compilation. In Proceedings of the GCC Research Opportunities Workshop (GROW).

[37]

Konrad Trifunovic, Dorit Nuzman, Albert Cohen, Ayal Zaks, and Ira Rosen. 2009. Polyhedral-model guided loop-nest auto-vectorization. In 18th International Conference on Parallel Architectures and Compilation Techniques (PACT’09).

Digital Library

Cited By

Maramzin AVasiladiotis CLozano RCole MFranke BAtoofian ETakizawa H(2019)“It looks like you’re writing a parallel loop”: a machine learning based parallelization assistantProceedings of the 6th ACM SIGPLAN International Workshop on AI-Inspired and Empirical Methods for Software Engineering on Parallel Computing Systems10.1145/3358500.3361567(1-10)Online publication date: 22-Oct-2019
https://dl.acm.org/doi/10.1145/3358500.3361567

Index Terms

Improving Loop Dependence Analysis

Recommendations

Outer-loop vectorization: revisited for short SIMD architectures
PACT '08: Proceedings of the 17th international conference on Parallel architectures and compilation techniques

Vectorization has been an important method of using data-level parallelism to accelerate scientific workloads on vector machines such as Cray for the past three decades. In the last decade it has also proven useful for accelerating multi-media and ...
SIMDization of Small Tensor Multiplication Kernels for Wide SIMD Vector Processors
WPMVP'18: Proceedings of the 2018 4th Workshop on Programming Models for SIMD/Vector Processing

Developers often rely on automatic vectorization to speed up fine-grained data-parallel code. However, for loop nests where the loops are shorter than the processor's SIMD width, automatic vectorization performs poorly. Vectorizers attempt to vectorize ...
Improving the effectiveness of searching for isomorphic chains in superword level parallelism
MICRO-50 '17: Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture

Most high-performance microprocessors come equipped with general-purpose Single Instruction Multiple Data (SIMD) execution engines to enhance performance. Compilers use auto-vectorization techniques to identify vector parallelism and generate SIMD code ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization

ACM Transactions on Architecture and Code Optimization Volume 14, Issue 3

September 2017

278 pages

ISSN:1544-3566

EISSN:1544-3973

DOI:10.1145/3132652

Editor:
Koen De Bosschere
Ghent University

Issue’s Table of Contents

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 August 2017

Accepted: 01 May 2017

Revised: 01 April 2017

Received: 01 October 2016

Published in TACO Volume 14, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

ARTEMIS

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
1,370
Total Downloads

Downloads (Last 12 months)260
Downloads (Last 6 weeks)28

Reflects downloads up to 12 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Maramzin AVasiladiotis CLozano RCole MFranke BAtoofian ETakizawa H(2019)“It looks like you’re writing a parallel loop”: a machine learning based parallelization assistantProceedings of the 6th ACM SIGPLAN International Workshop on AI-Inspired and Empirical Methods for Software Engineering on Parallel Computing Systems10.1145/3358500.3361567(1-10)Online publication date: 22-Oct-2019
https://dl.acm.org/doi/10.1145/3358500.3361567

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents