research-article

Open access

goSLP: globally optimized superword level parallelism framework

Authors:

Charith Mendis,

Saman AmarasingheAuthors Info & Claims

Proceedings of the ACM on Programming Languages, Volume 2, Issue OOPSLA

Article No.: 110, Pages 1 - 28

https://doi.org/10.1145/3276480

Published: 24 October 2018 Publication History

Abstract

Modern microprocessors are equipped with single instruction multiple data (SIMD) or vector instruction sets which allow compilers to exploit superword level parallelism (SLP), a type of fine-grained parallelism. Current SLP auto-vectorization techniques use heuristics to discover vectorization opportunities in high-level language code. These heuristics are fragile, local and typically only present one vectorization strategy that is either accepted or rejected by a cost model. We present goSLP, a novel SLP auto-vectorization framework which solves the statement packing problem in a pairwise optimal manner. Using an integer linear programming (ILP) solver, goSLP searches the entire space of statement packing opportunities for a whole function at a time, while limiting total compilation time to a few minutes. Furthermore, goSLP optimally solves the vector permutation selection problem using dynamic programming. We implemented goSLP in the LLVM compiler infrastructure, achieving a geometric mean speedup of 7.58% on SPEC2017fp, 2.42% on SPEC2006fp and 4.07% on NAS benchmarks compared to LLVM’s existing SLP auto-vectorizer.

Supplementary Material

WEBM File (a110-mendis.webm)

Download
86.84 MB

References

[1]

Randy Allen and Ken Kennedy. 1987. Automatic Translation of FORTRAN Programs to Vector Form. ACM Trans. Program. Lang. Syst. 9, 4 (Oct. 1987), 491–542.

Digital Library

[2]

Andrew W. Appel and Lal George. 2001. Optimal Spilling for CISC Machines with Few Registers. In Proceedings of the ACM SIGPLAN 2001 Conference on Programming Language Design and Implementation (PLDI ’01). ACM, New York, NY, USA, 243–253.

Digital Library

[3]

Sara S. Baghsorkhi, Nalini Vasudevan, and Youfeng Wu. 2016. FlexVec: Auto-vectorization for Irregular Loops. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’16). ACM, New York, NY, USA, 697–710.

Digital Library

[4]

Rajkishore Barik, Christian Grothoff, Rahul Gupta, Vinayaka Pandit, and Raghavendra Udupa. 2007. Optimal Bitwise Register Allocation Using Integer Linear Programming. In Proceedings of the 19th International Conference on Languages and Compilers for Parallel Computing (LCPC’06). Springer-Verlag, Berlin, Heidelberg, 267–282. http://dl.acm.org/citation. cfm?id=1757112.1757140

Digital Library

[5]

Rajkishore Barik, Jisheng Zhao, and Vivek Sarkar. 2010. Efficient Selection of Vector Instructions Using Dynamic Programming. In Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO ’43). IEEE Computer Society, Washington, DC, USA, 201–212.

Digital Library

[6]

Derek Bruening, Qin Zhao, and Saman Amarasinghe. 2012. Transparent Dynamic Instrumentation. In Proceedings of the 8th ACM SIGPLAN/SIGOPS Conference on Virtual Execution Environments (VEE ’12). ACM, New York, NY, USA, 133–144.

Digital Library

[7]

Chia-Ming Chang, Chien-Ming Chen, and Chung-Ta King. 1997. Using integer linear programming for instruction scheduling and register allocation in multi-issue processors. Computers & Mathematics with Applications 34, 9 (1997), 1 – 14.

[8]

Alexandre E. Eichenberger, Peng Wu, and Kevin O’Brien. 2004. Vectorization for SIMD Architectures with Alignment Constraints. In Proceedings of the ACM SIGPLAN 2004 Conference on Programming Language Design and Implementation (PLDI ’04). ACM, New York, NY, USA, 82–93.

Digital Library

[9]

John L. Henning. 2006. SPEC CP U2006 Benchmark Descriptions. SIGARCH Comput. Archit. News 34, 4 (Sept. 2006), 1–17.

Digital Library

[10]

IBM. 2006. PowerPC microprocessor family: Vector/SIMD multimedia extension technology programming environments manual. IBM Systems and Technology Group (2006).

[11]

IBM. 2017. IBM CPLEX ILP solver. https://www- 01.ibm.com/software/commerce/optimization/cplex- optimizer/

[12]

Intel. 2017a. Intel Software Developer’s manuals. https://www.intel.com/content/www/us/en/architecture- and- technology/ 64- ia- 32- architectures- software- developer- manual- 325462.html

[13]

Intel. 2017b. Intel VTune Amplifier. https://software.intel.com/en- us/intel- vtune- amplifier- xe

[14]

Ralf Karrenberg and Sebastian Hack. 2011. Whole-function Vectorization. In Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO ’11). IEEE Computer Society, Washington, DC, USA, 141–150. http://dl.acm.org/citation.cfm?id=2190025.2190061

Digital Library

[15]

Martin Kong, Richard Veras, Kevin Stock, Franz Franchetti, Louis-Noël Pouchet, and P. Sadayappan. 2013. When Polyhedral Transformations Meet SIMD Code Generation. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’13). ACM, New York, NY, USA, 127–138.

Digital Library

[16]

Alexei Kudriavtsev and Peter Kogge. 2005. Generation of Permutations for SIMD Processors. In Proceedings of the 2005 ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES ’05). ACM, New York, NY, USA, 147–156.

Digital Library

[17]

Samuel Larsen. 2000. Exploiting Superword Level Parallelism with Multimedia Instruction Sets. S.M. Thesis. Massachusetts Institute of Technology, Cambridge, MA. http://groups.csail.mit.edu/commit/papers/00/SLarsen- SM.pdf

[18]

Samuel Larsen and Saman Amarasinghe. 2000. Exploiting Superword Level Parallelism with Multimedia Instruction Sets. In Proceedings of the ACM SIGPLAN 2000 Conference on Programming Language Design and Implementation (PLDI ’00). ACM, New York, NY, USA, 145–156.

Digital Library

[19]

Samuel Larsen, Emmett Witchel, and Saman P. Amarasinghe. 2002. Increasing and Detecting Memory Address Congruence. In Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques (PACT ’02). IEEE Computer Society, Washington, DC, USA, 18–29. http://dl.acm.org/citation.cfm?id=645989.674329

Digital Library

[20]

Rainer Leupers. 2000. Code Selection for Media Processors with SIMD Instructions. In Proceedings of the Conference on Design, Automation and Test in Europe (DATE ’00). ACM, New York, NY, USA, 4–8.

Digital Library

[21]

Chen Linchuan, Jiang Peng, and Agrawal Gagan. 2016. Exploiting recent SIMD architectural advances for irregular applications. In Proceedings of the 2016 International Symposium on Code Generation and Optimization, CGO 2016, Barcelona, Spain, March 12-18, 2016. 47–58.

Digital Library

[22]

Jun Liu, Yuanrui Zhang, Ohyoung Jang, Wei Ding, and Mahmut Kandemir. 2012. A Compiler Framework for Extracting Superword Level Parallelism. In Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’12). ACM, New York, NY, USA, 347–358.

Digital Library

[23]

LLVM. 2017. LLVM Compiler Infrastructure. https://llvm.org

[24]

Roberto Castañeda Lozano, Mats Carlsson, Gabriel Hjort Blindell, and Christian Schulte. 2018. Combinatorial Register Allocation and Instruction Scheduling. CoRR abs/1804.02452 (2018). arXiv: 1804.02452 http://arxiv.org/abs/1804.02452

[25]

Charith Mendis, Saman Amarasinghe, and Michael Carbin. 2018. Ithemal: Accurate, Portable and Fast Basic Block Throughput Estimation using Deep Neural Networks. ArXiv e-prints (Aug. 2018). arXiv: cs.DC/1808.07412

[26]

S. Muthukrishnan. 2005. Data Streams: Algorithms and Applications. Found. Trends Theor. Comput. Sci. 1, 2 (Aug. 2005), 117–236.

Digital Library

[27]

Santosh G. Nagarakatte and R. Govindarajan. 2007. Register Allocation and Optimal Spill Code Scheduling in Software Pipelined Loops Using 0-1 Integer Linear Programming Formulation. In Compiler Construction, Shriram Krishnamurthi and Martin Odersky (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 126–140.

Digital Library

[28]

Division NASA Advanced Supercomputing. 1991–2014. NAS C Benchmark Suite 3.0. https://github.com/ benchmark- subsetting/NPB3.0- omp- C/

[29]

Dorit Nuzman, Ira Rosen, and Ayal Zaks. 2006. Auto-vectorization of Interleaved Data for SIMD. In Proceedings of the 27th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’06). ACM, New York, NY, USA, 132–143.

Digital Library

[30]

Dorit Nuzman and Ayal Zaks. 2008. Outer-loop Vectorization: Revisited for Short SIMD Architectures. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques (PACT ’08). ACM, New York, NY, USA, 2–11.

Digital Library

[31]

Stuart Oberman, Greg Favor, and Fred Weber. 1999. AMD 3DNow! Technology: Architecture and Implementations. IEEE Micro 19, 2 (March 1999), 37–48.

Digital Library

[32]

Vasileios Porpodas and Timothy M. Jones. 2015. Throttling Automatic Vectorization: When Less is More. In Proceedings of the 2015 International Conference on Parallel Architecture and Compilation (PACT) (PACT ’15). IEEE Computer Society, Washington, DC, USA, 432–444.

Digital Library

[33]

Vasileios Porpodas, Alberto Magni, and Timothy M. Jones. 2015. PSLP: Padded SLP Automatic Vectorization. In Proceedings of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO ’15). IEEE Computer Society, Washington, DC, USA, 190–201. http://dl.acm.org/citation.cfm?id=2738600.2738625

Digital Library

[34]

Fernando Magno Quintão Pereira and Jens Palsberg. 2008. Register Allocation by Puzzle Solving. SIGPLAN Not. 43, 6 (June 2008), 216–226.

Digital Library

[35]

Gang Ren, Peng Wu, and David Padua. 2006. Optimizing Data Permutations for SIMD Devices. In Proceedings of the 27th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’06). ACM, New York, NY, USA, 118–131.

Digital Library

[36]

Jaewook Shin, Jacqueline Chame, and Mary W. Hall. 2003. Exploiting superword-level locality in multimedia extension architectures. Vol. 5.

[37]

Jaewook Shin, Jacqueline Chame, and Mary W. Hall. 2002. Compiler-Controlled Caching in Superword Register Files for Multimedia Extension Architectures. In Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques (PACT ’02). IEEE Computer Society, Washington, DC, USA, 45–55. http://dl.acm.org/citation. cfm?id=645989.674318

Digital Library

[38]

Jaewook Shin, Mary Hall, and Jacqueline Chame. 2005. Superword-Level Parallelism in the Presence of Control Flow. In Proceedings of the International Symposium on Code Generation and Optimization (CGO ’05). IEEE Computer Society, Washington, DC, USA, 165–175.

Digital Library

[39]

Corporation SPEC. 2017. SPEC CP U2017 Benchmark Suite. https://www.spec.org/cpu2017/

[40]

N. Sreraman and R. Govindarajan. 2000. A Vectorizing Compiler for Multimedia Extensions. Int. J. Parallel Program. 28, 4 (Aug. 2000), 363–400.

[41]

Konrad Trifunovic, Dorit Nuzman, Albert Cohen, Ayal Zaks, and Ira Rosen. 2009. Polyhedral-Model Guided Loop-Nest Auto-Vectorization. In Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques (PACT ’09). IEEE Computer Society, Washington, DC, USA, 327–337.

Digital Library

[42]

Hao Zhou and Jingling Xue. 2016. Exploiting Mixed SIMD Parallelism by Reducing Data Reorganization Overhead. In Proceedings of the 2016 International Symposium on Code Generation and Optimization (CGO ’16). ACM, New York, NY, USA, 59–69.

Digital Library

Cited By

Krastev ASamardzic NLangowski SDevadas SSanchez D(2024)A Tensor Compiler with Automatic Data Packing for Simple and Efficient Fully Homomorphic EncryptionProceedings of the ACM on Programming Languages10.1145/36563828:PLDI(126-150)Online publication date: 20-Jun-2024
https://doi.org/10.1145/3656382
Zhou CHassman ZShah DRichard VLi YRodríguez GSadayappan PSukumaran-Rajam A(2024)YFlows: Systematic Dataflow Exploration and Code Generation for Efficient Neural Network Inference using SIMD Architectures on CPUsProceedings of the 33rd ACM SIGPLAN International Conference on Compiler Construction10.1145/3640537.3641566(212-226)Online publication date: 17-Feb-2024
https://dl.acm.org/doi/10.1145/3640537.3641566
Zhou HHan QShi HZhang YYao JTsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)Boost Linear Algebra Computation Performance via Efficient VNNI UtilizationProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3620666.3651333(149-163)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620666.3651333
Show More Cited By

Index Terms

goSLP: globally optimized superword level parallelism framework
1. Software and its engineering
  1. Software notations and tools
    1. Compilers
  2. Software organization and properties
    1. Extra-functional properties
      1. Software performance

Recommendations

A compiler framework for extracting superword level parallelism
PLDI '12: Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation

SIMD (single-instruction multiple-data) instruction set extensions are quite common today in both high performance and embedded microprocessors, and enable the exploitation of a specific type of data parallelism called SLP (Superword Level Parallelism). ...
A compiler framework for extracting superword level parallelism
PLDI '12

SIMD (single-instruction multiple-data) instruction set extensions are quite common today in both high performance and embedded microprocessors, and enable the exploitation of a specific type of data parallelism called SLP (Superword Level Parallelism). ...
Automatic generation of custom SIMD instructions for superword level parallelism
DATE '14: Proceedings of the conference on Design, Automation & Test in Europe

Application specific instruction-set processors (ASIPs) have drawn significant attention from System-on-a-Chip (SoC) community due to the capability of fine grain flexibility and customizability. In order to maximize the benefit of ASIP, automatic ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Programming Languages

Proceedings of the ACM on Programming Languages Volume 2, Issue OOPSLA

November 2018

1656 pages

EISSN:2475-1421

DOI:10.1145/3288538

Issue’s Table of Contents

Copyright © 2018 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 October 2018

Published in PACMPL Volume 2, Issue OOPSLA

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Badges

Artifacts Evaluated & Functional

Author Tags

Qualifiers

Research-article

Funding Sources

U.S. Department of Energy
Toyota Research Institute
Defense Advanced Research Projects Agency
Application Driving Architectures Research Center

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

28
Total Citations
View Citations
967
Total Downloads

Downloads (Last 12 months)127
Downloads (Last 6 weeks)15

Reflects downloads up to 13 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Krastev ASamardzic NLangowski SDevadas SSanchez D(2024)A Tensor Compiler with Automatic Data Packing for Simple and Efficient Fully Homomorphic EncryptionProceedings of the ACM on Programming Languages10.1145/36563828:PLDI(126-150)Online publication date: 20-Jun-2024
https://doi.org/10.1145/3656382
Zhou CHassman ZShah DRichard VLi YRodríguez GSadayappan PSukumaran-Rajam A(2024)YFlows: Systematic Dataflow Exploration and Code Generation for Efficient Neural Network Inference using SIMD Architectures on CPUsProceedings of the 33rd ACM SIGPLAN International Conference on Compiler Construction10.1145/3640537.3641566(212-226)Online publication date: 17-Feb-2024
https://dl.acm.org/doi/10.1145/3640537.3641566
Zhou HHan QShi HZhang YYao JTsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)Boost Linear Algebra Computation Performance via Efficient VNNI UtilizationProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3620666.3651333(149-163)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620666.3651333
Thomas SBornholt JTsafrir DMUSUVATHI MGupta RAbu-Ghazaleh N(2024)Automatic Generation of Vectorizing Compilers for Customizable Digital Signal ProcessorsProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3617232.3624873(19-34)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3617232.3624873
VanHattum APardeshi MFallin CSampson ABrown FTsafrir DMUSUVATHI MGupta RAbu-Ghazaleh N(2024)Lightweight, Modular Verification for WebAssembly-to-Native Instruction SelectionProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3617232.3624862(231-248)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3617232.3624862
Wei HZhang YLiu CHuang QJia PXu ZGuo Y(2024)The Strategic Random Search (SRS) – A new global optimizer for calibrating hydrological modelsEnvironmental Modelling & Software10.1016/j.envsoft.2023.105914172:COnline publication date: 17-Apr-2024
https://dl.acm.org/doi/10.1016/j.envsoft.2023.105914
Viand AJattke PHaller MHithnawi ACalandrino JTroncoso C(2023)HECOProceedings of the 32nd USENIX Conference on Security Symposium10.5555/3620237.3620501(4715-4732)Online publication date: 9-Aug-2023
https://dl.acm.org/doi/10.5555/3620237.3620501
Root AAhmad MSharlet DAdams AKamil SRagan-Kelley JAamodt TSwift MJerger N(2023)Fast Instruction Selection for Fast Digital Signal ProcessingProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 410.1145/3623278.3624768(125-137)Online publication date: 25-Mar-2023
https://dl.acm.org/doi/10.1145/3623278.3624768
Malik RSheth KKulkarni MAamodt TJerger NSwift M(2023)Coyote: A Compiler for Vectorizing Encrypted Arithmetic CircuitsProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3582016.3582057(118-133)Online publication date: 25-Mar-2023
https://dl.acm.org/doi/10.1145/3582016.3582057
Abel ASharma SReineke J(2023)Facile: Fast, Accurate, and Interpretable Basic-Block Throughput Prediction2023 IEEE International Symposium on Workload Characterization (IISWC)10.1109/IISWC59245.2023.00023(87-99)Online publication date: 1-Oct-2023
https://doi.org/10.1109/IISWC59245.2023.00023
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents