Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

goSLP: globally optimized superword level parallelism framework

Published: 24 October 2018 Publication History

Abstract

Modern microprocessors are equipped with single instruction multiple data (SIMD) or vector instruction sets which allow compilers to exploit superword level parallelism (SLP), a type of fine-grained parallelism. Current SLP auto-vectorization techniques use heuristics to discover vectorization opportunities in high-level language code. These heuristics are fragile, local and typically only present one vectorization strategy that is either accepted or rejected by a cost model. We present goSLP, a novel SLP auto-vectorization framework which solves the statement packing problem in a pairwise optimal manner. Using an integer linear programming (ILP) solver, goSLP searches the entire space of statement packing opportunities for a whole function at a time, while limiting total compilation time to a few minutes. Furthermore, goSLP optimally solves the vector permutation selection problem using dynamic programming. We implemented goSLP in the LLVM compiler infrastructure, achieving a geometric mean speedup of 7.58% on SPEC2017fp, 2.42% on SPEC2006fp and 4.07% on NAS benchmarks compared to LLVM’s existing SLP auto-vectorizer.

Supplementary Material

WEBM File (a110-mendis.webm)

References

[1]
Randy Allen and Ken Kennedy. 1987. Automatic Translation of FORTRAN Programs to Vector Form. ACM Trans. Program. Lang. Syst. 9, 4 (Oct. 1987), 491–542.
[2]
Andrew W. Appel and Lal George. 2001. Optimal Spilling for CISC Machines with Few Registers. In Proceedings of the ACM SIGPLAN 2001 Conference on Programming Language Design and Implementation (PLDI ’01). ACM, New York, NY, USA, 243–253.
[3]
Sara S. Baghsorkhi, Nalini Vasudevan, and Youfeng Wu. 2016. FlexVec: Auto-vectorization for Irregular Loops. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’16). ACM, New York, NY, USA, 697–710.
[4]
Rajkishore Barik, Christian Grothoff, Rahul Gupta, Vinayaka Pandit, and Raghavendra Udupa. 2007. Optimal Bitwise Register Allocation Using Integer Linear Programming. In Proceedings of the 19th International Conference on Languages and Compilers for Parallel Computing (LCPC’06). Springer-Verlag, Berlin, Heidelberg, 267–282. http://dl.acm.org/citation. cfm?id=1757112.1757140
[5]
Rajkishore Barik, Jisheng Zhao, and Vivek Sarkar. 2010. Efficient Selection of Vector Instructions Using Dynamic Programming. In Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO ’43). IEEE Computer Society, Washington, DC, USA, 201–212.
[6]
Derek Bruening, Qin Zhao, and Saman Amarasinghe. 2012. Transparent Dynamic Instrumentation. In Proceedings of the 8th ACM SIGPLAN/SIGOPS Conference on Virtual Execution Environments (VEE ’12). ACM, New York, NY, USA, 133–144.
[7]
Chia-Ming Chang, Chien-Ming Chen, and Chung-Ta King. 1997. Using integer linear programming for instruction scheduling and register allocation in multi-issue processors. Computers & Mathematics with Applications 34, 9 (1997), 1 – 14.
[8]
Alexandre E. Eichenberger, Peng Wu, and Kevin O’Brien. 2004. Vectorization for SIMD Architectures with Alignment Constraints. In Proceedings of the ACM SIGPLAN 2004 Conference on Programming Language Design and Implementation (PLDI ’04). ACM, New York, NY, USA, 82–93.
[9]
John L. Henning. 2006. SPEC CP U2006 Benchmark Descriptions. SIGARCH Comput. Archit. News 34, 4 (Sept. 2006), 1–17.
[10]
IBM. 2006. PowerPC microprocessor family: Vector/SIMD multimedia extension technology programming environments manual. IBM Systems and Technology Group (2006).
[11]
IBM. 2017. IBM CPLEX ILP solver. https://www- 01.ibm.com/software/commerce/optimization/cplex- optimizer/
[12]
Intel. 2017a. Intel Software Developer’s manuals. https://www.intel.com/content/www/us/en/architecture- and- technology/ 64- ia- 32- architectures- software- developer- manual- 325462.html
[13]
Intel. 2017b. Intel VTune Amplifier. https://software.intel.com/en- us/intel- vtune- amplifier- xe
[14]
Ralf Karrenberg and Sebastian Hack. 2011. Whole-function Vectorization. In Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO ’11). IEEE Computer Society, Washington, DC, USA, 141–150. http://dl.acm.org/citation.cfm?id=2190025.2190061
[15]
Martin Kong, Richard Veras, Kevin Stock, Franz Franchetti, Louis-Noël Pouchet, and P. Sadayappan. 2013. When Polyhedral Transformations Meet SIMD Code Generation. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’13). ACM, New York, NY, USA, 127–138.
[16]
Alexei Kudriavtsev and Peter Kogge. 2005. Generation of Permutations for SIMD Processors. In Proceedings of the 2005 ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES ’05). ACM, New York, NY, USA, 147–156.
[17]
Samuel Larsen. 2000. Exploiting Superword Level Parallelism with Multimedia Instruction Sets. S.M. Thesis. Massachusetts Institute of Technology, Cambridge, MA. http://groups.csail.mit.edu/commit/papers/00/SLarsen- SM.pdf
[18]
Samuel Larsen and Saman Amarasinghe. 2000. Exploiting Superword Level Parallelism with Multimedia Instruction Sets. In Proceedings of the ACM SIGPLAN 2000 Conference on Programming Language Design and Implementation (PLDI ’00). ACM, New York, NY, USA, 145–156.
[19]
Samuel Larsen, Emmett Witchel, and Saman P. Amarasinghe. 2002. Increasing and Detecting Memory Address Congruence. In Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques (PACT ’02). IEEE Computer Society, Washington, DC, USA, 18–29. http://dl.acm.org/citation.cfm?id=645989.674329
[20]
Rainer Leupers. 2000. Code Selection for Media Processors with SIMD Instructions. In Proceedings of the Conference on Design, Automation and Test in Europe (DATE ’00). ACM, New York, NY, USA, 4–8.
[21]
Chen Linchuan, Jiang Peng, and Agrawal Gagan. 2016. Exploiting recent SIMD architectural advances for irregular applications. In Proceedings of the 2016 International Symposium on Code Generation and Optimization, CGO 2016, Barcelona, Spain, March 12-18, 2016. 47–58.
[22]
Jun Liu, Yuanrui Zhang, Ohyoung Jang, Wei Ding, and Mahmut Kandemir. 2012. A Compiler Framework for Extracting Superword Level Parallelism. In Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’12). ACM, New York, NY, USA, 347–358.
[23]
LLVM. 2017. LLVM Compiler Infrastructure. https://llvm.org
[24]
Roberto Castañeda Lozano, Mats Carlsson, Gabriel Hjort Blindell, and Christian Schulte. 2018. Combinatorial Register Allocation and Instruction Scheduling. CoRR abs/1804.02452 (2018). arXiv: 1804.02452 http://arxiv.org/abs/1804.02452
[25]
Charith Mendis, Saman Amarasinghe, and Michael Carbin. 2018. Ithemal: Accurate, Portable and Fast Basic Block Throughput Estimation using Deep Neural Networks. ArXiv e-prints (Aug. 2018). arXiv: cs.DC/1808.07412
[26]
S. Muthukrishnan. 2005. Data Streams: Algorithms and Applications. Found. Trends Theor. Comput. Sci. 1, 2 (Aug. 2005), 117–236.
[27]
Santosh G. Nagarakatte and R. Govindarajan. 2007. Register Allocation and Optimal Spill Code Scheduling in Software Pipelined Loops Using 0-1 Integer Linear Programming Formulation. In Compiler Construction, Shriram Krishnamurthi and Martin Odersky (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 126–140.
[28]
Division NASA Advanced Supercomputing. 1991–2014. NAS C Benchmark Suite 3.0. https://github.com/ benchmark- subsetting/NPB3.0- omp- C/
[29]
Dorit Nuzman, Ira Rosen, and Ayal Zaks. 2006. Auto-vectorization of Interleaved Data for SIMD. In Proceedings of the 27th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’06). ACM, New York, NY, USA, 132–143.
[30]
Dorit Nuzman and Ayal Zaks. 2008. Outer-loop Vectorization: Revisited for Short SIMD Architectures. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques (PACT ’08). ACM, New York, NY, USA, 2–11.
[31]
Stuart Oberman, Greg Favor, and Fred Weber. 1999. AMD 3DNow! Technology: Architecture and Implementations. IEEE Micro 19, 2 (March 1999), 37–48.
[32]
Vasileios Porpodas and Timothy M. Jones. 2015. Throttling Automatic Vectorization: When Less is More. In Proceedings of the 2015 International Conference on Parallel Architecture and Compilation (PACT) (PACT ’15). IEEE Computer Society, Washington, DC, USA, 432–444.
[33]
Vasileios Porpodas, Alberto Magni, and Timothy M. Jones. 2015. PSLP: Padded SLP Automatic Vectorization. In Proceedings of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO ’15). IEEE Computer Society, Washington, DC, USA, 190–201. http://dl.acm.org/citation.cfm?id=2738600.2738625
[34]
Fernando Magno Quintão Pereira and Jens Palsberg. 2008. Register Allocation by Puzzle Solving. SIGPLAN Not. 43, 6 (June 2008), 216–226.
[35]
Gang Ren, Peng Wu, and David Padua. 2006. Optimizing Data Permutations for SIMD Devices. In Proceedings of the 27th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’06). ACM, New York, NY, USA, 118–131.
[36]
Jaewook Shin, Jacqueline Chame, and Mary W. Hall. 2003. Exploiting superword-level locality in multimedia extension architectures. Vol. 5.
[37]
Jaewook Shin, Jacqueline Chame, and Mary W. Hall. 2002. Compiler-Controlled Caching in Superword Register Files for Multimedia Extension Architectures. In Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques (PACT ’02). IEEE Computer Society, Washington, DC, USA, 45–55. http://dl.acm.org/citation. cfm?id=645989.674318
[38]
Jaewook Shin, Mary Hall, and Jacqueline Chame. 2005. Superword-Level Parallelism in the Presence of Control Flow. In Proceedings of the International Symposium on Code Generation and Optimization (CGO ’05). IEEE Computer Society, Washington, DC, USA, 165–175.
[39]
Corporation SPEC. 2017. SPEC CP U2017 Benchmark Suite. https://www.spec.org/cpu2017/
[40]
N. Sreraman and R. Govindarajan. 2000. A Vectorizing Compiler for Multimedia Extensions. Int. J. Parallel Program. 28, 4 (Aug. 2000), 363–400.
[41]
Konrad Trifunovic, Dorit Nuzman, Albert Cohen, Ayal Zaks, and Ira Rosen. 2009. Polyhedral-Model Guided Loop-Nest Auto-Vectorization. In Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques (PACT ’09). IEEE Computer Society, Washington, DC, USA, 327–337.
[42]
Hao Zhou and Jingling Xue. 2016. Exploiting Mixed SIMD Parallelism by Reducing Data Reorganization Overhead. In Proceedings of the 2016 International Symposium on Code Generation and Optimization (CGO ’16). ACM, New York, NY, USA, 59–69.

Cited By

View all
  • (2024)A Tensor Compiler with Automatic Data Packing for Simple and Efficient Fully Homomorphic EncryptionProceedings of the ACM on Programming Languages10.1145/36563828:PLDI(126-150)Online publication date: 20-Jun-2024
  • (2024)YFlows: Systematic Dataflow Exploration and Code Generation for Efficient Neural Network Inference using SIMD Architectures on CPUsProceedings of the 33rd ACM SIGPLAN International Conference on Compiler Construction10.1145/3640537.3641566(212-226)Online publication date: 17-Feb-2024
  • (2024)Boost Linear Algebra Computation Performance via Efficient VNNI UtilizationProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3620666.3651333(149-163)Online publication date: 27-Apr-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Programming Languages
Proceedings of the ACM on Programming Languages  Volume 2, Issue OOPSLA
November 2018
1656 pages
EISSN:2475-1421
DOI:10.1145/3288538
Issue’s Table of Contents
This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 October 2018
Published in PACMPL Volume 2, Issue OOPSLA

Permissions

Request permissions for this article.

Check for updates

Badges

Author Tags

  1. Auto-vectorization
  2. Dynamic Programming
  3. Integer Linear Programming
  4. Statement Packing
  5. Superword Level Parallelism
  6. Vector Permutation

Qualifiers

  • Research-article

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)127
  • Downloads (Last 6 weeks)15
Reflects downloads up to 13 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)A Tensor Compiler with Automatic Data Packing for Simple and Efficient Fully Homomorphic EncryptionProceedings of the ACM on Programming Languages10.1145/36563828:PLDI(126-150)Online publication date: 20-Jun-2024
  • (2024)YFlows: Systematic Dataflow Exploration and Code Generation for Efficient Neural Network Inference using SIMD Architectures on CPUsProceedings of the 33rd ACM SIGPLAN International Conference on Compiler Construction10.1145/3640537.3641566(212-226)Online publication date: 17-Feb-2024
  • (2024)Boost Linear Algebra Computation Performance via Efficient VNNI UtilizationProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3620666.3651333(149-163)Online publication date: 27-Apr-2024
  • (2024)Automatic Generation of Vectorizing Compilers for Customizable Digital Signal ProcessorsProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3617232.3624873(19-34)Online publication date: 27-Apr-2024
  • (2024)Lightweight, Modular Verification for WebAssembly-to-Native Instruction SelectionProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3617232.3624862(231-248)Online publication date: 27-Apr-2024
  • (2024)The Strategic Random Search (SRS) – A new global optimizer for calibrating hydrological modelsEnvironmental Modelling & Software10.1016/j.envsoft.2023.105914172:COnline publication date: 17-Apr-2024
  • (2023)HECOProceedings of the 32nd USENIX Conference on Security Symposium10.5555/3620237.3620501(4715-4732)Online publication date: 9-Aug-2023
  • (2023)Fast Instruction Selection for Fast Digital Signal ProcessingProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 410.1145/3623278.3624768(125-137)Online publication date: 25-Mar-2023
  • (2023)Coyote: A Compiler for Vectorizing Encrypted Arithmetic CircuitsProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3582016.3582057(118-133)Online publication date: 25-Mar-2023
  • (2023)Facile: Fast, Accurate, and Interpretable Basic-Block Throughput Prediction2023 IEEE International Symposium on Workload Characterization (IISWC)10.1109/IISWC59245.2023.00023(87-99)Online publication date: 1-Oct-2023
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media