research-article

Public Access

Extracting SIMD Parallelism from Recursive Task-Parallel Programs

Authors:

Shruthi Balakrishna,

Sriram Krishnamoorthy,

Milind KulkarniAuthors Info & Claims

ACM Transactions on Parallel Computing (TOPC), Volume 6, Issue 4

Article No.: 24, Pages 1 - 37

https://doi.org/10.1145/3365663

Published: 26 December 2019 Publication History

Abstract

The pursuit of computational efficiency has led to the proliferation of throughput-oriented hardware, from GPUs to increasingly wide vector units on commodity processors and accelerators. This hardware is designed to execute data-parallel computations in a vectorized manner efficiently. However, many algorithms are more naturally expressed as divide-and-conquer, recursive, task-parallel computations. In the absence of data parallelism, it seems that such algorithms are not well suited to throughput-oriented architectures. This article presents a set of novel code transformations that expose the data parallelism latent in recursive, task-parallel programs. These transformations facilitate straightforward vectorization of task-parallel programs on commodity hardware. We also present scheduling policies that maintain high utilization of vector resources while limiting space usage. Across several task-parallel benchmarks, we demonstrate both efficient vector resource utilization and substantial speedup on chips using Intel’s SSE4.2 vector units, as well as accelerators using Intel’s AVX512 units. We then show through rigorous sampling that, in practice, our vectorization techniques are effective for a much larger class of programs.

References

[1]

Timo Aila and Samuli Laine. 2009. Understanding the efficiency of ray traversal on GPUs. In HPG’09. 145--149.

[2]

Barcelona OpenMP Task Suite (BOTS) 2012. Barcelona OpenMP Task Suite (BOTS). https://pm.bsc.es/projects/bots.

[3]

Gilles Barthe, Juan Manuel Crespo, Sumit Gulwani, Cesar Kunz, and Mark Marron. 2013. From relational verification to SIMD loop synthesis. In PPoPP’13. 123--134.

[4]

Lars Bergstrom, Matthew Fluet, Mike Rainey, John Reppy, Stephen Rosen, and Adam Shaw. 2013. Data-only flattening for nested data parallelism. ACM SIGPLAN Notices, 48. ACM, 81--92.

[5]

Guy E. Blelloch and Phillip B. Gibbons. 2004. Effectively sharing a cache among threads. In SPAA’04: Proceedings of the 16th Annual ACM Symposium on Parallelism in Algorithms and Architectures. ACM, New York, 235--244.

[6]

Guy E. Blelloch and Gary W. Sabot. 1990. Compiling collection-oriented languages onto massively parallel computers. Journal of Parallel and Distributed Computing 8, 2 (1990), 119--134.

Digital Library

[7]

Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kuszmaul, Charles E. Leiserson, Keith H. Randall, and Yuli Zhou. 1995. Cilk: An efficient multithreaded runtime system. In PPOPP’95. 207--216.

Digital Library

[8]

Tiago Carneiro Pessoa, Jan Gmys, Francisco Heron de Carvalho Júnior, Nouredine Melab, and Daniel Tuyttens. 2018. GPU-accelerated backtracking using CUDA dynamic parallelism. Concurrency and Computation: Practice and Experience 30, 9 (2018), e4374.

[9]

Daniel Cederman and Philippas Tsigas. 2008. On dynamic load balancing on graphics processors. In Proceedings of the 23rd ACM SIGGRAPH/EUROGRAPHICS Symposium on Graphics Hardware. Eurographics Association, 57--64.

Digital Library

[10]

Manuel M. T. Chakravarty, Gabriele Keller, Roman Lechtchinsky, and Wolf Pfannenstiel. 2001. Nepal--nested data parallelism in Haskell. In European Conference on Parallel Processing. Springer, 524--534.

[11]

Jatin Chhugani, Changkyu Kim, Hemant Shukla, Jongsoo Park, Pradeep Dubey, John Shalf, and Horst D. Simon. 2012. Billion-particle SIMD-friendly two-point correlation on large-scale HPC cluster systems. In SC’12. Article 1, 11 pages.

[12]

Cilk 2010. Cilk. http://supertech.csail.mit.edu/cilk/.

[13]

Holger Dammertz, Johannes Hanika, and Alexander Keller. 2008. Shallow bounding volume hierarchies for fast SIMD ray tracing of incoherent rays. In EGSR’08. 1225--1233.

[14]

John S. Danaher, I.-Ting Angelina Lee, and Charles E. Leiserson. 2006. Programming with exceptions in JCilk. Sci. Comput. Program. 63, 2 (Dec. 2006), 147--171.

[15]

J. O. Eklundh. 1972. A fast computer method for matrix transposing. IEEE Trans. Comput. 21, 7 (July 1972), 801--803.

Digital Library

[16]

Izzat El Hajj, Juan Gómez-Luna, Cheng Li, Li-Wen Chang, Dejan Milojicic, and Wen-mei Hwu. 2016. KLAP: Kernel launch aggregation and promotion for optimizing dynamic parallelism. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 1--12.

[17]

Matteo Frigo, Pablo Halpern, Charles E. Leiserson, and Stephen Lewin-Berlin. 2009. Reducers and other Cilk++ hyperobjects. In SPAA’09. 79--90.

[18]

Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. 1998. The implementation of the Cilk-5 multithreaded language. In PLDI’98. 212--223.

[19]

B.R. Gaster and L. Howes. 2012. Can GPGPU programming be liberated from the data-parallel bottleneck? Computer 45, 8 (August 2012), 42--52.

Digital Library

[20]

Yi Guo, R. Barik, R. Raman, and V. Sarkar. 2009. Work-first and help-first scheduling policies for async-finish task parallelism. In IPDPS’09. 1--12.

[21]

Kshitij Gupta, Jeff A. Stuart, and John D. Owens. 2012. A study of persistent threads style GPU programming for GPGPU workloads. In 2012 Innovative Parallel Computing (InPar). IEEE, 1--14.

[22]

Jiri Havel and Adam Herout. 2010. Yet faster ray-triangle intersection (using SSE4). IEEE Transactions on Visualization and Computer Graphics 16, 3 (May 2010), 434--438.

Digital Library

[23]

Lars Hernquist. 1990. Vectorization of tree traversals. J. Comput. Phys. 87, 1 (March 1990), 137--147.

Digital Library

[24]

R. D. Hornung and J. A. Keasler. 2013. A Case for Improved C++ Compiler Support to Enable Performance Portability in Large Physics Simulation Codes. Technical Report. Tech. rep., Lawrence Livermore National Laboratory (LLNL), Livermore, CA.

[25]

Qiming Hou, Xin Sun, Kun Zhou, Christian Lauterbach, and Dinesh Manocha. 2011. Memory-scalable GPU spatial hierarchy construction. IEEE Transactions on Visualization and Computer Graphics 17, 4 (2011), 466--474.

Digital Library

[26]

Paul Hudak and Eric Mohr. 1988. Graphinators and the duality of SIMD and MIMD. In LFP’88. 224--234.

[27]

Xin Huo, Sriram Krishnamoorthy, and Gagan Agrawal. 2013. Efficient scheduling of recursive control flow on GPUs. In ICS’13. 409--420.

[28]

Youngjoon Jo, Michael Goldfarb, and Milind Kulkarni. 2013. Automatic vectorization of tree traversals. In PACT’13. 363--374.

[29]

Youngjoon Jo and Milind Kulkarni. 2011. Enhancing locality for recursive traversals of recursive structures. In OOPSLA’11. 463--482.

[30]

Changkyu Kim, Jatin Chhugani, Nadathur Satish, Eric Sedlar, Anthony D. Nguyen, Tim Kaldewey, Victor W. Lee, Scott A. Brandt, and Pradeep Dubey. 2010. FAST: Fast architecture sensitive tree search on modern CPUs and GPUs. In SIGMOD’10. 339--350.

Digital Library

[31]

Seonggun Kim and Hwansoo Han. 2012. Efficient SIMD code generation for irregular kernels. In PPoPP’12. 55--64.

[32]

Sriram Krishnamoorthy, Gerald Baumgartner, Daniel Cociorva, Chi-Chung Lam, and P. Sadayappan. 2004. Efficient parallel out-of-core matrix transposition. International Journal of High Performance Computing and Networking 2, 2 (2004), 110--119.

Digital Library

[33]

Vidyadhar Kulkarni. 1990. Generating random combinatorial objects. Journal of Algorithms 11, 2 (1990), 185—207.

Digital Library

[34]

Da Li, Hancheng Wu, and Michela Becchi. 2015. Nested parallelism on GPU: Exploring parallelization templates for irregular loops and recursive computations. In 2015 44th International Conference on Parallel Processing. IEEE, 979--988.

Digital Library

[35]

Yisheng Liao, Alex Rubinsteyn, Russell Power, and Jinyang Li. 2013. Learning random forests on the GPU. New York University, Department of Computer Science (2013).

[36]

Erkki Mäkinen. 1999. Generating random binary trees - A survey. Inf. Sci. 115, 1--4 (April 1999), 123--136.

[37]

Saeed Maleki, Yaoqing Gao, Maria J. Garzarán, Tommy Wong, and David A. Padua. 2011. An evaluation of vectorizing compilers. In PACT’11. 372--382.

[38]

Emanuele Manca, Andrea Manconi, Alessandro Orro, Giuliano Armano, and Luciano Milanesi. 2016. CUDA-quicksort: An improved GPU-based implementation of quicksort. Concurrency and Computation: Practice and Experience 28, 1 (2016), 21--43.

Digital Library

[39]

H. W. Martin and B. J. Orr. 1989. A random binary tree generator. In Proceedings of the 17th Conference on ACM Annual Computer Science Conference (CSC’89). ACM, New York, 33--38.

[40]

Todd Mytkowicz, Madanlal Musuvathi, and Wolfram Schulte. 2014. Data-parallel finite-state machines. In ASPLOS’14. 529--542.

[41]

B. Neelima, Bharath Shamsundar, Anjjan Narayan, Rithesh Prabhu, and Crystal Gomes. 2017. Kepler GPU accelerated recursive sorting using dynamic parallelism. Concurrency and Computation: Practice and Experience 29, 4 (2017), e3865.

[42]

Dorit Nuzman and Ayal Zaks. 2008. Outer-loop vectorization: Revisited for short SIMD architectures. In PACT’08. 2--11.

[43]

NVIDIA. 2015. CUDA. http://www.nvidia.com/object/cuda_home_new.html.

[44]

Stephen Olivier, Jun Huan, Jinze Liu, Jan Prins, James Dinan, P. Sadayappan, and Chau-Wen Tseng. 2007. UTS: An unbalanced tree search benchmark. In LCPC’06. 235--250.

[45]

OpenMP Architecture Review Board. 2008. OpenMP Specification and Features. http://openmp.org/wp/.

[46]

Marc S. Orr, Bradford M. Beckmann, Steven K. Reinhardt, and David A. Wood. 2014. Fine-grain task aggregation and coordination on GPUs. In ISCA’14. 181--192.

[47]

Anjul Patney and John D. Owens. 2008. Real-time Reyes-style adaptive surface subdivision. ACM Transactions on Graphics (TOG) 27, 5 (2008), 143.

Digital Library

[48]

Markus Puschel, José M. F. Moura, Jeremy R. Johnson, David Padua, Manuela M. Veloso, Bryan W. Singer, Jianxin Xiong, Franz Franchetti, Aca Gacic, Yevgen Voronenko, Kang Chen, Robert W. Johnson, and Nicholas Rizzolo. 2005. SPIRAL: Code generation for DSP transforms. Proc. IEEE 93, 2 (2005), 232--275.

[49]

James Reinders. 2007. Intel Threading Building Blocks: Outfitting C++ for Multi-Core Processor Parallelism. O’Reilly.

[50]

Bin Ren, Gagan Agrawal, James R. Larus, Todd Mytkowicz, Tomi Poutanen, and Wolfram Schulte. 2013. SIMD parallelization of applications that traverse irregular data structures. In CGO’13. 1--10.

[51]

Bin Ren, Youngjoon Jo, Sriram Krishnamoorthy, Kunal Agrawal, and Milind Kulkarni. 2015. Efficient execution of recursive programs on commodity vector hardware. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2015). ACM, New York, NY, USA, 509--520.

Digital Library

[52]

Bin Ren, Sriram Krishnamoorthy, Kunal Agrawal, and Milind Kulkarni. 2017. Exploiting vector and multicore parallelism for recursive, data-and task-parallel programs. ACM SIGPLAN Notices, 52. ACM, 117--130.

[53]

Jarmo Siltaneva and Erkki Makinen. 2002. A comparison of random binary tree generators. Comput. J. 45, 6 (2002), 653--660.

[54]

Michael Steffen and Joseph Zambreno. 2010. Improving SIMT efficiency of global rendering algorithms with architectural support for dynamic micro-kernels. In MICRO’43. 237--248.

[55]

Markus Steinberger, Michael Kenzel, Pedro Boechat, Bernhard Kerbl, Mark Dokter, and Dieter Schmalstieg. 2014. Whippletree: Task-based scheduling of dynamic workloads on the GPU. ACM Transactions on Graphics (TOG) 33, 6 (2014), 228.

Digital Library

[56]

John E. Stone, David Gohara, and Guochun Shi. 2010. OpenCL: A parallel programming standard for heterogeneous computing systems. IEEE Des. Test 12, 3 (May 2010), 66--73.

[57]

TPL 2007. The Task Parallel Library. http://msdn.microsoft.com/en-us/magazine/cc163340.aspx.

[58]

Stanley Tzeng, Anjul Patney, and John D. Owens. 2010. Task management for irregular-parallel workloads on the GPU. In HPG’10. 29--37.

[59]

Nicolas Weber, Florian Schmidt, Mathias Niepert, and Felipe Huici. 2018. BrainSlug: Transparent acceleration of deep learning through depth-first parallelism. arXiv preprint arXiv:1804.08378 (2018).

[60]

Thomas Weber, Michael Wimmer, and John D. Owens. 2015. Parallel Reyes-style adaptive subdivision with bounded memory usage. In Proceedings of the 19th Symposium on Interactive 3D Graphics and Games. ACM, 39--45.

[61]

Zhimin Wu, Yang Liu, Jun Sun, Jianqi Shi, and Shengchao Qin. 2015. GPU accelerated on-the-fly reachability checking. In 2015 20th International Conference on Engineering of Complex Computer Systems (ICECCS). IEEE, 100--109.

Digital Library

[62]

X10 2006. The X10 Programming Language. www.research.ibm.com/x10/.

[63]

Feng Zhang, Peng Di, Hao Zhou, Xiangke Liao, and Jingling Xue. 2016. RegTT: Accelerating tree traversals on GPUs by exploiting regularities. In 2016 45th International Conference on Parallel Processing (ICPP). IEEE, 562--571.

[64]

Jing Zhang, Ashwin M. Aji, Michael L. Chu, Hao Wang, and Wu-chun Feng. 2018. Taming irregular applications via advanced dynamic parallelism on GPUs. In Proceedings of the 15th ACM International Conference on Computing Frontiers. ACM, 146--154.

Digital Library

[65]

Tao Zhang, Wei Shu, and Min-You Wu. 2014. CUIRRE: An open-source library for load balancing and characterizing irregular applications on GPUs. Journal of Parallel and Distributed Computing 74, 10 (2014), 2951--2966.

[66]

Kun Zhou, Qiming Hou, Rui Wang, and Baining Guo. 2008. Real-time KD-tree construction on graphics hardware. ACM Transactions on Graphics (TOG) 27. ACM, 126.

Cited By

Loreti DVisani G(2024)Parallel approaches for a decision tree-based explainability algorithmFuture Generation Computer Systems10.1016/j.future.2024.04.044158(308-322)Online publication date: Sep-2024
https://doi.org/10.1016/j.future.2024.04.044

Index Terms

Extracting SIMD Parallelism from Recursive Task-Parallel Programs
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel programming languages

Recommendations

Efficient execution of recursive programs on commodity vector hardware
PLDI '15

The pursuit of computational efficiency has led to the proliferation of throughput-oriented hardware, from GPUs to increasingly wide vector units on commodity processors and accelerators. This hardware is designed to efficiently execute data-parallel ...
Efficient execution of recursive programs on commodity vector hardware
PLDI '15: Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation

The pursuit of computational efficiency has led to the proliferation of throughput-oriented hardware, from GPUs to increasingly wide vector units on commodity processors and accelerators. This hardware is designed to efficiently execute data-parallel ...
SIMD parallel MCMC sampling with applications for big-data Bayesian analytics

Computational intensity and sequential nature of estimation techniques for Bayesian methods in statistics and machine learning, combined with their increasing applications for big data analytics, necessitate both the identification of potential ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Parallel Computing

ACM Transactions on Parallel Computing Volume 6, Issue 4

December 2019

188 pages

ISSN:2329-4949

EISSN:2329-4957

DOI:10.1145/3372747

Editor:
David A. Bader
New Jersey Institute of Technology, USA

Issue’s Table of Contents

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 December 2019

Accepted: 01 September 2019

Revised: 01 September 2019

Received: 01 July 2015

Published in TOPC Volume 6, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

NSF
Battelle for DOE
U.S. Department of Energy's (DOE) Office of Science, Office of Advanced Scientific Computing Research, under DOE Early Career

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
667
Total Downloads

Downloads (Last 12 months)142
Downloads (Last 6 weeks)18

Reflects downloads up to 15 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Loreti DVisani G(2024)Parallel approaches for a decision tree-based explainability algorithmFuture Generation Computer Systems10.1016/j.future.2024.04.044158(308-322)Online publication date: Sep-2024
https://doi.org/10.1016/j.future.2024.04.044

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents