research-article

Optimizing recursive task parallel programs

Authors:

Rahul Shrivastava,

V Krishna NandivadaAuthors Info & Claims

ICS '17: Proceedings of the International Conference on Supercomputing

Article No.: 11, Pages 1 - 11

https://doi.org/10.1145/3079079.3079102

Published: 14 June 2017 Publication History

Abstract

We present a new optimization DECAF that optimizes recursive task parallel (RTP) programs by reducing the task creation and termination overheads. DECAF reduces the task termination (join) operations by aggressively increasing the scope of join operations (in a semantics preserving way), and eliminating the redundant join operations discovered on the way. Further, DECAF extends the traditional loop chunking technique to perform load-balanced chunking, at runtime, based on the number of available worker threads. This helps reduce the redundant parallel tasks at different levels of recursion. We also discuss the impact of exceptions on our techniques and extend them to handle RTP programs that may throw exceptions. We implemented DECAF in the X10v2.3 compiler and tested it over a set of benchmark kernels on two different hardwares (a 16-core Intel system and a 64-core AMD system). With respect to the base X10 compiler extended with loop-chunking of Nandivada et al. [26] (LC), DECAF achieved a geometric mean speed up of 2.14× and 2.53× on the Intel and AMD system, respectively. We also present an evaluation with respect to the energy consumption on the Intel system and show that on average, compared to the LC versions, the DECAF versions consume 71.2% less energy.

References

[1]

G. Aharoni, D. G. Feitelson, and A. Barak. 1992. A Run-Time Algorithm for Managing the Granularity of Parallel Functional Programs. JFP 2, 4 (Oct 1992), 387--405.

[2]

L. Bergstrom and J. H. Reppy. 2012. Nested data-parallelism on the GPU. In ICFP. 247--258.

Digital Library

[3]

G. Bikshandi, J. G. Castanos, S. B. Kodali, V. K. Nandivada, I. Peshansky, V. A. Saraswat, S. Sur, P. Varma, and T. Wen. 2009. Efficient, portable implementation of asynchronous multi-place programs. In PPoPP. ACM, 271--282.

Digital Library

[4]

G. E. Blelloch and G. Sabot. 1990. Compiling Collection-Oriented Languages onto Massively Parallel Computers. J. Parallel Distrib. Comput. 8, 2 (1990), 119--134.

Digital Library

[5]

V. Cavé, J. Zhao, J. Shirako, and V. Sarkar. 2011. Habanero-Java: The New Adventures ofOld X10. In PPPJ. ACM, 51--61.

Digital Library

[6]

O. Certner, Z. Li, P. Palatin, O. Temam, F. Arzel, and N. Drach. 2008. A Practical Approach for Reconciling High and Predictable Performance in Non-Regular Parallel Programs. In DATE. 740--745.

Digital Library

[7]

B.L. Chamberlain, D. Callahan, and H.P. Zima. 2007. Parallel Programmability and the Chapel Language. IJHPCA 21, 3 (Aug 2007), 291--312.

Digital Library

[8]

R. Cytron, J. Lipkis, and E. Schonberg. 1990. A Compiler-Assisted Approach to SPMD Execution. In SC. IEEE, 398--406.

Digital Library

[9]

A. Duran, J. Corbalán, and E. Ayguadé. 2008. An Adaptive Cut-off for Task Parallelism. In SC. IEEE Press, Article 36, 11 pages.

Digital Library

[10]

A. Duran, X. Teruel, R. Ferrer, X. Martorell, and E. Ayguade. 2009. Barcelona OpenMP Tasks Suite: A Set of Benchmarks Targeting the Exploitation of Task Parallelism in OpenMP. In ICPP. IEEE Computer Society, 124--131.

Digital Library

[11]

R. Ferrer, A. Duran, X. Martorell, and E. Ayguadé. 2010. Unrolling Loops Containing Task Parallelism. In LCPC. 416--423.

Digital Library

[12]

M. Frigo, C. E. Leiserson, and K. H. Randall. 1998. The Implementation of the Cilk-5 Multithreaded Language. In PLDI. 212--223.

Digital Library

[13]

Y. Guo, R. Barik, R. Raman, and V. Sarkar. 2009. Work-first and help-first scheduling policies for async-finish task parallelism. In IPDPS. 1--12.

Digital Library

[14]

S. Gupta and V. Krishna Nandivada. 2015. IMSuite: A Benchmark Suite for Simulating Distributed Algorithms. JPDC 75, 0 (Jan 2015), 1 -- 19.

Digital Library

[15]

I. E. Hajj, J. Gómez-Luna, C. Li, L. Chang, D. S. Milojicic, and W. W. Hwu. 2016. KLAP: Kernel launch aggregation and promotion for optimizing dynamic parallelism. In MICRO. 1--12.

[16]

M. W. Hall and M. Martonosi. 1998. Adaptive Parallelism in Compiler-Parallelized Code. Concurrency-Pract Ex 10, 14 (1998), 1235--1250.

[17]

E. A. Heinz and M. Philippsen. 1993. Synchronization Barrier Elimination in Synchronous FORALL Statements. Technical Report No. 13/93. University of Karlsruhe, Department of Informatics.

[18]

L. Huelsbergen, J. R. Larus, and A. Aiken. 1994. Using the Run-time Sizes of Data Structures to Guide Parallel-thread Creation. In LFP. ACM, 79--90.

Digital Library

[19]

Intel. 2014. Intel 64 and IA-32 Architectures Software Developer's Manual. (2014).

[20]

S. Iwasaki and K. Taura. 2016. A Static Cut-off for Task Parallel Programs. In Proceedings of the 2016 International Conference on Parallel Architectures and Compilation. ACM, 139--150.

Digital Library

[21]

K. Kennedy and J. R. Allen. 2002. Optimizing Compilers for Modern Architectures: A Dependence-based Approach. Morgan Kaufmann Publishers Inc.

Digital Library

[22]

D. A. Kranz, R. H. Halstead, Jr., and E. Mohr. 1989. Mul-T: A High-performance Parallel Lisp. In PLDI. ACM, 81--90.

Digital Library

[23]

J. Lifflander, S. Krishnamoorthy, and L. V. Kale. 2014. Optimizing Data Locality for Fork/Join Programs Using Constrained Work Stealing. In SC. 857--868.

Digital Library

[24]

S. S. Muchnick. 1997. Advanced Compiler Design and Implementation. Morgan Kaufmann.

Digital Library

[25]

V. Nagarajan and R. Gupta. 2010. Speculative Optimizations for Parallel Programs on Multicores. In LCPC. 323--337.

Digital Library

[26]

V. K. Nandivada, J. Shirako, J. Zhao, and V. Sarkar. 2013. A Transformation Framework for Optimizing Task-Parallel Programs. ACM Trans. Program. Lang. Syst. 35, 1 (April 2013), 3:1--3:48.

Digital Library

[27]

A. Noll and T. R. Gross. 2012. An Infrastructure for Dynamic Optimization of Parallel Programs. In PPoPP. ACM, 325--326.

Digital Library

[28]

OpenMP. 2008. OpenMP Application Program Interface, ver 3.0. (May 2008). http://www.openmp.org/mp-documents/spec30.pdf

[29]

L. Prechelt and S. U. Hánssgen. 2002. Efficient Parallel Execution of Irregular Recursive Programs. IEEE TPDS 13, 2 (Feb. 2002), 167--178.

Digital Library

[30]

J. Reinders. 2007. Intel Threading Building Blocks. O'Reilly Media.

Digital Library

[31]

V. Saraswat, B. Bard, P. Igor, O. Tardieu, and D. Grove. 2012. X10 Language Specification Version 2.3. Technical Report. IBM.

[32]

P. Thoman, H. Jordan, and T. Fahringer. 2014. Compiler Multiversioning for Automatic Task Granularity Control. Concurr. Comput. : Pract. Exper. 26, 14 (Sept. 2014), 2367--2385.

Digital Library

[33]

Chau-Wen Tseng. 1995. Compiler Optimizations for Eliminating Barrier Synchronization. In PPoPP. ACM, 144--155.

Digital Library

[34]

A. Tzannes, G. C. Caragea, R. Barua, and U. Vishkin. 2010. Lazy Binary-Splitting: A Run-time Adaptive Work-stealing Scheduler. In PPPoP. ACM, 179--190.

Digital Library

[35]

M. Voss and R. Eigenmann. 1999. Reducing Parallel Overheads through Dynamic Serialization. In IPPS/SPDP. 88--92.

Digital Library

[36]

R. P. Wilson, R. S. French, C. S. Wilson, S. P. Amarasinghe, J. M. Anderson, S. W. K. Tjiang, S. Liao, C. Tseng, M. W. Hall, M. S. Lam, and J. L. Hennessy. 1994. SUIF: An Infrastructure for Research on Parallelizing and Optimizing Compilers. SIGPLAN Not. 29, 12 (Dec. 1994), 31--37.

Digital Library

[37]

H. Wu, D. Li, and M. Becchi. 2016. Compiler-Assisted Workload Consolidation for Efficient Dynamic Parallelism on GPU. In IPDPS. 534--543.

[38]

N. Yonezawa, K. Wada, and T. Aida. 2006. Barrier Elimination Based on Access Dependency Analysis for OpenMP. In ISPA. 362--373.

Digital Library

[39]

K.K. Yue and D.J. Lilja. 1996. Efficient Execution of Parallel Applications in Multiprogrammed Multiprocessor Systems. In IPPS. 448--456.

Digital Library

Cited By

Nougrahiya ANandivada V(2024)Homeostasis: Design and Implementation of a Self-Stabilizing CompilerACM Transactions on Programming Languages and Systems10.1145/364930846:2(1-58)Online publication date: 1-May-2024
https://dl.acm.org/doi/10.1145/3649308
Prabhu INandivada VAyguadé EHwu WBadia RHofstee H(2020)Chunking loops with non-uniform workloadsProceedings of the 34th ACM International Conference on Supercomputing10.1145/3392717.3392763(1-12)Online publication date: 29-Jun-2020
https://dl.acm.org/doi/10.1145/3392717.3392763
Kobeissi SKetterlin AClauss P(2020)Rec2Poly: Converting Recursions to Polyhedral Optimized Loops Using an Inspector-Executor StrategyEmbedded Computer Systems: Architectures, Modeling, and Simulation10.1007/978-3-030-60939-9_7(96-109)Online publication date: 7-Oct-2020
https://doi.org/10.1007/978-3-030-60939-9_7
Show More Cited By

Recommendations

Extracting SIMD Parallelism from Recursive Task-Parallel Programs

The pursuit of computational efficiency has led to the proliferation of throughput-oriented hardware, from GPUs to increasingly wide vector units on commodity processors and accelerators. This hardware is designed to execute data-parallel computations ...
A Transformation Framework for Optimizing Task-Parallel Programs

Task parallelism has increasingly become a trend with programming models such as OpenMP 3.0, Cilk, Java Concurrency, X10, Chapel and Habanero-Java (HJ) to address the requirements of multicore programmers. While task parallelism increases productivity ...
Reducing task creation and termination overhead in explicitly parallel programs
PACT '10: Proceedings of the 19th international conference on Parallel architectures and compilation techniques

There has been a proliferation of task-parallel programming systems to address the requirements of multicore programmers. Current production task-parallel systems include Cilk++, Intel Threading Building Blocks, Java Concurrency, .Net Task Parallel ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICS '17: Proceedings of the International Conference on Supercomputing

June 2017

300 pages

ISBN:9781450350204

DOI:10.1145/3079079

General Chairs:
William D. Gropp
University of Illinois at Urbana-Champaign, Illinois
,
Pete Beckman
Argonne National Laboratory/Northwestern University, Illinois
,
Program Chairs:
Zhiyuan Li
Purdue University, West Lafayette, Indiana
,
Francisco J. Cazorla
IIIA-CSIC and Barcelona Supercomputing Center, Barcelona, Spain

Copyright © 2017 ACM.

© 2017 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 June 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

ICS '17

Sponsor:

SIGARCH

ICS '17: 2017 International Conference on Supercomputing

June 14 - 16, 2017

Illinois, Chicago

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
297
Total Downloads

Downloads (Last 12 months)15
Downloads (Last 6 weeks)1

Reflects downloads up to 01 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Nougrahiya ANandivada V(2024)Homeostasis: Design and Implementation of a Self-Stabilizing CompilerACM Transactions on Programming Languages and Systems10.1145/364930846:2(1-58)Online publication date: 1-May-2024
https://dl.acm.org/doi/10.1145/3649308
Prabhu INandivada VAyguadé EHwu WBadia RHofstee H(2020)Chunking loops with non-uniform workloadsProceedings of the 34th ACM International Conference on Supercomputing10.1145/3392717.3392763(1-12)Online publication date: 29-Jun-2020
https://dl.acm.org/doi/10.1145/3392717.3392763
Kobeissi SKetterlin AClauss P(2020)Rec2Poly: Converting Recursions to Polyhedral Optimized Loops Using an Inspector-Executor StrategyEmbedded Computer Systems: Architectures, Modeling, and Simulation10.1007/978-3-030-60939-9_7(96-109)Online publication date: 7-Oct-2020
https://doi.org/10.1007/978-3-030-60939-9_7
Yoga ANagarakatte SMcKinley KFisher K(2019)Parallelism-centric what-if and differential analysesProceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation10.1145/3314221.3314621(485-501)Online publication date: 8-Jun-2019
https://dl.acm.org/doi/10.1145/3314221.3314621

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents