Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3079079.3079102acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article

Optimizing recursive task parallel programs

Published: 14 June 2017 Publication History

Abstract

We present a new optimization DECAF that optimizes recursive task parallel (RTP) programs by reducing the task creation and termination overheads. DECAF reduces the task termination (join) operations by aggressively increasing the scope of join operations (in a semantics preserving way), and eliminating the redundant join operations discovered on the way. Further, DECAF extends the traditional loop chunking technique to perform load-balanced chunking, at runtime, based on the number of available worker threads. This helps reduce the redundant parallel tasks at different levels of recursion. We also discuss the impact of exceptions on our techniques and extend them to handle RTP programs that may throw exceptions. We implemented DECAF in the X10v2.3 compiler and tested it over a set of benchmark kernels on two different hardwares (a 16-core Intel system and a 64-core AMD system). With respect to the base X10 compiler extended with loop-chunking of Nandivada et al. [26] (LC), DECAF achieved a geometric mean speed up of 2.14× and 2.53× on the Intel and AMD system, respectively. We also present an evaluation with respect to the energy consumption on the Intel system and show that on average, compared to the LC versions, the DECAF versions consume 71.2% less energy.

References

[1]
G. Aharoni, D. G. Feitelson, and A. Barak. 1992. A Run-Time Algorithm for Managing the Granularity of Parallel Functional Programs. JFP 2, 4 (Oct 1992), 387--405.
[2]
L. Bergstrom and J. H. Reppy. 2012. Nested data-parallelism on the GPU. In ICFP. 247--258.
[3]
G. Bikshandi, J. G. Castanos, S. B. Kodali, V. K. Nandivada, I. Peshansky, V. A. Saraswat, S. Sur, P. Varma, and T. Wen. 2009. Efficient, portable implementation of asynchronous multi-place programs. In PPoPP. ACM, 271--282.
[4]
G. E. Blelloch and G. Sabot. 1990. Compiling Collection-Oriented Languages onto Massively Parallel Computers. J. Parallel Distrib. Comput. 8, 2 (1990), 119--134.
[5]
V. Cavé, J. Zhao, J. Shirako, and V. Sarkar. 2011. Habanero-Java: The New Adventures ofOld X10. In PPPJ. ACM, 51--61.
[6]
O. Certner, Z. Li, P. Palatin, O. Temam, F. Arzel, and N. Drach. 2008. A Practical Approach for Reconciling High and Predictable Performance in Non-Regular Parallel Programs. In DATE. 740--745.
[7]
B.L. Chamberlain, D. Callahan, and H.P. Zima. 2007. Parallel Programmability and the Chapel Language. IJHPCA 21, 3 (Aug 2007), 291--312.
[8]
R. Cytron, J. Lipkis, and E. Schonberg. 1990. A Compiler-Assisted Approach to SPMD Execution. In SC. IEEE, 398--406.
[9]
A. Duran, J. Corbalán, and E. Ayguadé. 2008. An Adaptive Cut-off for Task Parallelism. In SC. IEEE Press, Article 36, 11 pages.
[10]
A. Duran, X. Teruel, R. Ferrer, X. Martorell, and E. Ayguade. 2009. Barcelona OpenMP Tasks Suite: A Set of Benchmarks Targeting the Exploitation of Task Parallelism in OpenMP. In ICPP. IEEE Computer Society, 124--131.
[11]
R. Ferrer, A. Duran, X. Martorell, and E. Ayguadé. 2010. Unrolling Loops Containing Task Parallelism. In LCPC. 416--423.
[12]
M. Frigo, C. E. Leiserson, and K. H. Randall. 1998. The Implementation of the Cilk-5 Multithreaded Language. In PLDI. 212--223.
[13]
Y. Guo, R. Barik, R. Raman, and V. Sarkar. 2009. Work-first and help-first scheduling policies for async-finish task parallelism. In IPDPS. 1--12.
[14]
S. Gupta and V. Krishna Nandivada. 2015. IMSuite: A Benchmark Suite for Simulating Distributed Algorithms. JPDC 75, 0 (Jan 2015), 1 -- 19.
[15]
I. E. Hajj, J. Gómez-Luna, C. Li, L. Chang, D. S. Milojicic, and W. W. Hwu. 2016. KLAP: Kernel launch aggregation and promotion for optimizing dynamic parallelism. In MICRO. 1--12.
[16]
M. W. Hall and M. Martonosi. 1998. Adaptive Parallelism in Compiler-Parallelized Code. Concurrency-Pract Ex 10, 14 (1998), 1235--1250.
[17]
E. A. Heinz and M. Philippsen. 1993. Synchronization Barrier Elimination in Synchronous FORALL Statements. Technical Report No. 13/93. University of Karlsruhe, Department of Informatics.
[18]
L. Huelsbergen, J. R. Larus, and A. Aiken. 1994. Using the Run-time Sizes of Data Structures to Guide Parallel-thread Creation. In LFP. ACM, 79--90.
[19]
Intel. 2014. Intel 64 and IA-32 Architectures Software Developer's Manual. (2014).
[20]
S. Iwasaki and K. Taura. 2016. A Static Cut-off for Task Parallel Programs. In Proceedings of the 2016 International Conference on Parallel Architectures and Compilation. ACM, 139--150.
[21]
K. Kennedy and J. R. Allen. 2002. Optimizing Compilers for Modern Architectures: A Dependence-based Approach. Morgan Kaufmann Publishers Inc.
[22]
D. A. Kranz, R. H. Halstead, Jr., and E. Mohr. 1989. Mul-T: A High-performance Parallel Lisp. In PLDI. ACM, 81--90.
[23]
J. Lifflander, S. Krishnamoorthy, and L. V. Kale. 2014. Optimizing Data Locality for Fork/Join Programs Using Constrained Work Stealing. In SC. 857--868.
[24]
S. S. Muchnick. 1997. Advanced Compiler Design and Implementation. Morgan Kaufmann.
[25]
V. Nagarajan and R. Gupta. 2010. Speculative Optimizations for Parallel Programs on Multicores. In LCPC. 323--337.
[26]
V. K. Nandivada, J. Shirako, J. Zhao, and V. Sarkar. 2013. A Transformation Framework for Optimizing Task-Parallel Programs. ACM Trans. Program. Lang. Syst. 35, 1 (April 2013), 3:1--3:48.
[27]
A. Noll and T. R. Gross. 2012. An Infrastructure for Dynamic Optimization of Parallel Programs. In PPoPP. ACM, 325--326.
[28]
OpenMP. 2008. OpenMP Application Program Interface, ver 3.0. (May 2008). http://www.openmp.org/mp-documents/spec30.pdf
[29]
L. Prechelt and S. U. Hánssgen. 2002. Efficient Parallel Execution of Irregular Recursive Programs. IEEE TPDS 13, 2 (Feb. 2002), 167--178.
[30]
J. Reinders. 2007. Intel Threading Building Blocks. O'Reilly Media.
[31]
V. Saraswat, B. Bard, P. Igor, O. Tardieu, and D. Grove. 2012. X10 Language Specification Version 2.3. Technical Report. IBM.
[32]
P. Thoman, H. Jordan, and T. Fahringer. 2014. Compiler Multiversioning for Automatic Task Granularity Control. Concurr. Comput. : Pract. Exper. 26, 14 (Sept. 2014), 2367--2385.
[33]
Chau-Wen Tseng. 1995. Compiler Optimizations for Eliminating Barrier Synchronization. In PPoPP. ACM, 144--155.
[34]
A. Tzannes, G. C. Caragea, R. Barua, and U. Vishkin. 2010. Lazy Binary-Splitting: A Run-time Adaptive Work-stealing Scheduler. In PPPoP. ACM, 179--190.
[35]
M. Voss and R. Eigenmann. 1999. Reducing Parallel Overheads through Dynamic Serialization. In IPPS/SPDP. 88--92.
[36]
R. P. Wilson, R. S. French, C. S. Wilson, S. P. Amarasinghe, J. M. Anderson, S. W. K. Tjiang, S. Liao, C. Tseng, M. W. Hall, M. S. Lam, and J. L. Hennessy. 1994. SUIF: An Infrastructure for Research on Parallelizing and Optimizing Compilers. SIGPLAN Not. 29, 12 (Dec. 1994), 31--37.
[37]
H. Wu, D. Li, and M. Becchi. 2016. Compiler-Assisted Workload Consolidation for Efficient Dynamic Parallelism on GPU. In IPDPS. 534--543.
[38]
N. Yonezawa, K. Wada, and T. Aida. 2006. Barrier Elimination Based on Access Dependency Analysis for OpenMP. In ISPA. 362--373.
[39]
K.K. Yue and D.J. Lilja. 1996. Efficient Execution of Parallel Applications in Multiprogrammed Multiprocessor Systems. In IPPS. 448--456.

Cited By

View all
  • (2024)Homeostasis: Design and Implementation of a Self-Stabilizing CompilerACM Transactions on Programming Languages and Systems10.1145/364930846:2(1-58)Online publication date: 1-May-2024
  • (2020)Chunking loops with non-uniform workloadsProceedings of the 34th ACM International Conference on Supercomputing10.1145/3392717.3392763(1-12)Online publication date: 29-Jun-2020
  • (2020)Rec2Poly: Converting Recursions to Polyhedral Optimized Loops Using an Inspector-Executor StrategyEmbedded Computer Systems: Architectures, Modeling, and Simulation10.1007/978-3-030-60939-9_7(96-109)Online publication date: 7-Oct-2020
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICS '17: Proceedings of the International Conference on Supercomputing
June 2017
300 pages
ISBN:9781450350204
DOI:10.1145/3079079
  • General Chairs:
  • William D. Gropp,
  • Pete Beckman,
  • Program Chairs:
  • Zhiyuan Li,
  • Francisco J. Cazorla
© 2017 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 June 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. data parallel
  2. recursive task parallel
  3. useful parallelism

Qualifiers

  • Research-article

Funding Sources

Conference

ICS '17
Sponsor:

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)15
  • Downloads (Last 6 weeks)1
Reflects downloads up to 01 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Homeostasis: Design and Implementation of a Self-Stabilizing CompilerACM Transactions on Programming Languages and Systems10.1145/364930846:2(1-58)Online publication date: 1-May-2024
  • (2020)Chunking loops with non-uniform workloadsProceedings of the 34th ACM International Conference on Supercomputing10.1145/3392717.3392763(1-12)Online publication date: 29-Jun-2020
  • (2020)Rec2Poly: Converting Recursions to Polyhedral Optimized Loops Using an Inspector-Executor StrategyEmbedded Computer Systems: Architectures, Modeling, and Simulation10.1007/978-3-030-60939-9_7(96-109)Online publication date: 7-Oct-2020
  • (2019)Parallelism-centric what-if and differential analysesProceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation10.1145/3314221.3314621(485-501)Online publication date: 8-Jun-2019

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media