research-article

Open access

Contech: Efficiently Generating Dynamic Task Graphs for Arbitrary Parallel Programs

Authors:

Brian P. Railing,

Eric R. Hein, and

Thomas M. ConteAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization (TACO), Volume 12, Issue 2

Article No.: 25, Pages 1 - 24

https://doi.org/10.1145/2776893

Published: 08 July 2015 Publication History

Abstract

Parallel programs can be characterized by task graphs encoding instructions, memory accesses, and the parallel work’s dependencies, while representing any threading library and architecture. This article presents Contech, a high performance framework for generating dynamic task graphs from arbitrary parallel programs, and a novel representation enabling programmers and compiler optimizations to understand and exploit program aspects. The Contech framework supports a variety of languages (including C, C++, and Fortran), parallelization libraries, and ISAs (including × 86 and ARM). Running natively for collection speed and minimizing program perturbation, the instrumentation shows 4 × improvement over a Pin-based implementation on PARSEC and NAS benchmarks.

Supplementary Material

TACO1202-25 (taco1202-25.pdf)

Slide deck associated with this paper

Download
2.66 MB

References

[1]

V. S. Adve and R. Sakellariou. 2001. Compiler synthesis of task graphs for parallel program performance prediction. In Proceedings of the 13th International Workshop on Languages and Compilers for Parallel Computing-Revised Papers (LCPC’00). Springer-Verlag, London, UK, 208--226. http://dl.acm.org/citation.cfm?id=645678.663959.

Digital Library

[2]

V. S. Adve and M. K. Vernon. 2004. Parallel program performance prediction using deterministic task graph analysis. ACM Trans. Comput. Syst. 22, 1 (Feb. 2004), 94--136.

Digital Library

[3]

K. Agrawal, C. E. Leiserson, and J. Sukha. 2010. Executing task graphs using work-stealing. In IEEE International Symposium on Parallel Distributed Processing (IPDPS). 1--12.

[4]

V. A. F. Almeida, I. M. M. Vasconcelos, J. N. C. Árabe, and D. A. Menascé. 1992. Using random task graphs to investigate the potential benefits of heterogeneity in parallel systems. In Proceedings of the 1992 ACM/IEEE Conference on Supercomputing (Supercomputing’92). IEEE Computer Society, Los Alamitos, CA, 683--691. http://dl.acm.org/citation.cfm?id=147877.148113

Digital Library

[5]

D. Ansaloni, W. Binder, A. Heydarnoori, and L. Y. Chen. 2012. Deferred methods: Accelerating dynamic program analysis on multicores. In Proceedings of the 10th International Symposium on Code Generation and Optimization (CGO’12). ACM, New York, NY, 242--251.

Digital Library

[6]

M. Bach, M. Charney, R. Cohn, E. Demikhovsky, T. Devor, K. Hazelwood, A. Jaleel, Chi-Keung Luk, G. Lyons, H. Patil, and A. Tal. 2010. Analyzing parallel programs with pin. Computer 43, 3 (2010), 34--41.

Digital Library

[7]

Barcelona Supercomputing Center 2015. Paraver. Barcelona Supercomputing Center. http://www.bsc.es/computer-sciences/performance-tools/paraver.

[8]

C. J. Beckmann and C. D. Polychronopoulos. 1992. Microarchitecture support for dynamic scheduling of acyclic task graphs. In Proceedings of the 25th Annual International Symposium on Microarchitecture (MICRO 25). IEEE Computer Society, Los Alamitos, CA, 140--148. http://dl.acm.org/citation.cfm?id=144953.145791.

Digital Library

[9]

C. Bienia. 2011. Benchmarking Modern Multiprocessors. Ph.D. Dissertation. Princeton University.

Digital Library

[10]

G. E. Blelloch, P. B. Gibbons, Y. Matias, and G. J. Narlikar. 1997. Space-efficient scheduling of parallelism with synchronization variables. In Proceedings of the 9th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA’97). ACM, New York, NY, 12--23.

Digital Library

[11]

R. D. Blumofe and C. E. Leiserson. 1993. Space-efficient scheduling of multithreaded computations. In Proceedings of the 25th Annual ACM Symposium on Theory of Computing (STOC’93). ACM, New York, NY, 362--371.

Digital Library

[12]

D. Bruening, T. Garnett, and S. Amarasinghe. 2003. An infrastructure for adaptive dynamic optimization. In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-Directed and Runtime Optimization (CGO’03). IEEE Computer Society, Washington, DC, 265--275. http://dl.acm.org/citation.cfm?id=776261.776290.

Digital Library

[13]

H. Cui, J. Wu, J. Gallagher, H. Guo, and J. Yang. 2011. Efficient deterministic multithreading through schedule relaxation. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP’11). ACM, New York, NY, 337--351.

Digital Library

[14]

Y. Etsion, F. Cabarcas, A. Rico, A. Ramirez, R. M. Badia, E. Ayguade, J. Labarta, and M. Valero. 2010. Task superscalar: An out-of-order task pipeline. In Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’43). IEEE Computer Society, Washington, DC, 89--100.

Digital Library

[15]

A. Gerasoulis, S. Venugopal, and T. Yang. 1990. Clustering task graphs for message passing architectures. In Proceedings of the 4th International Conference on Supercomputing (ICS’90). ACM, New York, NY, 447--456.

Digital Library

[16]

A. Goel, A. Roychoudhury, and T. Mitra. 2003. Compactly representing parallel program executions. In Proceedings of the 9th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’03). ACM, New York, NY, 191--202.

Digital Library

[17]

G. Gupta and G. S. Sohi. 2011. Dataflow execution of sequential imperative programs on multicore architectures. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-44). ACM, New York, NY, 59--70.

Digital Library

[18]

J. Ha, M. Arnold, S. M. Blackburn, and K. S. McKinley. 2009. A concurrent dynamic analysis framework for multicore hardware. In Proceedings of the 24th ACM SIGPLAN Conference on Object Oriented Programming Systems Languages and Applications (OOPSLA’09). ACM, New York, NY, 155--174.

Digital Library

[19]

Y. He, C. E. Leiserson, and W. M. Leiserson. 2010. The cilkview scalability analyzer. In Proceedings of the 22nd Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA’10). ACM, New York, NY, 145--156.

Digital Library

[20]

D. R. Hower and M. D. Hill. 2008. Rerun: Exploiting episodes for lightweight memory race recording. In Proceedings of the 35th Annual International Symposium on Computer Architecture (ISCA’08). IEEE Computer Society, Washington, DC, 265--276.

Digital Library

[21]

Intel. 2014. Intel 64 and IA-32 Architectures Software Developer Manuals. Intel Corporation, Santa Clara, CA. Retrieved from http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html.

[22]

H. Jin, M. Frumkin, and J. Yan. 1999. The OpenMP Implementation of NAS Parallel Benchmarks and Its Performance. Technical Report NAS-99-011. NAS.

[23]

M. Kambadur, K. Tang, and M. A. Kim. 2012. Harmony: Collection and analysis of parallel block vectors. In Proceedings of the 39th Annual International Symposium on Computer Architecture (ISCA’12). IEEE Computer Society, Washington, DC, 452--463. http://dl.acm.org/citation.cfm?id=2337159.2337211

Digital Library

[24]

M. Kulkarni, M. Burtscher, R. Inkulu, K. Pingali, and C. Casçaval. 2009. How much parallelism is there in irregular applications?. In Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’09). ACM, New York, NY, 3--14.

Digital Library

[25]

S. Kumar, C. J. Hughes, and A. Nguyen. 2007. Carbon: Architectural support for fine-grained parallelism on chip multiprocessors. In Proceedings of the 34th Annual International Symposium on Computer Architecture (ISCA’07). ACM, New York, NY, 162--173.

Digital Library

[26]

J. R. Larus. 1990. Abstract execution: A technique for efficiently tracing programs. Softw. Pract. Exper. 20, 12 (Nov. 1990), 1241--1258.

Digital Library

[27]

J. R. Larus. 1999. Whole program paths. In Proceedings of the ACM SIGPLAN 1999 Conference on Programming Language Design and Implementation (PLDI’99). ACM, New York, NY, 259--269.

Digital Library

[28]

C. Lattner and V. Adve. 2004. LLVM: A compilation framework for lifelong program analysis & transformation. In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization (CGO’04). IEEE Computer Society, Washington, DC, p. 75. http://dl.acm.org/citation.cfm?id=977395.977673.

Digital Library

[29]

M. A. Laurenzano, J. Peraza, L. Carrington, A. Tiwari, W. A. Ward, and R. Campbell. 2012. A static binary instrumentation threading model for fast memory trace collection. In Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis (SCC’12). IEEE Computer Society, Washington, DC, 741--745.

Digital Library

[30]

D. L. Long and L. A. Clarke. 1989. Task interaction graphs for concurrency analysis. In Proceedings of the 11th International Conference on Software Engineering (ICSE’89). ACM, New York, NY, 44--52.

Digital Library

[31]

C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood. 2005. Pin: Building customized program analysis tools with dynamic instrumentation. In Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’05). ACM, New York, NY, 190--200.

Digital Library

[32]

Y.-H. Lyu, D.-Y. Hong, T.-Y. Wu, J.-J. Wu, W.-C. Hsu, P. Liu, and P.-C. Yew. 2014. DBILL: An efficient and retargetable dynamic binary instrumentation framework using llvm backend. In Proceedings of the 10th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE’14). ACM, New York, NY, 141--152.

Digital Library

[33]

M. McCool, J. Reinders, and A. Robison. 2012. Structured Parallel Programming: Patterns for Efficient Computation (1st ed.). Morgan Kaufmann Publishers Inc., San Francisco, CA.

Digital Library

[34]

N. Nethercote and J. Seward. 2007. Valgrind: A framework for heavyweight dynamic binary instrumentation. In Proceedings of the 2007 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’07). ACM, New York, NY, 89--100.

Digital Library

[35]

J. A. Poovey, B. P. Railing, and T. M. Conte. 2011. Parallel pattern detection for architectural improvements. In Proceedings of the 3rd USENIX Conference on Hot Topic in Parallelism (HotPar’11). USENIX Association, Berkeley, CA, 12--12. http://dl.acm.org/citation.cfm?id=2001252.2001264

Digital Library

[36]

B. P. Railing and E. R. Hein. 2015. Contech. Georgia Institute of Technology. https://github.com/bprail/contech

[37]

A. Rico, A. Duran, F. Cabarcas, Y. Etsion, A. Ramirez, and M. Valero. 2011. Trace-driven simulation of multithreaded applications. In 2011 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 87--96.

Digital Library

[38]

V. Sarkar and B. Simons. 1994. Parallel program graphs and their classification. In Proceedings of the 6th International Workshop on Languages and Compilers for Parallel Computing. Springer-Verlag, London, UK, 633--655. http://dl.acm.org/citation.cfm?id=645671.665396.

Digital Library

[39]

Y. W. Song and Y.-H. Lee. 2014. On the existence of probe effect in multi-threaded embedded programs. In Proceedings of the 14th International Conference on Embedded Software (EMSOFT’14). ACM, New York, NY, Article 18, 9 pages.

Digital Library

[40]

S. Sridharan, G. Gupta, and G. S. Sohi. 2014. Adaptive, efficient, parallel execution of parallel programs. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’14). ACM, New York, NY, 169--180.

Digital Library

[41]

S. Tallam and R. Gupta. 2007. Unified control flow and data dependence traces. ACM Trans. Archit. Code Optim. 4, 3, Article 19 (Sept. 2007).

Digital Library

[42]

D. Upton, K. Hazelwood, R. Cohn, and G. Lueck. 2009. Improving instrumentation speed via buffering. In Proceedings of the Workshop on Binary Instrumentation and Applications (WBIA’09). ACM, New York, NY, 52--61.

Digital Library

[43]

H. Vandierendonck, G. Tzenakis, and D. S. Nikolopoulos. 2013. Analysis of dependence tracking algorithms for task dataflow execution. ACM Trans. Archit. Code Optim. 10, 4, Article 61 (Dec. 2013), 24 pages.

Digital Library

[44]

R. M. Yoo, C. J. Hughes, C. Kim, Y.-K. Chen, and Christos. Kozyrakis. 2013. Locality-aware task management for unstructured parallelism: A quantitative limit study. In Proceedings of the 25th Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA’13). ACM, New York, NY, 315--325.

Digital Library

[45]

Q. Zhao, I. Cutcutache, and W.-F. Wong. 2008. Pipa: Pipelined profiling and analysis on multi-core systems. In Proceedings of the 6th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO’08). ACM, New York, NY, 185--194.

Digital Library

[46]

Q. Zhao, I. Cutcutache, and W.-F. Wong. 2010. PiPA: Pipelined profiling and analysis on multicore systems. ACM Trans. Archit. Code Optim. 7, 3, Article 13 (Dec. 2010), 29 pages.

Digital Library

Cited By

Railing B(2023)CADSS: Computer Architecture Design Simulator for StudentsProceedings of the Workshop on Computer Architecture Education10.1145/3605507.3610626(34-40)Online publication date: 17-Jun-2023
https://dl.acm.org/doi/10.1145/3605507.3610626
Kalyan KGangadhar KJancy SSelvan M(2021)Automatic Optimization and Allocation of Data Using Q-Learning TechniqueAdvances in Smart Grid and Renewable Energy10.1007/978-981-15-7511-2_65(643-649)Online publication date: 5-Jan-2021
https://doi.org/10.1007/978-981-15-7511-2_65
Nazarian SBogdan P(2020) S 4 oC: A Self-Optimizing, Self-Adapting Secure System-on-Chip Design Framework to Tackle Unknown Threats — A Network Theoretic, Learning Approach 2020 IEEE International Symposium on Circuits and Systems (ISCAS)10.1109/ISCAS45731.2020.9180687(1-8)Online publication date: Oct-2020
https://doi.org/10.1109/ISCAS45731.2020.9180687
Show More Cited By

Index Terms

Contech: Efficiently Generating Dynamic Task Graphs for Arbitrary Parallel Programs

Recommendations

An overlapping task assignment scheme for hierarchical coarse-grain task parallel processing: Research Articles
10th International Workshop on Compilers for Parallel Computers (CPC 2003)

This paper proposes an overlapping task assignment scheme for the hierarchical coarse-grain task parallel processing on multiprocessor systems. In coarse-grain task parallel processing, the compiler extracts parallelism among coarse-grain tasks ...
Read More
Performance analysis of large-scale OpenMP and hybrid MPI/OpenMP applications with VampirNG
IWOMP'05/IWOMP'06: Proceedings of the 2005 and 2006 international conference on OpenMP shared memory parallel programming

This paper presents a tool setup for comprehensive event-based performance analysis of large-scale openMP and hybrid openmp/ MPI applications. The KOJAK framework is used for portable code instrumentation and automatic analysis while the new VAMIIR NG ...
Read More
Stochastic Bounds on Execution Times of Parallel Programs

Stochastic bounds are obtained on execution times of parallel programs when the number of processors is unlimited. A parallel program is considered to consist of interdependent tasks with synchronization constraints. These constraints are described by ...
Read More

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization

ACM Transactions on Architecture and Code Optimization Volume 12, Issue 2

July 2015

410 pages

ISSN:1544-3566

EISSN:1544-3973

DOI:10.1145/2775085

Editor:
Koen De Bosschere
Ghent University

Issue’s Table of Contents

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 July 2015

Accepted: 01 May 2015

Revised: 01 May 2015

Received: 01 January 2015

Published in TACO Volume 12, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

National Science Foundation

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

14
Total Citations
View Citations
784
Total Downloads

Downloads (Last 12 months)90
Downloads (Last 6 weeks)3

Other Metrics

View Author Metrics

Citations

Cited By

Railing B(2023)CADSS: Computer Architecture Design Simulator for StudentsProceedings of the Workshop on Computer Architecture Education10.1145/3605507.3610626(34-40)Online publication date: 17-Jun-2023
https://dl.acm.org/doi/10.1145/3605507.3610626
Kalyan KGangadhar KJancy SSelvan M(2021)Automatic Optimization and Allocation of Data Using Q-Learning TechniqueAdvances in Smart Grid and Renewable Energy10.1007/978-981-15-7511-2_65(643-649)Online publication date: 5-Jan-2021
https://doi.org/10.1007/978-981-15-7511-2_65
Nazarian SBogdan P(2020) S 4 oC: A Self-Optimizing, Self-Adapting Secure System-on-Chip Design Framework to Tackle Unknown Threats — A Network Theoretic, Learning Approach 2020 IEEE International Symposium on Circuits and Systems (ISCAS)10.1109/ISCAS45731.2020.9180687(1-8)Online publication date: Oct-2020
https://doi.org/10.1109/ISCAS45731.2020.9180687
Kumar MSahu AMitra P(2020)A comparison of different metaheuristics for the quadratic assignment problem in accelerated systemsApplied Soft Computing10.1016/j.asoc.2020.106927(106927)Online publication date: Nov-2020
https://doi.org/10.1016/j.asoc.2020.106927
Xiao YNazarian SBogdan P(2019)Self-Optimizing and Self-Programming Computing Systems: A Combined Compiler, Complex Networks, and Machine Learning ApproachIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2019.2897650(1-12)Online publication date: 2019
https://doi.org/10.1109/TVLSI.2019.2897650
Huang KJiang XJiang HZhang XYu MYan RYan X(2019)Fine-Grained Communication-Aware Task Scheduling Approach for Acyclic and Cyclic Applications on MPSoCsIEEE Access10.1109/ACCESS.2019.29116537(54372-54389)Online publication date: 2019
https://doi.org/10.1109/ACCESS.2019.2911653
Xiao YNazarian SBogdan P(2018)Prometheus: Processing-in-memory heterogeneous architecture design from a multi-layer network theoretic strategy2018 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE.2018.8342229(1387-1392)Online publication date: Mar-2018
https://doi.org/10.23919/DATE.2018.8342229
Badr MJerger N(2018)Fast and Accurate Performance Analysis of SynchronizationProceedings of the 9th International Workshop on Programming Models and Applications for Multicores and Manycores10.1145/3178442.3178446(31-40)Online publication date: 24-Feb-2018
https://dl.acm.org/doi/10.1145/3178442.3178446
Lui MSangaiah KHempstead MTaskin B(2018)Towards Cross-Framework Workload Analysis via Flexible Event-Driven Interfaces2018 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS.2018.00030(169-178)Online publication date: Apr-2018
https://doi.org/10.1109/ISPASS.2018.00030
Badr MEnright Jerger N(2018)A high-level model for exploring multi-core architecturesParallel Computing10.1016/j.parco.2018.10.00680(23-35)Online publication date: Dec-2018
https://doi.org/10.1016/j.parco.2018.10.006
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents