research-article

Grain graphs: OpenMP performance analysis made easy

Authors:

Ananya Muddukrishna,

Peter A. Jonsson,

Mats BrorssonAuthors Info & Claims

ACM SIGPLAN Notices, Volume 51, Issue 8

Article No.: 28, Pages 1 - 13

https://doi.org/10.1145/3016078.2851156

Published: 27 February 2016 Publication History

Abstract

Average programmers struggle to solve performance problems in OpenMP programs with tasks and parallel for-loops. Existing performance analysis tools visualize OpenMP task performance from the runtime system's perspective where task execution is interleaved with other tasks in an unpredictable order. Problems with OpenMP parallel for-loops are similarly difficult to resolve since tools only visualize aggregate thread-level statistics such as load imbalance without zooming into a per-chunk granularity. The runtime system/threads oriented visualization provides poor support for understanding problems with task and chunk execution time, parallelism, and memory hierarchy utilization, forcing average programmers to rely on experts or use tedious trial-and-error tuning methods for performance. We present grain graphs, a new OpenMP performance analysis method that visualizes grains -- computation performed by a task or a parallel for-loop chunk instance -- and highlights problems such as low parallelism, work inflation and poor parallelization benefit at the grain level. We demonstrate that grain graphs can quickly reveal performance problems that are difficult to detect and characterize in fine detail using existing visualizations in standard OpenMP programs, simplifying OpenMP performance analysis. This enables average programmers to make portable optimizations for poor performing OpenMP programs, reducing pressure on experts and removing the need for tedious trial-and-error tuning.

References

[1]

L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin, J. Mellor-Crummey, and N. R. Tallent. Hpctoolkit: Tools for performance analysis of optimized parallel programs. Concurrency and Computation: Practice and Experience, 22(6):685--701, 2010.

Digital Library

[2]

J. M. Arul, G.-J. Hwang, and H.-Y. Ko. GOMP profiler: A profiler for OpenMP task level parallelism. Computer Science and Engineering, 3(3):56--66, 2013.

[3]

E. Ayguadé, N. Copty, A. Duran, J. Hoeflinger, Y. Lin, F. Massaioli, X. Teruel, P. Unnikrishnan, and G. Zhang. The design of OpenMP tasks. Parallel and Distributed Systems, IEEE Transactions on, 20(3): 404--418, 2009.

Digital Library

[4]

Barcelona Supercomputing Center. OmpSs task dependency graph, 2013. http://pm.bsc.es/ompss-docs/user-guide/run-programs-plugin-instrument-tdg.html. Accessed 10 April 2015.

[5]

C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC benchmark suite: characterization and architectural implications. In Proc. of the International Conference on Parallel Architecture and Compilation Techniques (17th PACT'08), pages 72--81. ACM, 2008.

Digital Library

[6]

S. Brinkmann, J. Gracia, and C. Niethammer. Task debugging with temanejo. In Tools for High Performance Computing 2012, pages 13--21. Springer, 2013.

[7]

H. Brunst and B. Mohr. Performance analysis of large-scale OpenMP and hybrid MPI/OpenMP applications with Vampir NG. In OpenMP Shared Memory Parallel Programming, number 4315 in LNCS, pages 5--14. Springer, 2008.

Digital Library

[8]

D. Chase and Y. Lev. Dynamic circular work-stealing deque. In Proceedings of the Seventeenth Annual ACM Symposium on Parallelism in Algorithms and Architectures, SPAA'05, pages 21--28. ACM, 2005.

Digital Library

[9]

J. Cownie, J. DelSignore, John, B. de Supinski, and K. Warren. DMPL: An OpenMP DLL debugging interface. In OpenMP Shared Memory Parallel Programming, volume 2716 of LNCS, pages 137--146. Springer, 2003.

Digital Library

[10]

G. Csardi and T. Nepusz. The igraph software package for complex network research. InterJournal, Complex Systems:1695, 2006.

[11]

Y. Ding, K. Hu, K. Wu, and Z. Zhao. Performance monitoring and analysis of task-based OpenMP. PLoS ONE, 8(10):e77742, 2013.

[12]

A. Drebes, A. Pop, K. Heydemann, A. Cohen, and N. Drach-Temam. Aftermath: A graphical tool for performance analysis and debugging of fine-grained task-parallel programs and run-time systems. In 7th Workshop on Programmability Issues for Heterogeneous Multicores (MULTIPROG, associated with HiPEAC), Vienna, Austria, 2014.

[13]

A. Duran, J. Corbalán, and E. Ayguadé. An adaptive cut-off for task parallelism. In High Performance Computing, Networking, Storage and Analysis. SC'08. International Conference for, pages 1--11, 2008.

Digital Library

[14]

A. Duran, X. Teruel, R. Ferrer, X. Martorell, and E. Ayguade. Barcelona OpenMP tasks suite: A set of benchmarks targeting the exploitation of task parallelism in OpenMP. In Parallel Processing, 2009. ICPP'09. International Conference on, pages 124--131, 2009.

Digital Library

[15]

A. Duran, E. Ayguadé, R. M. Badia, J. Labarta, L. Martinell, X. Martorell, and J. Planas. Ompss: a proposal for programming heterogeneous multi-core architectures. Parallel Processing Letters, 21(02): 173--193, 2011.

[16]

A. E. Eichenberger, J. Mellor-Crummey, M. Schulz, M. Wong, N. Copty, R. Dietrich, X. Liu, E. Loh, and D. Lorenz. OMPT: An OpenMP tools application programming interface for performance analysis. In OpenMP in the Era of Low Power Devices and Accelerators, pages 171--185. Springer, 2013.

[17]

K. Fürlinger. OpenMP application profiling---state of the art and directions for the future. Procedia Computer Science, 1(1):2107--2114, 2010.

[18]

K. Fürlinger and D. Skinner. Performance profiling for OpenMP tasks. In Evolving OpenMP in an Age of Extreme Parallelism, number 5568 in LNCS, pages 132--139. Springer, Jan. 2009.

Digital Library

[19]

M. Geimer, F. Wolf, B. J. Wylie, E. Ábrahám, D. Becker, and B. Mohr. The Scalasca performance toolset architecture. Concurrency and Computation: Practice and Experience, 22(6):702--719, 2010.

Digital Library

[20]

Intel Corporation. OpenMP* Runtime to align with Intel Parallel Studio XE 2015 Composer Edition Update 3, 2015. https://www.openmprtl.org/download. Accessed 10 April 2015.

[21]

K. E. Isaacs, A. Bhatele, J. Lifflander, D. Böhme, T. Gamblin, M. Schulz, B. Hamann, and P.-T. Bremer. Recovering logical structure from Charm++ event traces. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, ser. SC, volume 15, 2015.

Digital Library

[22]

D. Lorenz, P. Philippen, D. Schmidl, and F. Wolf. Profiling of OpenMP tasks with score-p. In Parallel Processing Workshops (ICPPW), 2012 41st International Conference on, pages 444--453, 2012.

Digital Library

[23]

M. McCool, J. Reinders, and A. Robison. Structured Parallel Programming: Patterns for Efficient Computation. Access Online via Elsevier, 2012.

Digital Library

[24]

B. P. Miller, M. D. Callaghan, J. M. Cargille, J. K. Hollingsworth, R. B. Irvin, K. L. Karavanic, K. Kunchithapadam, and T. Newhall. The paradyn parallel performance measurement tool. Computer, 28 (11):37--46, 1995.

Digital Library

[25]

M. S. Mohsen, R. Abdullah, and Y. M. Teo. A survey on performance tools for OpenMP. World Academy of Science, Engineering and Technology, 49, 2009.

[26]

P. J. Mucci, S. Browne, C. Deane, and G. Ho. PAPI: A portable interface to hardware performance counters. In Proceedings of the Department of Defense HPCMP Users Group Conference, pages 7--10, 1999.

[27]

A. Muddukrishna, P. A. Jonsson, V. Vlassov, and M. Brorsson. Locality-aware task scheduling and data distribution on NUMA systems. In OpenMP in the Era of Low Power Devices and Accelerators, number 8122 in LNCS, pages 156--170. Springer, 2013.

[28]

A. Muddukrishna, P. A. Jonsson, and M. Brorsson. Characterizing task-based OpenMP programs. PLoS ONE, 10(4):e0123545, 2015.

[29]

M. S. Müller, J. Baron, W. C. Brantley, H. Feng, D. Hackenberg, R. Henschel, G. Jost, D. Molka, C. Parrott, J. Robichaux, et al. Spec OMP2012---an application benchmark suite for parallel systems using openmp. In OpenMP in a Heterogeneous World, pages 223--236. Springer, 2012.

Digital Library

[30]

S. L. Olivier, B. R. de Supinski, M. Schulz, and J. F. Prins. Characterizing and mitigating work time inflation in task parallel programs. In High Performance Computing, Networking, Storage and Analysis (SC), 2012 International Conference for, pages 1--12, 2012.

Digital Library

[31]

OpenMP Architecture Review Board. OpenMP application program interface version 4.5, 2015. http://www.openmp.org/mp-documents/openmp-4.5.pdf.

[32]

V. Pillet, J. Labarta, T. Cortes, and S. Girona. Paraver: A tool to visualize and analyze parallel code. In Proceedings of WoTUG-18: Transputer and occam Developments, volume 44, pages 17--31, 1995.

[33]

A. Podobas and M. Brorsson. A comparison of some recent task-based parallel programming models. In Proceedings of the 3rd Workshop on Programmability Issues for Multi-Core Computers, (MULTIPROG' 2010), Pisa, 2010.

[34]

A. Podobas, M. Brorsson, and K.-F. Faxén. A comparative performance study of common and popular task-centric programming frameworks. Concurrency and Computation: Practice and Experience, 27(1):1--28, 2015.

Digital Library

[35]

T. B. Schardl, B. C. Kuszmaul, I. Lee, W. M. Leiserson, C. E. Leiserson, and others. The Cilkprof Scalability Profiler. In Proceedings of the 27th ACM on Symposium on Parallelism in Algorithms and Architectures, pages 89--100. ACM. URL http://dl.acm.org/citation.cfm?id=2755603.

Digital Library

[36]

D. Schmidl, P. Philippen, D. Lorenz, C. Rössel, M. Geimer, D. a. Mey, B. Mohr, and F. Wolf. Performance analysis techniques for task-based OpenMP applications. In OpenMP in a Heterogeneous World, number 7312 in LNCS, pages 196--209. Springer, 2012.

Digital Library

[37]

D. Schmidl, C. Terboven, D. a. Mey, and M. S. Müller. Suitability of performance tools for OpenMP task-parallel programs. In Tools for High Performance Computing 2013, pages 25--37. Springer, 2014.

[38]

H. Servat, X. Teruel, G. Llort, A. Duran, J. Gimenez, X. Martorell, E. Ayguadé, and J. Labarta. On the instrumentation of OpenMP and OmpSs tasking constructs. In Euro-Par Workshops, volume 7640 of LNCS, pages 414--428. Springer, 2012.

Digital Library

[39]

O. Sinnen. Task scheduling for parallel systems, volume 60. John Wiley & Sons, 2007.

Digital Library

[40]

M. E. Smoot, K. Ono, J. Ruscheinski, P.-L. Wang, and T. Ideker. Cytoscape 2.8: new features for data integration and network visualization. Bioinformatics, 27(3):431--432, 2011.

Digital Library

[41]

V. Subotic, S. Brinkmann, V. Marjanovic, R. M. Badia, J. Gracia, C. Niethammer, E. Ayguade, J. Labarta, and M. Valero. Programmability and portability for exascale: Top down programming methodology and tools with starss. Journal of Computational Science, 4(6):450--456, 2013.

[42]

G. Team. Gecode: Generic constraint development environment, 2006. http://www.gecode.org.

[43]

M. Thottethodi, S. Chatterjee, and A. R. Lebeck. Tuning Strassen's matrix multiplication for memory efficiency. In Proceedings of the 1998 ACM/IEEE conference on Supercomputing, pages 1--14. IEEE Computer Society, 1998.

Digital Library

[44]

V. Tovinkere and M. Voss. Flow graph designer: A tool for designing and analyzing Intel® threading building blocks flow graphs. In ICPP Workshops, pages 149--158. IEEE Computer Society, 2014.

Digital Library

[45]

yWorks GmBh. yEd graph editor, 2015. http://www.yworks.com/en/products_yed_about.html. Accessed 10 April 2015.

Cited By

Lasserre ANamyst RWacrenier P(2021)EasyPAP: A Framework for Learning Parallel ProgrammingJournal of Parallel and Distributed Computing10.1016/j.jpdc.2021.07.018Online publication date: Aug-2021
https://doi.org/10.1016/j.jpdc.2021.07.018
Lasserre ANamyst RWacrenier P(2020)EASYPAP: a Framework for Learning Parallel Programming2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW50202.2020.00059(276-283)Online publication date: May-2020
https://doi.org/10.1109/IPDPSW50202.2020.00059
Wodiany IDrebes ANeill RPop A(2020)AfterOMPT: An OMPT-Based Tool for Fine-Grained Tracing of Tasks and LoopsOpenMP: Portable Multi-Level Parallelism on Modern Systems10.1007/978-3-030-58144-2_11(165-180)Online publication date: 1-Sep-2020
https://doi.org/10.1007/978-3-030-58144-2_11
Show More Cited By

Index Terms

Grain graphs: OpenMP performance analysis made easy
1. Software and its engineering
  1. Software creation and management
    1. Software verification and validation

Recommendations

Grain graphs: OpenMP performance analysis made easy
PPoPP '16: Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

Average programmers struggle to solve performance problems in OpenMP programs with tasks and parallel for-loops. Existing performance analysis tools visualize OpenMP task performance from the runtime system's perspective where task execution is ...
Scaling applications to massively parallel machines using Projections performance analysis tool

Some of the most challenging applications to parallelize scalably are the ones that present a relatively small amount of computation per iteration. Multiple interacting performance challenges must be identified and solved to attain high parallel ...
Scalability analysis of SPMD codes using expectations
ICS '07: Proceedings of the 21st annual international conference on Supercomputing

We present a new technique for identifying scalability bottlenecks in executions of single-program, multiple-data (SPMD) parallel programs, quantifying their impact on performance, and associating this information with the program source code. Our ...

Comments

Information & Contributors

Information

Published In

cover image ACM SIGPLAN Notices

ACM SIGPLAN Notices Volume 51, Issue 8

PPoPP '16

August 2016

405 pages

ISSN:0362-1340

EISSN:1558-1160

DOI:10.1145/3016078

Editor:
Matthew Fluet

Issue’s Table of Contents

PPoPP '16: Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
February 2016
420 pages
ISBN:9781450340922
DOI:10.1145/2851141
General Chair:
Rafael Asenjo
University of Málaga, Spain
,
Program Chair:
Tim Harris
Oracle Labs, Cambridge, UK

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 February 2016

Published in SIGPLAN Volume 51, Issue 8

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

ARTEMIS-JU

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

20
Total Citations
View Citations
762
Total Downloads

Downloads (Last 12 months)29
Downloads (Last 6 weeks)3

Reflects downloads up to 10 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Lasserre ANamyst RWacrenier P(2021)EasyPAP: A Framework for Learning Parallel ProgrammingJournal of Parallel and Distributed Computing10.1016/j.jpdc.2021.07.018Online publication date: Aug-2021
https://doi.org/10.1016/j.jpdc.2021.07.018
Lasserre ANamyst RWacrenier P(2020)EASYPAP: a Framework for Learning Parallel Programming2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW50202.2020.00059(276-283)Online publication date: May-2020
https://doi.org/10.1109/IPDPSW50202.2020.00059
Wodiany IDrebes ANeill RPop A(2020)AfterOMPT: An OMPT-Based Tool for Fine-Grained Tracing of Tasks and LoopsOpenMP: Portable Multi-Level Parallelism on Modern Systems10.1007/978-3-030-58144-2_11(165-180)Online publication date: 1-Sep-2020
https://doi.org/10.1007/978-3-030-58144-2_11
Jahre M(2020)STHEM: Productive Implementation of High-Performance Embedded Image Processing ApplicationsTowards Ubiquitous Low-power Image Processing Platforms10.1007/978-3-030-53532-2_5(79-91)Online publication date: 16-Dec-2020
https://doi.org/10.1007/978-3-030-53532-2_5
Rosà ARosales EBinder W(2019)Analysis and Optimization of Task Granularity on the Java Virtual MachineACM Transactions on Programming Languages and Systems (TOPLAS)10.1145/333849741:3(1-47)Online publication date: 16-Jul-2019
https://dl.acm.org/doi/10.1145/3338497
Marts WDosanjh MSchonbein WLevy SBridges P(2023)Measuring Thread Timing to Assess the Feasibility of Early-bird Message DeliveryProceedings of the 52nd International Conference on Parallel Processing Workshops10.1145/3605731.3605884(119-126)Online publication date: 7-Aug-2023
https://dl.acm.org/doi/10.1145/3605731.3605884
Sakin SBigelow ATohid RScully-Allison CScheidegger CBrandt STaylor CHuck KKaiser HIsaacs K(2023)Traveler: Navigating Task Parallel Traces for Performance AnalysisIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2022.320937529:1(788-797)Online publication date: Jan-2023
https://doi.org/10.1109/TVCG.2022.3209375
Pinto VLeandro Nesi LMiletto MMello Schnorr L(2021)Providing In-depth Performance Analysis for Heterogeneous Task-based Applications with StarVZ2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW52791.2021.00013(16-25)Online publication date: Jun-2021
https://doi.org/10.1109/IPDPSW52791.2021.00013
Huthmann JPodobas ASommer LKoch ASano K(2020)Extending High-Level Synthesis with High-Performance Computing Performance Visualization2020 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER49012.2020.00047(371-380)Online publication date: Sep-2020
https://doi.org/10.1109/CLUSTER49012.2020.00047
Wodiany IDrebes ANeill RPop A(2020)AfterOMPT: An OMPT-Based Tool for Fine-Grained Tracing of Tasks and LoopsOpenMP: Portable Multi-Level Parallelism on Modern Systems10.1007/978-3-030-58144-2_11(165-180)Online publication date: 1-Sep-2020
https://doi.org/10.1007/978-3-030-58144-2_11
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents