research-article

Open access

Data-Driven Concurrency for High Performance Computing

Authors:

George Matheou,

Paraskevas EvripidouAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization (TACO), Volume 14, Issue 4

Article No.: 53, Pages 1 - 26

https://doi.org/10.1145/3162014

Published: 20 December 2017 Publication History

Abstract

In this work, we utilize dynamic dataflow/data-driven techniques to improve the performance of high performance computing (HPC) systems. The proposed techniques are implemented and evaluated through an efficient, portable, and robust programming framework that enables data-driven concurrency on HPC systems. The proposed framework is based on data-driven multithreading (DDM), a hybrid control-flow/dataflow model that schedules threads based on data availability on sequential processors. The proposed framework was evaluated using several benchmarks, with different characteristics, on two different systems: a 4-node AMD system with a total of 128 cores and a 64-node Intel HPC system with a total of 768 cores. The performance evaluation shows that the proposed framework scales well and tolerates scheduling overheads and memory latencies effectively. We also compare our framework to MPI, DDM-VM, and OmpSs@Cluster. The comparison results show that the proposed framework obtains comparable or better performance.

References

[1]

Emmanuel Agullo, Jim Demmel, Jack Dongarra, Bilel Hadri, Jakub Kurzak, Julien Langou, Hatem Ltaief, Piotr Luszczek, and Stanimire Tomov. 2009. Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects. Journal of Physics: Conference Series 180, 1, 012037.

[2]

Marco Aldinucci, Sonia Campa, Marco Danelutto, Peter Kilpatrick, and Massimo Torquati. 2013. Targeting distributed systems in FastFlow. In Euro-Par 2012: Parallel Processing Workshops. Lecture Notes in Computer Science, Vol. 7640. Springer, 47--56.

Digital Library

[3]

Marco Aldinucci, Marco Danelutto, Peter Kilpatrick, and Massimo Torquati. 2012. FastFlow: High-level and efficient streaming on multi-core. In Programming Multi-Core and Many-Core Computing Systems, S. Pllana (Ed.). John Wiley 8 Sons, 13.

[4]

Tiago A. O. Alves, Leandro A. J. Marzulo, Felipe M. G. França, and Vítor Santos Costa. 2011. Trebuchet: Exploring TLP with dataflow virtualisation. International Journal of High Performance Systems Architecture 3, 2-3, 137--148.

Digital Library

[5]

Saman Amarasinghe, Mary Hall, Richard Lethin, Keshav Pingali, Dan Quinlan, Vivek Sarkar, John Shalf, et al. 2011. Exascale programming challenges. In Proceedings of the Workshop on Exascale Programming Challenges.

[6]

E. Anderson, Z. Bai, C. Bischof, L. S. Blackford, J. Demmel, Jack J. Dongarra, J. Du Croz, S. Hammarling, A. Greenbaum, A. McKenney, and D. Sorensen. 1999. LAPACK Users’ Guide (3rd ed.). Society for Industrial and Applied Mathematics, Philadelphia, PA.

Digital Library

[7]

Samer Arandi and Paraskevas Evripidou. 2010. Programming multi-core architectures using data-flow techniques. In Proceedings of the 2010 International Conference on Embedded Computer Systems (SAMOS’10). IEEE, Los Alamitos, CA, 152--161.

[8]

Samer Arandi and Paraskevas Evripidou. 2011. DDM-VMc: The data-driven multithreading virtual machine for the cell processor. In Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers. ACM, New York, NY, 25--34.

Digital Library

[9]

Arvind and Gostelow. 1982. The U-interpreter. Computer 15, 2, 42--49.

Digital Library

[10]

Arvind and Robert A. Iannucci. 1988. Two fundamental issues in multiprocessing. In Proceedings of the 4th International DFVLR Seminar on Foundations of Engineering Sciences on Parallel Computing in Science and Engineering. Springer, New York, NY, 61--88.

Digital Library

[11]

Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC benchmark suite: Characterization and architectural implications. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques. ACM, New York, NY, 72--81.

Digital Library

[12]

L. S. Blackford, J. Choi, A. Cleary, E. D’Azeuedo, J. Demmel, I. Dhillon, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, and R. C. Whaley. 1997. ScaLAPACK User’s Guide. Society for Industrial and Applied Mathematics, Philadelphia, PA.

[13]

Dan Bonachea. 2002. GASNet Specification, v1.1. Technical Report UCB/CSD-02-1207. EECS Department, University of California, Berkeley. http://www2.eecs.berkeley.edu/Pubs/TechRpts/2002/5764.html.

Digital Library

[14]

George Bosilca, Aurelien Bouteiller, Anthony Danalis, Thomas Herault, Pierre Lemarinier, and Jack Dongarra. 2012. DAGuE: A generic distributed DAG engine for high performance computing. Parallel Computing 38, 1, 37--51.

Digital Library

[15]

BSC. 2017. BSC Application Repository. Retrieved November 21, 2017, from https://pm.bsc.es/projects/bar/wiki/Applications.

[16]

Javier Bueno, Luis Martinell, Alejandro Duran, Montse Farreras, Xavier Martorell, Rosa M. Badia, Eduard Ayguade, and Jesús Labarta. 2011. Productive cluster programming with OmpSs. In Proceedings of the European Conference on Parallel Processing. 555--566.

Digital Library

[17]

Javier Bueno, Xavier Martorell, Rosa M. Badia, Eduard Ayguadé, and Jesús Labarta. 2013. Implementing OmpSs support for regions of data in architectures with multiple address spaces. In Proceedings of the 27th International ACM Conference on Supercomputing. ACM, New York, NY, 359--368.

Digital Library

[18]

William W. Carlson, Jesse M. Draper, David E. Culler, Kathy Yelick, Eugene Brooks, and Karen Warren. 1999. Introduction to UPC and Language Specification. Technical Report. CCS-TR-99-157, IDA Center for Computing Sciences, Bowie, MD.

[19]

Márcia C. Cera, João V. F. Lima, Nicolas Maillard, and Philippe Olivier Alexandre Navaux. 2010. Challenges and issues of supporting task parallelism in MPI. In Proceedings of the 2010 EuroMPI Conference. Springer, 302--305.

Digital Library

[20]

Bradford L. Chamberlain, David Callahan, and Hans P. Zima. 2007. Parallel programmability and the Chapel language. International Journal of High Performance Computing Applications 21, 3, 291--312.

Digital Library

[21]

Philippe Charles, Christian Grothoff, Vijay Saraswat, Christopher Donawa, Allan Kielstra, Kemal Ebcioglu, Christoph Von Praun, and Vivek Sarkar. 2005. X10: An object-oriented approach to non-uniform cluster computing. ACM SIGPLAN Notices 40, 519--538.

Digital Library

[22]

Anthony Danalis, Heike Jagode, George Bosilca, and Jack Dongarra. 2015. PaRSEC in practice: Optimizing a legacy chemistry application through distributed task-based execution. In Proceedings of the 2015 IEEE International Conference on Cluster Computing (CLUSTER’15). IEEE, Los Alamitos, CA, 304--313.

Digital Library

[23]

Edsger W. Dijkstra and Carel S. Scholten. 1980. Termination detection for diffusing computations. Information Processing Letters 11, 1, 1--4.

[24]

Haohuan Fu, Junfeng Liao, Jinzhe Yang, Lanning Wang, Zhenya Song, Xiaomeng Huang, Chao Yang, et al. 2016. The Sunway Taihulight supercomputer: System and applications. Science China Information Sciences 59, 7, 072001.

[25]

Samuel H. Fuller and Lynette I. Millett. 2011. Computing performance: Game over or next level?Computer 44, 1, 31--38.

Digital Library

[26]

William Gropp, Ewing Lusk, and Anthony Skjellum. 1999. Using MPI: Portable Parallel Programming With the Message-Passing Interface, Vol. 1. MIT Press, Cambridge, MA.

Digital Library

[27]

Gagan Gupta and Gurindar S. Sohi. 2011. Dataflow execution of sequential imperative programs on multicore architectures. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture. ACM, New York, NY, 59--70.

Digital Library

[28]

J. R. Gurd, C. C. Kirkham, and I. Watson. 1985. The Manchester prototype dataflow computer. Communications of the ACM 28, 1, 34--52.

Digital Library

[29]

The Cyprus Institute. 2017. Cy-Tera. Retrieved November 21, 2017, from http://web.cytera.cyi.ac.cy.

[30]

Intel. 2017. Flow Graph. Retrieved November 21, 2017, from https://software.intel.com/en-us/node/506211.

[31]

Wesley M. Johnston, J. R. Hanna, and Richard J. Millar. 2004. Advances in dataflow programming languages. ACM Computing Surveys 36, 1, 1--34.

Digital Library

[32]

Hartmut Kaiser, Maciek Brodowicz, and Thomas Sterling. 2009. ParalleX: An advanced parallel execution model for scaling-impaired applications. In Proceedings of the 2009 International Conference on Parallel Processing Workshops (ICPPW’09). IEEE, Los Alamitos, CA, 394--401.

Digital Library

[33]

Hartmut Kaiser, Thomas Heller, Bryce Adelstein-Lelbach, Adrian Serio, and Dietmar Fey. 2014. HPX: A task based programming model in a global address space. In Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models. ACM, New York, NY, 6.

Digital Library

[34]

Laxmikant V. Kale and Sanjeev Krishnan. 1996. Charm++: Parallel programming with message-driven objects. In Parallel Programming Using C++, G. V. Wilson and P. Lu (Eds.). MIT Press, Cambridge, MA, 175--213.

[35]

Kathleen Knobe. 2009. Ease of use with Concurrent Collections (CnC). In Proceedings of the 1st USENIX Conference on Hot Topics in Parallelism (HotPar’09). 17.

Digital Library

[36]

David A. Koufaty, Xiangfeng Chen, David K. Poulsen, and Josep Torrellas. 1996. Data forwarding in scalable shared-memory multiprocessors. IEEE Transactions on Parallel and Distributed Systems 7, 12, 1250--1264.

Digital Library

[37]

C. Kyriacou, P. Evripidou, and P. Trancoso. 2006. Data-driven multithreading using conventional microprocessors. IEEE Transactions on Parallel and Distributed Systems 17, 10, 1176--1188.

Digital Library

[38]

Ulf Lamping and Ed Warnicke. 2004. Wireshark user’s guide. Interface 4, 6.

[39]

Joshua Landwehr, Joshua Suetterlein, Andrés Márquez, Joseph Manzano, and Guang R. Gao. 2016. Application characterization at scale: Lessons learned from developing a distributed open community runtime system for high performance computing. In Proceedings of the ACM International Conference on Computing Frontiers. ACM, New York, NY, 164--171.

Digital Library

[40]

Christopher Lauderdale, Mark Glines, Jihui Zhao, Alex Spiotta, and Rishi Khan. 2013. SWARM: A Unified Framework for Parallel-For, Task Dataflow, and Distributed Graph Traversal. ET International Inc., Newark, NJ.

[41]

Leandro A. J. Marzulo, Tiago A. O. Alves, Felipe M. G. França, and Vítor Santos Costa. 2014. Couillard: Parallel programming via coarse-grained data-flow compilation. Parallel Computing 40, 10, 661--680.

Digital Library

[42]

George Matheou. 2017. FREDDO Project. Retrieved November 21, 2017, from https://github.com/george-matheou/freddo-project.

[43]

George Matheou and Paraskevas Evripidou. 2013. Verilog-based simulation of hardware support for data-flow concurrency on multicore systems. In Proceedings of the 2013 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS XIII). IEEE, Los Alamitos, CA, 280--287.

[44]

George Matheou and Paraskevas Evripidou. 2015. Architectural support for data-driven execution. ACM Transactions on Architecture and Code Optimization 11, 4, Article 52, 25 pages.

Digital Library

[45]

George Matheou and Paraskevas Evripidou. 2016. FREDDO: An efficient framework for runtime execution of data-driven objects. In Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA’16). 265--273.

[46]

George Matheou and Paraskevas Evripidou. 2016. FREDDO: An Efficient Framework for Runtime Execution of Data-Driven Objects. Technical Report TR-16-1. Department of Computer Science, University of Cyprus, Nicosia, Cyprus. https://www.cs.ucy.ac.cy/docs/techreports/TR-16-1.pdf.

[47]

Timothy G. Mattson, Romain Cledat, Vincent Cavé, Vivek Sarkar, Zoran Budimlić, Sanjay Chatterjee, Josh Fryman, et al. 2016. The open community runtime: A runtime system for extreme scale computing. In Proceedings of the 2016 IEEE High Performance Extreme Computing Conference (HPEC’16). IEEE, Los Alamitos, CA, 1--7.

[48]

George Michael, Samer Arandi, and Paraskevas Evripidou. 2013. Data-flow concurrency on distributed multi-core systems. In Proceedings of the 2013 International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA’13).

[49]

Andrea Mondelli, Nam Ho, Alberto Scionti, Marco Solinas, Antoni Portero, and Roberto Giorgi. 2015. Dataflow support in x86_64 multicore architectures through small hardware extensions. In Proceedings of the 2015 Euromicro Conference on Digital System Design (DSD’15). IEEE, Los Alamitos, CA, 526--529.

Digital Library

[50]

mpich.org. 2017. LU Factorization. Retrieved November 21, 2017, from https://trac.mpich.org/projects/armci-mpi/browser/tests/contrib/lu/lu.c.

[51]

Oliver Pell, Oskar Mencer, Kuen Hung Tsoi, and Wayne Luk. 2013. Maximum performance computing with dataflow engines. In High-Performance Computing Using FPGAs. Springer, 747--774.

[52]

Judit Planas, Rosa M. Badia, Eduard Ayguadé, and Jesus Labarta. 2009. Hierarchical task-based programming with StarSs. International Journal of High Performance Computing Applications 23, 3, 284--299.

Digital Library

[53]

David K. Poulsen and Pen-Chung Yew. 1994. Data prefetching and data forwarding in shared memory multiprocessors. In Proceedings of the 1994 International Conference on Parallel Processing (ICPP’94), Vol. 2. IEEE, Los Alamitos, CA, 280.

Digital Library

[54]

Jelica Protic, Milo Tomasevic, and Veljko Milutinović. 1998. Distributed Shared Memory: Concepts and Systems, Vol. 21. John Wiley 8 Sons.

Digital Library

[55]

James Reinders. 2007. Intel Threading Building Blocks: Outfitting C++ for Multi-Core Processor Parallelism. O’Reilly Media, Inc.

Digital Library

[56]

S. Ashby, P. Beckman, J. Chen, P. Colella, B. Collins, D. Crawford, J. Dongarra, et al. 2010. The Opportunities and Challenges of Exascale Computing. Summary Report of the Advanced Scientific Computing Advisory Committee (ASCAC) Subcommittee. U.S. Department of Energy Office of Science, Washington, DC.

[57]

Frank Schlimbach, James C. Brodman, and Kath Knobe. 2013. Concurrent Collections on distributed memory theory put into practice. In Proceedings of the 2013 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP’13). IEEE, Los Alamitos, CA, 225--232.

Digital Library

[58]

Rafael J. N. Silva, Brunno Goldstein, Leandro Santiago, Alexandre C. Sena, Leandro A. J. Marzulo, Tiago A. O. Alves, and Felipe M. G. França. 2016. Task scheduling in Sucuri dataflow library. In Proceedings of the 2016 International Symposium on Computer Architecture and High Performance Computing Workshops. IEEE, Los Alamitos, CA, 37--42.

[59]

Steven Swanson, Ken Michelson, Andrew Schwerin, and Mark Oskin. 2003. WaveScalar. In Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, Los Alamitos, CA, 291.

Digital Library

[60]

Pedro Trancoso, Kyriakos Stavrou, and Paraskevas Evripidou. 2007. DDMCPP: The data-driven multithreading C pre-processor. In Proceedings of the 11th Workshop on the Interaction Between Compilers and Computer Architectures. 32.

[61]

Stéphane Zuckerman, Joshua Suetterlein, Rob Knauerhase, and Guang R. Gao. 2011. Using a “codelet” program execution model for exascale machines: Position paper. In Proceedings of the 1st International Workshop on Adaptive Self-Tuning Computing Systems for the Exaflop Era (EXADAPT’11). 64--69.

Digital Library

Cited By

Ahmadvand HForoutan FFathy M(2021)DV-DVFS: merging data variety and DVFS technique to manage the energy consumption of big data processingJournal of Big Data10.1186/s40537-021-00437-78:1Online publication date: 10-Mar-2021
https://doi.org/10.1186/s40537-021-00437-7
Giorgi RKhalili FProcaccini M(2018)Energy Efficiency Exploration on the ZYNQ Ultrascale+2018 30th International Conference on Microelectronics (ICM)10.1109/ICM.2018.8704092(48-54)Online publication date: Dec-2018
https://doi.org/10.1109/ICM.2018.8704092

Index Terms

Data-Driven Concurrency for High Performance Computing
1. Computer systems organization
  1. Architectures
2. Computing methodologies
  1. Distributed computing methodologies
    1. Distributed programming languages
  2. Parallel computing methodologies
    1. Parallel programming languages

Recommendations

Data-Driven Thread Execution on Heterogeneous Processors

In this paper we report our experience in implementing and evaluating the Data-Driven Multithreading (DDM) model on a heterogeneous multi-core processor. DDM is a non-blocking multithreading model that decouples the synchronization from the computation ...
A Halide-based Synergistic Computing Framework for Heterogeneous Systems

New programming models have been developed to embrace contemporary heterogeneous machines, each of which may contain several types of processors, e.g., CPUs, GPUs, FPGAs and ASICs. Unlike the conventional ones, which use separate programming schemes for ...
TFluxSCC: Exploiting Performance on Future Many-Core Systems through Data-Flow
PDP '15: Proceedings of the 2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing

The current trend in processor design is to increase the number of cores as to achieve a desired performance. While having a large number of cores on a chip seems to be feasible in terms of the hardware, the development of the software that is able to ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization

ACM Transactions on Architecture and Code Optimization Volume 14, Issue 4

December 2017

600 pages

ISSN:1544-3566

EISSN:1544-3973

DOI:10.1145/3154814

Editor:
Koen De Bosschere
Ghent University

Issue’s Table of Contents

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 December 2017

Accepted: 01 November 2017

Revised: 01 October 2017

Received: 01 May 2017

Published in TACO Volume 14, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Cyprus State Scholarship Foundation (IKYK)
University of Cyprus through the Processor project

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
567
Total Downloads

Downloads (Last 12 months)59
Downloads (Last 6 weeks)14

Reflects downloads up to 01 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Ahmadvand HForoutan FFathy M(2021)DV-DVFS: merging data variety and DVFS technique to manage the energy consumption of big data processingJournal of Big Data10.1186/s40537-021-00437-78:1Online publication date: 10-Mar-2021
https://doi.org/10.1186/s40537-021-00437-7
Giorgi RKhalili FProcaccini M(2018)Energy Efficiency Exploration on the ZYNQ Ultrascale+2018 30th International Conference on Microelectronics (ICM)10.1109/ICM.2018.8704092(48-54)Online publication date: Dec-2018
https://doi.org/10.1109/ICM.2018.8704092

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents