Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

Data-Driven Concurrency for High Performance Computing

Published: 20 December 2017 Publication History

Abstract

In this work, we utilize dynamic dataflow/data-driven techniques to improve the performance of high performance computing (HPC) systems. The proposed techniques are implemented and evaluated through an efficient, portable, and robust programming framework that enables data-driven concurrency on HPC systems. The proposed framework is based on data-driven multithreading (DDM), a hybrid control-flow/dataflow model that schedules threads based on data availability on sequential processors. The proposed framework was evaluated using several benchmarks, with different characteristics, on two different systems: a 4-node AMD system with a total of 128 cores and a 64-node Intel HPC system with a total of 768 cores. The performance evaluation shows that the proposed framework scales well and tolerates scheduling overheads and memory latencies effectively. We also compare our framework to MPI, DDM-VM, and OmpSs@Cluster. The comparison results show that the proposed framework obtains comparable or better performance.

References

[1]
Emmanuel Agullo, Jim Demmel, Jack Dongarra, Bilel Hadri, Jakub Kurzak, Julien Langou, Hatem Ltaief, Piotr Luszczek, and Stanimire Tomov. 2009. Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects. Journal of Physics: Conference Series 180, 1, 012037.
[2]
Marco Aldinucci, Sonia Campa, Marco Danelutto, Peter Kilpatrick, and Massimo Torquati. 2013. Targeting distributed systems in FastFlow. In Euro-Par 2012: Parallel Processing Workshops. Lecture Notes in Computer Science, Vol. 7640. Springer, 47--56.
[3]
Marco Aldinucci, Marco Danelutto, Peter Kilpatrick, and Massimo Torquati. 2012. FastFlow: High-level and efficient streaming on multi-core. In Programming Multi-Core and Many-Core Computing Systems, S. Pllana (Ed.). John Wiley 8 Sons, 13.
[4]
Tiago A. O. Alves, Leandro A. J. Marzulo, Felipe M. G. França, and Vítor Santos Costa. 2011. Trebuchet: Exploring TLP with dataflow virtualisation. International Journal of High Performance Systems Architecture 3, 2-3, 137--148.
[5]
Saman Amarasinghe, Mary Hall, Richard Lethin, Keshav Pingali, Dan Quinlan, Vivek Sarkar, John Shalf, et al. 2011. Exascale programming challenges. In Proceedings of the Workshop on Exascale Programming Challenges.
[6]
E. Anderson, Z. Bai, C. Bischof, L. S. Blackford, J. Demmel, Jack J. Dongarra, J. Du Croz, S. Hammarling, A. Greenbaum, A. McKenney, and D. Sorensen. 1999. LAPACK Users’ Guide (3rd ed.). Society for Industrial and Applied Mathematics, Philadelphia, PA.
[7]
Samer Arandi and Paraskevas Evripidou. 2010. Programming multi-core architectures using data-flow techniques. In Proceedings of the 2010 International Conference on Embedded Computer Systems (SAMOS’10). IEEE, Los Alamitos, CA, 152--161.
[8]
Samer Arandi and Paraskevas Evripidou. 2011. DDM-VMc: The data-driven multithreading virtual machine for the cell processor. In Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers. ACM, New York, NY, 25--34.
[9]
Arvind and Gostelow. 1982. The U-interpreter. Computer 15, 2, 42--49.
[10]
Arvind and Robert A. Iannucci. 1988. Two fundamental issues in multiprocessing. In Proceedings of the 4th International DFVLR Seminar on Foundations of Engineering Sciences on Parallel Computing in Science and Engineering. Springer, New York, NY, 61--88.
[11]
Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC benchmark suite: Characterization and architectural implications. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques. ACM, New York, NY, 72--81.
[12]
L. S. Blackford, J. Choi, A. Cleary, E. D’Azeuedo, J. Demmel, I. Dhillon, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, and R. C. Whaley. 1997. ScaLAPACK User’s Guide. Society for Industrial and Applied Mathematics, Philadelphia, PA.
[13]
Dan Bonachea. 2002. GASNet Specification, v1.1. Technical Report UCB/CSD-02-1207. EECS Department, University of California, Berkeley. http://www2.eecs.berkeley.edu/Pubs/TechRpts/2002/5764.html.
[14]
George Bosilca, Aurelien Bouteiller, Anthony Danalis, Thomas Herault, Pierre Lemarinier, and Jack Dongarra. 2012. DAGuE: A generic distributed DAG engine for high performance computing. Parallel Computing 38, 1, 37--51.
[15]
BSC. 2017. BSC Application Repository. Retrieved November 21, 2017, from https://pm.bsc.es/projects/bar/wiki/Applications.
[16]
Javier Bueno, Luis Martinell, Alejandro Duran, Montse Farreras, Xavier Martorell, Rosa M. Badia, Eduard Ayguade, and Jesús Labarta. 2011. Productive cluster programming with OmpSs. In Proceedings of the European Conference on Parallel Processing. 555--566.
[17]
Javier Bueno, Xavier Martorell, Rosa M. Badia, Eduard Ayguadé, and Jesús Labarta. 2013. Implementing OmpSs support for regions of data in architectures with multiple address spaces. In Proceedings of the 27th International ACM Conference on Supercomputing. ACM, New York, NY, 359--368.
[18]
William W. Carlson, Jesse M. Draper, David E. Culler, Kathy Yelick, Eugene Brooks, and Karen Warren. 1999. Introduction to UPC and Language Specification. Technical Report. CCS-TR-99-157, IDA Center for Computing Sciences, Bowie, MD.
[19]
Márcia C. Cera, João V. F. Lima, Nicolas Maillard, and Philippe Olivier Alexandre Navaux. 2010. Challenges and issues of supporting task parallelism in MPI. In Proceedings of the 2010 EuroMPI Conference. Springer, 302--305.
[20]
Bradford L. Chamberlain, David Callahan, and Hans P. Zima. 2007. Parallel programmability and the Chapel language. International Journal of High Performance Computing Applications 21, 3, 291--312.
[21]
Philippe Charles, Christian Grothoff, Vijay Saraswat, Christopher Donawa, Allan Kielstra, Kemal Ebcioglu, Christoph Von Praun, and Vivek Sarkar. 2005. X10: An object-oriented approach to non-uniform cluster computing. ACM SIGPLAN Notices 40, 519--538.
[22]
Anthony Danalis, Heike Jagode, George Bosilca, and Jack Dongarra. 2015. PaRSEC in practice: Optimizing a legacy chemistry application through distributed task-based execution. In Proceedings of the 2015 IEEE International Conference on Cluster Computing (CLUSTER’15). IEEE, Los Alamitos, CA, 304--313.
[23]
Edsger W. Dijkstra and Carel S. Scholten. 1980. Termination detection for diffusing computations. Information Processing Letters 11, 1, 1--4.
[24]
Haohuan Fu, Junfeng Liao, Jinzhe Yang, Lanning Wang, Zhenya Song, Xiaomeng Huang, Chao Yang, et al. 2016. The Sunway Taihulight supercomputer: System and applications. Science China Information Sciences 59, 7, 072001.
[25]
Samuel H. Fuller and Lynette I. Millett. 2011. Computing performance: Game over or next level?Computer 44, 1, 31--38.
[26]
William Gropp, Ewing Lusk, and Anthony Skjellum. 1999. Using MPI: Portable Parallel Programming With the Message-Passing Interface, Vol. 1. MIT Press, Cambridge, MA.
[27]
Gagan Gupta and Gurindar S. Sohi. 2011. Dataflow execution of sequential imperative programs on multicore architectures. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture. ACM, New York, NY, 59--70.
[28]
J. R. Gurd, C. C. Kirkham, and I. Watson. 1985. The Manchester prototype dataflow computer. Communications of the ACM 28, 1, 34--52.
[29]
The Cyprus Institute. 2017. Cy-Tera. Retrieved November 21, 2017, from http://web.cytera.cyi.ac.cy.
[30]
Intel. 2017. Flow Graph. Retrieved November 21, 2017, from https://software.intel.com/en-us/node/506211.
[31]
Wesley M. Johnston, J. R. Hanna, and Richard J. Millar. 2004. Advances in dataflow programming languages. ACM Computing Surveys 36, 1, 1--34.
[32]
Hartmut Kaiser, Maciek Brodowicz, and Thomas Sterling. 2009. ParalleX: An advanced parallel execution model for scaling-impaired applications. In Proceedings of the 2009 International Conference on Parallel Processing Workshops (ICPPW’09). IEEE, Los Alamitos, CA, 394--401.
[33]
Hartmut Kaiser, Thomas Heller, Bryce Adelstein-Lelbach, Adrian Serio, and Dietmar Fey. 2014. HPX: A task based programming model in a global address space. In Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models. ACM, New York, NY, 6.
[34]
Laxmikant V. Kale and Sanjeev Krishnan. 1996. Charm++: Parallel programming with message-driven objects. In Parallel Programming Using C++, G. V. Wilson and P. Lu (Eds.). MIT Press, Cambridge, MA, 175--213.
[35]
Kathleen Knobe. 2009. Ease of use with Concurrent Collections (CnC). In Proceedings of the 1st USENIX Conference on Hot Topics in Parallelism (HotPar’09). 17.
[36]
David A. Koufaty, Xiangfeng Chen, David K. Poulsen, and Josep Torrellas. 1996. Data forwarding in scalable shared-memory multiprocessors. IEEE Transactions on Parallel and Distributed Systems 7, 12, 1250--1264.
[37]
C. Kyriacou, P. Evripidou, and P. Trancoso. 2006. Data-driven multithreading using conventional microprocessors. IEEE Transactions on Parallel and Distributed Systems 17, 10, 1176--1188.
[38]
Ulf Lamping and Ed Warnicke. 2004. Wireshark user’s guide. Interface 4, 6.
[39]
Joshua Landwehr, Joshua Suetterlein, Andrés Márquez, Joseph Manzano, and Guang R. Gao. 2016. Application characterization at scale: Lessons learned from developing a distributed open community runtime system for high performance computing. In Proceedings of the ACM International Conference on Computing Frontiers. ACM, New York, NY, 164--171.
[40]
Christopher Lauderdale, Mark Glines, Jihui Zhao, Alex Spiotta, and Rishi Khan. 2013. SWARM: A Unified Framework for Parallel-For, Task Dataflow, and Distributed Graph Traversal. ET International Inc., Newark, NJ.
[41]
Leandro A. J. Marzulo, Tiago A. O. Alves, Felipe M. G. França, and Vítor Santos Costa. 2014. Couillard: Parallel programming via coarse-grained data-flow compilation. Parallel Computing 40, 10, 661--680.
[42]
George Matheou. 2017. FREDDO Project. Retrieved November 21, 2017, from https://github.com/george-matheou/freddo-project.
[43]
George Matheou and Paraskevas Evripidou. 2013. Verilog-based simulation of hardware support for data-flow concurrency on multicore systems. In Proceedings of the 2013 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS XIII). IEEE, Los Alamitos, CA, 280--287.
[44]
George Matheou and Paraskevas Evripidou. 2015. Architectural support for data-driven execution. ACM Transactions on Architecture and Code Optimization 11, 4, Article 52, 25 pages.
[45]
George Matheou and Paraskevas Evripidou. 2016. FREDDO: An efficient framework for runtime execution of data-driven objects. In Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA’16). 265--273.
[46]
George Matheou and Paraskevas Evripidou. 2016. FREDDO: An Efficient Framework for Runtime Execution of Data-Driven Objects. Technical Report TR-16-1. Department of Computer Science, University of Cyprus, Nicosia, Cyprus. https://www.cs.ucy.ac.cy/docs/techreports/TR-16-1.pdf.
[47]
Timothy G. Mattson, Romain Cledat, Vincent Cavé, Vivek Sarkar, Zoran Budimlić, Sanjay Chatterjee, Josh Fryman, et al. 2016. The open community runtime: A runtime system for extreme scale computing. In Proceedings of the 2016 IEEE High Performance Extreme Computing Conference (HPEC’16). IEEE, Los Alamitos, CA, 1--7.
[48]
George Michael, Samer Arandi, and Paraskevas Evripidou. 2013. Data-flow concurrency on distributed multi-core systems. In Proceedings of the 2013 International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA’13).
[49]
Andrea Mondelli, Nam Ho, Alberto Scionti, Marco Solinas, Antoni Portero, and Roberto Giorgi. 2015. Dataflow support in x86_64 multicore architectures through small hardware extensions. In Proceedings of the 2015 Euromicro Conference on Digital System Design (DSD’15). IEEE, Los Alamitos, CA, 526--529.
[50]
mpich.org. 2017. LU Factorization. Retrieved November 21, 2017, from https://trac.mpich.org/projects/armci-mpi/browser/tests/contrib/lu/lu.c.
[51]
Oliver Pell, Oskar Mencer, Kuen Hung Tsoi, and Wayne Luk. 2013. Maximum performance computing with dataflow engines. In High-Performance Computing Using FPGAs. Springer, 747--774.
[52]
Judit Planas, Rosa M. Badia, Eduard Ayguadé, and Jesus Labarta. 2009. Hierarchical task-based programming with StarSs. International Journal of High Performance Computing Applications 23, 3, 284--299.
[53]
David K. Poulsen and Pen-Chung Yew. 1994. Data prefetching and data forwarding in shared memory multiprocessors. In Proceedings of the 1994 International Conference on Parallel Processing (ICPP’94), Vol. 2. IEEE, Los Alamitos, CA, 280.
[54]
Jelica Protic, Milo Tomasevic, and Veljko Milutinović. 1998. Distributed Shared Memory: Concepts and Systems, Vol. 21. John Wiley 8 Sons.
[55]
James Reinders. 2007. Intel Threading Building Blocks: Outfitting C++ for Multi-Core Processor Parallelism. O’Reilly Media, Inc.
[56]
S. Ashby, P. Beckman, J. Chen, P. Colella, B. Collins, D. Crawford, J. Dongarra, et al. 2010. The Opportunities and Challenges of Exascale Computing. Summary Report of the Advanced Scientific Computing Advisory Committee (ASCAC) Subcommittee. U.S. Department of Energy Office of Science, Washington, DC.
[57]
Frank Schlimbach, James C. Brodman, and Kath Knobe. 2013. Concurrent Collections on distributed memory theory put into practice. In Proceedings of the 2013 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP’13). IEEE, Los Alamitos, CA, 225--232.
[58]
Rafael J. N. Silva, Brunno Goldstein, Leandro Santiago, Alexandre C. Sena, Leandro A. J. Marzulo, Tiago A. O. Alves, and Felipe M. G. França. 2016. Task scheduling in Sucuri dataflow library. In Proceedings of the 2016 International Symposium on Computer Architecture and High Performance Computing Workshops. IEEE, Los Alamitos, CA, 37--42.
[59]
Steven Swanson, Ken Michelson, Andrew Schwerin, and Mark Oskin. 2003. WaveScalar. In Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, Los Alamitos, CA, 291.
[60]
Pedro Trancoso, Kyriakos Stavrou, and Paraskevas Evripidou. 2007. DDMCPP: The data-driven multithreading C pre-processor. In Proceedings of the 11th Workshop on the Interaction Between Compilers and Computer Architectures. 32.
[61]
Stéphane Zuckerman, Joshua Suetterlein, Rob Knauerhase, and Guang R. Gao. 2011. Using a “codelet” program execution model for exascale machines: Position paper. In Proceedings of the 1st International Workshop on Adaptive Self-Tuning Computing Systems for the Exaflop Era (EXADAPT’11). 64--69.

Cited By

View all
  • (2021)DV-DVFS: merging data variety and DVFS technique to manage the energy consumption of big data processingJournal of Big Data10.1186/s40537-021-00437-78:1Online publication date: 10-Mar-2021
  • (2018)Energy Efficiency Exploration on the ZYNQ Ultrascale+2018 30th International Conference on Microelectronics (ICM)10.1109/ICM.2018.8704092(48-54)Online publication date: Dec-2018

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization
ACM Transactions on Architecture and Code Optimization  Volume 14, Issue 4
December 2017
600 pages
ISSN:1544-3566
EISSN:1544-3973
DOI:10.1145/3154814
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 December 2017
Accepted: 01 November 2017
Revised: 01 October 2017
Received: 01 May 2017
Published in TACO Volume 14, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Data-driven multithreading
  2. distributed execution
  3. high performance computing
  4. runtime system

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • Cyprus State Scholarship Foundation (IKYK)
  • University of Cyprus through the Processor project

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)59
  • Downloads (Last 6 weeks)14
Reflects downloads up to 01 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2021)DV-DVFS: merging data variety and DVFS technique to manage the energy consumption of big data processingJournal of Big Data10.1186/s40537-021-00437-78:1Online publication date: 10-Mar-2021
  • (2018)Energy Efficiency Exploration on the ZYNQ Ultrascale+2018 30th International Conference on Microelectronics (ICM)10.1109/ICM.2018.8704092(48-54)Online publication date: Dec-2018

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media