research-article

Open access

Reducing Data Movement on Large Shared Memory Systems by Exploiting Computation Dependencies

Authors:

Isaac Sánchez Barrera,

Miquel Moretó,

Eduard Ayguadé,

Jesús Labarta,

Marc CasasAuthors Info & Claims

ICS '18: Proceedings of the 2018 International Conference on Supercomputing

Pages 207 - 217

https://doi.org/10.1145/3205289.3205310

Published: 12 June 2018 Publication History

Abstract

Shared memory systems are becoming increasingly complex as they typically integrate several storage devices. That brings different access latencies or bandwidth rates depending on the proximity between the cores where memory accesses are issued and the storage devices containing the requested data. In this context, techniques to manage and mitigate non-uniform memory access (NUMA) effects consist in migrating threads, memory pages or both and are generally applied by the system software.

We propose techniques at the runtime system level to further mitigate the impact of NUMA effects on parallel applications' performance. We leverage runtime system metadata expressed in terms of a task dependency graph, where nodes are pieces of serial code and edges are control or data dependencies between them, to efficiently reduce data transfers. Our approach, based on graph partitioning, adds negligible overhead and is able to provide performance improvements up to 1.52X and average improvements of 1.12X with respect to the best state-of-the-art approach when deployed on a 288-core shared-memory system. Our approach reduces the coherence traffic by 2.28X on average with respect to the state-of-the-art.

References

[1]

Rabab al Omairy, Guillermo Miranda, Hatem Ltaief, Rosa M. Badia, Xavier Martorell, Jesús Labarta, and David Keyes. 2015. Dense Matrix Computations on NUMA Architectures with Distance-Aware Work Stealing. Supercomput. Front. Innov. 2, 1 (Jan. 2015), 49--72.

Digital Library

[2]

Atos {n. d.}. Bull bullion S16 Technical Specifications. Technical specifications. https://bull.com/wp-content/uploads/2016/08/f-bullion_s16_e7v3-en2_web.pdf

[3]

Jairo Balart, Alejandro Duran, Marc Gonzàlez, Xavier Martorell, Eduard Ayguadé, and Jesús Labarta. 2004. Nanos Mercurium: A Research Compiler for OpenMP. In 6th European Workshop on OpenMP (EWOMP 2004). 103--109. http://people.ac.upc.edu/eduard/papers/paper_a31.pdf.gz

[4]

Erik Boman, Karen Devine, Lee Ann Fisk, Robert Heaphy, Bruce Hendrickson, Vitus Leung, Courtenay Vaughan, Ümit V. Çatalyürek, Doruk Bozdag, and William Mitchell. 1999. Zoltan. Sandia National Laboratories. http://www.cs.sandia.gov/Zoltan

[5]

Aydın Buluç, Henning Meyerhenke, Ilya Safro, Peter Sanders, and Christian Schulz. 2016. Recent Advances in Graph Partitioning. In Algorithm Engineering: Selected Results and Surveys, Lasse Kliemann and Peter Sanders (Eds.). Lecture Notes in Computer Science, Vol. 9220. Springer International Publishing, Cham, 117--158. arXiv:1311.3144

[6]

Alfredo Buttari, Julien Langou, Jakub Kurzak, and Jack Dongarra. 2008. Parallel Tiled QR Factorization for Multicore Architectures. Concurr. Comput. Pract. Exp. 20, 13 (July 2008), 1573--1590.

Digital Library

[7]

Paul Caheny, Lluc Alvarez, Said Derradji, Mateo Valero, Miquel Moretó, and Marc Casas. 2018. Reducing Cache Coherence Traffic with a NUMA-Aware Runtime Approach. IEEE Trans. Parallel Distrib. Syst. 29, 5 (May 2018), 1174--1187.

[8]

Paul Caheny, Marc Casas, Miquel Moretó, Hervé Gloaguen, Maxime Saintes, Eduard Ayguadé, Jesús Labarta, and Mateo Valero. 2016. Reducing Cache Coherence Traffic with Hierarchical Directory Cache and NUMA-Aware Runtime Scheduling. In International Conference on Parallel Architectures and Compilation (PACT '16). ACM, New York, NY, USA, 275--286.

Digital Library

[9]

Marc Casas, Rosa M. Badia, and Jesús Labarta. 2010. Automatic Phase Detection and Structure Extraction of MPI Applications. Int. J. High Perform. Comput. Appl. 24, 3 (Aug. 2010), 335--360.

Digital Library

[10]

Marc Casas, Miquel Moretó, Lluc Alvarez, Emilio Castillo, Dimitrios Chasapis, Timothy Hayes, Luc Jaulmes, Oscar Palomar, Osman Unsal, Adrián Cristal, Eduard Ayguadé, Jesús Labarta, and Mateo Valero. 2015. Runtime-Aware Architectures. In EuroPar 2015: Parallel Processing. Springer, Berlin, Heidelberg, 16--27.

[11]

Ümit V. Çatalyürek. 2011. PaToH Graph Partitioner. http://www.cc.gatech.edu/~umit/software.html#patoh

[12]

Ümit V. Çatalyürek and Cevdet Aykanat. 1999. Hypergraph-Partitioning-Based Decomposition for Parallel Sparse-Matrix Vector Multiplication. IEEE Trans. Parallel Distrib. Syst. 10, 7 (July 1999), 673--693.

Digital Library

[13]

Charm 2016. Charm++ Programming Model. http://charmplusplus.org/

[14]

Sanjay Chatterjee, Nick Vrvilo, Zoran Budimlic, Kathleen Knobe, and Vivek Sarkar. 2016. Declarative Tuning for Locality in Parallel Programs. In 45th International Conference on Parallel Processing (ICPP 2016). IEEE, 452--457.

[15]

Mohammad Dashti, Alexandra Fedorova, Justin Funston, Fabien Gaud, Renaud Lachaize, Baptiste Lepers, Vivien Quema, and Mark Roth. 2013. Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems. In 18th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '13). ACM, New York, NY, USA, 381--394.

Digital Library

[16]

Matthias Diener, Eduardo H.M. Cruz, Philippe O.A. Navaux, Anselm Busse, and Hans-Ulrich Heiß. 2014. kMAF: Automatic Kernel-Level Management of Thread and Data Affinity. In 23rd International Conference on Parallel Architectures and Compilation (PACT '14). ACM, New York, NY, USA, 277--288.

Digital Library

[17]

Andi Drebes, Karine Heydemann, Nathalie Drach, Antoniu Pop, and Albert Cohen. 2014. Topology-Aware and Dependence-Aware Scheduling and Memory Allocation for Task-Parallel Languages. ACM Trans. Archit. Code Optim. 11, 3, Article 30 (Aug. 2014), 25 pages.

Digital Library

[18]

Andi Drebes, Antoniu Pop, Karine Heydemann, Albert Cohen, and Nathalie Drach. 2016. Scalable Task Parallelism for NUMA: A Uniform Abstraction for Coordinated Scheduling and Memory Management. In International Conference on Parallel Architectures and Compilation (PACT '16). ACM, New York, NY, USA, 125--137.

Digital Library

[19]

Luc Jaulmes, Marc Casas, Miquel Moretó, Eduard Ayguadé, Jesús Labarta, and Mateo Valero. 2015. Exploiting Asynchrony from Exact Forward Recovery for DUE in Iterative Solvers. In International Conference on High Performance Computing, Networking, Storage and Analysis (SC '15). ACM, New York, NY, USA, 53:1--53:12.

Digital Library

[20]

Laxmikant V. Kalé and Sanjeev Krishnan. 1996. Parallel Programming with Message-Driven Objects. In Parallel Programming Using C++, Gregory V. Wilson and Paul Lu (Eds.). MIT Press, Cambridge, MA, USA, 175--213.

[21]

George Karypis and Vipin Kumar. 1997. Metis Graph Partitioner. http://glaros.dtc.umn.edu/gkhome/metis/metis/overview

[22]

George Karypis and Vipin Kumar. 1998. A Fast and High Quality Multi-level Scheme for Partitioning Irregular Graphs. SIAM J. Sci. Comput. 20, 1 (Jan. 1998), 359--392.

Digital Library

[23]

George Karypis and Vipin Kumar. 1999. Multilevel k-Way Hypergraph Partitioning. In 36th Annual ACM/IEEE Design Automation Conference (DAC '99). ACM, New York, NY, USA, 343--348.

Digital Library

[24]

George Karypis and Vipin Kumar. 2007. hMetis Partitioning Software. http://glaros.dtc.umn.edu/gkhome/metis/hmetis/overview

[25]

Madhavan Manivannan and Per Stenström. 2014. Runtime-Guided Cache Coherence Optimizations in Multi-Core Architectures. In 28th International Parallel and Distributed Processing Symposium (IPDPS 2014). IEEE, 625--636.

Digital Library

[26]

Larry McAvoy and Carl Staelin. 1996. Lmbench: Portable Tools for Performance Analysis. In USENIX 1996 Annual Technical Conference. USENIX, 279--294. https://www.usenix.org/legacy/publications/library/proceedings/sd96/mcvoy.html

Digital Library

[27]

John D. McCalpin. 1995. Memory Bandwidth and Machine Balance in Current High Performance Computers. IEEE Comput. Soc. Tech. Comm. Comput. Archit. TCCA Newsl. (1995), 19--25. http://www.cs.virginia.edu/stream/

[28]

OpenBLAS 2016. OpenBLAS Library. http://www.openblas.net/

[29]

OpenMP Committee. 2013. OpenMP 4.0 Complete Specifications. OpenMP Committee Technical Report. OpenMP Architecture Review Board. http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf

[30]

Vassilis Papaefstathiou, Manolis G.H. Katevenis, Dimitrios S. Nikolopoulos, and Dionisios Pnevmatikatos. 2013. Prefetching and Cache Management Using Task Lifetimes. In 27th ACM International Conference on Supercomputing (ICS '13). ACM, New York, NY, USA, 325--334.

Digital Library

[31]

Jean-Charles Papin, Christophe Denoual, Laurent Colombet, and Raymond Namyst. 2015. SPAWN: An Iterative, Potentials-Based, Dynamic Scheduling and Partitioning Tool. In SC '15 - RESPA Workshop. https://hal.inria.fr/hal-01223897

[32]

François Pellegrini. 1994. Static Mapping by Dual Recursive Bipartitioning of Process Architecture Graphs. In Scalable High Performance Computing Conference (SHPCC 1994). IEEE, 486--493.

[33]

François Pellegrini. 2012. SCOTCH. https://www.labri.fr/perso/pelegrin/scotch/

[34]

François Pellegrini. 2014. Scotch and libScotch 6.0 User's Guide. http://gforge.inria.fr/docman/view.php/248/8260/scotch_user6.0.pdf

[35]

Fatih Porikli. 2005. Integral Histogram: A Fast Way to Extract Histograms in Cartesian Spaces. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2005), Vol. 1. IEEE, 829--836.

Digital Library

[36]

Maria Predari and Aurélien Esnard. 2016. A k-Way Greedy Graph Partitioning with Initial Fixed Vertices for Parallel Applications. In 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP 2016). IEEE, 280--287.

[37]

Masahiro Tanaka and Osamu Tatebe. 2012. Workflow Scheduling to Minimize Data Movement Using Multi-Constraint Graph Partitioning. In 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid 2012). IEEE, 65--72.

Digital Library

[38]

Xavier Teruel, Xavier Martorell, Alejandro Duran, Roger Ferrer, and Eduard Ayguadé. 2007. Support for OpenMP Tasks in Nanos V4. In Conference of the Center for Advanced Studies on Collaborative Research (CASCON '07). IBM Corp., Riverton, NJ, USA, 256--259.

Digital Library

[39]

Mustafa M. Tikir and Jeffrey K. Hollingsworth. 2008. Hardware Monitors for Dynamic Page Migration. J. Parallel Distrib. Comput. 68, 9 (Sept. 2008), 1186--1200.

Digital Library

[40]

Mateo Valero, Miquel Moretó, Marc Casas, Eduard Ayguadé, and Jesús Labarta. 2014. Runtime-Aware Architectures: A First Approach. Supercomput. Front. Innov. 1, 1 (Sept. 2014), 28--43.

Digital Library

[41]

Raul Vidal, Marc Casas, Miquel Moretó, Dimitrios Chasapis, Roger Ferrer, Xavier Martorell, Eduard Ayguadé, Jesús Labarta, and Mateo Valero. 2015. Evaluating the Impact of OpenMP 4.0 Extensions on Relevant Parallel Workloads. In OpenMP: Heterogenous Execution and Data Movements. International Workshop on OpenMP (Lecture Notes in Computer Science). Springer, Cham, 60--72.

[42]

Philippe Virouleau, François Broquedis, Thierry Gautier, and Fabrice Rastello. 2016. Using Data Dependencies to Improve Task-Based Scheduling Strategies on NUMA Architectures. In EuroPar 2016: Parallel Processing. Springer, Cham, 531--544.

Digital Library

[43]

Qian Wang, Xianyi Zhang, Yunquan Zhang, and Qing Yi. 2013. AUGEM: Automatically Generate High Performance Dense Linear Algebra Kernels on X86 CPUs. In International Conference on High Performance Computing, Networking, Storage and Analysis (SC '13). ACM, New York, NY, USA, 25:1--25:12.

Digital Library

[44]

Yonghong Yan, Jisheng Zhao, Yi Guo, and Vivek Sarkar. 2009. Hierarchical Place Trees: A Portable Abstraction for Task Parallelism and Data Movement. In Languages and Compilers for Parallel Computing. Springer, Berlin, Heidelberg, 172--187.

Digital Library

Cited By

Maroñas MNavarro AAyguadé EBeltran V(2023)Mitigating the NUMA effect on task-based runtime systemsThe Journal of Supercomputing10.1007/s11227-023-05164-979:13(14287-14312)Online publication date: 6-Apr-2023
https://doi.org/10.1007/s11227-023-05164-9
Caheny PAlvarez LCasas MMoreto M(2022)TD-NUCA: Runtime Driven Management of NUCA Caches in Task Dataflow Programming ModelsSC22: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41404.2022.00085(1-15)Online publication date: Nov-2022
https://doi.org/10.1109/SC41404.2022.00085
Dominico Sde Almeida EAlves M(2022)On the performance limits of thread placement for array databases in non-uniform memory architecturesComputing10.1007/s00607-021-01043-4105:5(1059-1075)Online publication date: 17-Jan-2022
https://doi.org/10.1007/s00607-021-01043-4
Show More Cited By

Index Terms

Reducing Data Movement on Large Shared Memory Systems by Exploiting Computation Dependencies

Recommendations

Graph partitioning applied to DAG scheduling to reduce NUMA effects
PPoPP '18: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

The complexity of shared memory systems is becoming more relevant as the number of memory domains increases, with different access latencies and bandwidth rates depending on the proximity between the cores and the devices containing the data. In this ...
Graph partitioning applied to DAG scheduling to reduce NUMA effects
PPoPP '18

The complexity of shared memory systems is becoming more relevant as the number of memory domains increases, with different access latencies and bandwidth rates depending on the proximity between the cores and the devices containing the data. In this ...
Affinity-Based Thread and Data Mapping in Shared Memory Systems

Shared memory architectures have recently experienced a large increase in thread-level parallelism, leading to complex memory hierarchies with multiple cache memory levels and memory controllers. These new designs created a Non-Uniform Memory Access (...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICS '18: Proceedings of the 2018 International Conference on Supercomputing

June 2018

407 pages

ISBN:9781450357838

DOI:10.1145/3205289

Copyright © 2018 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 June 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

Conference

ICS '18

Sponsor:

SIGARCH

ICS '18: 2018 International Conference on Supercomputing

June 12 - 15, 2018

Beijing, China

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

13
Total Citations
View Citations
642
Total Downloads

Downloads (Last 12 months)104
Downloads (Last 6 weeks)15

Reflects downloads up to 16 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Maroñas MNavarro AAyguadé EBeltran V(2023)Mitigating the NUMA effect on task-based runtime systemsThe Journal of Supercomputing10.1007/s11227-023-05164-979:13(14287-14312)Online publication date: 6-Apr-2023
https://doi.org/10.1007/s11227-023-05164-9
Caheny PAlvarez LCasas MMoreto M(2022)TD-NUCA: Runtime Driven Management of NUCA Caches in Task Dataflow Programming ModelsSC22: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41404.2022.00085(1-15)Online publication date: Nov-2022
https://doi.org/10.1109/SC41404.2022.00085
Dominico Sde Almeida EAlves M(2022)On the performance limits of thread placement for array databases in non-uniform memory architecturesComputing10.1007/s00607-021-01043-4105:5(1059-1075)Online publication date: 17-Jan-2022
https://doi.org/10.1007/s00607-021-01043-4
Dominico Sde Almeida EAlves MMeira J(2021)Performance Analysis of Array Database Systems in Non-Uniform Memory Architecture2021 29th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)10.1109/PDP52278.2021.00034(169-176)Online publication date: Mar-2021
https://doi.org/10.1109/PDP52278.2021.00034
Sultana TAllen BQasem ASarkar VKim H(2020)Intelligent Data Placement on Discrete GPU Nodes with Unified MemoryProceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques10.1145/3410463.3414651(139-151)Online publication date: 30-Sep-2020
https://dl.acm.org/doi/10.1145/3410463.3414651
Sánchez Barrera IBlack-Schaffer DCasas MMoretó MStupnikova APopov MAyguadé EHwu WBadia RHofstee H(2020)Modeling and optimizing NUMA effects and prefetching with machine learningProceedings of the 34th ACM International Conference on Supercomputing10.1145/3392717.3392765(1-13)Online publication date: 29-Jun-2020
https://dl.acm.org/doi/10.1145/3392717.3392765
Han MPark JBaek W(2020)Design and Implementation of a Criticality- and Heterogeneity-Aware Runtime System for Task-Parallel ApplicationsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2020.3031911(1-1)Online publication date: 2020
https://doi.org/10.1109/TPDS.2020.3031911
Imes CHofmeyr SKang DWalters J(2020)A Case Study and Characterization of a Many-socket, Multi-tier NUMA HPC Platform2020 IEEE/ACM 6th Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC) and Workshop on Hierarchical Parallelism for Exascale Computing (HiPar)10.1109/LLVMHPCHiPar51896.2020.00013(74-84)Online publication date: Nov-2020
https://doi.org/10.1109/LLVMHPCHiPar51896.2020.00013
Chen LTang SFu YGao XGuo JJiang S(2020)AceMesh: a structured data driven programming language for high performance computingCCF Transactions on High Performance Computing10.1007/s42514-020-00047-4Online publication date: 27-Aug-2020
https://doi.org/10.1007/s42514-020-00047-4
Hope JNag TQasem A(2019)Energy-Efficient GPU Graph Processing with On-Demand Page Migration2019 Tenth International Green and Sustainable Computing Conference (IGSC)10.1109/IGSC48788.2019.8957183(1-8)Online publication date: Oct-2019
https://doi.org/10.1109/IGSC48788.2019.8957183
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents