Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3205289.3205310acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article
Open access

Reducing Data Movement on Large Shared Memory Systems by Exploiting Computation Dependencies

Published: 12 June 2018 Publication History

Abstract

Shared memory systems are becoming increasingly complex as they typically integrate several storage devices. That brings different access latencies or bandwidth rates depending on the proximity between the cores where memory accesses are issued and the storage devices containing the requested data. In this context, techniques to manage and mitigate non-uniform memory access (NUMA) effects consist in migrating threads, memory pages or both and are generally applied by the system software.
We propose techniques at the runtime system level to further mitigate the impact of NUMA effects on parallel applications' performance. We leverage runtime system metadata expressed in terms of a task dependency graph, where nodes are pieces of serial code and edges are control or data dependencies between them, to efficiently reduce data transfers. Our approach, based on graph partitioning, adds negligible overhead and is able to provide performance improvements up to 1.52X and average improvements of 1.12X with respect to the best state-of-the-art approach when deployed on a 288-core shared-memory system. Our approach reduces the coherence traffic by 2.28X on average with respect to the state-of-the-art.

References

[1]
Rabab al Omairy, Guillermo Miranda, Hatem Ltaief, Rosa M. Badia, Xavier Martorell, Jesús Labarta, and David Keyes. 2015. Dense Matrix Computations on NUMA Architectures with Distance-Aware Work Stealing. Supercomput. Front. Innov. 2, 1 (Jan. 2015), 49--72.
[2]
Atos {n. d.}. Bull bullion S16 Technical Specifications. Technical specifications. https://bull.com/wp-content/uploads/2016/08/f-bullion_s16_e7v3-en2_web.pdf
[3]
Jairo Balart, Alejandro Duran, Marc Gonzàlez, Xavier Martorell, Eduard Ayguadé, and Jesús Labarta. 2004. Nanos Mercurium: A Research Compiler for OpenMP. In 6th European Workshop on OpenMP (EWOMP 2004). 103--109. http://people.ac.upc.edu/eduard/papers/paper_a31.pdf.gz
[4]
Erik Boman, Karen Devine, Lee Ann Fisk, Robert Heaphy, Bruce Hendrickson, Vitus Leung, Courtenay Vaughan, Ümit V. Çatalyürek, Doruk Bozdag, and William Mitchell. 1999. Zoltan. Sandia National Laboratories. http://www.cs.sandia.gov/Zoltan
[5]
Aydın Buluç, Henning Meyerhenke, Ilya Safro, Peter Sanders, and Christian Schulz. 2016. Recent Advances in Graph Partitioning. In Algorithm Engineering: Selected Results and Surveys, Lasse Kliemann and Peter Sanders (Eds.). Lecture Notes in Computer Science, Vol. 9220. Springer International Publishing, Cham, 117--158. arXiv:1311.3144
[6]
Alfredo Buttari, Julien Langou, Jakub Kurzak, and Jack Dongarra. 2008. Parallel Tiled QR Factorization for Multicore Architectures. Concurr. Comput. Pract. Exp. 20, 13 (July 2008), 1573--1590.
[7]
Paul Caheny, Lluc Alvarez, Said Derradji, Mateo Valero, Miquel Moretó, and Marc Casas. 2018. Reducing Cache Coherence Traffic with a NUMA-Aware Runtime Approach. IEEE Trans. Parallel Distrib. Syst. 29, 5 (May 2018), 1174--1187.
[8]
Paul Caheny, Marc Casas, Miquel Moretó, Hervé Gloaguen, Maxime Saintes, Eduard Ayguadé, Jesús Labarta, and Mateo Valero. 2016. Reducing Cache Coherence Traffic with Hierarchical Directory Cache and NUMA-Aware Runtime Scheduling. In International Conference on Parallel Architectures and Compilation (PACT '16). ACM, New York, NY, USA, 275--286.
[9]
Marc Casas, Rosa M. Badia, and Jesús Labarta. 2010. Automatic Phase Detection and Structure Extraction of MPI Applications. Int. J. High Perform. Comput. Appl. 24, 3 (Aug. 2010), 335--360.
[10]
Marc Casas, Miquel Moretó, Lluc Alvarez, Emilio Castillo, Dimitrios Chasapis, Timothy Hayes, Luc Jaulmes, Oscar Palomar, Osman Unsal, Adrián Cristal, Eduard Ayguadé, Jesús Labarta, and Mateo Valero. 2015. Runtime-Aware Architectures. In EuroPar 2015: Parallel Processing. Springer, Berlin, Heidelberg, 16--27.
[11]
Ümit V. Çatalyürek. 2011. PaToH Graph Partitioner. http://www.cc.gatech.edu/~umit/software.html#patoh
[12]
Ümit V. Çatalyürek and Cevdet Aykanat. 1999. Hypergraph-Partitioning-Based Decomposition for Parallel Sparse-Matrix Vector Multiplication. IEEE Trans. Parallel Distrib. Syst. 10, 7 (July 1999), 673--693.
[13]
Charm 2016. Charm++ Programming Model. http://charmplusplus.org/
[14]
Sanjay Chatterjee, Nick Vrvilo, Zoran Budimlic, Kathleen Knobe, and Vivek Sarkar. 2016. Declarative Tuning for Locality in Parallel Programs. In 45th International Conference on Parallel Processing (ICPP 2016). IEEE, 452--457.
[15]
Mohammad Dashti, Alexandra Fedorova, Justin Funston, Fabien Gaud, Renaud Lachaize, Baptiste Lepers, Vivien Quema, and Mark Roth. 2013. Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems. In 18th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '13). ACM, New York, NY, USA, 381--394.
[16]
Matthias Diener, Eduardo H.M. Cruz, Philippe O.A. Navaux, Anselm Busse, and Hans-Ulrich Heiß. 2014. kMAF: Automatic Kernel-Level Management of Thread and Data Affinity. In 23rd International Conference on Parallel Architectures and Compilation (PACT '14). ACM, New York, NY, USA, 277--288.
[17]
Andi Drebes, Karine Heydemann, Nathalie Drach, Antoniu Pop, and Albert Cohen. 2014. Topology-Aware and Dependence-Aware Scheduling and Memory Allocation for Task-Parallel Languages. ACM Trans. Archit. Code Optim. 11, 3, Article 30 (Aug. 2014), 25 pages.
[18]
Andi Drebes, Antoniu Pop, Karine Heydemann, Albert Cohen, and Nathalie Drach. 2016. Scalable Task Parallelism for NUMA: A Uniform Abstraction for Coordinated Scheduling and Memory Management. In International Conference on Parallel Architectures and Compilation (PACT '16). ACM, New York, NY, USA, 125--137.
[19]
Luc Jaulmes, Marc Casas, Miquel Moretó, Eduard Ayguadé, Jesús Labarta, and Mateo Valero. 2015. Exploiting Asynchrony from Exact Forward Recovery for DUE in Iterative Solvers. In International Conference on High Performance Computing, Networking, Storage and Analysis (SC '15). ACM, New York, NY, USA, 53:1--53:12.
[20]
Laxmikant V. Kalé and Sanjeev Krishnan. 1996. Parallel Programming with Message-Driven Objects. In Parallel Programming Using C++, Gregory V. Wilson and Paul Lu (Eds.). MIT Press, Cambridge, MA, USA, 175--213.
[21]
George Karypis and Vipin Kumar. 1997. Metis Graph Partitioner. http://glaros.dtc.umn.edu/gkhome/metis/metis/overview
[22]
George Karypis and Vipin Kumar. 1998. A Fast and High Quality Multi-level Scheme for Partitioning Irregular Graphs. SIAM J. Sci. Comput. 20, 1 (Jan. 1998), 359--392.
[23]
George Karypis and Vipin Kumar. 1999. Multilevel k-Way Hypergraph Partitioning. In 36th Annual ACM/IEEE Design Automation Conference (DAC '99). ACM, New York, NY, USA, 343--348.
[24]
George Karypis and Vipin Kumar. 2007. hMetis Partitioning Software. http://glaros.dtc.umn.edu/gkhome/metis/hmetis/overview
[25]
Madhavan Manivannan and Per Stenström. 2014. Runtime-Guided Cache Coherence Optimizations in Multi-Core Architectures. In 28th International Parallel and Distributed Processing Symposium (IPDPS 2014). IEEE, 625--636.
[26]
Larry McAvoy and Carl Staelin. 1996. Lmbench: Portable Tools for Performance Analysis. In USENIX 1996 Annual Technical Conference. USENIX, 279--294. https://www.usenix.org/legacy/publications/library/proceedings/sd96/mcvoy.html
[27]
John D. McCalpin. 1995. Memory Bandwidth and Machine Balance in Current High Performance Computers. IEEE Comput. Soc. Tech. Comm. Comput. Archit. TCCA Newsl. (1995), 19--25. http://www.cs.virginia.edu/stream/
[28]
OpenBLAS 2016. OpenBLAS Library. http://www.openblas.net/
[29]
OpenMP Committee. 2013. OpenMP 4.0 Complete Specifications. OpenMP Committee Technical Report. OpenMP Architecture Review Board. http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf
[30]
Vassilis Papaefstathiou, Manolis G.H. Katevenis, Dimitrios S. Nikolopoulos, and Dionisios Pnevmatikatos. 2013. Prefetching and Cache Management Using Task Lifetimes. In 27th ACM International Conference on Supercomputing (ICS '13). ACM, New York, NY, USA, 325--334.
[31]
Jean-Charles Papin, Christophe Denoual, Laurent Colombet, and Raymond Namyst. 2015. SPAWN: An Iterative, Potentials-Based, Dynamic Scheduling and Partitioning Tool. In SC '15 - RESPA Workshop. https://hal.inria.fr/hal-01223897
[32]
François Pellegrini. 1994. Static Mapping by Dual Recursive Bipartitioning of Process Architecture Graphs. In Scalable High Performance Computing Conference (SHPCC 1994). IEEE, 486--493.
[33]
François Pellegrini. 2012. SCOTCH. https://www.labri.fr/perso/pelegrin/scotch/
[34]
François Pellegrini. 2014. Scotch and libScotch 6.0 User's Guide. http://gforge.inria.fr/docman/view.php/248/8260/scotch_user6.0.pdf
[35]
Fatih Porikli. 2005. Integral Histogram: A Fast Way to Extract Histograms in Cartesian Spaces. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2005), Vol. 1. IEEE, 829--836.
[36]
Maria Predari and Aurélien Esnard. 2016. A k-Way Greedy Graph Partitioning with Initial Fixed Vertices for Parallel Applications. In 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP 2016). IEEE, 280--287.
[37]
Masahiro Tanaka and Osamu Tatebe. 2012. Workflow Scheduling to Minimize Data Movement Using Multi-Constraint Graph Partitioning. In 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid 2012). IEEE, 65--72.
[38]
Xavier Teruel, Xavier Martorell, Alejandro Duran, Roger Ferrer, and Eduard Ayguadé. 2007. Support for OpenMP Tasks in Nanos V4. In Conference of the Center for Advanced Studies on Collaborative Research (CASCON '07). IBM Corp., Riverton, NJ, USA, 256--259.
[39]
Mustafa M. Tikir and Jeffrey K. Hollingsworth. 2008. Hardware Monitors for Dynamic Page Migration. J. Parallel Distrib. Comput. 68, 9 (Sept. 2008), 1186--1200.
[40]
Mateo Valero, Miquel Moretó, Marc Casas, Eduard Ayguadé, and Jesús Labarta. 2014. Runtime-Aware Architectures: A First Approach. Supercomput. Front. Innov. 1, 1 (Sept. 2014), 28--43.
[41]
Raul Vidal, Marc Casas, Miquel Moretó, Dimitrios Chasapis, Roger Ferrer, Xavier Martorell, Eduard Ayguadé, Jesús Labarta, and Mateo Valero. 2015. Evaluating the Impact of OpenMP 4.0 Extensions on Relevant Parallel Workloads. In OpenMP: Heterogenous Execution and Data Movements. International Workshop on OpenMP (Lecture Notes in Computer Science). Springer, Cham, 60--72.
[42]
Philippe Virouleau, François Broquedis, Thierry Gautier, and Fabrice Rastello. 2016. Using Data Dependencies to Improve Task-Based Scheduling Strategies on NUMA Architectures. In EuroPar 2016: Parallel Processing. Springer, Cham, 531--544.
[43]
Qian Wang, Xianyi Zhang, Yunquan Zhang, and Qing Yi. 2013. AUGEM: Automatically Generate High Performance Dense Linear Algebra Kernels on X86 CPUs. In International Conference on High Performance Computing, Networking, Storage and Analysis (SC '13). ACM, New York, NY, USA, 25:1--25:12.
[44]
Yonghong Yan, Jisheng Zhao, Yi Guo, and Vivek Sarkar. 2009. Hierarchical Place Trees: A Portable Abstraction for Task Parallelism and Data Movement. In Languages and Compilers for Parallel Computing. Springer, Berlin, Heidelberg, 172--187.

Cited By

View all
  • (2023)Mitigating the NUMA effect on task-based runtime systemsThe Journal of Supercomputing10.1007/s11227-023-05164-979:13(14287-14312)Online publication date: 6-Apr-2023
  • (2022)TD-NUCA: Runtime Driven Management of NUCA Caches in Task Dataflow Programming ModelsSC22: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41404.2022.00085(1-15)Online publication date: Nov-2022
  • (2022)On the performance limits of thread placement for array databases in non-uniform memory architecturesComputing10.1007/s00607-021-01043-4105:5(1059-1075)Online publication date: 17-Jan-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICS '18: Proceedings of the 2018 International Conference on Supercomputing
June 2018
407 pages
ISBN:9781450357838
DOI:10.1145/3205289
This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 June 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. NUMA
  2. scheduling
  3. shared memory
  4. task-based programming model

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

Conference

ICS '18
Sponsor:

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)104
  • Downloads (Last 6 weeks)15
Reflects downloads up to 16 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Mitigating the NUMA effect on task-based runtime systemsThe Journal of Supercomputing10.1007/s11227-023-05164-979:13(14287-14312)Online publication date: 6-Apr-2023
  • (2022)TD-NUCA: Runtime Driven Management of NUCA Caches in Task Dataflow Programming ModelsSC22: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41404.2022.00085(1-15)Online publication date: Nov-2022
  • (2022)On the performance limits of thread placement for array databases in non-uniform memory architecturesComputing10.1007/s00607-021-01043-4105:5(1059-1075)Online publication date: 17-Jan-2022
  • (2021)Performance Analysis of Array Database Systems in Non-Uniform Memory Architecture2021 29th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)10.1109/PDP52278.2021.00034(169-176)Online publication date: Mar-2021
  • (2020)Intelligent Data Placement on Discrete GPU Nodes with Unified MemoryProceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques10.1145/3410463.3414651(139-151)Online publication date: 30-Sep-2020
  • (2020)Modeling and optimizing NUMA effects and prefetching with machine learningProceedings of the 34th ACM International Conference on Supercomputing10.1145/3392717.3392765(1-13)Online publication date: 29-Jun-2020
  • (2020)Design and Implementation of a Criticality- and Heterogeneity-Aware Runtime System for Task-Parallel ApplicationsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2020.3031911(1-1)Online publication date: 2020
  • (2020)A Case Study and Characterization of a Many-socket, Multi-tier NUMA HPC Platform2020 IEEE/ACM 6th Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC) and Workshop on Hierarchical Parallelism for Exascale Computing (HiPar)10.1109/LLVMHPCHiPar51896.2020.00013(74-84)Online publication date: Nov-2020
  • (2020)AceMesh: a structured data driven programming language for high performance computingCCF Transactions on High Performance Computing10.1007/s42514-020-00047-4Online publication date: 27-Aug-2020
  • (2019)Energy-Efficient GPU Graph Processing with On-Demand Page Migration2019 Tenth International Green and Sustainable Computing Conference (IGSC)10.1109/IGSC48788.2019.8957183(1-8)Online publication date: Oct-2019
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media