Multi-tasking execution in PGAS language XcalableMP and communication optimization on many-core clusters

K Tsugane, J Lee, H Murai, M Sato - Proceedings of the International …, 2018 - dl.acm.org
K Tsugane, J Lee, H Murai, M Sato
Proceedings of the International Conference on High Performance Computing in …, 2018dl.acm.org
Large-scale clusters based on many-core processors such as Intel Xeon Phi have recently
been deployed. Multi-tasking execution using task dependencies in OpenMP 4.0 is a
promising candidate for facilitating the parallelization of such many-core processors,
because this enables users to avoid global synchronization through fine-grained task-to-task
synchronization using user-specified data dependencies. Recently, the partitioned global
address space (PGAS) model has emerged as a usable distributed-memory programming …
Large-scale clusters based on many-core processors such as Intel Xeon Phi have recently been deployed. Multi-tasking execution using task dependencies in OpenMP 4.0 is a promising candidate for facilitating the parallelization of such many-core processors, because this enables users to avoid global synchronization through fine-grained task-to-task synchronization using user-specified data dependencies. Recently, the partitioned global address space (PGAS) model has emerged as a usable distributed-memory programming model. In this paper, we propose a multi-tasking execution model in the PGAS language XcalableMP (XMP) for many-core clusters. The model provides a method to describe interactions between tasks based on point-to-point communications on the global address space. A communication is executed non-collectively among nodes. We implemented the proposed execution model in XMP, and designed a simple code transformation algorithm to MPI and OpenMP. We implemented two benchmarks using our model for preliminary evaluation, namely blocked Cholesky factorization and the Laplace equation solver. Most of the implementations using our model outperform the conventional barrier-based data-parallel model. To improve the performance in many-core clusters, we propose a communication optimization method by dedicating a single thread for communications, to avoid performance problems related to the current multi-threaded MPI execution. As a result, the performances of blocked Cholesky factorization and the Laplace equation solver using this communication optimization are improved to 138% and 119% compared with the barrier-based implementation in Intel Xeon Phi KNL clusters, respectively. From the viewpoint of productivity, the program implemented by our model in XMP is almost the same as the implementation based on the OpenMP task depend clause, because XMP enables the parallelization of the serial source code with additional directives and small changes as well as OpenMP.
ACM Digital Library