article

Efficient heterogeneous execution on large multicore and accelerator platforms: Case study using a block tridiagonal solver

Authors:

Alfred J. Park,

Kalyan S. PerumallaAuthors Info & Claims

Journal of Parallel and Distributed Computing, Volume 73, Issue 12

Pages 1578 - 1591

https://doi.org/10.1016/j.jpdc.2013.07.012

Published: 01 December 2013 Publication History

Abstract

The algorithmic and implementation principles are explored in gainfully exploiting GPU accelerators in conjunction with multicore processors on high-end systems with large numbers of compute nodes, and evaluated in an implementation of a scalable block tridiagonal solver. The accelerator of each compute node is exploited in combination with multicore processors of that node in performing block-level linear algebra operations in the overall, distributed solver algorithm. Optimizations incorporated include: (1) an efficient memory mapping and synchronization interface to minimize data movement, (2) multi-process sharing of the accelerator within a node to obtain balanced load with multicore processors, and (3) an automatic memory management system to efficiently utilize accelerator memory when sub-matrices spill over the limits of device memory. Results are reported from our novel implementation that uses MAGMA and CUBLAS accelerator software systems simultaneously with ACML (2013) [2] for multithreaded execution on processors. Overall, using 940 nVidia Tesla X2090 accelerators and 15,040 cores, the best heterogeneous execution delivers a 10.9-fold reduction in run time relative to an already efficient parallel multicore-only baseline implementation that is highly optimized with intra-node and inter-node concurrency and computation-communication overlap. Detailed quantitative results are presented to explain all critical runtime components contributing to hybrid performance.

References

[1]

Aaby, B.G., Perumalla, K.S. and Seal, S.K., Efficient simulation of agent-based models on multi-gpu and multi-core clusters. In: Proceedings of the 3rd International ICST Conference on Simulation Tools and Techniques, ICST (Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering), ICST, Brussels, Belgium, Belgium. pp. 29:1-29:10.

Digital Library

[2]

Acml: Advanced micro devices core math library, 2013.http://developer.amd.com.

[3]

E. Agullo, C. Augonnet, J. Dongarra, M. Faverge, J. Langou, H. Ltaief, S. Tomov, Lu factorization for accelerator-based systems, in: 2011 9th IEEE/ACS International Conference on Computer Systems and Applications, AICCSA, pp.¿217-224.

Digital Library

[4]

Agullo, E., Augonnet, C., Dongarra, J., Ltaief, H., Namyst, R., Thibault, S. and Tomov, S., Chapter 34-a hybridization methodology for high-performance linear algebra software for gpus. In: mei, W., Hwu, W. (Eds.), GPU Computing Gems Jade Edition, Morgan Kaufmann, Boston. pp. 473-484.

[5]

Agullo, E., Demmel, J., Dongarra, J., Hadri, B., Kurzak, J., Langou, J., Ltaief, H., Luszczek, P. and Tomov, S., Numerical linear algebra on emerging architectures: the plasma and magma projects. Journal of Physics: Conference Series. v180. 012037

[6]

S. Bai, D. Nicol, Acceleration of wireless channel simulation using gpus, in: Wireless Conference, EW, 2010 European, pp. 841-848.

[7]

Clarke, D., Ilic, A., Lastovetsky, A. and Sousa, L., Hierarchical partitioning algorithm for scientific computing on highly heterogeneous cpu + gpu clusters. In: Proceedings of the 18th international conference on Parallel Processing, Springer-Verlag, Berlin, Heidelberg. pp. 489-501.

[8]

Cublas: Cuda basic linear algebra subroutines, 2012.http://developer.nvidia.com/cublas.

[9]

Cuda: Compute unified device architecture, 2012.http://www.nvidia.com/object/cuda_home_new.html.

[10]

Cula: Gpu-accelerated linear algebra libraries, 2012.http://www.culatools.com.

[11]

Davidson, A., Zhang, Y. and Owens, J.D., An auto-tuned method for solving large tridiagonal systems on the gpu. In: Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium, IEEE Computer Society, Washington, DC, USA. pp. 956-965.

Digital Library

[12]

D'Azevedo, E. and Hill, J., Parallel lu factorization on gpu cluster. Procedia Computer Science. v9. 67-75.

[13]

A. DePrince, J. Hammond, Quantum chemical many-body theory on heterogeneous nodes, in: 2011 Symposium on Application Accelerators in High-Performance Computing, SAAHPC, pp. 131-140.

[14]

Gpu-based parallel algorithms for sparse nonlinear systems. Journal of Parallel and Distributed Computing. v72. 1098-1105.

[15]

Gelado, I., Stone, J.E., Cabezas, J., Patel, S., Navarro, N. and Hwu, W.m.W., An asymmetric distributed shared memory model for heterogeneous parallel systems. In: Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems, ACM, New York, NY, USA. pp. 347-358.

[16]

Genovese, L., Ospici, M., Deutsch, T., Méhaut, J.F., Neelov, A. and Goedecker, S., Density functional theory calculation on many-cores hybrid central processing unit-graphic processing unit architectures. The Journal of Chemical Physics. v131. 034103

[17]

Goddeke, D. and Strzodka, R., Cyclic reduction tridiagonal solvers on gpus applied to mixed-precision multigrid. IEEE Transactions on Parallel and Distributed Systems. v22. 22-32.

[18]

Goto, K. and Van De Geijn, R., High-performance implementation of the level-3 blas. ACM Transactions on Mathematical Software. v35. 4:1-4:14.

[19]

Hirshman, S., Perumalla, K., Lynch, V. and Sanchez, R., Bcyclic: a parallel block tridiagonal matrix cyclic solver. Journal of Computational Physics. v229. 6392-6404.

[20]

Hirshman, S.P., Sanchez, R. and Cook, C.R., Siesta: a scalable iterative equilibrium solver for toroidal applications. Physics of Plasmas. v18. 062504

[21]

Hong, S., Oguntebi, T. and Olukotun, K., Efficient parallel graph exploration on multi-core cpu and gpu. In: Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques, IEEE Computer Society, Washington, DC, USA. pp. 78-88.

Digital Library

[22]

Horton, M., Tomov, S. and Dongarra, J., A class of hybrid lapack algorithms for multicore and gpu architectures. In: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing, IEEE Computer Society, Washington, DC, USA. pp. 150-158.

Digital Library

[23]

memCUDA: map device memory to host memory on gpgpu platform. In: Ding, C., Shao, Z., Zheng, R. (Eds.), Lecture Notes in Computer Science, vol. 6289. Springer, Berlin, Heidelberg. pp. 299-313.

[24]

Kim, H.S., Wu, S., Chang, L.w. and Hwu, W.m.W., A scalable tridiagonal solver for gpus. In: Proceedings of the 2011 International Conference on Parallel Processing, IEEE Computer Society, Washington, DC, USA. pp. 444-453.

Digital Library

[25]

W. Liu, Z. Du, Y. Xiao, D. Bader, C. Xu, A waterfall model to achieve energy efficient tasks mapping for large scale gpu clusters, in: 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum, IPDPSW, pp. 82-92.

Digital Library

[26]

Miyamoto, Y., Wada, A., Iizuka, T., Suzuki, T., Nakayama, T., Yanagawa, K., Aoki, S., Ozaki, Y., Toriu, T., Kawana, M., Nishino, T. and Takeda, M., Realtime 3D profilometer using gpu and multicore cpu. In: Digital Holography and Three-Dimensional Imaging, Optical Society of America. pp. DTuC10

[27]

Opencl: The open standard for parallel programming of heterogeneous systems, 2012. http://www.khronos.org/opencl.

[28]

Rossinelli, D., Hejazialhosseini, B., Spampinato, D.G. and Koumoutsakos, P., Multicore/multi-gpu accelerated simulations of multiphase compressible flows using wavelet adapted grids. SIAM Journal on Scientific Computing. v33. 512-540.

[29]

Saha, B., Zhou, X., Chen, H., Gao, Y., Yan, S., Rajagopalan, M., Fang, J., Zhang, P., Ronen, R. and Mendelson, A., Programming model for a heterogeneous x86 platform. In: Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation, ACM, New York, NY, USA. pp. 431-440.

[30]

C.P. Stone, E.P.N. Duque, Y. Zhang, D. Car, J.D. Owens, R.L. Davis, Gpgpu parallel algorithms for structured-grid cfd codes, in: Proceedings of the 20th AIAA Computational Fluid Dynamics Conference, 2011-3221.

[31]

P. Stpiczynski, J. Potiopa, Solving a kind of bvp for odes on heterogeneous cpu¿+¿cuda-enabled gpu systems, in: Proceedings of the 2010 International Multiconference on Computer Science and Information Technology, IMCSIT, pp. 349-353.

[32]

Yan, S., Zhou, X., Gao, Y., Chen, H., Wu, G., Luo, S. and Saha, B., Optimizing a shared virtual memory system for a heterogeneous cpu-accelerator platform. SIGOPS Operating Systems Review. v45. 92-100.

[33]

Zhang, Y., Cohen, J. and Owens, J.D., Fast tridiagonal solvers on the gpu. In: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, ACM, New York, NY, USA. pp. 127-136.

Cited By

Köster TPerumalla KUhrmacher ACai WMeng TWilsey PJin K(2017)Efficient Simulation of Nested Hollow Sphere IntersectionsProceedings of the 2017 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation10.1145/3064911.3064920(173-183)Online publication date: 16-May-2017
https://dl.acm.org/doi/10.1145/3064911.3064920

Index Terms

Efficient heterogeneous execution on large multicore and accelerator platforms: Case study using a block tridiagonal solver

Index terms have been assigned to the content through auto-classification.

Recommendations

Out-of-core implementation for accelerator kernels on heterogeneous clouds

Cloud environments today are increasingly featuring hybrid nodes containing multicore CPU processors and a diverse mix of accelerators such as Graphics Processing Units (GPUs), Intel Xeon Phi co-processors, and Field-Programmable Gate Arrays (FPGAs) to ...
StarPU: a unified platform for task scheduling on heterogeneous multicore architectures
Euro-Par 2009

In the field of HPC, the current hardware trend is to design multiprocessor architectures featuring heterogeneous technologies such as specialized coprocessors (e.g. Cell/BE) or data-parallel accelerators (e.g. GPUs). Approaching the theoretical ...
LU Factorization with Partial Pivoting for a Multicore System with Accelerators

LU factorization with partial pivoting is a canonical numerical procedure and the main component of the high performance LINPACK benchmark. This paper presents an implementation of the algorithm for a hybrid, shared memory, system with standard CPU ...

Comments

Information & Contributors

Information

Published In

cover image Journal of Parallel and Distributed Computing

Journal of Parallel and Distributed Computing Volume 73, Issue 12

December, 2013

193 pages

ISSN:0743-7315

Issue’s Table of Contents

Copyright © Elsevier Inc. © 2013.

Publisher

Academic Press, Inc.

United States

Publication History

Published: 01 December 2013

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 10 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Köster TPerumalla KUhrmacher ACai WMeng TWilsey PJin K(2017)Efficient Simulation of Nested Hollow Sphere IntersectionsProceedings of the 2017 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation10.1145/3064911.3064920(173-183)Online publication date: 16-May-2017
https://dl.acm.org/doi/10.1145/3064911.3064920

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents