Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Efficient heterogeneous execution on large multicore and accelerator platforms: Case study using a block tridiagonal solver

Published: 01 December 2013 Publication History
  • Get Citation Alerts
  • Abstract

    The algorithmic and implementation principles are explored in gainfully exploiting GPU accelerators in conjunction with multicore processors on high-end systems with large numbers of compute nodes, and evaluated in an implementation of a scalable block tridiagonal solver. The accelerator of each compute node is exploited in combination with multicore processors of that node in performing block-level linear algebra operations in the overall, distributed solver algorithm. Optimizations incorporated include: (1) an efficient memory mapping and synchronization interface to minimize data movement, (2) multi-process sharing of the accelerator within a node to obtain balanced load with multicore processors, and (3) an automatic memory management system to efficiently utilize accelerator memory when sub-matrices spill over the limits of device memory. Results are reported from our novel implementation that uses MAGMA and CUBLAS accelerator software systems simultaneously with ACML (2013) [2] for multithreaded execution on processors. Overall, using 940 nVidia Tesla X2090 accelerators and 15,040 cores, the best heterogeneous execution delivers a 10.9-fold reduction in run time relative to an already efficient parallel multicore-only baseline implementation that is highly optimized with intra-node and inter-node concurrency and computation-communication overlap. Detailed quantitative results are presented to explain all critical runtime components contributing to hybrid performance.

    References

    [1]
    Aaby, B.G., Perumalla, K.S. and Seal, S.K., Efficient simulation of agent-based models on multi-gpu and multi-core clusters. In: Proceedings of the 3rd International ICST Conference on Simulation Tools and Techniques, ICST (Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering), ICST, Brussels, Belgium, Belgium. pp. 29:1-29:10.
    [2]
    Acml: Advanced micro devices core math library, 2013.http://developer.amd.com.
    [3]
    E. Agullo, C. Augonnet, J. Dongarra, M. Faverge, J. Langou, H. Ltaief, S. Tomov, Lu factorization for accelerator-based systems, in: 2011 9th IEEE/ACS International Conference on Computer Systems and Applications, AICCSA, pp.¿217-224.
    [4]
    Agullo, E., Augonnet, C., Dongarra, J., Ltaief, H., Namyst, R., Thibault, S. and Tomov, S., Chapter 34-a hybridization methodology for high-performance linear algebra software for gpus. In: mei, W., Hwu, W. (Eds.), GPU Computing Gems Jade Edition, Morgan Kaufmann, Boston. pp. 473-484.
    [5]
    Agullo, E., Demmel, J., Dongarra, J., Hadri, B., Kurzak, J., Langou, J., Ltaief, H., Luszczek, P. and Tomov, S., Numerical linear algebra on emerging architectures: the plasma and magma projects. Journal of Physics: Conference Series. v180. 012037
    [6]
    S. Bai, D. Nicol, Acceleration of wireless channel simulation using gpus, in: Wireless Conference, EW, 2010 European, pp. 841-848.
    [7]
    Clarke, D., Ilic, A., Lastovetsky, A. and Sousa, L., Hierarchical partitioning algorithm for scientific computing on highly heterogeneous cpu + gpu clusters. In: Proceedings of the 18th international conference on Parallel Processing, Springer-Verlag, Berlin, Heidelberg. pp. 489-501.
    [8]
    Cublas: Cuda basic linear algebra subroutines, 2012.http://developer.nvidia.com/cublas.
    [9]
    Cuda: Compute unified device architecture, 2012.http://www.nvidia.com/object/cuda_home_new.html.
    [10]
    Cula: Gpu-accelerated linear algebra libraries, 2012.http://www.culatools.com.
    [11]
    Davidson, A., Zhang, Y. and Owens, J.D., An auto-tuned method for solving large tridiagonal systems on the gpu. In: Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium, IEEE Computer Society, Washington, DC, USA. pp. 956-965.
    [12]
    D'Azevedo, E. and Hill, J., Parallel lu factorization on gpu cluster. Procedia Computer Science. v9. 67-75.
    [13]
    A. DePrince, J. Hammond, Quantum chemical many-body theory on heterogeneous nodes, in: 2011 Symposium on Application Accelerators in High-Performance Computing, SAAHPC, pp. 131-140.
    [14]
    Gpu-based parallel algorithms for sparse nonlinear systems. Journal of Parallel and Distributed Computing. v72. 1098-1105.
    [15]
    Gelado, I., Stone, J.E., Cabezas, J., Patel, S., Navarro, N. and Hwu, W.m.W., An asymmetric distributed shared memory model for heterogeneous parallel systems. In: Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems, ACM, New York, NY, USA. pp. 347-358.
    [16]
    Genovese, L., Ospici, M., Deutsch, T., Méhaut, J.F., Neelov, A. and Goedecker, S., Density functional theory calculation on many-cores hybrid central processing unit-graphic processing unit architectures. The Journal of Chemical Physics. v131. 034103
    [17]
    Goddeke, D. and Strzodka, R., Cyclic reduction tridiagonal solvers on gpus applied to mixed-precision multigrid. IEEE Transactions on Parallel and Distributed Systems. v22. 22-32.
    [18]
    Goto, K. and Van De Geijn, R., High-performance implementation of the level-3 blas. ACM Transactions on Mathematical Software. v35. 4:1-4:14.
    [19]
    Hirshman, S., Perumalla, K., Lynch, V. and Sanchez, R., Bcyclic: a parallel block tridiagonal matrix cyclic solver. Journal of Computational Physics. v229. 6392-6404.
    [20]
    Hirshman, S.P., Sanchez, R. and Cook, C.R., Siesta: a scalable iterative equilibrium solver for toroidal applications. Physics of Plasmas. v18. 062504
    [21]
    Hong, S., Oguntebi, T. and Olukotun, K., Efficient parallel graph exploration on multi-core cpu and gpu. In: Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques, IEEE Computer Society, Washington, DC, USA. pp. 78-88.
    [22]
    Horton, M., Tomov, S. and Dongarra, J., A class of hybrid lapack algorithms for multicore and gpu architectures. In: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing, IEEE Computer Society, Washington, DC, USA. pp. 150-158.
    [23]
    memCUDA: map device memory to host memory on gpgpu platform. In: Ding, C., Shao, Z., Zheng, R. (Eds.), Lecture Notes in Computer Science, vol. 6289. Springer, Berlin, Heidelberg. pp. 299-313.
    [24]
    Kim, H.S., Wu, S., Chang, L.w. and Hwu, W.m.W., A scalable tridiagonal solver for gpus. In: Proceedings of the 2011 International Conference on Parallel Processing, IEEE Computer Society, Washington, DC, USA. pp. 444-453.
    [25]
    W. Liu, Z. Du, Y. Xiao, D. Bader, C. Xu, A waterfall model to achieve energy efficient tasks mapping for large scale gpu clusters, in: 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum, IPDPSW, pp. 82-92.
    [26]
    Miyamoto, Y., Wada, A., Iizuka, T., Suzuki, T., Nakayama, T., Yanagawa, K., Aoki, S., Ozaki, Y., Toriu, T., Kawana, M., Nishino, T. and Takeda, M., Realtime 3D profilometer using gpu and multicore cpu. In: Digital Holography and Three-Dimensional Imaging, Optical Society of America. pp. DTuC10
    [27]
    Opencl: The open standard for parallel programming of heterogeneous systems, 2012. http://www.khronos.org/opencl.
    [28]
    Rossinelli, D., Hejazialhosseini, B., Spampinato, D.G. and Koumoutsakos, P., Multicore/multi-gpu accelerated simulations of multiphase compressible flows using wavelet adapted grids. SIAM Journal on Scientific Computing. v33. 512-540.
    [29]
    Saha, B., Zhou, X., Chen, H., Gao, Y., Yan, S., Rajagopalan, M., Fang, J., Zhang, P., Ronen, R. and Mendelson, A., Programming model for a heterogeneous x86 platform. In: Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation, ACM, New York, NY, USA. pp. 431-440.
    [30]
    C.P. Stone, E.P.N. Duque, Y. Zhang, D. Car, J.D. Owens, R.L. Davis, Gpgpu parallel algorithms for structured-grid cfd codes, in: Proceedings of the 20th AIAA Computational Fluid Dynamics Conference, 2011-3221.
    [31]
    P. Stpiczynski, J. Potiopa, Solving a kind of bvp for odes on heterogeneous cpu¿+¿cuda-enabled gpu systems, in: Proceedings of the 2010 International Multiconference on Computer Science and Information Technology, IMCSIT, pp. 349-353.
    [32]
    Yan, S., Zhou, X., Gao, Y., Chen, H., Wu, G., Luo, S. and Saha, B., Optimizing a shared virtual memory system for a heterogeneous cpu-accelerator platform. SIGOPS Operating Systems Review. v45. 92-100.
    [33]
    Zhang, Y., Cohen, J. and Owens, J.D., Fast tridiagonal solvers on the gpu. In: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, ACM, New York, NY, USA. pp. 127-136.

    Cited By

    View all
    • (2017)Efficient Simulation of Nested Hollow Sphere IntersectionsProceedings of the 2017 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation10.1145/3064911.3064920(173-183)Online publication date: 16-May-2017

    Index Terms

    1. Efficient heterogeneous execution on large multicore and accelerator platforms: Case study using a block tridiagonal solver
            Index terms have been assigned to the content through auto-classification.

            Recommendations

            Comments

            Information & Contributors

            Information

            Published In

            cover image Journal of Parallel and Distributed Computing
            Journal of Parallel and Distributed Computing  Volume 73, Issue 12
            December, 2013
            193 pages

            Publisher

            Academic Press, Inc.

            United States

            Publication History

            Published: 01 December 2013

            Author Tags

            1. Accelerator
            2. GPU
            3. Heterogeneous execution
            4. Linear algebra
            5. Memory management
            6. Tridiagonal solver

            Qualifiers

            • Article

            Contributors

            Other Metrics

            Bibliometrics & Citations

            Bibliometrics

            Article Metrics

            • Downloads (Last 12 months)0
            • Downloads (Last 6 weeks)0
            Reflects downloads up to 10 Aug 2024

            Other Metrics

            Citations

            Cited By

            View all
            • (2017)Efficient Simulation of Nested Hollow Sphere IntersectionsProceedings of the 2017 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation10.1145/3064911.3064920(173-183)Online publication date: 16-May-2017

            View Options

            View options

            Get Access

            Login options

            Media

            Figures

            Other

            Tables

            Share

            Share

            Share this Publication link

            Share on social media