Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2578948.2560686acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
tutorial

Vectorizing Unstructured Mesh Computations for Many-core Architectures

Published: 07 February 2014 Publication History

Abstract

Achieving optimal performance on the latest multi-core and many-core architectures depends more and more on making efficient use of the hardware's vector processing capabilities. While auto-vectorizing compilers do not require the use of vector processing constructs, they are only effective on a few classes of applications with regular memory access and computational patterns. Irregular application classes require the explicit use of parallel programming models; CUDA and OpenCL are well established for programming GPUs, but it is not obvious what model to use to exploit vector units on architectures such as CPUs or the Xeon Phi. Therefore it is of growing interest what programming models are available, such as Single Instruction Multiple Threads (SIMT) or Single Instruction Multiple Data (SIMD), and how they map to vector units.
This paper presents results on achieving high performance through vectorization on CPUs and the Xeon Phi on a key class of applications: unstructured mesh computations. By exploring the SIMT and SIMD execution and parallel programming models, we show how abstract unstructured grid computations map to OpenCL or vector intrinsics through the use of code generation techniques, and how these in turn utilize the hardware.
We benchmark a number of systems, including Intel Xeon CPUs and the Intel Xeon Phi, using an industrially representative CFD application and compare the results against previous work on CPUs and NVIDIA GPUs to provide a contrasting comparison of what could be achieved on current many-core systems. By carrying out a performance analysis study, we identify key performance bottlenecks due to computational, control and bandwidth limitations.
We show that the OpenCL SIMT model does not map efficiently to CPU vector units due to auto-vectorization issues and threading overheads. We demonstrate that while the use of SIMD vector intrinsics imposes some restrictions, and requires more involved programming techniques, it does result in efficient code and near-optimal performance, that is up to 2 times faster than the non-vectorized code. We observe that the Xeon Phi does not provide good performance for this class of applications, but is still on par with a pair of high-end Xeon chips. CPUs and GPUs do saturate the available resources, giving performance very near to the optimum.

References

[1]
Texas Instruments Multi-core TMS320C66x processor. http://www.ti.com/c66multicore.
[2]
Intel SDK for OpenCL Applications, 2013. http://software.intel.com/en-us/vcsource/tools/opencl-sdk.
[3]
R. G. Brook, B. Hadri, V. C. Betro, R. C. Hulguin, and R. Braby. In Cray User Group (CUG), 2012, 2012.
[4]
M. Giles, G. Mudalige, Z. Sharif, G. Markall, and P. H. J. Kelly. Performance analysis and optimization of the OP2 framework on many-core architectures. The Computer Journal, 55(2):168--180, 2012.
[5]
M. B. Giles, D. Ghate, and M. C. Duta. Using Automatic Differentiation for Adjoint CFD Code Development. Computational Fluid Dynamics Journal, 16(4):434--443, 2008.
[6]
M. B. Giles, G. R. Mudalige, B. Spencer, C. Bertolli, and I. Reguly. Designing OP2 for GPU Architectures. Journal of Parallel and Distributed Computing, 73:1451--1460, November 2013.
[7]
A. Heinecke, K. Vaidyanathan, M. Smelyanskiy, A. Kobotov, R. Dubtsov, G. Henry, A. Shet, G. Chrysos, and P. Dubey. Design and Implementation of the Linpack Benchmark for Single and Multi-node Systems Based on Intel Xeon Phi Coprocessor. In Parallel Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on, pages 126--137, 2013.
[8]
NVIDIA Tesla Kepler GPU Accelerators, 2012. http://www.nvidia.com/object/tesla-servers.html.
[9]
R. P. LaRowe and C. S. Ellis. Page placement policies for NUMA multiprocessors. Journal of Parallel and Distributed Computing, 11(2):112--129, 1991.
[10]
P. Lindberg. Basic OpenMP threading overhead. Technical report, Intel, 2009. http://software.intel.com/en-us/articles/basic-openmp-threading-overhead.
[11]
O. Lindtjorn, R. Clapp, O. Pell, H. Fu, M. Flynn, and H. Fu. Beyond traditional microprocessors for geoscience high-performance computing applications. Micro, IEEE, 31(2):41--49, March - April 2011.
[12]
Intel Math Kernel Library, 2013. http://software.intel.com/en-us/intel-mkl.
[13]
G. R. Mudalige, M. B. Giles, J. Thiyagalingam, I. Z. Reguly, C. Bertolli, P. H. J. Kelly, and A. E. Trefethen. Design and initial performance of a high-level unstructured mesh framework on heterogeneous parallel systems. Parallel Computing, 39(11):669--692, 2013.
[14]
What is GPU Computing, 2013. http://www.nvidia.com/object/what-is-gpu-computing.html.
[15]
OP2 github repository, 2013. https://github.com/OP2/OP2-Common.
[16]
S. Pennycook, C. Hughes, M. Smelyanskiy, and S. Jarvis. Exploring SIMD for Molecular Dynamics, Using Intel Xeon Processors and Intel Xeon Phi Coprocessors. In Parallel Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on, pages 1085--1097, 2013.
[17]
E. L. Poole and J. M. Ortega. Multicolor ICCG Methods for Vector Computers. SIAM J. Numer. Anal., 24(6):pp. 1394--1418, 1987.
[18]
Scotch and PT-Scotch, 2013. http://www.labri.fr/perso/pelegrin/scotch/.
[19]
J. Rosinski. Porting, validating, and optimizing NOAA weather models NIM and FIM to Intel Xeon Phi. Technical report, NOAA, 2013.
[20]
K. Skaugen. Petascale to Exascale: Extending Intel's HPC Commitment, June 2011. ISC 2010 keynote. http://download.intel.com/pressroom/archive/reference/ISC_2010_Skaugen_keynote.pdf.
[21]
M. Smelyanskiy, J. Sewall, D. Kalamkar, N. Satish, P. Dubey, N. Astafiev, I. Burylov, A. Nikolaev, S. Maidanov, S. Li, S. Kulkarni, and C. Finan. Analysis and Optimization of Financial Analytics Benchmark on Modern Multi- and Many-core IA-Based Architectures. In High Performance Computing, Networking, Storage and Analysis (SCC), 2012 SC Companion:, pages 1154--1162, 2012.
[22]
Top500 Systems, 2013. http://www.top500.org/list/.
[23]
A. Vladimirov and V. Karpusenko. Test-driving Intel Xeon Phi coprocessors with a basic N-body simulation. Technical report, Colfax International, 2013.

Cited By

View all
  • (2018)The VOLNA-OP2 tsunami code (version 1.5)Geoscientific Model Development10.5194/gmd-11-4621-201811:11(4621-4635)Online publication date: 19-Nov-2018
  • (2018)Comparative analysis of coprocessorsConcurrency and Computation: Practice and Experience10.1002/cpe.475631:1Online publication date: 4-Sep-2018
  • (2017)Experimentation of vision algorithm performance using custom OpenCL™ vector language extensions for a graphical accelerator with vector architecture2017 13th IEEE International Conference on Intelligent Computer Communication and Processing (ICCP)10.1109/ICCP.2017.8117027(339-346)Online publication date: Sep-2017
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
PMAM'14: Proceedings of Programming Models and Applications on Multicores and Manycores
February 2014
156 pages
ISBN:9781450326575
DOI:10.1145/2578948
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 February 2014

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. AVX
  2. CUDA
  3. Domain Specific Library
  4. OP2
  5. Unstructured Grid
  6. Vectorization
  7. Xeon Phi

Qualifiers

  • Tutorial
  • Research
  • Refereed limited

Conference

PPoPP '14
Sponsor:

Acceptance Rates

Overall Acceptance Rate 53 of 97 submissions, 55%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)5
  • Downloads (Last 6 weeks)0
Reflects downloads up to 26 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2018)The VOLNA-OP2 tsunami code (version 1.5)Geoscientific Model Development10.5194/gmd-11-4621-201811:11(4621-4635)Online publication date: 19-Nov-2018
  • (2018)Comparative analysis of coprocessorsConcurrency and Computation: Practice and Experience10.1002/cpe.475631:1Online publication date: 4-Sep-2018
  • (2017)Experimentation of vision algorithm performance using custom OpenCL™ vector language extensions for a graphical accelerator with vector architecture2017 13th IEEE International Conference on Intelligent Computer Communication and Processing (ICCP)10.1109/ICCP.2017.8117027(339-346)Online publication date: Sep-2017
  • (2016)Acceleration of a Full-Scale Industrial CFD Application with OP2IEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2015.245397227:5(1265-1278)Online publication date: 1-May-2016
  • (2016)High Performance Computing on the IBM Power8 PlatformHigh Performance Computing10.1007/978-3-319-46079-6_17(235-254)Online publication date: 6-Oct-2016
  • (2014)The OPS domain specific abstraction for multi-block structured grid computationsProceedings of the Fourth International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing10.5555/2691166.2691173(58-67)Online publication date: 16-Nov-2014
  • (2014)The OPS Domain Specific Abstraction for Multi-block Structured Grid Computations2014 Fourth International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing10.1109/WOLFHPC.2014.7(58-67)Online publication date: Nov-2014

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media