research-article

Register optimizations for stencils on GPUs

Authors:

Prashant Singh Rawat,

Fabrice Rastello,

Aravind Sukumaran-Rajam,

Louis-Noël Pouchet,

Atanas Rountev,

P. SadayappanAuthors Info & Claims

PPoPP '18: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

Pages 168 - 182

https://doi.org/10.1145/3178487.3178500

Published: 10 February 2018 Publication History

Abstract

The recent advent of compute-intensive GPU architecture has allowed application developers to explore high-order 3D stencils for better computational accuracy. A common optimization strategy for such stencils is to expose sufficient data reuse by means such as loop unrolling, with the expectation of register-level reuse. However, the resulting code is often highly constrained by register pressure. While current state-of-the-art register allocators are satisfactory for most applications, they are unable to effectively manage register pressure for such complex high-order stencils, resulting in sub-optimal code with a large number of register spills. In this paper, we develop a statement reordering framework that models stencil computations as a DAG of trees with shared leaves, and adapts an optimal scheduling algorithm for minimizing register usage for expression trees. The effectiveness of the approach is demonstrated through experimental results on a range of stencils extracted from application codes.

Supplementary Material

Artifacts Available (ppopp-artifact-1.0.0.zip)

The is the software accompanying the paper "Register Optimizations for Stencils on GPUs" that appears in PPoPP'18. The software proposes a source-to-source register reordering framework that can help alleviate register pressure for high-order stencils on GPUs.

The package contains: a. the source code for the reordering framework b. the examples used in the paper in the examples/ directory c. scripts for code installation and benchmarking.

Download
1.10 MB

References

[1]

A. Aho, M. Lam, R. Sethi, and J. Ullman. 2007. Compilers: Principles, Techniques, and Tools (2nd ed). Pearson.

Digital Library

[2]

A. V. Aho and S. C. Johnson. 1975. Optimal Code Generation for Expression Trees. In Proceedings of Seventh Annual ACM Symposium on Theory of Computing (STOC '75). ACM, New York, NY, USA, 207--217.

Digital Library

[3]

A. V. Aho, S. C. Johnson, and J. D. Ullman. 1977. Code Generation for Expressions with Common Subexpressions. J. ACM 24, 1 (Jan. 1977), 146--160.

Digital Library

[4]

A. W. Appel and K. J. Supowit. 1987. Generalization of the Sethi-Ullman Algorithm for Register Allocation. Softw. Pract. Exper. 17, 6 (June 1987), 417--421.

Digital Library

[5]

Vinayaka Bandishti, Irshad Pananilath, and Uday Bondhugula. 2012. Tiling Stencil Computations to Maximize Parallelism. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC '12). IEEE Computer Society Press, Los Alamitos, CA, USA, Article 40, 11 pages.

Digital Library

[6]

P. Basu, M. Hall, S. Williams, B. V. Straalen, L. Oliker, and P. Colella. 2015. Compiler-Directed Transformation for Higher-Order Stencils. In Parallel and Distributed Processing Symposium (IPDPS), 2015 IEEE International. 313--323.

Digital Library

[7]

David A. Berson, Rajiv Gupta, and Mary Lou Soffa. 1999. Integrated Instruction Scheduling and Register Allocation Techniques. In Proceedings of the 11th International Workshop on Languages and Compilers for Parallel Computing (LCPC '98). Springer-Verlag, London, UK, UK, 247--262.

Digital Library

[8]

Uday Bondhugula, Albert Hartono, J. Ramanujam, and P. Sadayappan. 2008. A Practical Automatic Polyhedral Parallelizer and Locality Optimizer. In Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '08). ACM, New York, NY, USA, 101--113.

Digital Library

[9]

Preston Briggs, Keith D. Cooper, and Linda Torczon. 1992. Rematerialization. In Proceedings of the ACM SIGPLAN 1992 Conference on Programming Language Design and Implementation (PLDI '92). ACM, New York, NY, USA, 311--321.

Digital Library

[10]

Preston Briggs, Keith D. Cooper, and Linda Torczon. 1994. Improvements to Graph Coloring Register Allocation. ACM Trans. Program. Lang. Syst. 16, 3 (May 1994), 428--455.

Digital Library

[11]

G. J. Chaitin. 1982. Register Allocation & Spilling via Graph Coloring. In Proceedings of the 1982 SIGPLAN Symposium on Compiler Construction (SIGPLAN '82). ACM, New York, NY, USA, 98--105.

Digital Library

[12]

J. M. Codina, J. Sanchez, and A. Gonzalez. 2001. A unified modulo scheduling and register allocation technique for clustered processors. In Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques. 175--184.

Digital Library

[13]

Q. Colombet, B. Boissinot, P. Brisk, S. Hack, and F. Rastello. 2011. Graph-coloring and treescan register allocation using repairing. In 2011 Proceedings of the 14th International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES). 45--54.

Digital Library

[14]

Raúl De La Cruz, Mauricio Araya-Polo, and José María Cela. 2010. Introducing the Semi-stencil Algorithm. In Proceedings of the 8th International Conference on Parallel Processing and Applied Mathematics: Part I (PPAM'09). Springer-Verlag, Berlin, Heidelberg, 496--506.

Digital Library

[15]

Steven J. Deitz, Bradford L. Chamberlain, and Lawrence Snyder. 2001. Eliminating Redundancies in Sum-of-product Array Computations. In Proceedings of the 15th International Conference on Supercomputing (ICS '01). ACM, New York, NY, USA, 65--77.

Digital Library

[16]

Lukasz Domagala, Duco van Amstel, Fabrice Rastello, and P. Sadayappan. 2016. Register Allocation and Promotion Through Combined Instruction Scheduling and Loop Unrolling. In Proceedings of the 25th International Conference on Compiler Construction (CC 2016). ACM, New York, NY, USA, 143--151.

Digital Library

[17]

ExaCT 2013. ExaCT: Center for Exascale Simulation of Combustion in Turbulence: Proxy App Software. https://exactcodesign.org/proxy-app-software/. (2013).

[18]

M. Frigo and S. G. Johnson. 2005. The Design and Implementation of FFTW3. Proc. IEEE 93, 2 (Feb 2005), 216--231.

[19]

Kazushige Goto and Robert A. van de Geijn. 2008. Anatomy of High-performance Matrix Multiplication. ACM Trans. Math. Softw. 34, 3, Article 12 (May 2008), 25 pages.

Digital Library

[20]

Ramaswamy Govindarajan, H. Yang, Chihong Zhang, José N. Amaral, and Guang R. Gao. 2001. Minimum Register Instruction Sequence Problem: Revisiting Optimal Code Generation for DAGs. In Proceedings of the 15th International Parallel & Distributed Processing Symposium (IPDPS '01). IEEE Computer Society, Washington, DC, USA, 26--33.

Digital Library

[21]

Tobias Grosser, Albert Cohen, Justin Holewinski, P. Sadayappan, and Sven Verdoolaege. 2014. Hybrid Hexagonal/Classical Tiling for GPUs. In Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO '14). ACM, Article 66, 10 pages.

Digital Library

[22]

Tobias Gysi, Tobias Grosser, and Torsten Hoefler. 2015. MODESTO: Data-centric Analytic Optimization of Complex Stencil Programs on Heterogeneous Architectures. In Proceedings of the 29th ACM on International Conference on Supercomputing (ICS '15). ACM, 177--186.

Digital Library

[23]

Mary Hall, Jacqueline Chame, Chun Chen, Jaewook Shin, Gabe Rudy, and Malik Murtaza Khan. 2010. Loop Transformation Recipes for Code Generation and Auto-tuning. In Proceedings of the 22Nd International Conference on Languages and Compilers for Parallel Computing (LCPC'09). Springer-Verlag, Berlin, Heidelberg, 50--64.

Digital Library

[24]

Ari B. Hayes, Lingda Li, Daniel Chavarría-Miranda, Shuaiwen Leon Song, and Eddy Z. Zhang. 2016. Orion: A Framework for GPU Occupancy Tuning. In Proceedings of the 17th International Middleware Conference (Middleware '16). ACM, New York, NY, USA, 18:1--18:13.

Digital Library

[25]

Tom Henretty, Richard Veras, Franz Franchetti, Louis-Noël Pouchet, J. Ramanujam, and P. Sadayappan. 2013. A Stencil Compiler for Short-vector SIMD Architectures. In Proceedings of the 27th International ACM Conference on International Conference on Supercomputing (ICS '13). ACM, 13--24.

Digital Library

[26]

HPGMG 2016. High-Performance Geometric Multigrid. https://hpgmg.org/. (2016).

[27]

Hong Jia-Wei and H. T. Kung. 1981. I/O Complexity: The Red-blue Pebble Game. In Proceedings of the Thirteenth Annual ACM Symposium on Theory of Computing (STOC '81). ACM, New York, NY, USA, 326--333.

Digital Library

[28]

Mengyao Jin, Haohuan Fu, Zihong Lv, and Guangwen Yang. 2016. Libra: An Automated Code Generation and Tuning Framework for Register-limited Stencils on GPUs. In Proceedings of the ACM International Conference on Computing Frontiers (CF '16). ACM, New York, NY, USA, 92--99.

Digital Library

[29]

David Ryan Koes and Seth Copen Goldstein. 2006. A Global Progressive Register Allocator. In Proceedings of the 27th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '06). ACM, New York, NY, USA, 204--215.

Digital Library

[30]

Stefan Kral, Franz Franchetti, Juergen Lorenz, Christoph W. Ueberhuber, and Peter Wurzinger. 2004. FFT Compiler Techniques. In Compiler Construction: 13th International Conference, CC 2004. Springer Berlin Heidelberg, 217--231.

[31]

Chris Lattner and Vikram Adve. 2004. LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization (CGO '04). IEEE Computer Society, Washington, DC, USA, 75-.

Digital Library

[32]

A. Li, S. L. Song, A. Kumar, E. Z. Zhang, D. Chavarría-Miranda, and H. Corporaal. 2016. Critical points based register-concurrency auto-tuning for GPUs. In 2016 Design, Automation Test in Europe Conference Exhibition (DATE). 1273--1278.

Digital Library

[33]

Guei-Yuan Lueh, Thomas Gross, and Ali-Reza Adl-Tabatabai. 2000. Fusion-based Register Allocation. ACM Trans. Program. Lang. Syst. 22, 3 (May 2000), 431--470.

Digital Library

[34]

Hanspeter Mössenböck and Michael Pfeiffer. 2002. Linear Scan Register Allocation in the Context of SSA Form and Register Constraints. Springer Berlin Heidelberg, 229--246.

Digital Library

[35]

Rajeev Motwani, Krishna V. Palem, Vivek Sarkar, and Salem Reyen. 1995. Combining Register Allocation and Instruction Scheduling. Technical Report. Stanford, CA, USA.

Digital Library

[36]

Ravi Teja Mullapudi, Vinay Vasista, and Uday Bondhugula. 2015. Poly-Mage: Automatic Optimization for Image Processing Pipelines. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '15). ACM, 429--443.

Digital Library

[37]

C. Norris and L. L. Pollock. 1993. A scheduler-sensitive global register allocator. In Supercomputing '93. Proceedings. 804--813.

Digital Library

[38]

NVCC 2017. NVIDIA CUDA Compiler Driver NVCC. docs.nvidia.com/cuda/cuda-compiler-driver-nvcc. (2017).

[39]

NVprof 2017. NVIDIA Profiler. http://docs.nvidia.com/cuda/profiler-users-guide. (2017).

[40]

Shlomit S. Pinter. 1993. Register Allocation with Instruction Scheduling. In Proceedings of the ACM SIGPLAN 1993 Conference on Programming Language Design and Implementation (PLDI '93). ACM, New York, NY, USA, 248--257.

Digital Library

[41]

Massimiliano Poletto and Vivek Sarkar. 1999. Linear Scan Register Allocation. ACM Trans. Program. Lang. Syst. 21, 5 (Sept. 1999), 895--913.

Digital Library

[42]

Fernando Magno Quintão Pereira and Jens Palsberg. 2008. Register Allocation by Puzzle Solving. In Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '08). ACM, New York, NY, USA, 216--226.

Digital Library

[43]

Mahesh Ravishankar, Justin Holewinski, and Vinod Grover. 2015. Forma: A DSL for Image Processing Applications to Target GPUs and Multi-core CPUs. In Proc. 8th Workshop on General Purpose Processing Using GPUs. 109--120.

Digital Library

[44]

Prashant Singh Rawat, Changwan Hong, Mahesh Ravishankar, Vinod Grover, Louis-Noel Pouchet, Atanas Rountev, and P. Sadayappan. 2016. Resource Conscious Reuse-Driven Tiling for GPUs. In Proceedings of the 2016 International Conference on Parallel Architectures and Compilation (PACT '16). ACM, 99--111.

Digital Library

[45]

Hongbo Rong. 2009. Tree Register Allocation. In Proceedings of the 42Nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 42). ACM, New York, NY, USA, 67--77.

Digital Library

[46]

Vivek Sarkar and Rajkishore Barik. 2007. Extended Linear Scan: An Alternate Foundation for Global Register Allocation. In Proceedings of the 16th International Conference on Compiler Construction (CC'07). Springer-Verlag, Berlin, Heidelberg, 141--155.

Digital Library

[47]

Ravi Sethi and J. D. Ullman. 1970. The Generation of Optimal Code for Arithmetic Expressions. J. ACM 17, 4 (Oct. 1970), 715--728.

Digital Library

[48]

Michael D. Smith, Norman Ramsey, and Glenn Holloway. 2004. A Generalized Algorithm for Graph-coloring Register Allocation. In Proceedings of the ACM SIGPLAN 2004 Conference on Programming Language Design and Implementation (PLDI '04). ACM, New York, NY, USA, 277--288.

Digital Library

[49]

Richard M. Stallman and GCC Developer Community. 2009. Using The GNU Compiler Collection: A GNU Manual For GCC Version 4.3.3. CreateSpace, Paramount, CA.

Digital Library

[50]

Kevin Stock, Martin Kong, Tobias Grosser, Louis-Noël Pouchet, Fabrice Rastello, J. Ramanujam, and P. Sadayappan. 2014. A Framework for Enhancing Data Reuse via Associative Reordering. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '14). ACM, New York, NY, USA, 65--76.

Digital Library

[51]

SW4 2014. Seismic Wave Modelling (SW4) - Computational Infrastructure for Geodynamics. https://geodynamics.org/cig/software/sw4/. (2014).

[52]

Sid Touati and Christine Eisenbeis. 2004. Early Periodic Register Allocation on ILP Processors. Parallel Processing Letters 14, 2 (June 2004), 287--313.

[53]

Swapneela Unkule, Christopher Shaltz, and Apan Qasem. 2012. Automatic Restructuring of GPU Kernels for Exploiting Inter-thread Data Locality. In Proceedings of the 21st International Conference on Compiler Construction (CC'12). Springer-Verlag, Berlin, Heidelberg, 21--40.

Digital Library

[54]

Mohamed Wahib and Naoya Maruyama. 2014. Scalable Kernel Fusion for Memory-bound GPU Applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '14). IEEE Press, 191--202.

Digital Library

[55]

Mohamed Wahib and Naoya Maruyama. 2015. Automated GPU Kernel Transformations in Large-Scale Production Stencil Applications. In Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing (HPDC '15). ACM, New York, NY, USA, 259--270.

Digital Library

[56]

Jian Wang, Andreas Krall, M. Anton Ertl, and Christine Eisenbeis. 1994. Software Pipelining with Register Allocation and Spilling. In Proceedings of the 27th Annual International Symposium on Microarchitecture (MICRO 27). ACM, New York, NY, USA, 95--99.

Digital Library

[57]

Jingyue Wu, Artem Belevich, Eli Bendersky, Mark Heffernan, Chris Leary, Jacques Pienaar, Bjarke Roune, Rob Springer, Xuetian Weng, and Robert Hundt. 2016. gpucc: An Open-source GPGPU Compiler. In Proceedings of the 2016 International Symposium on Code Generation and Optimization (CGO '16). 105--116.

Digital Library

[58]

Xiaolong Xie, Yun Liang, Xiuhong Li, Yudong Wu, Guangyu Sun, Tao Wang, and Dongrui Fan. 2015. Enabling Coordinated Register Allocation and Thread-level Parallelism Optimization for GPUs. In Proceedings of the 48th International Symposium on Microarchitecture (MICRO-48). ACM, New York, NY, USA, 395--406.

Digital Library

[59]

Jingling Xue. 1997. On Tiling as a Loop Transformation. Parallel Processing Letters 07, 04 (1997), 409--424.

Cited By

Deng CChen ZShi YMa YWen MLuo L(2024)Optimizing VLIW Instruction Scheduling via a Two-Dimensional Constrained Dynamic ProgrammingACM Transactions on Design Automation of Electronic Systems10.1145/364313529:5(1-20)Online publication date: 25-Jan-2024
https://dl.acm.org/doi/10.1145/3643135
Sun QLiu YYang HJiang ZLuan ZQian D(2024)Adaptive Auto-Tuning Framework for Global Exploration of Stencil Optimization on GPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.332563035:1(20-33)Online publication date: Jan-2024
https://doi.org/10.1109/TPDS.2023.3325630
Shobaki GMuyan-Özçelik PHutton JLinck BMalyshenko VKerbow ARamirez-Ortega RGordon V(2024)Instruction Scheduling for the GPU on the GPU2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)10.1109/CGO57630.2024.10444869(435-447)Online publication date: 2-Mar-2024
https://doi.org/10.1109/CGO57630.2024.10444869
Show More Cited By

Index Terms

Register optimizations for stencils on GPUs
1. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

Register optimizations for stencils on GPUs
PPoPP '18

The recent advent of compute-intensive GPU architecture has allowed application developers to explore high-order 3D stencils for better computational accuracy. A common optimization strategy for such stencils is to expose sufficient data reuse by means ...
Performance evaluation of OpenMP's target construct on GPUs-exploring compiler optimisations

OpenMP is a directive-based shared memory parallel programming model and has been widely used for many years. From OpenMP 4.0 onwards, GPU platforms are supported by extending OpenMP's high-level parallel abstractions with accelerator programming. This ...
Register file optimizations for superscalar microprocessors

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PPoPP '18: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

February 2018

442 pages

ISBN:9781450349826

DOI:10.1145/3178487

General Chair:
Andreas Krall
Vienna University of Technology, Austria
,
Program Chair:
Thomas R. Gross
ETH Zürich, Switzerland

ACM SIGPLAN Notices Volume 53, Issue 1
PPoPP '18
January 2018
426 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/3200691
Editor:
Matthew Fluet
Rodchester Institude of Technology
Issue’s Table of Contents

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication Notes

Badge change: Article originally badged under Version 1.0 guidelines https://www.acm.org/publications/policies/artifact-review-badging

Publication History

Published: 10 February 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Badges

Qualifiers

Research-article

Funding Sources

U.S. National Science Foundation
U.S. Department of Energy Exascale Computing Project

Conference

PPoPP '18

Sponsor:

PPoPP '18: 23nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

February 24 - 28, 2018

Vienna, Austria

Acceptance Rates

Overall Acceptance Rate 230 of 1,014 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

39
Total Citations
View Citations
987
Total Downloads

Downloads (Last 12 months)67
Downloads (Last 6 weeks)11

Reflects downloads up to 13 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Deng CChen ZShi YMa YWen MLuo L(2024)Optimizing VLIW Instruction Scheduling via a Two-Dimensional Constrained Dynamic ProgrammingACM Transactions on Design Automation of Electronic Systems10.1145/364313529:5(1-20)Online publication date: 25-Jan-2024
https://dl.acm.org/doi/10.1145/3643135
Sun QLiu YYang HJiang ZLuan ZQian D(2024)Adaptive Auto-Tuning Framework for Global Exploration of Stencil Optimization on GPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.332563035:1(20-33)Online publication date: Jan-2024
https://doi.org/10.1109/TPDS.2023.3325630
Shobaki GMuyan-Özçelik PHutton JLinck BMalyshenko VKerbow ARamirez-Ortega RGordon V(2024)Instruction Scheduling for the GPU on the GPU2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)10.1109/CGO57630.2024.10444869(435-447)Online publication date: 2-Mar-2024
https://doi.org/10.1109/CGO57630.2024.10444869
Matsumura KDe Gonzalo SPeña AVerbrugge CLhoták OShen X(2023)A Symbolic Emulator for Shuffle Synthesis on the NVIDIA PTX CodeProceedings of the 32nd ACM SIGPLAN International Conference on Compiler Construction10.1145/3578360.3580253(110-121)Online publication date: 17-Feb-2023
https://dl.acm.org/doi/10.1145/3578360.3580253
Chang CZhang XHsiao H(2023)Query Regrouping Problem on Tree Structure for GPUs Accelerated PlatformNew Trends in Computer Technologies and Applications10.1007/978-981-19-9582-8_18(207-211)Online publication date: 10-Feb-2023
https://doi.org/10.1007/978-981-19-9582-8_18
Liu XLiu YYang HLiao JLi MLuan ZQian DRauchwerger LCameron KNikolopoulos DPnevmatikatos D(2022)Toward accelerated stencil computation by adapting tensor core unit on GPUProceedings of the 36th ACM International Conference on Supercomputing10.1145/3524059.3532392(1-12)Online publication date: 28-Jun-2022
https://dl.acm.org/doi/10.1145/3524059.3532392
Hanindhito BGourounas DFathi ATrenev DGerstlauer AJohn LRauchwerger LCameron KNikolopoulos DPnevmatikatos D(2022)GAPSProceedings of the 36th ACM International Conference on Supercomputing10.1145/3524059.3532373(1-13)Online publication date: 28-Jun-2022
https://dl.acm.org/doi/10.1145/3524059.3532373
Shobaki GBassett JHeffernan MKerbow AEgger BSmith A(2022)Graph transformations for register-pressure-aware instruction schedulingProceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction10.1145/3497776.3517771(41-53)Online publication date: 19-Mar-2022
https://dl.acm.org/doi/10.1145/3497776.3517771
Li JYe HTian SLi XZhang J(2022)A Fine-grained Prefetching Scheme for DGEMM Kernels on GPU with Auto-tuning Compatibility2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS53621.2022.00089(863-874)Online publication date: May-2022
https://doi.org/10.1109/IPDPS53621.2022.00089
Li KYuan LZhang YYue YCao H(2022)An Efficient Vectorization Scheme for Stencil Computation2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS53621.2022.00069(650-660)Online publication date: May-2022
https://doi.org/10.1109/IPDPS53621.2022.00069
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents