research-article

Free access

Warp-aware trace scheduling for GPUs

Authors:

James A. Jablin,

Thomas B. Jablin,

Onur Mutlu, and

Maurice HerlihyAuthors Info & Claims

PACT '14: Proceedings of the 23rd international conference on Parallel architectures and compilation

August 2014

Pages 163 - 174

https://doi.org/10.1145/2628071.2628101

Published: 24 August 2014 Publication History

Abstract

GPU performance depends not only on thread/warp level parallelism (TLP) but also on instruction-level parallelism (ILP). It is not enough to schedule instructions within basic blocks, it is also necessary to exploit opportunities for ILP optimization beyond branch boundaries. Unfortunately, modern GPUs cannot dynamically carry out such optimizations because they lack hardware branch prediction and cannot speculatively execute instructions beyond a branch.

We propose to circumvent these limitations by adapting Trace Scheduling, a technique originally developed for microcode optimization. Trace Scheduling divides code into traces (or paths), and optimizes each trace in a context-independent way. Adapting Trace Scheduling to GPU code requires revisiting and revising each step of microcode Trace Scheduling to attend to branch and warp behavior, identifying instructions on the critical path, avoiding warp divergence, and reducing divergence time.

Here, we propose "Warp-Aware Trace Scheduling" for GPUs. As evaluated on the Rodinia Benchmark Suite using dynamic profiling, our fully-automatic optimization achieves a geometric mean speedup of 1.10x on a real system by increasing instructions executed per cycle (IPC) by a harmonic mean of 1.12x and reducing instruction serialization and total instructions executed.

References

[1]

Advanced Micro Devices, Inc. AMD Graphics Cores Next (GCN) Architecture, 2012.

[2]

A. V. Aho, M. S. Lam, R. Sethi, and J. D. Ullman. Compilers: Principles, Techniques, and Tools (2nd Edition). Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 2006.

Digital Library

[3]

T. Ball and J. R. Larus. Branch Prediction for Free. In Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation, PLDI '93, pages 300--313, New York, NY, USA, 1993. ACM.

Digital Library

[4]

P. P. Chang, S. A. Mahlke, W. Y. Chen, N. J. Warter, and W.-m. W. Hwu. IMPACT: An Architectural Framework for Multiple-Instruction-Issue Processors. In Proceedings of the 18th annual international symposium on Computer architecture, ISCA '91, pages 266--275, New York, NY, USA, 1991. ACM.

Digital Library

[5]

S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A Benchmark Suite for Heterogeneous Computing. In IEEE International Symposium on Workload Characterization, 2009. IISWC 2009., pages 44--54, October 2009.

Digital Library

[6]

B. Coutinho, D. Sampaio, F. M. Q. Pereira, and W. Meira Jr. Divergence Analysis and Optimizations. In Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques, PACT '11, pages 320--329, Washington, DC, USA, 2011. IEEE Computer Society.

Digital Library

[7]

J. R. Ellis. Bulldog: A Compiler for VLIW Architectures (Parallel Computing, Reduced-Instruction-Set, Trace Scheduling, Scientific). PhD thesis, New Haven, CT, USA, 1985.

Digital Library

[8]

P. Faraboschi, J. Fisher, and C. Young. Instruction Scheduling for Instruction Level Parallel Processors. Proceedings of the IEEE, 89(11):1638--1659, 2001.

[9]

J. A. Fisher. Trace Scheduling: A Technique for Global Microcode Compaction. IEEE Trans. Comput., 30(7):478--490, July 1981.

Digital Library

[10]

W. W. L. Fung and T. M. Aamodt. Thread Block Compaction for Efficient SIMT Control Flow. In Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture, HPCA '11, pages 25--36, Washington, DC, USA, 2011. IEEE Computer Society.

Digital Library

[11]

W. W. L. Fung, I. Sham, G. Yuan, and T. M. Aamodt. Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 40, pages 407--420, Washington, DC, USA, 2007. IEEE Computer Society.

Digital Library

[12]

T. D. Han and T. S. Abdelrahman. Reducing Branch Divergence in GPU Programs. In Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, GPGPU-4, pages 3:1--3:8, New York, NY, USA, 2011. ACM.

Digital Library

[13]

HSA Foundation. HSA Programmer's Reference Manual: HSAIL Virtual ISA and Programming Model, Compiler Writer's Guide, and Object Format (BRIG), May 2013.

[14]

W.-M. W. Hwu, S. A. Mahlke, W. Y. Chen, P. P. Chang, N. J. Warter, R. A. Bringmann, R. G. Ouellette, R. E. Hank, T. Kiyohara, G. E. Haab, J. G. Holm, and D. M. Lavery. The Superblock: An Effective Technique for VLIW and Superscalar Compilation. J. Supercomput., 7(1-2):229--248, May 1993.

Digital Library

[15]

Intel Corporation. Performance Interactions of OpenCL Code and Intel® - Quick Sync Video on Intel HD Graphics 4000, 2012.

[16]

Intel Corporation. Intel 64 and IA-32 Architectures Software Developer's Manual, September 2013.

[17]

A. Jog, O. Kayıran, N. Chidambaram Nachiappan, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das. OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '13, pages 395--406, New York, NY, USA, 2013. ACM.

Digital Library

[18]

A. Jog, O. Kayıran, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das. Orchestrated Scheduling and Prefetching for GPGPUs. In Proceedings of the 40th Annual International Symposium on Computer Architecture, ISCA '13, pages 332--343, New York, NY, USA, 2013. ACM.

Digital Library

[19]

O. Kayıran, A. Jog, M. T. Kandemir, and C. R. Das. Neither More nor Less: Optimizing Thread-level Parallelism for GPGPUs. In Proceedings of the 22Nd International Conference on Parallel Architectures and Compilation Techniques, PACT '13, pages 157--166, Piscataway, NJ, USA, 2013. IEEE Press.

Digital Library

[20]

D. B. Kirk and W.-m. W. Hwu. Programming Massively Parallel Processors: A Hands-on Approach. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1st edition, 2010.

Digital Library

[21]

M. Lam. Software Pipelining: An Effective Scheduling Technique for VLIW Machines. In Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation, PLDI '88, pages 318--328, New York, NY, USA, 1988. ACM.

Digital Library

[22]

S. A. Mahlke, D. C. Lin, W. Y. Chen, R. E. Hank, and R. A. Bringmann. Effective Compiler Support for Predicated Execution Using the Hyperblock. In Proceedings of the 25th annual international symposium on Microarchitecture, MICRO 25, pages 45--54, Los Alamitos, CA, USA, 1992. IEEE Computer Society Press.

Digital Library

[23]

J. Meng, D. Tarjan, and K. Skadron. Dynamic Warp Subdivision for Integrated Branch and Memory Divergence Tolerance. In Proceedings of the 37th Annual International Symposium on Computer Architecture, ISCA '10, pages 235--246, New York, NY, USA, 2010. ACM.

Digital Library

[24]

V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, and Y. N. Patt. Improving GPU Performance via Large Warps and Two-level Warp Scheduling. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-44, pages 308--317, New York, NY, USA, 2011. ACM.

Digital Library

[25]

J. Nickolls, I. Buck, M. Garland, and K. Skadron. Scalable Parallel Programming with CUDA. Queue, 6(2):40--53, Mar. 2008.

Digital Library

[26]

NVIDIA Corporation. CUDA Binary Utilities, July 2013.

[27]

NVIDIA Corporation. CUDA C Programming Guide, July 2013.

[28]

NVIDIA Corporation. CUPTI, July 2013.

[29]

NVIDIA Corporation. Parallel Thread Execution ISA v3.2, July 2013.

[30]

NVIDIA Corporation. Tuning CUDA Applications For Kepler, July 2013.

[31]

J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, and S. Amarasinghe. Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '13, pages 519--530, New York, NY, USA, 2013. ACM.

Digital Library

[32]

S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and W.-m. W. Hwu. Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA. In Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '08, pages 73--82, New York, NY, USA, 2008. ACM.

Digital Library

[33]

M. D. Smith. Support for Speculative Execution in High-Performance Processors. Technical report, Stanford, CA, USA, 1992.

Digital Library

[34]

M. D. Smith, M. S. Lam, and M. A. Horowitz. Boosting Beyond Static Scheduling in a Superscalar Processor. In Proceedings of the 17th annual international symposium on Computer Architecture, ISCA '90, pages 344--354, New York, NY, USA, 1990. ACM.

Digital Library

Cited By

Yu HBian HHuang J(2023)Efficient SpMM with Kernel Switching on GPUs for Graph Neural NetworksProceedings of the 2023 7th International Conference on High Performance Compilation, Computing and Communications10.1145/3606043.3606051(56-61)Online publication date: 17-Jun-2023
https://dl.acm.org/doi/10.1145/3606043.3606051
Krolik AVerbrugge CHendren L(2023)rNdN: Fast Query Compilation for NVIDIA GPUsACM Transactions on Architecture and Code Optimization10.1145/360350320:3(1-25)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3603503
Fang JZhao LCai MYang H(2023)WSMP: a warp scheduling strategy based on MFQ and PPFThe Journal of Supercomputing10.1007/s11227-023-05127-079:11(12317-12340)Online publication date: 10-Mar-2023
https://doi.org/10.1007/s11227-023-05127-0
Show More Cited By

Index Terms

Warp-aware trace scheduling for GPUs
1. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Source code generation

Recommendations

Optimal trace scheduling using enumeration

This article presents the first optimal algorithm for trace scheduling. The trace is a global scheduling region used by compilers to exploit instruction-level parallelism across basic block boundaries. Several heuristic techniques have been proposed for ...
Read More
Parallelizing nonnumerical code with selective scheduling and software pipelining

Instruction-level parallelism (ILP) in nonnumerical code is regarded as scarce and hard to exploit due to its irregularity. In this article, we introduce a new code-scheduling technique for irregular ILP called “selective scheduling” which can be used ...
Read More
An evaluation of speculative instruction execution on simultaneous multithreaded processors

Modern superscalar processors rely heavily on speculative execution for performance. For example, our measurements show that on a 6-issue superscalar, 93% of committed instructions for SPECINT95 are speculative. Without speculation, processor resources ...
Read More

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PACT '14: Proceedings of the 23rd international conference on Parallel architectures and compilation

August 2014

514 pages

ISBN:9781450328098

DOI:10.1145/2628071

General Chair:
J. Nelson Amaral
University of Alberta, Canada
,
Program Chair:
Josep Torrellas
University of Illinois, USA

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

IFIP WG 10.3: IFIP WG 10.3
SIGARCH: ACM Special Interest Group on Computer Architecture
IEEE CS TCPP: IEEE Computer Society Technical Committee on Parallel Processing
IEEE CS TCAA: IEEE CS technical committee on architectural acoustics

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 August 2014

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

PACT '14

Sponsor:

IFIP WG 10.3
SIGARCH
IEEE CS TCPP
IEEE CS TCAA

PACT '14: International Conference on Parallel Architectures and Compilation

August 24 - 27, 2014

AB, Edmonton, Canada

Acceptance Rates

PACT '14 Paper Acceptance Rate 54 of 144 submissions, 38%;

Overall Acceptance Rate 121 of 471 submissions, 26%

Upcoming Conference

PACT '24

Sponsor:
sigarch

International Conference on Parallel Architectures and Compilation Techniques

October 14 - 16, 2024

Long Beach , CA , USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

14
Total Citations
View Citations
734
Total Downloads

Downloads (Last 12 months)92
Downloads (Last 6 weeks)21

Other Metrics

View Author Metrics

Citations

Cited By

Yu HBian HHuang J(2023)Efficient SpMM with Kernel Switching on GPUs for Graph Neural NetworksProceedings of the 2023 7th International Conference on High Performance Compilation, Computing and Communications10.1145/3606043.3606051(56-61)Online publication date: 17-Jun-2023
https://dl.acm.org/doi/10.1145/3606043.3606051
Krolik AVerbrugge CHendren L(2023)rNdN: Fast Query Compilation for NVIDIA GPUsACM Transactions on Architecture and Code Optimization10.1145/360350320:3(1-25)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3603503
Fang JZhao LCai MYang H(2023)WSMP: a warp scheduling strategy based on MFQ and PPFThe Journal of Supercomputing10.1007/s11227-023-05127-079:11(12317-12340)Online publication date: 10-Mar-2023
https://doi.org/10.1007/s11227-023-05127-0
Fiorini ADagenais M(2022)Visualization of profiling and tracing in CPU‐GPU programsConcurrency and Computation: Practice and Experience10.1002/cpe.718834:23Online publication date: 19-Jul-2022
https://doi.org/10.1002/cpe.7188
Goens ABrauckmann AErtel SCummins CLeather HCastrillon JMattson TMuzahid ASolar-Lezama A(2019)A case study on machine learning for synthesizing benchmarksProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages10.1145/3315508.3329976(38-46)Online publication date: 22-Jun-2019
https://dl.acm.org/doi/10.1145/3315508.3329976
Yang CBuluç AOwens J(2018)Design Principles for Sparse Matrix Multiplication on the GPUEuro-Par 2018: Parallel Processing10.1007/978-3-319-96983-1_48(672-687)Online publication date: 1-Aug-2018
https://doi.org/10.1007/978-3-319-96983-1_48
Gong XChen ZZiabari AUbal RKaeli DReddi VSmith ATang L(2017)TwinKernels: an execution model to improve GPU hardware scheduling at compile timeProceedings of the 2017 International Symposium on Code Generation and Optimization10.5555/3049832.3049838(39-49)Online publication date: 4-Feb-2017
https://dl.acm.org/doi/10.5555/3049832.3049838
Kloosterman JBeaumont JJamshidi DBailey JMudge TMahlke SHunter HMoreno JEmer JSanchez D(2017)ReglessProceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3123939.3123974(151-164)Online publication date: 14-Oct-2017
https://dl.acm.org/doi/10.1145/3123939.3123974
Liang YLi X(2017)Efficient Kernel Management on GPUsACM Transactions on Embedded Computing Systems10.1145/307071016:4(1-24)Online publication date: 26-May-2017
https://dl.acm.org/doi/10.1145/3070710
Gong XChen ZZiabari AUbal RKaeli D(2017)TwinKernels: An execution model to improve GPU hardware scheduling at compile time2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)10.1109/CGO.2017.7863727(39-49)Online publication date: Feb-2017
https://doi.org/10.1109/CGO.2017.7863727
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents