Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2628071.2628101acmconferencesArticle/Chapter ViewAbstractPublication PagespactConference Proceedingsconference-collections
research-article
Free access

Warp-aware trace scheduling for GPUs

Published: 24 August 2014 Publication History
  • Get Citation Alerts
  • Abstract

    GPU performance depends not only on thread/warp level parallelism (TLP) but also on instruction-level parallelism (ILP). It is not enough to schedule instructions within basic blocks, it is also necessary to exploit opportunities for ILP optimization beyond branch boundaries. Unfortunately, modern GPUs cannot dynamically carry out such optimizations because they lack hardware branch prediction and cannot speculatively execute instructions beyond a branch.
    We propose to circumvent these limitations by adapting Trace Scheduling, a technique originally developed for microcode optimization. Trace Scheduling divides code into traces (or paths), and optimizes each trace in a context-independent way. Adapting Trace Scheduling to GPU code requires revisiting and revising each step of microcode Trace Scheduling to attend to branch and warp behavior, identifying instructions on the critical path, avoiding warp divergence, and reducing divergence time.
    Here, we propose "Warp-Aware Trace Scheduling" for GPUs. As evaluated on the Rodinia Benchmark Suite using dynamic profiling, our fully-automatic optimization achieves a geometric mean speedup of 1.10x on a real system by increasing instructions executed per cycle (IPC) by a harmonic mean of 1.12x and reducing instruction serialization and total instructions executed.

    References

    [1]
    Advanced Micro Devices, Inc. AMD Graphics Cores Next (GCN) Architecture, 2012.
    [2]
    A. V. Aho, M. S. Lam, R. Sethi, and J. D. Ullman. Compilers: Principles, Techniques, and Tools (2nd Edition). Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 2006.
    [3]
    T. Ball and J. R. Larus. Branch Prediction for Free. In Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation, PLDI '93, pages 300--313, New York, NY, USA, 1993. ACM.
    [4]
    P. P. Chang, S. A. Mahlke, W. Y. Chen, N. J. Warter, and W.-m. W. Hwu. IMPACT: An Architectural Framework for Multiple-Instruction-Issue Processors. In Proceedings of the 18th annual international symposium on Computer architecture, ISCA '91, pages 266--275, New York, NY, USA, 1991. ACM.
    [5]
    S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A Benchmark Suite for Heterogeneous Computing. In IEEE International Symposium on Workload Characterization, 2009. IISWC 2009., pages 44--54, October 2009.
    [6]
    B. Coutinho, D. Sampaio, F. M. Q. Pereira, and W. Meira Jr. Divergence Analysis and Optimizations. In Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques, PACT '11, pages 320--329, Washington, DC, USA, 2011. IEEE Computer Society.
    [7]
    J. R. Ellis. Bulldog: A Compiler for VLIW Architectures (Parallel Computing, Reduced-Instruction-Set, Trace Scheduling, Scientific). PhD thesis, New Haven, CT, USA, 1985.
    [8]
    P. Faraboschi, J. Fisher, and C. Young. Instruction Scheduling for Instruction Level Parallel Processors. Proceedings of the IEEE, 89(11):1638--1659, 2001.
    [9]
    J. A. Fisher. Trace Scheduling: A Technique for Global Microcode Compaction. IEEE Trans. Comput., 30(7):478--490, July 1981.
    [10]
    W. W. L. Fung and T. M. Aamodt. Thread Block Compaction for Efficient SIMT Control Flow. In Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture, HPCA '11, pages 25--36, Washington, DC, USA, 2011. IEEE Computer Society.
    [11]
    W. W. L. Fung, I. Sham, G. Yuan, and T. M. Aamodt. Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 40, pages 407--420, Washington, DC, USA, 2007. IEEE Computer Society.
    [12]
    T. D. Han and T. S. Abdelrahman. Reducing Branch Divergence in GPU Programs. In Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, GPGPU-4, pages 3:1--3:8, New York, NY, USA, 2011. ACM.
    [13]
    HSA Foundation. HSA Programmer's Reference Manual: HSAIL Virtual ISA and Programming Model, Compiler Writer's Guide, and Object Format (BRIG), May 2013.
    [14]
    W.-M. W. Hwu, S. A. Mahlke, W. Y. Chen, P. P. Chang, N. J. Warter, R. A. Bringmann, R. G. Ouellette, R. E. Hank, T. Kiyohara, G. E. Haab, J. G. Holm, and D. M. Lavery. The Superblock: An Effective Technique for VLIW and Superscalar Compilation. J. Supercomput., 7(1-2):229--248, May 1993.
    [15]
    Intel Corporation. Performance Interactions of OpenCL Code and Intel® - Quick Sync Video on Intel HD Graphics 4000, 2012.
    [16]
    Intel Corporation. Intel 64 and IA-32 Architectures Software Developer's Manual, September 2013.
    [17]
    A. Jog, O. Kayıran, N. Chidambaram Nachiappan, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das. OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '13, pages 395--406, New York, NY, USA, 2013. ACM.
    [18]
    A. Jog, O. Kayıran, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das. Orchestrated Scheduling and Prefetching for GPGPUs. In Proceedings of the 40th Annual International Symposium on Computer Architecture, ISCA '13, pages 332--343, New York, NY, USA, 2013. ACM.
    [19]
    O. Kayıran, A. Jog, M. T. Kandemir, and C. R. Das. Neither More nor Less: Optimizing Thread-level Parallelism for GPGPUs. In Proceedings of the 22Nd International Conference on Parallel Architectures and Compilation Techniques, PACT '13, pages 157--166, Piscataway, NJ, USA, 2013. IEEE Press.
    [20]
    D. B. Kirk and W.-m. W. Hwu. Programming Massively Parallel Processors: A Hands-on Approach. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1st edition, 2010.
    [21]
    M. Lam. Software Pipelining: An Effective Scheduling Technique for VLIW Machines. In Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation, PLDI '88, pages 318--328, New York, NY, USA, 1988. ACM.
    [22]
    S. A. Mahlke, D. C. Lin, W. Y. Chen, R. E. Hank, and R. A. Bringmann. Effective Compiler Support for Predicated Execution Using the Hyperblock. In Proceedings of the 25th annual international symposium on Microarchitecture, MICRO 25, pages 45--54, Los Alamitos, CA, USA, 1992. IEEE Computer Society Press.
    [23]
    J. Meng, D. Tarjan, and K. Skadron. Dynamic Warp Subdivision for Integrated Branch and Memory Divergence Tolerance. In Proceedings of the 37th Annual International Symposium on Computer Architecture, ISCA '10, pages 235--246, New York, NY, USA, 2010. ACM.
    [24]
    V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, and Y. N. Patt. Improving GPU Performance via Large Warps and Two-level Warp Scheduling. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-44, pages 308--317, New York, NY, USA, 2011. ACM.
    [25]
    J. Nickolls, I. Buck, M. Garland, and K. Skadron. Scalable Parallel Programming with CUDA. Queue, 6(2):40--53, Mar. 2008.
    [26]
    NVIDIA Corporation. CUDA Binary Utilities, July 2013.
    [27]
    NVIDIA Corporation. CUDA C Programming Guide, July 2013.
    [28]
    NVIDIA Corporation. CUPTI, July 2013.
    [29]
    NVIDIA Corporation. Parallel Thread Execution ISA v3.2, July 2013.
    [30]
    NVIDIA Corporation. Tuning CUDA Applications For Kepler, July 2013.
    [31]
    J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, and S. Amarasinghe. Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '13, pages 519--530, New York, NY, USA, 2013. ACM.
    [32]
    S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and W.-m. W. Hwu. Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA. In Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '08, pages 73--82, New York, NY, USA, 2008. ACM.
    [33]
    M. D. Smith. Support for Speculative Execution in High-Performance Processors. Technical report, Stanford, CA, USA, 1992.
    [34]
    M. D. Smith, M. S. Lam, and M. A. Horowitz. Boosting Beyond Static Scheduling in a Superscalar Processor. In Proceedings of the 17th annual international symposium on Computer Architecture, ISCA '90, pages 344--354, New York, NY, USA, 1990. ACM.

    Cited By

    View all
    • (2023)Efficient SpMM with Kernel Switching on GPUs for Graph Neural NetworksProceedings of the 2023 7th International Conference on High Performance Compilation, Computing and Communications10.1145/3606043.3606051(56-61)Online publication date: 17-Jun-2023
    • (2023)rNdN: Fast Query Compilation for NVIDIA GPUsACM Transactions on Architecture and Code Optimization10.1145/360350320:3(1-25)Online publication date: 19-Jul-2023
    • (2023)WSMP: a warp scheduling strategy based on MFQ and PPFThe Journal of Supercomputing10.1007/s11227-023-05127-079:11(12317-12340)Online publication date: 10-Mar-2023
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    PACT '14: Proceedings of the 23rd international conference on Parallel architectures and compilation
    August 2014
    514 pages
    ISBN:9781450328098
    DOI:10.1145/2628071
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 24 August 2014

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. compiler optimization
    2. global instruction scheduling
    3. gpu
    4. instruction-level parallelism
    5. trace scheduling

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    PACT '14
    Sponsor:
    • IFIP WG 10.3
    • SIGARCH
    • IEEE CS TCPP
    • IEEE CS TCAA

    Acceptance Rates

    PACT '14 Paper Acceptance Rate 54 of 144 submissions, 38%;
    Overall Acceptance Rate 121 of 471 submissions, 26%

    Upcoming Conference

    PACT '24

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)92
    • Downloads (Last 6 weeks)21

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Efficient SpMM with Kernel Switching on GPUs for Graph Neural NetworksProceedings of the 2023 7th International Conference on High Performance Compilation, Computing and Communications10.1145/3606043.3606051(56-61)Online publication date: 17-Jun-2023
    • (2023)rNdN: Fast Query Compilation for NVIDIA GPUsACM Transactions on Architecture and Code Optimization10.1145/360350320:3(1-25)Online publication date: 19-Jul-2023
    • (2023)WSMP: a warp scheduling strategy based on MFQ and PPFThe Journal of Supercomputing10.1007/s11227-023-05127-079:11(12317-12340)Online publication date: 10-Mar-2023
    • (2022)Visualization of profiling and tracing in CPU‐GPU programsConcurrency and Computation: Practice and Experience10.1002/cpe.718834:23Online publication date: 19-Jul-2022
    • (2019)A case study on machine learning for synthesizing benchmarksProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages10.1145/3315508.3329976(38-46)Online publication date: 22-Jun-2019
    • (2018)Design Principles for Sparse Matrix Multiplication on the GPUEuro-Par 2018: Parallel Processing10.1007/978-3-319-96983-1_48(672-687)Online publication date: 1-Aug-2018
    • (2017)TwinKernels: an execution model to improve GPU hardware scheduling at compile timeProceedings of the 2017 International Symposium on Code Generation and Optimization10.5555/3049832.3049838(39-49)Online publication date: 4-Feb-2017
    • (2017)ReglessProceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3123939.3123974(151-164)Online publication date: 14-Oct-2017
    • (2017)Efficient Kernel Management on GPUsACM Transactions on Embedded Computing Systems10.1145/307071016:4(1-24)Online publication date: 26-May-2017
    • (2017)TwinKernels: An execution model to improve GPU hardware scheduling at compile time2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)10.1109/CGO.2017.7863727(39-49)Online publication date: Feb-2017
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media