research-article

Register allocation for Intel processor graphics

Authors:

Guei-Yuan Lueh,

Buqi ChengAuthors Info & Claims

CGO '18: Proceedings of the 2018 International Symposium on Code Generation and Optimization

Pages 352 - 364

https://doi.org/10.1145/3168806

Published: 24 February 2018 Publication History

Abstract

Register allocation is a well-studied problem, but surprisingly little work has been published on assigning registers for GPU architectures. In this paper we present the register allocator in the production compiler for Intel HD and Iris Graphics. Intel GPUs feature a large byte-addressable register file organized into banks, an expressive instruction set that supports variable SIMD-sizes and divergent control flow, and high spill overhead due to relatively long memory latencies. These distinctive characteristics impose challenges for register allocation, as input programs may have arbitrarily-sized variables, partial updates, and complex control flow. Not only should the allocator make a program spill-free, but it must also reduce the number of register bank conflicts and anti-dependencies. Since compilation occurs in a JIT environment, the allocator also needs to incur little overhead.

To manage compilation overhead, our register allocation framework adopts a hybrid approach that separates the assignment of local and global variables. Several extensions are introduced to the traditional graph-coloring algorithm to support variables with different sizes and to accurately model liveness under divergent branches. Different assignment polices are applied to exploit the trade-offs between minimizing register usage and avoiding bank conflicts and anti-dependencies. Experimental results show our framework produces very few spilling kernels and can improve RA JIT time by up to 4x over pure graph-coloring. Our round-robin and bank-conflict-reduction assignment policies can also achieve up to 20% runtime improvements.

References

[1]

2K Games. 2013. Bioshock Infinite. (2013). https://bioshockinfinite.ghoststorygames.com/.

[2]

Advanced Micro Devices. 2016. AMD GCN3 ISA Architecture Manual. (2016). http://gpuopen.com/compute-product/amd-gcn3-isa-architecture-manual/.

[3]

Apple. 2017. Apple Metal 2. (2017). https://developer.apple.com/metal/.

[4]

Blizzard Entertainment. 2016. World of Warcraft: Legion. (2016). https://worldofwarcraft.com/.

[5]

David Blythe. 2006. The direct3d 10 system. In ACM Transactions on Graphics (TOG), Vol. 25. ACM, 724-734.

Digital Library

[6]

Matthias Braun and Sebastian Hack. 2009. Register Spilling and Live-Range Splitting for SSA-Form Programs. In Proceedings of the 18th International Conference on Compiler Construction: Held As Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2009 (CC '09). Springer-Verlag, Berlin, Heidelberg, 174-189.

Digital Library

[7]

Preston Briggs, Keith D Cooper, and Linda Torczon. 1994. Improvements to graph coloring register allocation. ACM Transactions on Programming Languages and Systems (TOPLAS) 16, 3 (1994), 428-455.

Digital Library

[8]

John Cavazos, J. Eliot B. Moss, and Michael F. P. O'Boyle. 2006. Hybrid Optimizations: Which Optimization Algorithm to Use?. In Proceedings of the 15th International Conference on Compiler Construction (CC'06). Springer-Verlag, Berlin, Heidelberg, 124-138.

Digital Library

[9]

Gregory Chaitin. 2004. Register allocation and spilling via graph coloring. Acm Sigplan Notices 39, 4 (2004), 66-74.

Digital Library

[10]

Fred C Chow and John L Hennessy. 1990. The priority-based coloring approach to register allocation. ACM Transactions on Programming Languages and Systems (TOPLAS) 12, 4 (1990), 501-536.

Digital Library

[11]

Quentin Colombet, Benoit Boissinot, Philip Brisk, Sebastian Hack, and Fabrice Rastello. 2011. Graph-coloring and Treescan Register Allocation Using Repairing. In Proceedings of the 14th International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES '11). ACM, New York, NY, USA, 45-54.

Digital Library

[12]

Gregory Diamos, Benjamin Ashbaugh, Subramaniam Maiyuran, Andrew Kerr, Haicheng Wu, and Sudhakar Yalamanchili. 2011. SIMD re-convergence at thread frontiers. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 477-488.

Digital Library

[13]

Ahmed ElTantawy, Jessica Wenjie Ma, Mike O'Connor, and Tor M Aamodt. 2014. A scalable multi-path microarchitecture for efficient GPU control flow. In High Performance Computer Architecture (HPCA), 2014 IEEE 20th International Symposium on. IEEE, 248-259.

[14]

Wilson WL Fung, Ivan Sham, George Yuan, and Tor M Aamodt. 2007. Dynamic warp formation and scheduling for efficient GPU control flow. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 407-420.

Digital Library

[15]

Futuremark. 2017. 3DMark. (2017). https://www.futuremark.com/benchmarks/3dmark.

[16]

Futuremark. 2017. VRMark - the virtual reality benchmark. (2017). https://www.futuremark.com/benchmarks/vrmark.

[17]

James R Goodman and W-C Hsu. 1988. Code scheduling and register allocation in large basic blocks. In Proceedings of the 2nd international conference on Supercomputing. ACM, 442-452.

Digital Library

[18]

Sebastian Hack, Daniel Grund, and Gerhard Goos. 2006. Register allocation for programs in SSA-form. CC 6 (2006), 247-262.

Digital Library

[19]

Tianyi David Han and Tarek S Abdelrahman. 2011. Reducing branch divergence in GPU programs. In Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units. ACM, 3.

Digital Library

[20]

Ari B Hayes and Eddy Z Zhang. 2014. Unified on-chip memory allocation for SIMT architecture. In Proceedings of the 28th ACM international conference on Supercomputing. ACM, 293-302.

Digital Library

[21]

Intel 2017. Intel Graphics for Linux documentation. (2017). https://01.org/linuxgraphics/documentation.

[22]

Intel 2017. Intel Processor Graphics. (2017). https://software.intel.com/en-us/articles/intel-graphics-developers-guides.

[23]

Chi keung Luk. 2015. MICRO48-Tutorial on Intel Processor Graphics: Architecture and Programming. (2015). https://software.intel.com/en-us/blogs/2015/08/27/micro48-tutorial-on-intel-processor-graphics-architecture-and-programming.

[24]

Kishonti Ltd. 2017. CompuBench - performance benchmark for various compute APIs. (2017). https://compubench.com/.

[25]

Kishonti Ltd. 2017. GFXBench. (2017). https://gfxbench.com/benchmark.jsp.

[26]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems. 1097-1105.

Digital Library

[27]

Chris Lattner and Vikram Adve. 2004. LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization (CGO '04). IEEE Computer Society, Washington, DC, USA, 75-. http://dl.acm.org/citation.cfm?id=977395.977673

Digital Library

[28]

Guei-Yuan Lueh, Thomas Gross, and Ali-Reza Adl-Tabatabai. 2000. Fusion-based Register Allocation. ACM Trans. Program. Lang. Syst. 22, 3 (May 2000), 431-470.

Digital Library

[29]

Sparsh Mittal. 2017. A survey of techniques for architecting and managing GPU register file. IEEE Transactions on Parallel and Distributed Systems 28, 1 (2017), 16-28.

Digital Library

[30]

Rajeev Motwani, Krishna V Palem, Vivek Sarkar, and Salem Reyen. 1995. Combining register allocation and instruction scheduling. Courant Institute, New York University (1995).

[31]

Brian R Nickerson. 1990. Graph coloring register allocation for processors with multi-register operands. In ACM SIGPLAN Notices, Vol. 25. ACM, 40-52.

Digital Library

[32]

Nvidia 2016. Pascal Architecture Whitepaper. (2016). https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf.

[33]

Fernando Magno Quintao Pereira and Jens Palsberg. 2005. Register allocation via coloring of chordal graphs. In APLAS, Vol. 5. Springer, 315-329.

Digital Library

[34]

Shlomit S Pinter. 1993. Register allocation with instruction scheduling. In ACM SIGPLAN Notices, Vol. 28. ACM, 248-257.

Digital Library

[35]

Massimiliano Poletto and Vivek Sarkar. 1999. Linear scan register allocation. ACM Transactions on Programming Languages and Systems (TOPLAS) 21, 5 (1999), 895-913.

Digital Library

[36]

Fernando Magno Quintão Pereira and Jens Palsberg. 2008. Register Allocation by Puzzle Solving. In Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '08). ACM, New York, NY, USA, 216-226.

Digital Library

[37]

Minsoo Rhu and Mattan Erez. 2013. The dual-path execution model for efficient GPU control flow. In High Performance Computer Architecture (HPCA2013), 2013 IEEE 19th International Symposium on. IEEE, 591-602.

Digital Library

[38]

Johan Runeson and Sven-Olof Nyström. 2003. Retargetable graph-coloring register allocation for irregular architectures. In SCOPES. Springer, 240-254.

[39]

Diogo Sampaio, Rafael Martins de Souza, Sylvain Collange, and Fernando Magno Quintão Pereira. 2014. Divergence Analysis. ACM Trans. Program. Lang. Syst. 35, 4, Article 13 (Jan. 2014), 36 pages.

[40]

Dave Shreiner, Graham Sellers, John Kessenich, and Bill Licea-Kane. 2013. OpenGL programming guide: The Official guide to learning OpenGL, version 4.3. Addison-Wesley.

Digital Library

[41]

SiSoftware. 2016. Sandra Platinum. (2016). http://www.sisoftware.net.

[42]

Michael D. Smith, Norman Ramsey, and Glenn Holloway. 2004. A Generalized Algorithm for Graph-coloring Register Allocation. In Proceedings of the ACM SIGPLAN 2004 Conference on Programming Language Design and Implementation (PLDI '04). ACM, New York, NY, USA, 277-288.

Digital Library

[43]

John E Stone, David Gohara, and Guochun Shi. 2010. OpenCL: A parallel programming standard for heterogeneous computing systems. Computing in science & engineering 12, 3 (2010), 66-73.

Digital Library

[44]

The Khronos Group. 2016. Vulkan Overview. (2016). https://www.khronos.org/vulkan/.

[45]

Xiaolong Xie, Yun Liang, Xiuhong Li, Yudong Wu, Guangyu Sun, Tao Wang, and Dongrui Fan. 2015. Enabling coordinated register allocation and thread-level parallelism optimization for GPUs. In Proceedings of the 48th International Symposium on Microarchitecture. ACM, 395-406.

Digital Library

Cited By

Guan XZhou HBao GLi HZhu LYao JGrosser TDubach CSteuwer MXue JOttoni GQuintão Pereira F(2024)PresCount: Effective Register Allocation for Bank Conflict ReductionProceedings of the 2024 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO57630.2024.10444841(170-181)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1109/CGO57630.2024.10444841
Brighen AChouikh AIkhlef HSlimani HRezgui AKheddouci H(2024)Giraph-Based Distributed Algorithms for Coloring Large-Scale GraphsInternational Journal of Parallel Programming10.1007/s10766-024-00781-053:1Online publication date: 17-Oct-2024
https://dl.acm.org/doi/10.1007/s10766-024-00781-0
Krolik AVerbrugge CHendren L(2023)rNdN: Fast Query Compilation for NVIDIA GPUsACM Transactions on Architecture and Code Optimization10.1145/360350320:3(1-25)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3603503
Show More Cited By

Index Terms

Register allocation for Intel processor graphics
1. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Just-in-time compilers

Recommendations

Differential register allocation
PLDI '05: Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation

Micro-architecture designers are very cautious about expanding the number of architected registers (also the register field), because increasing the register field adds to the code size, raises I-cache and memory pressure, complicates processor ...
Differential register allocation
Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation

Micro-architecture designers are very cautious about expanding the number of architected registers (also the register field), because increasing the register field adds to the code size, raises I-cache and memory pressure, complicates processor ...
Bytewise Register Allocation
SCOPES '15: Proceedings of the 18th International Workshop on Software and Compilers for Embedded Systems

Traditionally, variables have been considered as atoms by register allocation: Each variable was to be placed in one register, or spilt (placed in main memory) or rematerialized (recalculated as needed). Some flexibility arose from what would be ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CGO '18: Proceedings of the 2018 International Symposium on Code Generation and Optimization

February 2018

377 pages

ISBN:9781450356176

DOI:10.1145/3179541

General Chairs:
Jens Knoop
Vienna University of Technology, Austria
,
Markus Schordan
Lawrence Livermore National Laboratory, USA
,
Program Chairs:
Teresa Johnson
Google, USA
,
Michael O'Boyle
University of Edinburgh, UK

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGPLAN: ACM Special Interest Group on Programming Languages
SIGMICRO: ACM Special Interest Group on Microarchitectural Research and Processing
IEEE-CS: Computer Society

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 February 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

CGO '18

Sponsor:

CGO '18: 16th Annual IEEE/ACM International Symposium on Code Generation and Optimization

February 24 - 28, 2018

Vienna, Austria

Acceptance Rates

Overall Acceptance Rate 312 of 1,061 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

16
Total Citations
View Citations
757
Total Downloads

Downloads (Last 12 months)51
Downloads (Last 6 weeks)2

Reflects downloads up to 10 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Guan XZhou HBao GLi HZhu LYao JGrosser TDubach CSteuwer MXue JOttoni GQuintão Pereira F(2024)PresCount: Effective Register Allocation for Bank Conflict ReductionProceedings of the 2024 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO57630.2024.10444841(170-181)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1109/CGO57630.2024.10444841
Brighen AChouikh AIkhlef HSlimani HRezgui AKheddouci H(2024)Giraph-Based Distributed Algorithms for Coloring Large-Scale GraphsInternational Journal of Parallel Programming10.1007/s10766-024-00781-053:1Online publication date: 17-Oct-2024
https://dl.acm.org/doi/10.1007/s10766-024-00781-0
Krolik AVerbrugge CHendren L(2023)rNdN: Fast Query Compilation for NVIDIA GPUsACM Transactions on Architecture and Code Optimization10.1145/360350320:3(1-25)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3603503
VenkataKeerthy SJain SKundu AAggarwal RCohen AUpadrasta RVerbrugge CLhoták OShen X(2023)RL4ReAl: Reinforcement Learning for Register AllocationProceedings of the 32nd ACM SIGPLAN International Conference on Compiler Construction10.1145/3578360.3580273(133-144)Online publication date: 17-Feb-2023
https://dl.acm.org/doi/10.1145/3578360.3580273
Brighen ASlimani HRezgui AKheddouci H(2023)A new distributed graph coloring algorithm for large graphsCluster Computing10.1007/s10586-023-03988-x27:1(875-891)Online publication date: 23-Mar-2023
https://dl.acm.org/doi/10.1007/s10586-023-03988-x
Ji ZWang C(2022)Compiler-Directed Incremental Checkpointing for Low Latency GPU Preemption2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS53621.2022.00078(751-761)Online publication date: May-2022
https://doi.org/10.1109/IPDPS53621.2022.00078
Giannoula CPeppas AGoumas GKoziris N(2022)High-performance and balanced parallel graph coloring on multicore platformsThe Journal of Supercomputing10.1007/s11227-022-04894-679:6(6373-6421)Online publication date: 7-Nov-2022
https://dl.acm.org/doi/10.1007/s11227-022-04894-6
Xiao SHe JYang JHong XLuo J(2022)Energy Reduction Method by Compiler OptimizationArtificial Intelligence and Security10.1007/978-3-031-06794-5_54(672-683)Online publication date: 15-Jul-2022
https://dl.acm.org/doi/10.1007/978-3-031-06794-5_54
Hu YZhang XWang SLiang WLi K(2022)Research on global register allocation for code containing array‐unit dual‐usage register namesConcurrency and Computation: Practice and Experience10.1002/cpe.751935:19Online publication date: 5-Dec-2022
https://doi.org/10.1002/cpe.7519
Lueh GChen KChen GFuentes JChen WFu FJiang HLi HRhee DLee J(2021)C-for-metalProceedings of the 2021 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO51591.2021.9370324(289-300)Online publication date: 27-Feb-2021
https://dl.acm.org/doi/10.1109/CGO51591.2021.9370324
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents