Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

COMPASS: a programmable data prefetcher using idle GPU shaders

Published: 13 March 2010 Publication History

Abstract

A traditional fixed-function graphics accelerator has evolved into a programmable general-purpose graphics processing unit over the last few years. These powerful computing cores are mainly used for accelerating graphics applications or enabling low-cost scientific computing. To further reduce the cost and form factor, an emerging trend is to integrate GPU along with the memory controllers onto the same die with the processor cores. However, given such a system-on-chip, the GPU, while occupying a substantial part of the silicon, will sit idle and contribute nothing to the overall system performance when running non-graphics workloads or applications lack of data-level parallelism. In this paper, we propose COMPASS, a compute shader-assisted data prefetching scheme, to leverage the GPU resource for improving single-threaded performance on an integrated system. By harnessing the GPU shader cores with very lightweight architectural support, COMPASS can emulate the functionality of a hardware-based prefetcher using the idle GPU and successfully improve the memory performance of single-thread applications. Moreover, thanks to its flexibility and programmability, one can implement the best performing prefetch scheme to improve each specific application as demonstrated in this paper. With COMPASS, we envision that a future application vendor can provide a custom-designed COMPASS shader bundled with its software to be loaded at runtime to optimize the performance. Our simulation results show that COMPASS can improve the single-thread performance of memory-intensive applications by 68% on average.

References

[1]
Advanced Micro Devices Inc. R700-Family Instruction Set Architecture, March 2009. http://developer.amd.com/gpu assets/R700-Family Instruction Set Architecture.pdf.
[2]
M. Annavaram, J. Patel, and E. Davidson. Data Prefetching by Dependence Graph Precomputation. In Proceedings of the International Symposium on Computer Architecture, 2001.
[3]
D. Callahan, K. Kennedy, and A. Porterfield. Software Prefetching. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, 1991.
[4]
T.-F. Chen and J.-L. Baer. Reducing Memory Latency via Nonblocking and Prefetching Caches. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, 1992.
[5]
W. Y. Chen, S. A. Mahlke, P. P. Chang, and W.-m. W. Hwu. Data Access Microarchitectures for Superscalar Processors with Compiler-Assisted Data Prefetching. In Proceedings of the International Symposium on Microarchitecture, 1991.
[6]
J. Collins, H. Wang, D. Tullsen, C. Hughes, Y. Lee, D. Lavery, and J. Shen. Speculative Precomputation: Long-range Prefetching of Delinquent Loads. In Proceedings of the International Symposium on Computer Architecture, 2001.
[7]
R. Cooksey, S. Jourdan, and D. Grunwald. A Stateless, Content-Directed Data Prefetching Mechanism. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, 2002.
[8]
M. Dimitrov and H. Zhou. Combining Local and Global History for High Performance Data Prefetching. In The Journal of Instruction-Level Parallelism Data Prefetching Championship, 2009.
[9]
J. Dundas and T. Mudge. Improving Data Cache Performance by Preexecuting Instructions Under a Cache Miss. In Proceedings of the International Conference on Supercomputing, 1997.
[10]
A. Fedorova, M. Seltzer, C. Small, and D. Nussbaum. Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design. In Proceedings of the annual conference on USENIX Annual Technical Conference, 2005.
[11]
A. Fedorova, M. Seltzer, and M. Smith. Improving Performance Isolation on Chip Multiprocessors via an Operating System Scheduler. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, 2007.
[12]
W. W. L. Fung, I. Sham, G. Yuan, and T. M. Aamodt. Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow. In Proceedings of the International Symposium on Microarchitecture, 2007.
[13]
I. Ganusov and M. Burtscher. Efficient Emulation of Hardware Prefetchers via Event--Driven Helper Threading. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, 2006.
[14]
L. Hammond, M. Willey, and K. Olukotun. Data Speculation Support for a Chip Multiprocessor. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, 1998.
[15]
Z. Hu, S. Kaxiras, and M. Martonosi. Timekeeping in the Memory System: Predicting and Optimizing Memory Behavior. In Proceedings of the International Symposium on Computer Architecture, 2002.
[16]
R. Huddy. ATI RadeondTM HD 2000 SeriesTechnology Overview. In AMD Technical Day, The Develop Conference & Expo, 2007.
[17]
Intel Corporation. Optimizing Application Performance on IntelR CoreTM Microarchitecture Using Hardware-Implemented Prefetchers, http://software.intel.com/en-us/articles/optimizingapplication-performance-on-intel-coret-microarchitecture-usinghardware-implemented--prefetchers, September 2008.
[18]
Intel Corporation. Intel R CoreTM i7-900 Desktop Processor Extreme Edition Series and IntelR CoreTM i7-900 Desktop Processor Series, October 2009.
[19]
D. Joseph and D. Grunwald. Prefetching using Markov Predictors. In Proceedings of the International Symposium on Computer Architecture, 1997.
[20]
G. B. Kandiraju and A. Sivasubramaniam. Going the Distance for TLB Prefetching: An Application-driven Study. In Proceedings of the International Symposium on Computer Architecture, 2002.
[21]
R. Knauerhase, P. Brett, B. Hohlt, T. Li, and S. Hahn. Using OS Observations to Improve Performance in Multicore Systems. IEEE Micro, 28(3):54.66, 2008.
[22]
S. S. Liao, P. H. Wang, H. Wang, G. Ho_ehner, D. Lavery, and J. P. Shen. Post-Pass Binary Adaptation for Software--Based Speculative Precomputation. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, 2002.
[23]
W. Lin, S. Reinhardt, and D. Burger. Reducing DRAM Latencies with an Integrated Memory Hierarchy Design. In Proceedings of the International Symposium on High Performance Computer Architecture, 2001.
[24]
D. Luebke, M. Harris, J. Krüger, T. Purcell, N. Govindaraju, I. Buck, C. Woolley, and A. Lefohn. GPGPU: General Purpose Computation on Graphics Hardware. In Proceedings of the conference on SIGGRAPH 2004 course notes, 2004.
[25]
C.-K. Luk. Tolerating Memory Latency through Software-Controlled Pre-Execution in Simultaneous Multithreading Processors. In Proceedings of the International Symposium on Computer Architecture, 2001.
[26]
M. Mantor. Radeon R600, a 2nd Generation Unified Shader Architecture. In Proceedings of the 19th Hot Chips Conference, August, 2007.
[27]
M. Mantor. Entering the Golden Age of Heterogeneous Computing. In Performance Enhancement on Emerging Parallel Processing Platforms, 2008.
[28]
C. Moore. The Role of Accelerated Computing in the Multi-core Era. In Workshop on Manycore and Multicore Computing: Architectures, Applications And Directions, 2007.
[29]
O. Mutlu, J. Stark, C. Wilkerson, and Y. Patt. Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors. In Proceedings of the International Symposium on High Performance Computer Architecture, 2003.
[30]
K. Nesbit and J. Smith. Data Cache Prefetching Using a Global History Buffer. In Proceedings of the International Symposium on High Performance Computer Architecture, 2004.
[31]
D. G. Perez, G. Mouchard, and O. Temam. MicroLib: A Case for the Quantitative Comparison of Micro-Architecture Mechanisms. In Proceedings of the International Symposium on Microarchitecture, 2004.
[32]
N. Rafique, W.-T. Lim, and M. Thottethodi. Architectural Support for Operating System-Driven CMP Cache Management. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, 2006.
[33]
J. Renau, B. Fraguela, J. Tuck, W. Liu, M. Prvulovic, L. Ceze, S. Sarangi, P. Sack, K. Strauss, and P. Montesinos. SESC simulator, January 2005. http://sesc.sourceforge.net.
[34]
N. Rubin. Issues And Challenges In Compiling for Graphics Processors (Keynote speech). In Proceedings of the International Symposium on Code Generation and Optimization, 2008.
[35]
A. Sharif and H.-H. S. Lee. Data Prefetching Mechanism by Exploiting Global and Local Access Patterns. In The Journal of Instruction-Level Parallelism Data Prefetching Championship, 2009.
[36]
S. L. Smith. Intel Roadmap Overview. In Intel Developer Forum, 2008.
[37]
G. S. Sohi, S. E. Breach, and T. N. Vijaykumar. Multiscalar Processors. In Proceedings of the International Symposium on Computer Architecture, 1995.
[38]
Y. Solihin, J. Lee, and J. Torrellas. Using a User-Level Memory Thread for Correlation Prefetching. In Proceedings of the International Symposium on Computer Architecture, 2002.
[39]
J. Tendler, S. Dodson, S. Fields, H. Le, and B. Sinharoy. POWER4 System Microarchitecture. IBM Technical White Paper, October 2001.
[40]
N. Tuck and D. Tullsen. Initial Observations of the Simultaneous Multithreading Pentium 4 Processor. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, 2003.

Cited By

View all
  • (2016)Hardware Support for Concurrent Detection of Multiple Concurrency Bugs on Fused CPU-GPU ArchitecturesIEEE Transactions on Computers10.1109/TC.2015.251286065:10(3083-3095)Online publication date: 1-Oct-2016
  • (2016)Cache locking vs. partitioning for real-time computing on integrated CPU-GPU processors2016 IEEE 35th International Performance Computing and Communications Conference (IPCCC)10.1109/PCCC.2016.7820644(1-8)Online publication date: Dec-2016
  • (2016)A reuse distance based performance analysis on GPU L1 data cache2016 IEEE 35th International Performance Computing and Communications Conference (IPCCC)10.1109/PCCC.2016.7820638(1-8)Online publication date: Dec-2016
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM SIGPLAN Notices
ACM SIGPLAN Notices  Volume 45, Issue 3
ASPLOS '10
March 2010
399 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/1735971
Issue’s Table of Contents
  • cover image ACM Conferences
    ASPLOS XV: Proceedings of the fifteenth International Conference on Architectural support for programming languages and operating systems
    March 2010
    422 pages
    ISBN:9781605588391
    DOI:10.1145/1736020
    • General Chair:
    • James C. Hoe,
    • Program Chair:
    • Vikram S. Adve
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 March 2010
Published in SIGPLAN Volume 45, Issue 3

Check for updates

Author Tags

  1. GPU
  2. compute shader
  3. prefetch

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)14
  • Downloads (Last 6 weeks)1
Reflects downloads up to 06 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2016)Hardware Support for Concurrent Detection of Multiple Concurrency Bugs on Fused CPU-GPU ArchitecturesIEEE Transactions on Computers10.1109/TC.2015.251286065:10(3083-3095)Online publication date: 1-Oct-2016
  • (2016)Cache locking vs. partitioning for real-time computing on integrated CPU-GPU processors2016 IEEE 35th International Performance Computing and Communications Conference (IPCCC)10.1109/PCCC.2016.7820644(1-8)Online publication date: Dec-2016
  • (2016)A reuse distance based performance analysis on GPU L1 data cache2016 IEEE 35th International Performance Computing and Communications Conference (IPCCC)10.1109/PCCC.2016.7820638(1-8)Online publication date: Dec-2016
  • (2016)LLC Buffer for Arbitrary Data Sharing in Heterogeneous Systems2016 IEEE 18th International Conference on High Performance Computing and Communications; IEEE 14th International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS)10.1109/HPCC-SmartCity-DSS.2016.0046(260-267)Online publication date: Dec-2016
  • (2019)A survey of architectural approaches for improving GPGPU performance, programmability and heterogeneityJournal of Parallel and Distributed Computing10.1016/j.jpdc.2018.11.012Online publication date: Jan-2019
  • (2018)Accelerated bulk memory operations on heterogeneous multi-core systemsThe Journal of Supercomputing10.1007/s11227-018-2589-x74:12(6898-6922)Online publication date: 1-Dec-2018
  • (2016)Warped-preexecution: A GPU pre-execution approach for improving latency hiding2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)10.1109/HPCA.2016.7446062(163-175)Online publication date: Mar-2016
  • (2015)Power-efficient prefetching on GPGPUsThe Journal of Supercomputing10.1007/s11227-014-1331-671:8(2808-2829)Online publication date: 1-Aug-2015
  • (2014)ad-heapProceedings of Workshop on General Purpose Processing Using GPUs10.1145/2588768.2576786(54-63)Online publication date: 1-Mar-2014
  • (2014)Hardware/Software Helper Thread Prefetching on Heterogeneous Many CoresProceedings of the 2014 IEEE 26th International Symposium on Computer Architecture and High Performance Computing10.1109/SBAC-PAD.2014.39(214-221)Online publication date: 22-Oct-2014
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media