research-article

COMPASS: a programmable data prefetcher using idle GPU shaders

Authors:

Hsien-Hsin S. LeeAuthors Info & Claims

ACM SIGPLAN Notices, Volume 45, Issue 3

Pages 297 - 310

https://doi.org/10.1145/1735971.1736054

Published: 13 March 2010 Publication History

Abstract

A traditional fixed-function graphics accelerator has evolved into a programmable general-purpose graphics processing unit over the last few years. These powerful computing cores are mainly used for accelerating graphics applications or enabling low-cost scientific computing. To further reduce the cost and form factor, an emerging trend is to integrate GPU along with the memory controllers onto the same die with the processor cores. However, given such a system-on-chip, the GPU, while occupying a substantial part of the silicon, will sit idle and contribute nothing to the overall system performance when running non-graphics workloads or applications lack of data-level parallelism. In this paper, we propose COMPASS, a compute shader-assisted data prefetching scheme, to leverage the GPU resource for improving single-threaded performance on an integrated system. By harnessing the GPU shader cores with very lightweight architectural support, COMPASS can emulate the functionality of a hardware-based prefetcher using the idle GPU and successfully improve the memory performance of single-thread applications. Moreover, thanks to its flexibility and programmability, one can implement the best performing prefetch scheme to improve each specific application as demonstrated in this paper. With COMPASS, we envision that a future application vendor can provide a custom-designed COMPASS shader bundled with its software to be loaded at runtime to optimize the performance. Our simulation results show that COMPASS can improve the single-thread performance of memory-intensive applications by 68% on average.

References

[1]

Advanced Micro Devices Inc. R700-Family Instruction Set Architecture, March 2009. http://developer.amd.com/gpu assets/R700-Family Instruction Set Architecture.pdf.

[2]

M. Annavaram, J. Patel, and E. Davidson. Data Prefetching by Dependence Graph Precomputation. In Proceedings of the International Symposium on Computer Architecture, 2001.

Digital Library

[3]

D. Callahan, K. Kennedy, and A. Porterfield. Software Prefetching. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, 1991.

Digital Library

[4]

T.-F. Chen and J.-L. Baer. Reducing Memory Latency via Nonblocking and Prefetching Caches. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, 1992.

Digital Library

[5]

W. Y. Chen, S. A. Mahlke, P. P. Chang, and W.-m. W. Hwu. Data Access Microarchitectures for Superscalar Processors with Compiler-Assisted Data Prefetching. In Proceedings of the International Symposium on Microarchitecture, 1991.

Digital Library

[6]

J. Collins, H. Wang, D. Tullsen, C. Hughes, Y. Lee, D. Lavery, and J. Shen. Speculative Precomputation: Long-range Prefetching of Delinquent Loads. In Proceedings of the International Symposium on Computer Architecture, 2001.

Digital Library

[7]

R. Cooksey, S. Jourdan, and D. Grunwald. A Stateless, Content-Directed Data Prefetching Mechanism. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, 2002.

Digital Library

[8]

M. Dimitrov and H. Zhou. Combining Local and Global History for High Performance Data Prefetching. In The Journal of Instruction-Level Parallelism Data Prefetching Championship, 2009.

[9]

J. Dundas and T. Mudge. Improving Data Cache Performance by Preexecuting Instructions Under a Cache Miss. In Proceedings of the International Conference on Supercomputing, 1997.

Digital Library

[10]

A. Fedorova, M. Seltzer, C. Small, and D. Nussbaum. Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design. In Proceedings of the annual conference on USENIX Annual Technical Conference, 2005.

Digital Library

[11]

A. Fedorova, M. Seltzer, and M. Smith. Improving Performance Isolation on Chip Multiprocessors via an Operating System Scheduler. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, 2007.

Digital Library

[12]

W. W. L. Fung, I. Sham, G. Yuan, and T. M. Aamodt. Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow. In Proceedings of the International Symposium on Microarchitecture, 2007.

Digital Library

[13]

I. Ganusov and M. Burtscher. Efficient Emulation of Hardware Prefetchers via Event--Driven Helper Threading. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, 2006.

Digital Library

[14]

L. Hammond, M. Willey, and K. Olukotun. Data Speculation Support for a Chip Multiprocessor. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, 1998.

Digital Library

[15]

Z. Hu, S. Kaxiras, and M. Martonosi. Timekeeping in the Memory System: Predicting and Optimizing Memory Behavior. In Proceedings of the International Symposium on Computer Architecture, 2002.

Digital Library

[16]

R. Huddy. ATI RadeondTM HD 2000 SeriesTechnology Overview. In AMD Technical Day, The Develop Conference & Expo, 2007.

[17]

Intel Corporation. Optimizing Application Performance on IntelR CoreTM Microarchitecture Using Hardware-Implemented Prefetchers, http://software.intel.com/en-us/articles/optimizingapplication-performance-on-intel-coret-microarchitecture-usinghardware-implemented--prefetchers, September 2008.

[18]

Intel Corporation. Intel R CoreTM i7-900 Desktop Processor Extreme Edition Series and IntelR CoreTM i7-900 Desktop Processor Series, October 2009.

[19]

D. Joseph and D. Grunwald. Prefetching using Markov Predictors. In Proceedings of the International Symposium on Computer Architecture, 1997.

Digital Library

[20]

G. B. Kandiraju and A. Sivasubramaniam. Going the Distance for TLB Prefetching: An Application-driven Study. In Proceedings of the International Symposium on Computer Architecture, 2002.

Digital Library

[21]

R. Knauerhase, P. Brett, B. Hohlt, T. Li, and S. Hahn. Using OS Observations to Improve Performance in Multicore Systems. IEEE Micro, 28(3):54.66, 2008.

Digital Library

[22]

S. S. Liao, P. H. Wang, H. Wang, G. Ho_ehner, D. Lavery, and J. P. Shen. Post-Pass Binary Adaptation for Software--Based Speculative Precomputation. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, 2002.

Digital Library

[23]

W. Lin, S. Reinhardt, and D. Burger. Reducing DRAM Latencies with an Integrated Memory Hierarchy Design. In Proceedings of the International Symposium on High Performance Computer Architecture, 2001.

Digital Library

[24]

D. Luebke, M. Harris, J. Krüger, T. Purcell, N. Govindaraju, I. Buck, C. Woolley, and A. Lefohn. GPGPU: General Purpose Computation on Graphics Hardware. In Proceedings of the conference on SIGGRAPH 2004 course notes, 2004.

Digital Library

[25]

C.-K. Luk. Tolerating Memory Latency through Software-Controlled Pre-Execution in Simultaneous Multithreading Processors. In Proceedings of the International Symposium on Computer Architecture, 2001.

Digital Library

[26]

M. Mantor. Radeon R600, a 2nd Generation Unified Shader Architecture. In Proceedings of the 19th Hot Chips Conference, August, 2007.

[27]

M. Mantor. Entering the Golden Age of Heterogeneous Computing. In Performance Enhancement on Emerging Parallel Processing Platforms, 2008.

[28]

C. Moore. The Role of Accelerated Computing in the Multi-core Era. In Workshop on Manycore and Multicore Computing: Architectures, Applications And Directions, 2007.

[29]

O. Mutlu, J. Stark, C. Wilkerson, and Y. Patt. Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors. In Proceedings of the International Symposium on High Performance Computer Architecture, 2003.

Digital Library

[30]

K. Nesbit and J. Smith. Data Cache Prefetching Using a Global History Buffer. In Proceedings of the International Symposium on High Performance Computer Architecture, 2004.

Digital Library

[31]

D. G. Perez, G. Mouchard, and O. Temam. MicroLib: A Case for the Quantitative Comparison of Micro-Architecture Mechanisms. In Proceedings of the International Symposium on Microarchitecture, 2004.

Digital Library

[32]

N. Rafique, W.-T. Lim, and M. Thottethodi. Architectural Support for Operating System-Driven CMP Cache Management. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, 2006.

Digital Library

[33]

J. Renau, B. Fraguela, J. Tuck, W. Liu, M. Prvulovic, L. Ceze, S. Sarangi, P. Sack, K. Strauss, and P. Montesinos. SESC simulator, January 2005. http://sesc.sourceforge.net.

[34]

N. Rubin. Issues And Challenges In Compiling for Graphics Processors (Keynote speech). In Proceedings of the International Symposium on Code Generation and Optimization, 2008.

Digital Library

[35]

A. Sharif and H.-H. S. Lee. Data Prefetching Mechanism by Exploiting Global and Local Access Patterns. In The Journal of Instruction-Level Parallelism Data Prefetching Championship, 2009.

[36]

S. L. Smith. Intel Roadmap Overview. In Intel Developer Forum, 2008.

[37]

G. S. Sohi, S. E. Breach, and T. N. Vijaykumar. Multiscalar Processors. In Proceedings of the International Symposium on Computer Architecture, 1995.

Digital Library

[38]

Y. Solihin, J. Lee, and J. Torrellas. Using a User-Level Memory Thread for Correlation Prefetching. In Proceedings of the International Symposium on Computer Architecture, 2002.

Digital Library

[39]

J. Tendler, S. Dodson, S. Fields, H. Le, and B. Sinharoy. POWER4 System Microarchitecture. IBM Technical White Paper, October 2001.

[40]

N. Tuck and D. Tullsen. Initial Observations of the Simultaneous Multithreading Pentium 4 Processor. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, 2003.

Digital Library

Cited By

Zhang WYu SWang HDai ZChen H(2016)Hardware Support for Concurrent Detection of Multiple Concurrency Bugs on Fused CPU-GPU ArchitecturesIEEE Transactions on Computers10.1109/TC.2015.251286065:10(3083-3095)Online publication date: 1-Oct-2016
https://dl.acm.org/doi/10.1109/TC.2015.2512860
Wang XZhang W(2016)Cache locking vs. partitioning for real-time computing on integrated CPU-GPU processors2016 IEEE 35th International Performance Computing and Communications Conference (IPCCC)10.1109/PCCC.2016.7820644(1-8)Online publication date: Dec-2016
https://doi.org/10.1109/PCCC.2016.7820644
Wang DXiao W(2016)A reuse distance based performance analysis on GPU L1 data cache2016 IEEE 35th International Performance Computing and Communications Conference (IPCCC)10.1109/PCCC.2016.7820638(1-8)Online publication date: Dec-2016
https://doi.org/10.1109/PCCC.2016.7820638
Show More Cited By

Index Terms

COMPASS: a programmable data prefetcher using idle GPU shaders
1. Computing methodologies
  1. Computer graphics
    1. Graphics systems and interfaces
      1. Graphics processors
2. Hardware
  1. Integrated circuits
    1. Semiconductor memory
      1. Dynamic memory

Recommendations

COMPASS: a programmable data prefetcher using idle GPU shaders
ASPLOS '10

A traditional fixed-function graphics accelerator has evolved into a programmable general-purpose graphics processing unit over the last few years. These powerful computing cores are mainly used for accelerating graphics applications or enabling low-...
COMPASS: a programmable data prefetcher using idle GPU shaders
ASPLOS XV: Proceedings of the fifteenth International Conference on Architectural support for programming languages and operating systems

A traditional fixed-function graphics accelerator has evolved into a programmable general-purpose graphics processing unit over the last few years. These powerful computing cores are mainly used for accelerating graphics applications or enabling low-...
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing

The graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...

Comments

Information & Contributors

Information

Published In

cover image ACM SIGPLAN Notices

ACM SIGPLAN Notices Volume 45, Issue 3

ASPLOS '10

March 2010

399 pages

ISSN:0362-1340

EISSN:1558-1160

DOI:10.1145/1735971

Issue’s Table of Contents

ASPLOS XV: Proceedings of the fifteenth International Conference on Architectural support for programming languages and operating systems
March 2010
422 pages
ISBN:9781605588391
DOI:10.1145/1736020
General Chair:
James C. Hoe
Carnegie Mellon University, USA
,
Program Chair:
Vikram S. Adve
University of Illinois at Urbana-Champaign, USA

Copyright © 2010 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 March 2010

Published in SIGPLAN Volume 45, Issue 3

Check for updates

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

18
Total Citations
View Citations
840
Total Downloads

Downloads (Last 12 months)14
Downloads (Last 6 weeks)1

Reflects downloads up to 06 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhang WYu SWang HDai ZChen H(2016)Hardware Support for Concurrent Detection of Multiple Concurrency Bugs on Fused CPU-GPU ArchitecturesIEEE Transactions on Computers10.1109/TC.2015.251286065:10(3083-3095)Online publication date: 1-Oct-2016
https://dl.acm.org/doi/10.1109/TC.2015.2512860
Wang XZhang W(2016)Cache locking vs. partitioning for real-time computing on integrated CPU-GPU processors2016 IEEE 35th International Performance Computing and Communications Conference (IPCCC)10.1109/PCCC.2016.7820644(1-8)Online publication date: Dec-2016
https://doi.org/10.1109/PCCC.2016.7820644
Wang DXiao W(2016)A reuse distance based performance analysis on GPU L1 data cache2016 IEEE 35th International Performance Computing and Communications Conference (IPCCC)10.1109/PCCC.2016.7820638(1-8)Online publication date: Dec-2016
https://doi.org/10.1109/PCCC.2016.7820638
Licheng YYulong PTianzhou CXueqing LMinghui WTiefei Z(2016)LLC Buffer for Arbitrary Data Sharing in Heterogeneous Systems2016 IEEE 18th International Conference on High Performance Computing and Communications; IEEE 14th International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS)10.1109/HPCC-SmartCity-DSS.2016.0046(260-267)Online publication date: Dec-2016
https://doi.org/10.1109/HPCC-SmartCity-DSS.2016.0046
Khairy MWassal AZahran M(2019)A survey of architectural approaches for improving GPGPU performance, programmability and heterogeneityJournal of Parallel and Distributed Computing10.1016/j.jpdc.2018.11.012Online publication date: Jan-2019
https://doi.org/10.1016/j.jpdc.2018.11.012
Lee JShi WGil J(2018)Accelerated bulk memory operations on heterogeneous multi-core systemsThe Journal of Supercomputing10.1007/s11227-018-2589-x74:12(6898-6922)Online publication date: 1-Dec-2018
https://dl.acm.org/doi/10.1007/s11227-018-2589-x
Kim KLee SYoon MKoo GRo WAnnavaram M(2016)Warped-preexecution: A GPU pre-execution approach for improving latency hiding2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)10.1109/HPCA.2016.7446062(163-175)Online publication date: Mar-2016
https://doi.org/10.1109/HPCA.2016.7446062
Falahati HHessabi SAbdi MBaniasadi A(2015)Power-efficient prefetching on GPGPUsThe Journal of Supercomputing10.1007/s11227-014-1331-671:8(2808-2829)Online publication date: 1-Aug-2015
https://dl.acm.org/doi/10.1007/s11227-014-1331-6
Liu WVinter BCavazos JGong XKaeli D(2014)ad-heapProceedings of Workshop on General Purpose Processing Using GPUs10.1145/2588768.2576786(54-63)Online publication date: 1-Mar-2014
https://dl.acm.org/doi/10.1145/2588768.2576786
Swamy BKetterlin ASeznec A(2014)Hardware/Software Helper Thread Prefetching on Heterogeneous Many CoresProceedings of the 2014 IEEE 26th International Symposium on Computer Architecture and High Performance Computing10.1109/SBAC-PAD.2014.39(214-221)Online publication date: 22-Oct-2014
https://dl.acm.org/doi/10.1109/SBAC-PAD.2014.39
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents