research-article

Open access

R-GPU: A Reconfigurable GPU Architecture

Authors:

Gert-Jan Van Den Braak,

Henk CorporaalAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization (TACO), Volume 13, Issue 1

Article No.: 12, Pages 1 - 24

https://doi.org/10.1145/2890506

Published: 07 March 2016 Publication History

Abstract

Over the last decade, Graphics Processing Unit (GPU) architectures have evolved from a fixed-function graphics pipeline to a programmable, energy-efficient compute accelerator for massively parallel applications. The compute power arises from the GPU’s Single Instruction/Multiple Threads architecture: concurrently running many threads and executing them as Single Instruction/Multiple Data--style vectors. However, compute power is still lost due to cycles spent on data movement and control instructions instead of data computations. Even more cycles are lost on pipeline stalls resulting from long latency (memory) operations.

To improve not only performance but also energy efficiency, we introduce R-GPU: a reconfigurable GPU architecture with communicating cores. R-GPU is an addition to a GPU, which can still be used as such, but also has the ability to reorganize the cores of a GPU in a reconfigurable network. In R-GPU data movement and control is implicit in the configuration of the network. Each core executes a fixed instruction, reducing instruction decode count and increasing energy efficiency. On a number of benchmarks we show an average performance improvement of 2.1 × over the same GPU without modifications. We further make a conservative power estimation of R-GPU which shows that power consumption can be reduced by 6%, leading to an energy consumption reduction of 55%, while area only increases by a mere 4%.

References

[1]

Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong, and Tor M. Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, 2009 (ISPASS 2009). IEEE, Washington, DC, 163--174.

[2]

Michael Bauer, Henry Cook, and Brucek Khailany. 2011. CudaDMA: Optimizing GPU memory bandwidth via warp specialization. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC’11). ACM, New York, NY, Article 12, 11 pages.

Digital Library

[3]

Michael Bauer, Sean Treichler, and Alex Aiken. 2014. Singe: Leveraging warp specialization for high performance on GPUs. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’14). ACM, New York, NY, 119--130.

Digital Library

[4]

Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC) (IISWC’09). IEEE Computer Society, Washington, DC, 44--54.

Digital Library

[5]

Samuel H. Fuller and Lynette I. Millett. 2011. Computing performance: Game over or next level? Computer 44, 1 (Jan 2011), 31--38.

Digital Library

[6]

Mark Gebhart, Daniel R. Johnson, David Tarjan, Stephen W. Keckler, William J. Dally, Erik Lindholm, and Kevin Skadron. 2011. Energy-efficient mechanisms for managing thread context in throughput processors. In Proceedings of the 38th Annual International Symposium on Computer Architecture (ISCA’11). ACM, New York, NY, 235--246.

Digital Library

[7]

Juan Gómez-Luna, José María González-Linares, José Ignacio Benavides, and Nicolás Guil. 2013. An optimized approach to histogram computation on GPU. Machine Vision and Applications. 24, 5 (2013), 899--908.

Digital Library

[8]

Sunpyo Hong and Hyesoon Kim. 2010. An integrated GPU power and performance model. In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA’10). ACM, New York, NY, 280--289.

Digital Library

[9]

Jingwen Leng, Tayler Hetherington, Ahmed ElTantawy, Syed Gilani, Nam Sung Kim, Tor M. Aamodt, and Vijay Janapa Reddi. 2013. GPUWattch: Enabling energy optimizations in GPGPUs. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA’13). ACM, New York, NY, 487--498.

Digital Library

[10]

E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym. 2008. NVIDIA Tesla: A unified graphics and computing architecture. IEEE Micro 28, 2 (March 2008), 39--55.

Digital Library

[11]

Jan Lucas, Sohan Lal, Michael Andersch, Mauricio Alvarez-Mesa, and Ben Juurlink. 2013. How a single chip causes massive power bills GPUSimPow: A GPGPU power simulator. In 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 97--106.

[12]

Bingfeng Mei, Serge Vernalde, Diederik Verkest, Hugo Man, and Rudy Lauwereins. 2003. ADRES: An architecture with tightly coupled VLIW processor and coarse-grained reconfigurable matrix. In Field Programmable Logic and Application, Peter Cheung and George A. Constantinides (Eds.). Lecture Notes in Computer Science, Vol. 2778. Springer, Berlin, 61--70.

[13]

Naveen Muralimanohar, Rajeev Balasubramonian, and Norm Jouppi. 2007. Optimizing NUCA organizations and wiring alternatives for large caches with CACTI 6.0. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 40). IEEE Computer Society, Washington, DC, 3--14.

Digital Library

[14]

Veynu Narasiman, Michael Shebanow, Chang Joo Lee, Rustam Miftakhutdinov, Onur Mutlu, and Yale N. Patt. 2011. Improving GPU performance via large warps and two-level warp scheduling. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-44). ACM, New York, NY, 308--317.

Digital Library

[15]

J. R. Nickolls, B. W. Coon, M. Y. Siu, S. F. Oberman, and S. Liu. 2010. Single interconnect providing read and write access to a memory shared by concurrent threads. (16 2010). US Patent 7,680,988.

[16]

NVIDIA Corporation. 2009. NVIDIA’s Next Generation CUDA Compute Architecture: Fermi. (2009). http://www.nvidia.com/content/pdf/fermi_white_papers/nvidia_fermi_compute_architecture_whitepaper.pdf.

[17]

NVIDIA Corporation. 2012a. CUDA C Programming Guide 5.0. (Oct 2012). http://docs.nvidia.com/cuda/cuda-c-programming-guide/.

[18]

NVIDIA Corporation. 2012b. NVIDIA’s Next Generation CUDA Compute Architecture: Kepler GK110. (2012). http://www.nvidia.com/content/pdf/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf.

[19]

NVIDIA Corporation. 2013. NVIDIA Tegra K1: A New Era in Mobile Computing. http://www.nvidia.com/content/pdf/tegra_white_papers/tegra_k1_whitepaper_v1.0.pdf.

[20]

A Papakonstantinou, K. Gururaj, J. A. Stratton, Deming Chen, J. Cong, and W.-M. W. Hwu. 2009. FCUDA: Enabling efficient compilation of CUDA kernels onto FPGAs. In Proceedings of the IEEE 7th Symposium on Application Specific Processors, 2009. SASP’09. IEEE Computer Society, Washington, DC, 35--42.

[21]

H. Singh, Ming-Hau Lee, Guangming Lu, F. J. Kurdahi, N. Bagherzadeh, and E. M. Chaves Filho. 2000. MorphoSys: An integrated reconfigurable system for data-parallel and computation-intensive applications. IEEE Transactions on Computers. 49, 5 (May 2000), 465--481.

Digital Library

[22]

John A. Stratton, Christopher Rodrigrues, I.-Jui Sung, Nady Obeid, Liwen Chang, Geng Liu, and Wen-Mei W. Hwu. 2012. Parboil: A Revised Benchmark Suite for Scientific and Commercial throughput Computing. Technical Report IMPACT-12-01. University of Illinois at Urbana-Champaign, Urbana.

[23]

Gert-Jan van den Braak and Henk Corporaal. 2013. GPU-CC: A reconfigurable GPU architecture with communicating cores. In Proceedings of the 16th International Workshop on Software and Compilers for Embedded Systems (M-SCOPES’13). ACM, New York, NY, 86--89.

Digital Library

[24]

Dani Voitsechov and Yoav Etsion. 2014. Single-graph multiple flows: Energy efficient design alternative for GPGPUs. In Proceedings of the 41st Annual International Symposium on Computer Architecture (ISCA’14). ACM, New York, NY, 12.

Digital Library

[25]

Yulei Zhang, Xiang Hu, Alina Deutsch, A. Ege Engin, James F. Buckwalter, and Chung-Kuan Cheng. 2011. Prediction and comparison of high-performance on-chip global interconnection. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 19, 7 (July 2011), 1154--1166.

Digital Library

Cited By

Voitsechov DPort OEtsion YOskin MInoue K(2018)Inter-thread communication in multithreaded, reconfigurable coarse-grain arraysProceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO.2018.00013(42-54)Online publication date: 20-Oct-2018
https://dl.acm.org/doi/10.1109/MICRO.2018.00013
Sinha Hang DRaj G(2018)Elastic Search in Cache Based Service Management for Healthcare Automation2018 International Conference on Communications (COMM)10.1109/ICComm.2018.8430162(01-06)Online publication date: 14-Jun-2018
https://dl.acm.org/doi/10.1109/ICComm.2018.8430162
Sinha HDewang Raj G(2018)Elastic Search in Cache Based Service Management For Healthcare Automation2018 International Conference on Advances in Computing and Communication Engineering (ICACCE)10.1109/ICACCE.2018.8441722(445-450)Online publication date: Jun-2018
https://doi.org/10.1109/ICACCE.2018.8441722
Show More Cited By

Index Terms

R-GPU: A Reconfigurable GPU Architecture
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Data flow architectures
    2. Parallel architectures

Recommendations

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing

The graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
Optimized HPL for AMD GPU and multi-core CPU usage

The installation of the LOEWE-CSC ( http://csc.uni-frankfurt.de/csc/__ __51 ) supercomputer at the Goethe University in Frankfurt lead to the development of a Linpack which can fully utilize the installed AMD Cypress GPUs. At its core, a fast DGEMM for ...
Accelerating the discontinuous Galerkin method for seismic wave propagation simulations using the graphic processing unit (GPU)-single-GPU implementation

We have successfully ported an arbitrary high-order discontinuous Galerkin (ADER-DG) method for solving the three-dimensional elastic seismic wave equation on unstructured tetrahedral meshes to an Nvidia Tesla C2075 GPU using the Nvidia CUDA programming ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization

ACM Transactions on Architecture and Code Optimization Volume 13, Issue 1

April 2016

347 pages

ISSN:1544-3566

EISSN:1544-3973

DOI:10.1145/2899032

Editor:
Koen De Bosschere
Ghent University

Issue’s Table of Contents

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 March 2016

Accepted: 01 February 2016

Revised: 01 December 2015

Received: 01 May 2015

Published in TACO Volume 13, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
1,149
Total Downloads

Downloads (Last 12 months)153
Downloads (Last 6 weeks)16

Reflects downloads up to 18 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Voitsechov DPort OEtsion YOskin MInoue K(2018)Inter-thread communication in multithreaded, reconfigurable coarse-grain arraysProceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO.2018.00013(42-54)Online publication date: 20-Oct-2018
https://dl.acm.org/doi/10.1109/MICRO.2018.00013
Sinha Hang DRaj G(2018)Elastic Search in Cache Based Service Management for Healthcare Automation2018 International Conference on Communications (COMM)10.1109/ICComm.2018.8430162(01-06)Online publication date: 14-Jun-2018
https://dl.acm.org/doi/10.1109/ICComm.2018.8430162
Sinha HDewang Raj G(2018)Elastic Search in Cache Based Service Management For Healthcare Automation2018 International Conference on Advances in Computing and Communication Engineering (ICACCE)10.1109/ICACCE.2018.8441722(445-450)Online publication date: Jun-2018
https://doi.org/10.1109/ICACCE.2018.8441722
Ben Amor NLahbib KFrikha T(2017)Design of a dynamically reconfigurable architecture for the 3D image synthesis2017 International Conference on Advanced Technologies for Signal and Image Processing (ATSIP)10.1109/ATSIP.2017.8075556(1-5)Online publication date: May-2017
https://doi.org/10.1109/ATSIP.2017.8075556

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents