research-article

Exploiting uniform vector instructions for GPGPU performance, energy efficiency, and opportunistic reliability enhancement

Authors:

Lisa R. Hsu, and

Huiyang ZhouAuthors Info & Claims

ICS '13: Proceedings of the 27th international ACM conference on International conference on supercomputing

June 2013

Pages 433 - 442

https://doi.org/10.1145/2464996.2465022

Published: 10 June 2013 Publication History

Abstract

State-of-art graphics processing units (GPUs) employ the single-instruction multiple-data (SIMD) style execution to achieve both high computational throughput and energy efficiency. As previous works have shown, there exists significant computational redundancy in SIMD execution, where different execution lanes operate on the same operand values. Such value locality is referred to as uniform vectors. In this paper, we first show that besides redundancy within a uniform vector, different vectors can also have the identical values. Then, we propose detailed architecture designs to exploit both types of redundancy. For redundancy within a uniform vector, we propose to either extend the vector register file with token bits or add a separate small scalar register file to eliminate redundant computations as well as redundant data storage. For redundancy across different uniform vectors, we adopt instruction reuse, proposed originally for CPU architectures, to detect and eliminate redundancy. The elimination of redundant computations and data storage leads to both significant energy savings and performance improvement. Furthermore, we propose to leverage such redundancy to protect arithmetic-logic units (ALUs) and register files against hardware errors. Our detailed evaluation shows that our proposed design has low hardware overhead and achieves performance gains, up to 23.9% and 12.0% on average, along with energy savings, up to 24.8% and 12.6% on average, as well as a 21.1% and 14.1% protection coverage for ALUs and register files, respectively.

References

[1]

AMD Accelerated Parallel Processing OpenCL Programming Guide 2.1, May 2012

[2]

A. Bakhoda, et al., Analyzing CUDA workloads using a detailed GPU simulator. IPASS 2009.

[3]

S. Che, et al., Rodinia: a benchmark suite for heterogeneous computing, IISWC 2009.

Digital Library

[4]

Z. Chen, et al., Characterizing Scalar Opportunities in GPGPU Applications, ISPSS, 2013

[5]

S. Collange, et al., Dynamic detection of uniform and affine vectors in GPGPU computations, Euro-Par, 2009

Digital Library

[6]

S. Collange. Identifying scalar behavior in CUDA kernels. Technical report hal-00555134, 2011.

[7]

B. Coutinho, et al., Divergence analysis and optimizations, PACT 2011.

Digital Library

[8]

M. Dimitrov, et al., Understanding software approaches for GPGPU reliability, GPGPU-2, 2009

Digital Library

[9]

S. Gilani, N. Kim, M. Schulte: Power-efficient computing for compute-intensive GPGPU applications. PACT 2012.

Digital Library

[10]

M. Gomaa and T. Vijaykumar, "Opportunistic Transient-Fault Detection", ISCA-32, 2005.

Digital Library

[11]

N. B. Lakshminarayana and H. Kim, Effect of Instruction Fetch and Memory Scheduling on GPU Performance, Workshop on Language, Compiler, and Architecture Support for GPGPU, 2010.

[12]

C. J. Lee, et al. Prefetch-aware DRAM controllers. MICRO-41, 2008.

Digital Library

[13]

J. Leng, et al., GPUWattch: Enabling Energy Optimizations in GPGPUs, ISCA, 2013

[14]

S. Li at al., McPAT: an integrated power, area and timing modeling framework for multicore and manycore architectures, MICRO 2009.

Digital Library

[15]

G. Long, et al., Minimal Multi-Threading: Finding and Removing Redundant Instructions in Multi-Threaded Processors. MICRO, 2010.

Digital Library

[16]

Y. Lee, et al. Convergence and Scalarization for Data-Parallel Architectures. CGO 2013.

[17]

NVIDIA GPU Computing SDK 3.1.

[18]

J. Sheaffer, et al. A Hardware Redundancy and Recovery Mechanism for Reliable Scientific Computation on Graphics Processors. Graphics Hardware 2007.

Digital Library

[19]

A. Sodani and G. S. Sohi. Dynamic Instruction Reuse. ISCA 1997.

Digital Library

Cited By

Ha DOh YRo WSolihin YHeinrich M(2023)R2D2: Removing ReDunDancy Utilizing Linearity of Address Generation in GPUsProceedings of the 50th Annual International Symposium on Computer Architecture10.1145/3579371.3589039(1-14)Online publication date: 17-Jun-2023
https://dl.acm.org/doi/10.1145/3579371.3589039
Zhou KHao YMellor-Crummey JMeng XLiu XFalsafi BFerdman MLu SWenisch T(2022)ValueExpert: exploring value patterns in GPU-accelerated applicationsProceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3503222.3507708(171-185)Online publication date: 28-Feb-2022
https://dl.acm.org/doi/10.1145/3503222.3507708
Atoofian E(2020)Approximate Cache in GPGPUsACM Transactions on Embedded Computing Systems10.1145/340790419:5(1-22)Online publication date: 26-Sep-2020
https://dl.acm.org/doi/10.1145/3407904
Show More Cited By

Index Terms

Exploiting uniform vector instructions for GPGPU performance, energy efficiency, and opportunistic reliability enhancement
1. Computer systems organization
  1. Architectures
    1. Parallel architectures

Recommendations

Architecture supported register stash for GPGPU

GPGPU provides abundant hardware resources to support a large number of light-weighted threads. They are organized into blocks and run in warps. All threads of a block must be dispatched to one stream multiprocessor (SM) of GPGPU together. When the ...
Read More
Enhancing the performance of 16-bit code using augmenting instructions
Special Issue: Proceedings of the 2003 ACM SIGPLAN conference on Language, compiler, and tool support for embedded systems (San Diego, CA).

In the embedded domain, memory usage and energy consumption are critical constraints. Dual width instruction set embedded processors such as the ARM provide a 16-bit instruction set in addition to the 32-bit instruction set to address these concerns. ...
Read More
A Front-end Execution Architecture for High Energy Efficiency
MICRO-47: Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture

Smart phones and tablets have recently become widespread and dominant in the computer market. Users require that these mobile devices provide a high-quality experience and an even higher performance. Hence, major developers adopt out-of-order ...
Read More

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICS '13: Proceedings of the 27th international ACM conference on International conference on supercomputing

June 2013

512 pages

ISBN:9781450321303

DOI:10.1145/2464996

General Chair:
Allen D. Malony
University of Oregon, USA
,
Program Chairs:
Mario Nemirovsky
Barcelona Supercomputing Center, Spain
,
Sam Midkiff
Purdue University, USA

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 June 2013

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ICS'13

Sponsor:

SIGARCH

ICS'13: International Conference on Supercomputing

June 10 - 14, 2013

Oregon, Eugene, USA

Acceptance Rates

ICS '13 Paper Acceptance Rate 43 of 202 submissions, 21%;

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

26
Total Citations
View Citations
364
Total Downloads

Downloads (Last 12 months)9
Downloads (Last 6 weeks)1

Other Metrics

View Author Metrics

Citations

Cited By

Ha DOh YRo WSolihin YHeinrich M(2023)R2D2: Removing ReDunDancy Utilizing Linearity of Address Generation in GPUsProceedings of the 50th Annual International Symposium on Computer Architecture10.1145/3579371.3589039(1-14)Online publication date: 17-Jun-2023
https://dl.acm.org/doi/10.1145/3579371.3589039
Zhou KHao YMellor-Crummey JMeng XLiu XFalsafi BFerdman MLu SWenisch T(2022)ValueExpert: exploring value patterns in GPU-accelerated applicationsProceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3503222.3507708(171-185)Online publication date: 28-Feb-2022
https://dl.acm.org/doi/10.1145/3503222.3507708
Atoofian E(2020)Approximate Cache in GPGPUsACM Transactions on Embedded Computing Systems10.1145/340790419:5(1-22)Online publication date: 26-Sep-2020
https://dl.acm.org/doi/10.1145/3407904
Zhou KHao YMellor-Crummey JMeng XLiu X(2020)GVPROF: A Value Profiler for GPU-Based ClustersSC20: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41405.2020.00093(1-16)Online publication date: Nov-2020
https://doi.org/10.1109/SC41405.2020.00093
Kim HAhn SOh YKim BRo WSong W(2020)Duplo: Lifting Redundant Memory Accesses of Deep Neural Networks for GPU Tensor Cores2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO50266.2020.00065(725-737)Online publication date: Oct-2020
https://doi.org/10.1109/MICRO50266.2020.00065
Valero ASuarez-Gracia DGran-Tejero R(2020)DC-Patch: A Microarchitectural Fault Patching Technique for GPU Register FilesIEEE Access10.1109/ACCESS.2020.30258998(173276-173288)Online publication date: 2020
https://doi.org/10.1109/ACCESS.2020.3025899
Valero ACandel FSuarez-Gracia DPetit SSahuquillo J(2019)An Aging-Aware GPU Register File Design Based on Data RedundancyIEEE Transactions on Computers10.1109/TC.2018.284937668:1(4-20)Online publication date: 1-Jan-2019
https://doi.org/10.1109/TC.2018.2849376
Wang QGuo WWei J(2018)An efficient control flow validation method using redundant computing capacity of dual-processor architecturePLOS ONE10.1371/journal.pone.020112713:8(e0201127)Online publication date: 1-Aug-2018
https://doi.org/10.1371/journal.pone.0201127
Tan JYan K(2018)Efficiently Managing the Impact of Hardware Variability on GPUs’ Streaming ProcessorsACM Transactions on Design Automation of Electronic Systems10.1145/328730824:1(1-15)Online publication date: 21-Dec-2018
https://dl.acm.org/doi/10.1145/3287308
Bailey JKloosterman JMahlke S(2018)Scratch That (But Cache This): A Hybrid Register Cache/Scratchpad for GPUsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2018.285704337:11(2779-2789)Online publication date: Nov-2018
https://doi.org/10.1109/TCAD.2018.2857043
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents