research-article

Characterizing power and performance of GPU memory access

Authors:

Rong GeAuthors Info & Claims

E2SC '16: Proceedings of the 4th International Workshop on Energy Efficient Supercomputing

Pages 46 - 53

Published: 13 November 2016 Publication History

Abstract

Power is a major limiting factor for the future of HPC and the realization of exascale computing under a power budget. GPUs have now become a mainstream parallel computation device in HPC, and optimizing power usage on GPUs is critical to achieving future goals. GPU memory is seldom studied, especially for power usage. Nevertheless, memory accesses draw significant power and are critical to understanding and optimizing GPU power usage. In this work we investigate the power and performance characteristics of various GPU memory accesses. We take an empirical approach and experimentally examine and evaluate how GPU power and performance vary with data access patterns and software parameters including GPU thread block size. In addition, we take into account the advanced power saving technology dynamic voltage and frequency scaling (DVFS) on GPU processing units and global memory. We analyze power and performance and provide some suggestions for the optimal parameters for applications that heavily use specific memory operations.

References

[1]

"Cuda occupancy calculator." {Online}. Available: developer.download.nvidia.com/compute/.../CUDA_Occupancy_calculator.xl

[2]

"Cuda warps and occupancy." {Online}. Available: http://developer.download.nvidia.com/CUDA/training/cuda_webinars_WarpsAndOccupancy.pdf

[3]

"TOP500 Supercomputer Site," http://www.top500.org, accessed: 2016-09-08.

[4]

"Whitepaper nvidia's next generation cuda compute architecture: Fermi," NVIDIA, Tech. Rep., 2009.

[5]

"Whitepaper nvidia's next generation cuda compute architecture: Kepler gk110," NVIDIA, Tech. Rep., 2012.

[6]

Y. Abe, H. Sasaki, S. Kato et al., "Power and performance characterization and modeling of gpu-accelerated systems," in Parallel and Distributed Processing Symposium, 2014 IEEE 28th International, May 2014, pp. 113--122.

Digital Library

[7]

K. Bergman, S. Borkar, D. Campbell et al., "Exascale computing study: Technology challenges in achieving exascale systems peter kogge, editor & study lead," Dept. of Computer Science and Eng., University of Notre Dame, Tech. Rep., 2008.

[8]

S. Che, J. W. Sheaffer, M. Boyer et al., "A characterization of the rodinia benchmark suite with comparison to contemporary cmp workloads," in Proceedings of the IEEE International Symposium on Workload Characterization (IISWC'10), ser. IISWC '10. Washington, DC, USA: IEEE Computer Society, 2010, pp. 1--11. {Online}. Available: http://dx.doi.org/10.1109/IISWC.2010.5650274

Digital Library

[9]

G. Chen, B. Wu, D. Li, and X. Shen, "Porple: An extensible optimizer for portable data placement on gpu," in 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture, Dec 2014, pp. 88--100.

Digital Library

[10]

J. Chen, B. Li, Y. Zhang, L. Peng et al., "Tree structured analysis on gpu power study," in Computer Design (ICCD), 2011 IEEE 29th International Conference on, Oct 2011, pp. 57--64.

Digital Library

[11]

J. Coplin and M. Burtscher, "Energy, power, and performance characterization of gpgpu benchmark programs," in 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), May 2016, pp. 1190--1199.

[12]

A. Danalis, G. Marin, C. McCurdy et al., "The scalable heterogeneous computing (shoc) benchmark suite," in Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, ser. GPGPU-3. New York, NY, USA: ACM, 2010, pp. 63--74. {Online}. Available: http://doi.acm.org/10.1145/1735688.1735702

Digital Library

[13]

R. Ge, R. Vogt, J. Majumder et al., "Effects of dynamic voltage and frequency scaling on a k20 gpu," in 2013 42nd International Conference on Parallel Processing, Oct 2013, pp. 826--833.

Digital Library

[14]

R. Ge, X. Feng, S. Song et al., "Powerpack: Energy profiling and analysis of high-performance systems and applications," IEEE Trans. Parallel Distrib. Syst., vol. 21, no. 5, pp. 658--671, May 2010. {Online}. Available: http://dx.doi.org/10.1109/TPDS.2009.76

Digital Library

[15]

J. Guerreiro, A. Ilic, N. Roma et al., "Multi-kernel auto-tuning on gpus: Performance and energy-aware optimization," in 2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, March 2015, pp. 438--445.

Digital Library

[16]

M. Harris, S. Sengupta, and J. D. Owens, "Parallel prefix sum (scan) with cuda," in Gpu Gems 3, 1st ed., H. Nguyen, Ed. Addison-Wesley Professional, 2007, ch. 39.2.3, pp. 855--866.

[17]

S. Hong and H. Kim, "An analytical model for a gpu architecture with memory-level and thread-level parallelism awareness," in Proceedings of the 36th Annual International Symposium on Computer Architecture, ser. ISCA '09. New York, NY, USA: ACM, 2009, pp. 152--163. {Online}. Available: http://doi.acm.org/10.1145/1555754.1555775

Digital Library

[18]

Y. Jiao, H. Lin, P. Balaji et al., "Power and performance characterization of computational kernels on the gpu," in Green Computing and Communications (GreenCom), 2010 IEEE/ACM Int'l Conference on Int'l Conference on Cyber, Physical and Social Computing (CPSCom), Dec 2010, pp. 221--228.

Digital Library

[19]

A. Jog, O. Kayiran, T. Kesten et al., "Anatomy of gpu memory system for multi-application execution," in Proceedings of the 2015 International Symposium on Memory Systems, ser. MEMSYS '15. New York, NY, USA: ACM, 2015, pp. 223--234. {Online}. Available: http://doi.acm.org/10.1145/2818950.2818979

Digital Library

[20]

S. W. Keckler, W. J. Dally, B. Khailany et al., "Gpus and the future of parallel computing," IEEE Micro, vol. 31, no. 5, pp. 7--17, Sept 2011.

Digital Library

[21]

E. Lindholm, J. Nickolls, S. Oberman et al., "Nvidia tesla: A unified graphics and computing architecture," IEEE Micro, vol. 28, no. 2, pp. 39--55, Mar. 2008. {Online}. Available: http://dx.doi.org/10.1109/MM.2008.31

Digital Library

[22]

K. Ma, X. Li, W. Chen et al., "Greengpu: A holistic approach to energy efficiency in gpu-cpu heterogeneous architectures," in Proceedings of the 2012 41st International Conference on Parallel Processing, ser. ICPP '12. Washington, DC, USA: IEEE Computer Society, 2012, pp. 48--57. {Online}. Available: http://dx.doi.org/10.1109/ICPP.2012.31

Digital Library

[23]

A. T. McLaughlin, "Power-constrained performance optimization of gpu graph traversal," Master's thesis, Georgia Institute of Technology, 2013.

[24]

X. Mei, L. S. Yung, K. Zhao et al., "A measurement study of gpu dvfs on energy conservation," in Proceedings of the Workshop on Power-Aware Computing and Systems, ser. HotPower '13. New York, NY, USA: ACM, 2013, pp. 10:1--10:5. {Online}. Available: http://doi.acm.org/10.1145/2525526.2525852

Digital Library

[25]

I. Paul, W. Huang, M. Arora et al., "Harmonia: Balancing compute and memory power in high-performance gpus," in Proceedings of the 42Nd Annual International Symposium on Computer Architecture, ser. ISCA '15. New York, NY, USA: ACM, 2015, pp. 54--65. {Online}. Available: http://doi.acm.org/10.1145/2749469.2750404

Digital Library

[26]

G. Wu, J. L. Greathouse, A. Lyashevsky et al., "Gpgpu performance and power estimation using machine learning," in 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA), Feb 2015, pp. 564--576.

Cited By

Li SWang YHanson EChang ASeok Ki YLi HChen Y(2024)NDRec: A Near-Data Processing System for Training Large-Scale Recommendation ModelsIEEE Transactions on Computers10.1109/TC.2024.336593973:5(1248-1261)Online publication date: 15-Feb-2024
https://dl.acm.org/doi/10.1109/TC.2024.3365939
Escobar JOrtega JDíaz AGonzález JDamas M(2019)Time-energy analysis of multilevel parallelism in heterogeneous clustersThe Journal of Supercomputing10.1007/s11227-019-02908-475:7(3397-3425)Online publication date: 1-Jul-2019
https://dl.acm.org/doi/10.1007/s11227-019-02908-4
Escobar JOrtega JDíaz AGonzález JDamas M(2018)Speedup and Energy Analysis of EEG Classification for BCI Tasks on CPU-GPU ClustersProceedings of the 6th International Workshop on Parallelism in Bioinformatics10.1145/3235830.3235834(33-43)Online publication date: 23-Sep-2018
https://dl.acm.org/doi/10.1145/3235830.3235834

Characterizing power and performance of GPU memory access

Recommendations

An integrated GPU power and performance model
ISCA '10: Proceedings of the 37th annual international symposium on Computer architecture

GPU architectures are increasingly important in the multi-core era due to their high number of parallel processors. Performance optimization for multi-core processors has been a challenge for programmers. Furthermore, optimizing for power consumption is ...
MIC acceleration of short-range molecular dynamics simulations
COSMIC '13: Proceedings of the First International Workshop on Code OptimiSation for MultI and many Cores

Heterogeneous systems containing accelerators such as GPUs or co-processors such as Intel MIC are becoming more prevalent due to their ability of exploiting large-scale parallelism in applications. In this paper, we have developed a hierarchical ...
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing

The graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

E2SC '16: Proceedings of the 4th International Workshop on Energy Efficient Supercomputing

November 2016

91 pages

ISBN:9781509038565

General Chairs:
Kirk Cameron
Virginia Tech
,
Adolfy Hoisie
PNNL
,
Darren Kerbyson
PNNL
,
David Lowenthal
ASU
,
Dimitrios S. Nikolopoulos
Queen's University of Belfast, UK
,
Sudha Yalamanchili
Georgia Institute of Technology
,
Program Chairs:
Kevin Barker
PNNL
,
Rong Ge
Clemson University

Sponsors

SIGHPC: ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing
IEEE-CS\DATC: IEEE Computer Society

In-Cooperation

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

IEEE Press

Publication History

Published: 13 November 2016

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SC16

Sponsor:

SIGHPC
IEEE-CS\DATC

SC16: The International Conference for High Performance Computing, Networking, Storage and Analysis

November 13 - 18, 2016

Utah, Salt Lake City

Acceptance Rates

Overall Acceptance Rate 17 of 33 submissions, 52%

Upcoming Conference

ICSE 2025

2025 IEEE/ACM 46th International Conference on Software Engineering

April 26 - May 3, 2025

Ottawa , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
166
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)1

Reflects downloads up to 01 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Li SWang YHanson EChang ASeok Ki YLi HChen Y(2024)NDRec: A Near-Data Processing System for Training Large-Scale Recommendation ModelsIEEE Transactions on Computers10.1109/TC.2024.336593973:5(1248-1261)Online publication date: 15-Feb-2024
https://dl.acm.org/doi/10.1109/TC.2024.3365939
Escobar JOrtega JDíaz AGonzález JDamas M(2019)Time-energy analysis of multilevel parallelism in heterogeneous clustersThe Journal of Supercomputing10.1007/s11227-019-02908-475:7(3397-3425)Online publication date: 1-Jul-2019
https://dl.acm.org/doi/10.1007/s11227-019-02908-4
Escobar JOrtega JDíaz AGonzález JDamas M(2018)Speedup and Energy Analysis of EEG Classification for BCI Tasks on CPU-GPU ClustersProceedings of the 6th International Workshop on Parallelism in Bioinformatics10.1145/3235830.3235834(33-43)Online publication date: 23-Sep-2018
https://dl.acm.org/doi/10.1145/3235830.3235834

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents