Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/3018076.3018083acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
research-article

Characterizing power and performance of GPU memory access

Published: 13 November 2016 Publication History

Abstract

Power is a major limiting factor for the future of HPC and the realization of exascale computing under a power budget. GPUs have now become a mainstream parallel computation device in HPC, and optimizing power usage on GPUs is critical to achieving future goals. GPU memory is seldom studied, especially for power usage. Nevertheless, memory accesses draw significant power and are critical to understanding and optimizing GPU power usage. In this work we investigate the power and performance characteristics of various GPU memory accesses. We take an empirical approach and experimentally examine and evaluate how GPU power and performance vary with data access patterns and software parameters including GPU thread block size. In addition, we take into account the advanced power saving technology dynamic voltage and frequency scaling (DVFS) on GPU processing units and global memory. We analyze power and performance and provide some suggestions for the optimal parameters for applications that heavily use specific memory operations.

References

[1]
"Cuda occupancy calculator." {Online}. Available: developer.download.nvidia.com/compute/.../CUDA_Occupancy_calculator.xl
[2]
"Cuda warps and occupancy." {Online}. Available: http://developer.download.nvidia.com/CUDA/training/cuda_webinars_WarpsAndOccupancy.pdf
[3]
"TOP500 Supercomputer Site," http://www.top500.org, accessed: 2016-09-08.
[4]
"Whitepaper nvidia's next generation cuda compute architecture: Fermi," NVIDIA, Tech. Rep., 2009.
[5]
"Whitepaper nvidia's next generation cuda compute architecture: Kepler gk110," NVIDIA, Tech. Rep., 2012.
[6]
Y. Abe, H. Sasaki, S. Kato et al., "Power and performance characterization and modeling of gpu-accelerated systems," in Parallel and Distributed Processing Symposium, 2014 IEEE 28th International, May 2014, pp. 113--122.
[7]
K. Bergman, S. Borkar, D. Campbell et al., "Exascale computing study: Technology challenges in achieving exascale systems peter kogge, editor & study lead," Dept. of Computer Science and Eng., University of Notre Dame, Tech. Rep., 2008.
[8]
S. Che, J. W. Sheaffer, M. Boyer et al., "A characterization of the rodinia benchmark suite with comparison to contemporary cmp workloads," in Proceedings of the IEEE International Symposium on Workload Characterization (IISWC'10), ser. IISWC '10. Washington, DC, USA: IEEE Computer Society, 2010, pp. 1--11. {Online}. Available: http://dx.doi.org/10.1109/IISWC.2010.5650274
[9]
G. Chen, B. Wu, D. Li, and X. Shen, "Porple: An extensible optimizer for portable data placement on gpu," in 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture, Dec 2014, pp. 88--100.
[10]
J. Chen, B. Li, Y. Zhang, L. Peng et al., "Tree structured analysis on gpu power study," in Computer Design (ICCD), 2011 IEEE 29th International Conference on, Oct 2011, pp. 57--64.
[11]
J. Coplin and M. Burtscher, "Energy, power, and performance characterization of gpgpu benchmark programs," in 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), May 2016, pp. 1190--1199.
[12]
A. Danalis, G. Marin, C. McCurdy et al., "The scalable heterogeneous computing (shoc) benchmark suite," in Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, ser. GPGPU-3. New York, NY, USA: ACM, 2010, pp. 63--74. {Online}. Available: http://doi.acm.org/10.1145/1735688.1735702
[13]
R. Ge, R. Vogt, J. Majumder et al., "Effects of dynamic voltage and frequency scaling on a k20 gpu," in 2013 42nd International Conference on Parallel Processing, Oct 2013, pp. 826--833.
[14]
R. Ge, X. Feng, S. Song et al., "Powerpack: Energy profiling and analysis of high-performance systems and applications," IEEE Trans. Parallel Distrib. Syst., vol. 21, no. 5, pp. 658--671, May 2010. {Online}. Available: http://dx.doi.org/10.1109/TPDS.2009.76
[15]
J. Guerreiro, A. Ilic, N. Roma et al., "Multi-kernel auto-tuning on gpus: Performance and energy-aware optimization," in 2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, March 2015, pp. 438--445.
[16]
M. Harris, S. Sengupta, and J. D. Owens, "Parallel prefix sum (scan) with cuda," in Gpu Gems 3, 1st ed., H. Nguyen, Ed. Addison-Wesley Professional, 2007, ch. 39.2.3, pp. 855--866.
[17]
S. Hong and H. Kim, "An analytical model for a gpu architecture with memory-level and thread-level parallelism awareness," in Proceedings of the 36th Annual International Symposium on Computer Architecture, ser. ISCA '09. New York, NY, USA: ACM, 2009, pp. 152--163. {Online}. Available: http://doi.acm.org/10.1145/1555754.1555775
[18]
Y. Jiao, H. Lin, P. Balaji et al., "Power and performance characterization of computational kernels on the gpu," in Green Computing and Communications (GreenCom), 2010 IEEE/ACM Int'l Conference on Int'l Conference on Cyber, Physical and Social Computing (CPSCom), Dec 2010, pp. 221--228.
[19]
A. Jog, O. Kayiran, T. Kesten et al., "Anatomy of gpu memory system for multi-application execution," in Proceedings of the 2015 International Symposium on Memory Systems, ser. MEMSYS '15. New York, NY, USA: ACM, 2015, pp. 223--234. {Online}. Available: http://doi.acm.org/10.1145/2818950.2818979
[20]
S. W. Keckler, W. J. Dally, B. Khailany et al., "Gpus and the future of parallel computing," IEEE Micro, vol. 31, no. 5, pp. 7--17, Sept 2011.
[21]
E. Lindholm, J. Nickolls, S. Oberman et al., "Nvidia tesla: A unified graphics and computing architecture," IEEE Micro, vol. 28, no. 2, pp. 39--55, Mar. 2008. {Online}. Available: http://dx.doi.org/10.1109/MM.2008.31
[22]
K. Ma, X. Li, W. Chen et al., "Greengpu: A holistic approach to energy efficiency in gpu-cpu heterogeneous architectures," in Proceedings of the 2012 41st International Conference on Parallel Processing, ser. ICPP '12. Washington, DC, USA: IEEE Computer Society, 2012, pp. 48--57. {Online}. Available: http://dx.doi.org/10.1109/ICPP.2012.31
[23]
A. T. McLaughlin, "Power-constrained performance optimization of gpu graph traversal," Master's thesis, Georgia Institute of Technology, 2013.
[24]
X. Mei, L. S. Yung, K. Zhao et al., "A measurement study of gpu dvfs on energy conservation," in Proceedings of the Workshop on Power-Aware Computing and Systems, ser. HotPower '13. New York, NY, USA: ACM, 2013, pp. 10:1--10:5. {Online}. Available: http://doi.acm.org/10.1145/2525526.2525852
[25]
I. Paul, W. Huang, M. Arora et al., "Harmonia: Balancing compute and memory power in high-performance gpus," in Proceedings of the 42Nd Annual International Symposium on Computer Architecture, ser. ISCA '15. New York, NY, USA: ACM, 2015, pp. 54--65. {Online}. Available: http://doi.acm.org/10.1145/2749469.2750404
[26]
G. Wu, J. L. Greathouse, A. Lyashevsky et al., "Gpgpu performance and power estimation using machine learning," in 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA), Feb 2015, pp. 564--576.

Cited By

View all
  • (2024)NDRec: A Near-Data Processing System for Training Large-Scale Recommendation ModelsIEEE Transactions on Computers10.1109/TC.2024.336593973:5(1248-1261)Online publication date: 15-Feb-2024
  • (2019)Time-energy analysis of multilevel parallelism in heterogeneous clustersThe Journal of Supercomputing10.1007/s11227-019-02908-475:7(3397-3425)Online publication date: 1-Jul-2019
  • (2018)Speedup and Energy Analysis of EEG Classification for BCI Tasks on CPU-GPU ClustersProceedings of the 6th International Workshop on Parallelism in Bioinformatics10.1145/3235830.3235834(33-43)Online publication date: 23-Sep-2018

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
E2SC '16: Proceedings of the 4th International Workshop on Energy Efficient Supercomputing
November 2016
91 pages
ISBN:9781509038565

Sponsors

In-Cooperation

Publisher

IEEE Press

Publication History

Published: 13 November 2016

Check for updates

Author Tags

  1. GPU memory access
  2. heterogeneous computing
  3. high performance computing
  4. power and performance characterization

Qualifiers

  • Research-article

Conference

SC16
Sponsor:

Acceptance Rates

Overall Acceptance Rate 17 of 33 submissions, 52%

Upcoming Conference

ICSE 2025

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)6
  • Downloads (Last 6 weeks)1
Reflects downloads up to 01 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)NDRec: A Near-Data Processing System for Training Large-Scale Recommendation ModelsIEEE Transactions on Computers10.1109/TC.2024.336593973:5(1248-1261)Online publication date: 15-Feb-2024
  • (2019)Time-energy analysis of multilevel parallelism in heterogeneous clustersThe Journal of Supercomputing10.1007/s11227-019-02908-475:7(3397-3425)Online publication date: 1-Jul-2019
  • (2018)Speedup and Energy Analysis of EEG Classification for BCI Tasks on CPU-GPU ClustersProceedings of the 6th International Workshop on Parallelism in Bioinformatics10.1145/3235830.3235834(33-43)Online publication date: 23-Sep-2018

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media