Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2668930.2688046acmconferencesArticle/Chapter ViewAbstractPublication PagesicpeConference Proceedingsconference-collections
research-article
Free access

NUPAR: A Benchmark Suite for Modern GPU Architectures

Published: 31 January 2015 Publication History

Abstract

Heterogeneous systems consisting of multi-core CPUs, Graphics Processing Units (GPUs) and many-core accelerators have gained widespread use by application developers and data-center platform developers. Modern day heterogeneous systems have evolved to include advanced hardware and software features to support a spectrum of application patterns. Heterogeneous programming frameworks such as CUDA, OpenCL, and OpenACC have all introduced new interfaces to enable developers to utilize new features on these platforms. In emerging applications, performance optimization is not only limited to effectively exploiting data-level parallelism, but includes leveraging new degrees of concurrency and parallelism to accelerate the entire application.
To aid hardware architects and application developers in effectively tuning performance on GPUs, we have developed the NUPAR benchmark suite. The NUPAR applications belong to a number of different scientific and commercial computing domains. These benchmarks exhibit a range of GPU computing characteristics that consider memory-bandwidth limitations, device occupancy and resource utilization, synchronization latency and device-specific compute optimizations. The NUPAR applications are specifically designed to stress new hardware and software features that include: nested parallelism, concurrent kernel execution, shared host-device memory and new instructions for precise computation and data movement. In this paper, we focus our discussion on applications developed in CUDA and OpenCL, and focus on high-end server class GPUs. We describe these benchmarks and evaluate their interaction with different architectural features on a GPU. Our evaluation examines the behavior of the advanced hardware features on recently-released GPU architectures.

References

[1]
CUDA C Programming Guide. NVIDIA Corporation, Feb, 2014.
[2]
3DMark. http://www.futuremark.com/benchmarks.
[3]
AMD. Accelerated Parallel Processing: OpenCL programming guide. URL http://developer. amd. com/sdks/AMDAPPSDK/documentation, 2011.
[4]
K. Asanovic, R. Bodik, B. C. Catanzaro, J. J. Gebis, P. Husbands, K. Keutzer, D. A. Patterson, W. L. Plishker, J. Shalf, S. W. Williams, et al. The landscape of parallel computing research: A view from berkeley. Technical report, Technical Report UCB/EECS-2006-183, EECS Department, University of California, Berkeley, 2006.
[5]
F. Azmandian, A. Yilmazer, J. G. Dy, J. A. Aslam, and D. R. Kaeli. Gpu-accelerated feature selection for outlier detection using the local kernel density ratio. In Proceedings of the 2012 IEEE 12th International Conference on Data Mining, pages 51--60. IEEE Computer Society, 2012.
[6]
O. P. Bruno, S. P. Shipman, C. Turc, and V. Stephanos. Efficient Evaluation of Doubly Periodic Green Functions in 3D Scattering, Including Wood Anomaly Frequencies. Arxiv, 1307.1176:80--110, 2013.
[7]
S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. In Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on, pages 44--54. IEEE, 2009.
[8]
D. Chen, H. P. Huynh, R. S. M. Goh, and K. Rupnow. Efficient gpu spatial-temporal multitasking. IEEE Transactions on Parallel and Distributed Systems, page 1, 2014.
[9]
D. Comaniciu, V. Ramesh, and P. Meer. Real-time tracking of non-rigid objects using mean shift. In Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662), volume 2, pages 142--149, 2000.
[10]
A. Danalis, G. Marin, C. McCurdy, J. S. Meredith, P. C. Roth, K. Spafford, V. Tipparaju, and J. S. Vetter. The scalable heterogeneous computing (shoc) benchmark suite. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, pages 63--74. ACM, 2010.
[11]
C. Frankfurt. Crysis. http://www.crysis.com, 2007.
[12]
A. Jaleel, M. Mattina, and B. Jacob. Last level cache (llc) performance of data mining workloads on a cmp-a case study of parallel bioinformatics workloads. In High-Performance Computer Architecture, 2006. The Twelfth International Symposium on, pages 88--98. IEEE, 2006.
[13]
M. Kulkarni, M. Burtscher, C. Casçaval, and K. Pingali. Lonestar: A suite of parallel irregular programs. In Performance Analysis of Systems and Software, 2009. ISPASS 2009. IEEE International Symposium on, pages 65--76. IEEE, 2009.
[14]
C. Lee, M. Potkonjak, and W. H. Mangione-Smith. Mediabench: a tool for evaluating and synthesizing multimedia and communicatons systems. In Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture, pages 330--335. IEEE Computer Society, 1997.
[15]
M. Mantor and M. Houston. AMD Graphics Core Next. In AMD Fusion Developer Summit, 2011.
[16]
P. Mistry, Y. Ukidave, D. Schaa, and D. Kaeli. Valar: A benchmark suite to study the dynamic behavior of heterogeneous systems. In Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units, pages 54--65. ACM, 2013.
[17]
A. Munshi. The OpenCL Specification 2.0. Khronos OpenCL Working Group, 2014.
[18]
T. Namiki. A new fdtd algorithm based on alternating-direction implicit method. Microwave Theory and Techniques, IEEE Transactions on, 47(10):2003--2007, 1999.
[19]
R. Narayanan, B. Ozisikyilmaz, J. Zambreno, G. Memik, and A. Choudhary. Minebench: A benchmark suite for data mining workloads. In Workload Characterization, 2006 IEEE International Symposium on, pages 182--188. IEEE, 2006.
[20]
A. Y. Ng, M. I. Jordan, Y. Weiss, et al. On spectral clustering: Analysis and an algorithm. Advances in neural information processing systems, 2:849--856, 2002.
[21]
J. Nickolls and W. J. Dally. The gpu computing era. IEEE micro, 30(2):56--69, 2010.
[22]
F. Nina-Paravecino and D. Kaeli. Accelerated connected component labeling using cuda framework. In Computer Vision and Graphics (ICCVG), 2014 International Conference on, 2014.
[23]
NVIDIA. Visual Profiler, 2011.
[24]
NVIDIA. NVIDIA's Next Generation CUDA Computer Architecture Kepler GK110. 2012.
[25]
S. Osher and J. A. Sethian. Fronts propagating with curvature dependent speed: algorithms based on hamilton-jacobi formulations. Journal of Computational Physics, 79(1):12--49, 1988.
[26]
B. Porat. A course in digital signal processing, volume 1. Wiley New York, 1997.
[27]
P. Rogers. Heterogeneous system architecture overview. In Hot Chips, 2013.
[28]
J. Schmidt. A Flexible IIR Filtering Implementation for Audio Processing. Technicolor Research & Innovation, GTC, 2014.
[29]
Y. Shi and W. C. Karl. A fast implementation of the level set method without solving partial differential equations. Boston University, Department of Electrical and Computer Engineering, 2005.
[30]
S. Singh, W. F. Richards, J. R. Zinecker, and D. R. Wilton. Accelerating the convergence of series representing the free space periodic green's function. Antennas and Propagation, IEEE Transactions on, 38(12):1958--1962, 1990.
[31]
J. A. Stratton, C. Rodrigues, I.-J. Sung, N. Obeid, L.-W. Chang, N. Anssari, G. D. Liu, and W.-m. W. Hwu. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Center for Reliable and High-Performance Computing, 2012.
[32]
J. A. Stratton, S. S. Stone, and W. H. Wen-mei. Mcuda: An efficient implementation of cuda kernels for multi-core cpus. In Languages and Compilers for Parallel Computing, pages 16--30. Springer, 2008.
[33]
W. Sun and R. Ricci. Augmenting Operating Systems With the GPU. 2011.
[34]
V. Volkov and J. W. Demmel. Benchmarking gpus to tune dense linear algebra. In Proceedings of the 2008 ACM/IEEE conference on Supercomputing, page 31. IEEE Press, 2008.
[35]
Y. Yang and H. Zhou. Cuda-np: realizing nested thread-level parallelism in gpgpu applications. In Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming, pages 93--106. ACM, 2014.
[36]
H. Zhao, Y. Fan, T. Zhang, and H. Sang. Stripe-based connected components labelling. Electronics Letters, 46:1434--1436, October 2010.

Cited By

View all
  • (2024)RAJA Performance Suite: Performance Portability Analysis with Caliper and ThicketSC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SCW63240.2024.00162(1206-1218)Online publication date: 17-Nov-2024
  • (2024)Nearest data processing in GPUSustainable Computing: Informatics and Systems10.1016/j.suscom.2024.10104744(101047)Online publication date: Dec-2024
  • (2023)A Full-System Perspective on UPMEM PerformanceProceedings of the 1st Workshop on Disruptive Memory Systems10.1145/3609308.3625266(1-7)Online publication date: 23-Oct-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICPE '15: Proceedings of the 6th ACM/SPEC International Conference on Performance Engineering
January 2015
366 pages
ISBN:9781450332484
DOI:10.1145/2668930
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 January 2015

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. benchmark suite
  2. cuda
  3. gpus
  4. opencl

Qualifiers

  • Research-article

Funding Sources

  • NSF

Conference

ICPE'15
Sponsor:
ICPE'15: ACM/SPEC International Conference on Performance Engineering
January 28 - February 4, 2015
Texas, Austin, USA

Acceptance Rates

ICPE '15 Paper Acceptance Rate 23 of 74 submissions, 31%;
Overall Acceptance Rate 252 of 851 submissions, 30%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)184
  • Downloads (Last 6 weeks)22
Reflects downloads up to 12 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)RAJA Performance Suite: Performance Portability Analysis with Caliper and ThicketSC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SCW63240.2024.00162(1206-1218)Online publication date: 17-Nov-2024
  • (2024)Nearest data processing in GPUSustainable Computing: Informatics and Systems10.1016/j.suscom.2024.10104744(101047)Online publication date: Dec-2024
  • (2023)A Full-System Perspective on UPMEM PerformanceProceedings of the 1st Workshop on Disruptive Memory Systems10.1145/3609308.3625266(1-7)Online publication date: 23-Oct-2023
  • (2023)Exploring OpenMP GPU Offloading for Implementing Convolutional Neural NetworksProceedings of the 14th International Workshop on Programming Models and Applications for Multicores and Manycores10.1145/3582514.3582523(60-69)Online publication date: 25-Feb-2023
  • (2023)Systematically Exploring High-Performance Representations of Vector Fields Through Compile-Time CompositionProceedings of the 2023 ACM/SPEC International Conference on Performance Engineering10.1145/3578244.3583723(55-66)Online publication date: 15-Apr-2023
  • (2022)Experimental Findings on the Sources of Detected Unrecoverable Errors in GPUsIEEE Transactions on Nuclear Science10.1109/TNS.2022.314134169:3(436-443)Online publication date: Mar-2022
  • (2022)A compiler framework for optimizing dynamic parallelism on GPUsProceedings of the 20th IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO53902.2022.9741284(1-13)Online publication date: 2-Apr-2022
  • (2021)Benchmarking the Nvidia GPU LineageProceedings of the 11th International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies10.1145/3468044.3468053(1-6)Online publication date: 21-Jun-2021
  • (2021)GNNMark: A Benchmark Suite to Characterize Graph Neural Network Training on GPUs2021 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS51385.2021.00013(13-23)Online publication date: Mar-2021
  • (2020)ArmorAllACM Transactions on Architecture and Code Optimization10.1145/338213217:2(1-24)Online publication date: 29-May-2020
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media