research-article

Free access

NUPAR: A Benchmark Suite for Modern GPU Architectures

Authors:

Fanny Nina Paravecino,

Zhongliang Chen,

Perhaad Mistry,

David KaeliAuthors Info & Claims

ICPE '15: Proceedings of the 6th ACM/SPEC International Conference on Performance Engineering

Pages 253 - 264

https://doi.org/10.1145/2668930.2688046

Published: 31 January 2015 Publication History

Abstract

Heterogeneous systems consisting of multi-core CPUs, Graphics Processing Units (GPUs) and many-core accelerators have gained widespread use by application developers and data-center platform developers. Modern day heterogeneous systems have evolved to include advanced hardware and software features to support a spectrum of application patterns. Heterogeneous programming frameworks such as CUDA, OpenCL, and OpenACC have all introduced new interfaces to enable developers to utilize new features on these platforms. In emerging applications, performance optimization is not only limited to effectively exploiting data-level parallelism, but includes leveraging new degrees of concurrency and parallelism to accelerate the entire application.

To aid hardware architects and application developers in effectively tuning performance on GPUs, we have developed the NUPAR benchmark suite. The NUPAR applications belong to a number of different scientific and commercial computing domains. These benchmarks exhibit a range of GPU computing characteristics that consider memory-bandwidth limitations, device occupancy and resource utilization, synchronization latency and device-specific compute optimizations. The NUPAR applications are specifically designed to stress new hardware and software features that include: nested parallelism, concurrent kernel execution, shared host-device memory and new instructions for precise computation and data movement. In this paper, we focus our discussion on applications developed in CUDA and OpenCL, and focus on high-end server class GPUs. We describe these benchmarks and evaluate their interaction with different architectural features on a GPU. Our evaluation examines the behavior of the advanced hardware features on recently-released GPU architectures.

References

[1]

CUDA C Programming Guide. NVIDIA Corporation, Feb, 2014.

[2]

3DMark. http://www.futuremark.com/benchmarks.

[3]

AMD. Accelerated Parallel Processing: OpenCL programming guide. URL http://developer. amd. com/sdks/AMDAPPSDK/documentation, 2011.

[4]

K. Asanovic, R. Bodik, B. C. Catanzaro, J. J. Gebis, P. Husbands, K. Keutzer, D. A. Patterson, W. L. Plishker, J. Shalf, S. W. Williams, et al. The landscape of parallel computing research: A view from berkeley. Technical report, Technical Report UCB/EECS-2006-183, EECS Department, University of California, Berkeley, 2006.

[5]

F. Azmandian, A. Yilmazer, J. G. Dy, J. A. Aslam, and D. R. Kaeli. Gpu-accelerated feature selection for outlier detection using the local kernel density ratio. In Proceedings of the 2012 IEEE 12th International Conference on Data Mining, pages 51--60. IEEE Computer Society, 2012.

Digital Library

[6]

O. P. Bruno, S. P. Shipman, C. Turc, and V. Stephanos. Efficient Evaluation of Doubly Periodic Green Functions in 3D Scattering, Including Wood Anomaly Frequencies. Arxiv, 1307.1176:80--110, 2013.

[7]

S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. In Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on, pages 44--54. IEEE, 2009.

Digital Library

[8]

D. Chen, H. P. Huynh, R. S. M. Goh, and K. Rupnow. Efficient gpu spatial-temporal multitasking. IEEE Transactions on Parallel and Distributed Systems, page 1, 2014.

[9]

D. Comaniciu, V. Ramesh, and P. Meer. Real-time tracking of non-rigid objects using mean shift. In Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662), volume 2, pages 142--149, 2000.

[10]

A. Danalis, G. Marin, C. McCurdy, J. S. Meredith, P. C. Roth, K. Spafford, V. Tipparaju, and J. S. Vetter. The scalable heterogeneous computing (shoc) benchmark suite. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, pages 63--74. ACM, 2010.

Digital Library

[11]

C. Frankfurt. Crysis. http://www.crysis.com, 2007.

[12]

A. Jaleel, M. Mattina, and B. Jacob. Last level cache (llc) performance of data mining workloads on a cmp-a case study of parallel bioinformatics workloads. In High-Performance Computer Architecture, 2006. The Twelfth International Symposium on, pages 88--98. IEEE, 2006.

[13]

M. Kulkarni, M. Burtscher, C. Casçaval, and K. Pingali. Lonestar: A suite of parallel irregular programs. In Performance Analysis of Systems and Software, 2009. ISPASS 2009. IEEE International Symposium on, pages 65--76. IEEE, 2009.

[14]

C. Lee, M. Potkonjak, and W. H. Mangione-Smith. Mediabench: a tool for evaluating and synthesizing multimedia and communicatons systems. In Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture, pages 330--335. IEEE Computer Society, 1997.

Digital Library

[15]

M. Mantor and M. Houston. AMD Graphics Core Next. In AMD Fusion Developer Summit, 2011.

[16]

P. Mistry, Y. Ukidave, D. Schaa, and D. Kaeli. Valar: A benchmark suite to study the dynamic behavior of heterogeneous systems. In Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units, pages 54--65. ACM, 2013.

Digital Library

[17]

A. Munshi. The OpenCL Specification 2.0. Khronos OpenCL Working Group, 2014.

[18]

T. Namiki. A new fdtd algorithm based on alternating-direction implicit method. Microwave Theory and Techniques, IEEE Transactions on, 47(10):2003--2007, 1999.

[19]

R. Narayanan, B. Ozisikyilmaz, J. Zambreno, G. Memik, and A. Choudhary. Minebench: A benchmark suite for data mining workloads. In Workload Characterization, 2006 IEEE International Symposium on, pages 182--188. IEEE, 2006.

[20]

A. Y. Ng, M. I. Jordan, Y. Weiss, et al. On spectral clustering: Analysis and an algorithm. Advances in neural information processing systems, 2:849--856, 2002.

Digital Library

[21]

J. Nickolls and W. J. Dally. The gpu computing era. IEEE micro, 30(2):56--69, 2010.

Digital Library

[22]

F. Nina-Paravecino and D. Kaeli. Accelerated connected component labeling using cuda framework. In Computer Vision and Graphics (ICCVG), 2014 International Conference on, 2014.

[23]

NVIDIA. Visual Profiler, 2011.

[24]

NVIDIA. NVIDIA's Next Generation CUDA Computer Architecture Kepler GK110. 2012.

[25]

S. Osher and J. A. Sethian. Fronts propagating with curvature dependent speed: algorithms based on hamilton-jacobi formulations. Journal of Computational Physics, 79(1):12--49, 1988.

Digital Library

[26]

B. Porat. A course in digital signal processing, volume 1. Wiley New York, 1997.

Digital Library

[27]

P. Rogers. Heterogeneous system architecture overview. In Hot Chips, 2013.

[28]

J. Schmidt. A Flexible IIR Filtering Implementation for Audio Processing. Technicolor Research & Innovation, GTC, 2014.

[29]

Y. Shi and W. C. Karl. A fast implementation of the level set method without solving partial differential equations. Boston University, Department of Electrical and Computer Engineering, 2005.

[30]

S. Singh, W. F. Richards, J. R. Zinecker, and D. R. Wilton. Accelerating the convergence of series representing the free space periodic green's function. Antennas and Propagation, IEEE Transactions on, 38(12):1958--1962, 1990.

[31]

J. A. Stratton, C. Rodrigues, I.-J. Sung, N. Obeid, L.-W. Chang, N. Anssari, G. D. Liu, and W.-m. W. Hwu. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Center for Reliable and High-Performance Computing, 2012.

[32]

J. A. Stratton, S. S. Stone, and W. H. Wen-mei. Mcuda: An efficient implementation of cuda kernels for multi-core cpus. In Languages and Compilers for Parallel Computing, pages 16--30. Springer, 2008.

Digital Library

[33]

W. Sun and R. Ricci. Augmenting Operating Systems With the GPU. 2011.

[34]

V. Volkov and J. W. Demmel. Benchmarking gpus to tune dense linear algebra. In Proceedings of the 2008 ACM/IEEE conference on Supercomputing, page 31. IEEE Press, 2008.

Digital Library

[35]

Y. Yang and H. Zhou. Cuda-np: realizing nested thread-level parallelism in gpgpu applications. In Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming, pages 93--106. ACM, 2014.

Digital Library

[36]

H. Zhao, Y. Fan, T. Zhang, and H. Sang. Stripe-based connected components labelling. Electronics Letters, 46:1434--1436, October 2010.

Cited By

Pearce OBurmark JHornung RBogale BLumsden IMcKinsey MYokelson DBoehme DBrink STaufer MScogland T(2024)RAJA Performance Suite: Performance Portability Analysis with Caliper and ThicketSC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SCW63240.2024.00162(1206-1218)Online publication date: 17-Nov-2024
https://doi.org/10.1109/SCW63240.2024.00162
Bitalebi HSafaei FEbrahimi M(2024)Nearest data processing in GPUSustainable Computing: Informatics and Systems10.1016/j.suscom.2024.10104744(101047)Online publication date: Dec-2024
https://doi.org/10.1016/j.suscom.2024.101047
Friesel BLütke Dreimann MSpinczyk O(2023)A Full-System Perspective on UPMEM PerformanceProceedings of the 1st Workshop on Disruptive Memory Systems10.1145/3609308.3625266(1-7)Online publication date: 23-Oct-2023
https://dl.acm.org/doi/10.1145/3609308.3625266
Show More Cited By

Index Terms

NUPAR: A Benchmark Suite for Modern GPU Architectures
1. General and reference
  1. Cross-computing tools and techniques
    1. Metrics
2. Human-centered computing
  1. Human computer interaction (HCI)
    1. HCI design and evaluation methods

Recommendations

PRO: Progress Aware GPU Warp Scheduling Algorithm
IPDPS '15: Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium

Graphics Processing Units (GPUs) contain multiple SIMD cores and each core can run a large number of threads concurrently. Threads in a core are scheduled and executed infixed sized groups, called warps. Each core contains one or more warp schedulers ...
Benchmarking OpenCL, OpenACC, OpenMP, and CUDA: Programming Productivity, Performance, and Energy Consumption
ARMS-CC '17: Proceedings of the 2017 Workshop on Adaptive Resource Management and Scheduling for Cloud Computing

Many modern parallel computing systems are heterogeneous at their node level. Such nodes may comprise general purpose CPUs and accelerators (such as, GPU, or Intel Xeon Phi) that provide high performance with suitable energy-consumption characteristics. ...
Architecture-Aware Mapping and Optimization on a 1600-Core GPU
ICPADS '11: Proceedings of the 2011 IEEE 17th International Conference on Parallel and Distributed Systems

The graphics processing unit (GPU) continues to make in-roads as a computational accelerator for high-performance computing (HPC). However, despite its increasing popularity, mapping and optimizing GPU code remains a difficult task, it is a multi-...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICPE '15: Proceedings of the 6th ACM/SPEC International Conference on Performance Engineering

January 2015

366 pages

ISBN:9781450332484

DOI:10.1145/2668930

General Chairs:
Lizy K. John
UT Austin, USA
,
Connie U. Smith
L&S Computer Technology, Inc., USA
,
Program Chairs:
Kai Sachs
SAP SE, Germany
,
Catalina M. Lladó
University of the Balearic Islands, Spain

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMETRICS: ACM Special Interest Group on Measurement and Evaluation
SIGSOFT: ACM Special Interest Group on Software Engineering
SPEC: SPEC Research Group

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 January 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

NSF

Conference

ICPE'15

Sponsor:

ICPE'15: ACM/SPEC International Conference on Performance Engineering

January 28 - February 4, 2015

Texas, Austin, USA

Acceptance Rates

ICPE '15 Paper Acceptance Rate 23 of 74 submissions, 31%;

Overall Acceptance Rate 252 of 851 submissions, 30%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

27
Total Citations
View Citations
893
Total Downloads

Downloads (Last 12 months)184
Downloads (Last 6 weeks)22

Reflects downloads up to 12 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Pearce OBurmark JHornung RBogale BLumsden IMcKinsey MYokelson DBoehme DBrink STaufer MScogland T(2024)RAJA Performance Suite: Performance Portability Analysis with Caliper and ThicketSC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SCW63240.2024.00162(1206-1218)Online publication date: 17-Nov-2024
https://doi.org/10.1109/SCW63240.2024.00162
Bitalebi HSafaei FEbrahimi M(2024)Nearest data processing in GPUSustainable Computing: Informatics and Systems10.1016/j.suscom.2024.10104744(101047)Online publication date: Dec-2024
https://doi.org/10.1016/j.suscom.2024.101047
Friesel BLütke Dreimann MSpinczyk O(2023)A Full-System Perspective on UPMEM PerformanceProceedings of the 1st Workshop on Disruptive Memory Systems10.1145/3609308.3625266(1-7)Online publication date: 23-Oct-2023
https://dl.acm.org/doi/10.1145/3609308.3625266
Yan KShi YYan YChen QHuang ZSi M(2023)Exploring OpenMP GPU Offloading for Implementing Convolutional Neural NetworksProceedings of the 14th International Workshop on Programming Models and Applications for Multicores and Manycores10.1145/3582514.3582523(60-69)Online publication date: 25-Feb-2023
https://dl.acm.org/doi/10.1145/3582514.3582523
Swatman SVarbanescu APimentel ASalzburger AKrasznahorkay AVieira MCardellini VDi Marco ATuma P(2023)Systematically Exploring High-Performance Representations of Vector Fields Through Compile-Time CompositionProceedings of the 2023 ACM/SPEC International Conference on Performance Engineering10.1145/3578244.3583723(55-66)Online publication date: 15-Apr-2023
https://dl.acm.org/doi/10.1145/3578244.3583723
dos Santos FMalde SCazzaniga CFrost CCarro LRech P(2022)Experimental Findings on the Sources of Detected Unrecoverable Errors in GPUsIEEE Transactions on Nuclear Science10.1109/TNS.2022.314134169:3(436-443)Online publication date: Mar-2022
https://doi.org/10.1109/TNS.2022.3141341
Olabi MLuna JMutlu OHwu WEl Hajj ILee J(2022)A compiler framework for optimizing dynamic parallelism on GPUsProceedings of the 20th IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO53902.2022.9741284(1-13)Online publication date: 2-Apr-2022
https://dl.acm.org/doi/10.1109/CGO53902.2022.9741284
Svedin MChien SChikafa GJansson NPodobas A(2021)Benchmarking the Nvidia GPU LineageProceedings of the 11th International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies10.1145/3468044.3468053(1-6)Online publication date: 21-Jun-2021
https://dl.acm.org/doi/10.1145/3468044.3468053
Baruah TShivdikar KDong SSun YMojumder SJung KAbellan JUkidave YJoshi AKim JKaeli D(2021)GNNMark: A Benchmark Suite to Characterize Graph Neural Network Training on GPUs2021 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS51385.2021.00013(13-23)Online publication date: Mar-2021
https://doi.org/10.1109/ISPASS51385.2021.00013
Kalra CPrevilon FRubin NKaeli D(2020)ArmorAllACM Transactions on Architecture and Code Optimization10.1145/338213217:2(1-24)Online publication date: 29-May-2020
https://dl.acm.org/doi/10.1145/3382132
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents