Article

Scan primitives for GPU computing

Authors:

Shubhabrata Sengupta,

John D. OwensAuthors Info & Claims

GH '07: Proceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware

Pages 97 - 106

Published: 04 August 2007 Publication History

Abstract

The scan primitives are powerful, general-purpose data-parallel primitives that are building blocks for a broad range of applications. We describe GPU implementations of these primitives, specifically an efficient formulation and implementation of segmented scan, on NVIDIA GPUs using the CUDA API. Using the scan primitives, we show novel GPU implementations of quicksort and sparse matrix-vector multiply, and analyze the performance of the scan primitives, several sort algorithms that use the scan primitives, and a graphical shallow-water fluid simulation using the scan framework for a tridiagonal matrix solver.

References

[1]

{AS89} Anderson E., Saad Y.: Solving sparse triangular systems on parallel computers. International Journal of High Speed Computing 1, 1 (May 1989), 73--95.

Digital Library

[2]

{BFGS03} Bolz J., Farmer I., Grinspun E., Schröder P.: Sparse matrix solvers on the GPU: Conjugate gradients and multigrid. ACM Transactions on Graphics (Proceedings of ACM SIGGRAPH 2003) 22, 3 (July 2003), 917--924.

Digital Library

[3]

{BH03} Buck I., Hanrahan P.: Data Parallel Computation on Graphics Hardware. Tech. Rep. 2003-03, Stanford University Computer Science Department, Dec. 2003.

[4]

{BHZ93} Blelloch G. E., Heroux M. A., Zagha M.: Segmented Operations for Sparse Matrix Computation on Vector Multiprocessors. Tech. Rep. CMU-CS-93-173, School of Computer Science, Carnegie Mellon University, Aug. 1993.

Digital Library

[5]

{Ble90} Blelloch G.: Vector Models for Data-Parallel Computing. MIT Press, 1990.

Digital Library

[6]

{Buc05} Buck I.: Taking the plunge into GPU computing. In GPU Gems 2, Pharr M., (Ed.). Addison Wesley, Mar. 2005, ch. 32, pp. 509--519.

[7]

{CBZ90} Chatterjee S., Blelloch G. E., Zagha M.: Scan primitives for vector computers. In Super-computing '90: Proceedings of the 1990 Conference on Supercomputing (1990), pp. 666--675.

Digital Library

[8]

{Dav94} Davis T. A.: The University of Florida sparse matrix collection. NA Digest 92, 42 (16 Oct. 1994). http://www.cise.ufl.edu/research/sparse/matrices.

[9]

{Gah06} Gahvari H. B.: Benchmarking Sparse Matrix-Vector Multiply. Master's thesis, University of California, Berkeley, Dec. 2006.

[10]

{GGK06} Gress A., Guthe M., Klein R.: GPU-based collision detection for deformable parameterized surfaces. Computer Graphics Forum 25, 3 (Sept. 2006), 497--506.

[11]

{GGKM06} Govindaraju N. K., Gray J., Kumar R., Manocha D.: GPUTeraSort: High performance graphics coprocessor sorting for large database management. In Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data (June 2006), pp. 325--336.

Digital Library

[12]

{Hor05} Horn D.: Stream reduction operations for GPGPU applications. In GPU Gems 2, Pharr M., (Ed.). Addison Wesley, Mar. 2005, ch. 36, pp. 573--589.

[13]

{HSC*05} Hensley J., Scheuermann T., Coombe G., Singh M., Lastra A.: Fast summed-area table generation and its applications. Computer Graphics Forum 24, 3 (Sept. 2005), 547--555.

[14]

{HSO07} Harris M., Sengupta S., Owens J. D.: Parallel prefix sum (scan) with CUDA. In GPU Gems 3, Nguyen H., (Ed.). Addison Wesley, Aug. 2007.

[15]

{Ive62} Iverson K. E.: A Programming Language. Wiley, New York, 1962.

Digital Library

[16]

{KLO06} Kass M., Lefohn A., Owens J.: Interactive Depth of Field. Tech. Rep. # 06-01, Pixar, 2006. http://graphics.pixar.com/DepthOfField/.

[17]

{KM90} Kass M., Miller G.: Rapid, stable fluid dynamics for computer graphics. In Computer Graphics (Proceedings of SIGGRAPH 90) (Aug. 1990), pp. 49--57.

Digital Library

[18]

{KW03} Krüger J., Westermann R.: Linear algebra operators for GPU implementation of numerical algorithms. ACM Transactions on Graphics 22, 3 (July 2003), 908--916.

Digital Library

[19]

{LKS*06} Lefohn A. E., Kniss J., Strzodka R., Sengupta S., Owens J. D.: Glift: Generic, efficient, random-access GPU data structures. ACM Transactions on Graphics 26, 1 (Jan. 2006), 60--99.

Digital Library

[20]

{LM01} Larsen E. S., McAllister D.: Fast matrix multiplies using graphics hardware. In Proceedings of the 2001 ACM/IEEE Conference on Supercomputing (Nov. 2001), p. 55.

Digital Library

[21]

{Mor02} Moravánszky A.: Dense matrix algebra on the GPU. In ShaderX2: Shader Programming Tips and Tricks with DirectX 9.0, Engel W. F., (Ed.). Wordware Publishing, 2002, pp. 352--380.

[22]

{NVI07} NVIDIA Corporation: NVIDIA CUDA compute unified device architecture programming guide. http://developer.nvidia.com/cuda, Jan. 2007.

[23]

{PSG06} Peercy M., Segal M., Gerstmann D.: A performance-oriented data parallel virtual machine for GPUs. In ACM SIGGRAPH 2006 Conference Abstracts and Applications (Aug. 2006).

Digital Library

[24]

{Sch80} Schwartz J. T.: Ultracomputers. ACM Transactions on Programming Languages and Systems 2, 4 (Oct. 1980), 484--521.

Digital Library

[25]

{SLO06} Sengupta S., Lefohn A. E., Owens J. D.: A work-efficient step-efficient prefix sum algorithm. In Proceedings of the Workshop on Edge Computing Using New Commodity Architectures (May 2006), pp. D-26--27.

Cited By

Giannoula CYang PFernandez IYang JDurvasula SLi YSadrosadati MLuna JMutlu OPekhimenko G(2024)PyGim : An Efficient Graph Neural Network Library for Real Processing-In-Memory ArchitecturesProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/37004348:3(1-36)Online publication date: 10-Dec-2024
https://dl.acm.org/doi/10.1145/3700434
Giannoula CFernandez ILuna JKoziris NGoumas GMutlu O(2022)SparsePProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/35080416:1(1-49)Online publication date: 28-Feb-2022
https://dl.acm.org/doi/10.1145/3508041
Green O(2021)HashGraph—Scalable Hash Tables Using a Sparse Graph Data StructureACM Transactions on Parallel Computing10.1145/34608728:2(1-17)Online publication date: 15-Jul-2021
https://dl.acm.org/doi/10.1145/3460872
Show More Cited By

Scan primitives for GPU computing

Recommendations

Many-core GPU computing with NVIDIA CUDA
ICS '08: Proceedings of the 22nd annual international conference on Supercomputing

In the past, graphics processors were special-purpose hardwired application accelerators, suitable only for conventional graphics applications. Modern GPUs are fully programmable, massively parallel floating point processors. In this talk I will ...
NVIDIA cuda software and gpu parallel computing architecture
ISMM '07: Proceedings of the 6th international symposium on Memory management

In the past, graphics processors were special purpose hardwired application accelerators, suitable only for conventional rasterization-style graphics applications. Modern GPUs are now fully programmable, massively parallel floating point processors. ...
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing

The graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

GH '07: Proceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware

August 2007

119 pages

ISBN:9781595936257

Editors:
Dieter Fellner
TU Braunschweig, Germany
,
Stephen Spencer
The University of Washington

Sponsors

SIGGRAPH: ACM Special Interest Group on Computer Graphics and Interactive Techniques
EUROGRAPHICS: The European Association for Computer Graphics

Publisher

Eurographics Association

Goslar, Germany

Publication History

Published: 04 August 2007

Check for updates

Qualifiers

Article

Conference

GH07

Sponsor:

SIGGRAPH
EUROGRAPHICS

GH07: Graphics Hardware

August 4 - 5, 2007

California, San Diego

Acceptance Rates

GH '07 Paper Acceptance Rate 12 of 30 submissions, 40%;

Overall Acceptance Rate 37 of 94 submissions, 39%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

188
Total Citations
View Citations
76
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 01 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Giannoula CYang PFernandez IYang JDurvasula SLi YSadrosadati MLuna JMutlu OPekhimenko G(2024)PyGim : An Efficient Graph Neural Network Library for Real Processing-In-Memory ArchitecturesProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/37004348:3(1-36)Online publication date: 10-Dec-2024
https://dl.acm.org/doi/10.1145/3700434
Giannoula CFernandez ILuna JKoziris NGoumas GMutlu O(2022)SparsePProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/35080416:1(1-49)Online publication date: 28-Feb-2022
https://dl.acm.org/doi/10.1145/3508041
Green O(2021)HashGraph—Scalable Hash Tables Using a Sparse Graph Data StructureACM Transactions on Parallel Computing10.1145/34608728:2(1-17)Online publication date: 15-Jul-2021
https://dl.acm.org/doi/10.1145/3460872
Nie JZhang CZou DXia FLu LWang XZhao F(2019)Adaptive Sparse Matrix-Vector Multiplication on CPU-GPU Heterogeneous ArchitectureProceedings of the 2019 3rd High Performance Computing and Cluster Technologies Conference10.1145/3341069.3341072(6-10)Online publication date: 22-Jun-2019
https://dl.acm.org/doi/10.1145/3341069.3341072
Wang LChen ZLiu YWang YZheng LLi MWang Y(2019)A Unified Optimization Approach for CNN Model Inference on Integrated GPUsProceedings of the 48th International Conference on Parallel Processing10.1145/3337821.3337839(1-10)Online publication date: 5-Aug-2019
https://dl.acm.org/doi/10.1145/3337821.3337839
Gaihre AWu ZYao FLiu HWeissman JButt ASmirni E(2019)XBFSProceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing10.1145/3307681.3326606(121-131)Online publication date: 17-Jun-2019
https://dl.acm.org/doi/10.1145/3307681.3326606
Peng CZhang DMøller A(2019)On the correctness of GPU programsProceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3293882.3338989(443-447)Online publication date: 10-Jul-2019
https://dl.acm.org/doi/10.1145/3293882.3338989
Olsson O(2019)Efficient GPU Stream TransformProceedings of the Australasian Computer Science Week Multiconference10.1145/3290688.3290707(1-11)Online publication date: 29-Jan-2019
https://dl.acm.org/doi/10.1145/3290688.3290707
Song YYang SLei J(2019)ParaCellsIEEE/ACM Transactions on Computational Biology and Bioinformatics10.1109/TCBB.2018.281457016:3(994-1006)Online publication date: 1-May-2019
https://dl.acm.org/doi/10.1109/TCBB.2018.2814570
Zellmann SSchulze JLang UHentschel BChilds HCucchietti F(2018)Rapid k-d tree construction for sparse volume dataProceedings of the Symposium on Parallel Graphics and Visualization10.5555/3293524.3293531(69-77)Online publication date: 4-Jun-2018
https://dl.acm.org/doi/10.5555/3293524.3293531
Show More Cited By

View Options

View options

Media

Figures

Other

Tables

View Table of Contents