research-article

Compiling generalized histograms for GPU

Authors:

Troels Henriksen,

Sune Hellfritzsch,

Ponnuswamy Sadayappan,

Cosmin OanceaAuthors Info & Claims

SC '20: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Article No.: 97, Pages 1 - 14

Published: 09 November 2020 Publication History

Abstract

We present and evaluate an implementation technique for histogram-like computations on GPUs that ensures both work-efficient asymptotic cost, support for arbitrary associative and commutative operators, and efficient use of hardware-supported atomic operations when applicable. Based on a systematic empirical examination of the design space, we develop a technique that balances conflict rates and memory footprint.

We demonstrate our technique both as a library implementation in CUDA, as well as by extending the parallel array language Futhark with a new construct for expressing generalized histograms, and by supporting this construct with several compiler optimizations. We show that our histogram implementation taken in isolation outperforms similar primitives from CUB, and that it is competitive or outperforms the hand-written code of several application benchmarks, even when the latter is specialized for a class of datasets.

References

[1]

S. Brown and J. Snoeyink, "Modestly faster histogram computations on GPUs," in Proceedings of Innovative Parallel Computing Foundations & Applications of GPU, Manycore, and Heterogeneous Systems (INPAR), ser. INPAR'12, 2012, pp. 1--7.

[2]

V. T. Ravi, W. Ma, D. Chiu, and G. Agrawal, "Compiler and runtime support for enabling generalized reduction computations on heterogeneous parallel configurations," in Proceedings of the 24th ACM International Conference on Supercomputing, ser. ICS'10. ACM, 2010, pp. 137--146. [Online]. Available

Digital Library

[3]

R. Shams and R. A. Kennedy, "Efficient histogram algorithms for NVIDIA CUDA compatible devices," in Proceedings of International Conference on Signal Processing and Communication Systems (ICSPCS), 2007, pp. 418--422.

[4]

V. Podlozhnyuk, "Histogram calculations in cuda," 2007. [Online]. Available: https://developer.download.nvidia.com/compute/cuda/1.1-Beta/x86website/projects/histogram64/doc/histogram.pdf

[5]

D. Merrill, "CUDA Unbound (CUB) Library," 2015. [Online]. Available: https://nvlabs.github.io/cub/

[6]

A. Irpino and R. Verde, "Basic statistics for distributional symbolic variables: a new metric-based approach," Advances in Data Analysis and Classification, vol. 9, no. 2, pp. 143--175, 2015.

Digital Library

[7]

E. Diday, "Principal component analysis for categorical histogram data: Some open directions of research," in Classification and Multivariate Analysis for Complex Data Structures, B. Fichet, D. Piccolo, R. Verde, and M. Vichi, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2011, pp. 3--15.

[8]

R. B. Gurung, T. Lindgren, and H. Boström, "Learning decision trees from histogram data using multiple subsets of bins," in Proceedings of the Twenty-Ninth International Florida Artificial Intelligence Research Society Conference. AAAI Press, 2016, p. 430--435, [ed] Zdravko Markov, Ingrid Russell. [Online]. Available: http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-221498

[9]

S. Satoh, "Generalized histogram: Empirical optimization of low dimensional features for image matching," in Computer Vision - ECCV 2004, T. Pajdla and J. Matas, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2004, pp. 210--223.

[10]

S. Darkner and J. Sporring, "Locally orderless registration," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 6, pp. 1437--1450, 2013.

Digital Library

[11]

K. Lu and H. Shen, "Multivariate volumetric data analysis and visualization through bottom-up subspace exploration," in 2017 IEEE Pacific Visualization Symposium, PacificVis 2017, Seoul, South Korea, April 18--21, 2017, 2017, pp. 141--150.

[12]

K. Wang, K. Lu, T. Wei, N. Shareef, and H. Shen, "Statistical visualization and analysis of large data using a value-based spatial distribution," in 2017 IEEE Pacific Visualization Symposium, PacificVis 2017, Seoul, South Korea, April 18--21, 2017, 2017, pp. 161--170.

[13]

C. E. Oancea and L. Rauchwerger, "Logical inference techniques for loop parallelization," in Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation, ser. PLDI '12. New York, NY, USA: ACM, 2012, pp. 509--520. [Online]. Available

Digital Library

[14]

R. Mitchell, "Gradient boosting, decision trees and xgboost with cuda," 2017. [Online]. Available: https://devblogs.nvidia.com/gradient-boosting-decision-trees-xgboost-cuda/

[15]

E. E. Catmull, "A subdivision algorithm for computer display of curved surfaces." Ph.D. dissertation, 1974, aAI7504786.

[16]

Y. Ding, Y. Zhao, X. Shen, M. Musuvathi, and T. Mytkowicz, "Yinyang k-means: A drop-in replacement of the classic k-means with consistent speedup," in Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37, ser. ICML'15. JMLR.org, 2015, pp. 579--587. [Online]. Available: http://dl.acm.org/citation.cfm?id=3045118.3045181

[17]

A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, "Automatic differentiation in pytorch," in NIPS 2017 Workshop on Autodiff, 2017. [Online]. Available: https://openreview.net/forum?id=BJJsrmfCZ

[18]

T. Henriksen, M. Dybdal, H. Urms, A. S. Kiehn, D. Gavin, H. Abelskov, M. Elsman, and C. Oancea, "APL on GPUs: A TAIL from the Past, Scribbled in Futhark," in Procs. of the 5th Int. Workshop on Functional High-Performance Computing, ser. FHPC'16. New York, NY, USA: ACM, 2016, pp. 38--43.

Digital Library

[19]

A. G. Baydin, B. A. Pearlmutter, A. A. Radul, and J. M. Siskind, "Automatic differentiation in machine learning: A survey," J. Mach. Learn. Res., vol. 18, no. 1, p. 5595--5637, Jan. 2017.

Digital Library

[20]

T. Henriksen and C. E. Oancea, "A T2 graph-reduction approach to fusion," in Proceedings of the 2Nd ACM SIGPLAN Workshop on Functional High-performance Computing, ser. FHPC '13. New York, NY, USA: ACM, 2013, pp. 47--58. [Online]. Available

Digital Library

[21]

T. Henriksen, F. Thorøe, M. Elsman, and C. Oancea, "Incremental flattening for nested data parallelism," in Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming, ser. PPoPP '19. New York, NY, USA: ACM, 2019, pp. 53--67. [Online]. Available

Digital Library

[22]

J. A. Stratton, C. Rodrigues, I.-J. Sung, N. Obeid, L.-W. Chang, N. Anssari, G. D. Liu, and W.-m. W. Hwu, "Parboil: A revised benchmark suite for scientific and commercial throughput computing," Center for Reliable and High-Performance Computing, vol. 127, 2012.

[23]

J. Koenderink and A. V. Doorn, "The structure of locally orderless images," International Journal of Computer Vision, vol. 31, no. 2, pp. 159--168, 1999.

Digital Library

[24]

C. Nugteren, G.-J. van den Braak, H. Corporaal, and B. Mesman, "High performance predictable histogramming on GPUs: Exploring and evaluating algorithm trade-offs," in Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, ser. GPGPU-4. ACM, 2011, pp. 1:1--1:8.

[25]

J. Gómez-Luna, J. M. González-Linares, J. I. Benavides, and N. Guil, "An optimized approach to histogram computation on GPU," Machine Vision and Applications, vol. 24, no. 5, pp. 899--908, Jul 2013.

Digital Library

[26]

B. Dhanasekaran and N. Rubin, "A new method for gpu based irregular reductions and its application to k-means clustering," in Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, ser. GPGPU-4. New York, NY, USA: ACM, 2011. [Online]. Available

Digital Library

[27]

R. Baghdadi, U. Beaugnon, A. Cohen, T. Grosser, M. Kruse, C. Reddy, S. Verdoolaege, A. Betts, A. F. Donaldson, J. Ketema et al., "Pencil: a platform-neutral compute intermediate language for accelerator programming," in 2015 International Conference on Parallel Architecture and Compilation (PACT). IEEE, 2015, pp. 138--149.

[28]

C. Reddy, M. Kruse, and A. Cohen, "Reduction drawing: Language constructs and polyhedral compilation for reductions on gpu," in Proceedings of the 2016 International Conference on Parallel Architectures and Compilation, ser. PACT '16. New York, NY, USA: ACM, 2016, pp. 87--97. [Online]. Available

Digital Library

[29]

T. L. McDonell, M. M. Chakravarty, G. Keller, and B. Lippmeier, "Optimising Purely Functional GPU Programs," in Procs. of Int. Conf. Funct. Prog. (ICFP), 2013.

Cited By

Henriksen TRainey MScholz S(2024)A Comparison of OpenCL, CUDA, and HIP as Compilation Targets for a Functional Array LanguageProceedings of the 1st ACM SIGPLAN International Workshop on Functional Programming for Productivity and Performance10.1145/3677997.3678226(1-9)Online publication date: 28-Aug-2024
https://dl.acm.org/doi/10.1145/3677997.3678226
Bruun LLarsen UHinnerskov NOancea C(2023)Reverse-Mode AD of Multi-Reduce and Scan in FutharkProceedings of the 35th Symposium on Implementation and Application of Functional Languages10.1145/3652561.3652575(1-14)Online publication date: 29-Aug-2023
https://dl.acm.org/doi/10.1145/3652561.3652575
Dong XWu YWang ZDhulipala LGu YSun YAgrawal KShun J(2023)High-Performance and Flexible Parallel Algorithms for Semisort and Related ProblemsProceedings of the 35th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3558481.3591071(341-353)Online publication date: 17-Jun-2023
https://dl.acm.org/doi/10.1145/3558481.3591071
Show More Cited By

Compiling generalized histograms for GPU
1. Computing methodologies
  1. Parallel computing methodologies
2. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language types

Recommendations

Exploiting Parallelism on GPUs and FPGAs with OmpSs
ANDARE '17: Proceedings of the 1st Workshop on AutotuniNg and aDaptivity AppRoaches for Energy efficient HPC Systems

This paper presents the OmpSs approach to deal with heterogeneous programming on GPU and FPGA accelerators. The OmpSs programming model is based on the Mercurium compiler and the Nanos++ runtime. Applications are annotated with compiler directives ...
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing

The graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
Strategies for regular segmented reductions on GPU
FHPC 2017: Proceedings of the 6th ACM SIGPLAN International Workshop on Functional High-Performance Computing

We present and evaluate an implementation technique for regular segmented reductions on GPUs. Existing techniques tend to be either consistent in performance but relatively inefficient in absolute terms, or optimised for specific workloads and thereby ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '20: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

November 2020

1454 pages

ISBN:9781728199986

General Chair:
Christine Cuicchi,
Program Chairs:
Irene Qualters,
William Kramer

Sponsors

SIGHPC: ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing

In-Cooperation

IEEE CS

Publisher

IEEE Press

Publication History

Published: 09 November 2020

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SC '20

Sponsor:

SIGHPC

SC '20: The International Conference for High Performance Computing, Networking, Storage and Analysis

November 9 - 19, 2020

Georgia, Atlanta

Acceptance Rates

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
172
Total Downloads

Downloads (Last 12 months)14
Downloads (Last 6 weeks)4

Reflects downloads up to 26 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Henriksen TRainey MScholz S(2024)A Comparison of OpenCL, CUDA, and HIP as Compilation Targets for a Functional Array LanguageProceedings of the 1st ACM SIGPLAN International Workshop on Functional Programming for Productivity and Performance10.1145/3677997.3678226(1-9)Online publication date: 28-Aug-2024
https://dl.acm.org/doi/10.1145/3677997.3678226
Bruun LLarsen UHinnerskov NOancea C(2023)Reverse-Mode AD of Multi-Reduce and Scan in FutharkProceedings of the 35th Symposium on Implementation and Application of Functional Languages10.1145/3652561.3652575(1-14)Online publication date: 29-Aug-2023
https://dl.acm.org/doi/10.1145/3652561.3652575
Dong XWu YWang ZDhulipala LGu YSun YAgrawal KShun J(2023)High-Performance and Flexible Parallel Algorithms for Semisort and Related ProblemsProceedings of the 35th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3558481.3591071(341-353)Online publication date: 17-Jun-2023
https://dl.acm.org/doi/10.1145/3558481.3591071
El Kharroubi MCoudray BMalaspinas OHenriksen TLow T(2022)Distributed parallel computing with Futhark: a functional language to generate distributed parallel codeProceedings of the 8th ACM SIGPLAN International Workshop on Libraries, Languages and Compilers for Array Programming10.1145/3520306.3534501(12-24)Online publication date: 13-Jun-2022
https://dl.acm.org/doi/10.1145/3520306.3534501
Paszke AJohnson DDuvenaud DVytiniotis DRadul AJohnson MRagan-Kelley JMaclaurin D(2021)Getting to the point: index sets and parallelism-preserving autodiff for pointful array programmingProceedings of the ACM on Programming Languages10.1145/34735935:ICFP(1-29)Online publication date: 19-Aug-2021
https://dl.acm.org/doi/10.1145/3473593

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents