Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/3433701.3433830acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Compiling generalized histograms for GPU

Published: 09 November 2020 Publication History

Abstract

We present and evaluate an implementation technique for histogram-like computations on GPUs that ensures both work-efficient asymptotic cost, support for arbitrary associative and commutative operators, and efficient use of hardware-supported atomic operations when applicable. Based on a systematic empirical examination of the design space, we develop a technique that balances conflict rates and memory footprint.
We demonstrate our technique both as a library implementation in CUDA, as well as by extending the parallel array language Futhark with a new construct for expressing generalized histograms, and by supporting this construct with several compiler optimizations. We show that our histogram implementation taken in isolation outperforms similar primitives from CUB, and that it is competitive or outperforms the hand-written code of several application benchmarks, even when the latter is specialized for a class of datasets.

References

[1]
S. Brown and J. Snoeyink, "Modestly faster histogram computations on GPUs," in Proceedings of Innovative Parallel Computing Foundations & Applications of GPU, Manycore, and Heterogeneous Systems (INPAR), ser. INPAR'12, 2012, pp. 1--7.
[2]
V. T. Ravi, W. Ma, D. Chiu, and G. Agrawal, "Compiler and runtime support for enabling generalized reduction computations on heterogeneous parallel configurations," in Proceedings of the 24th ACM International Conference on Supercomputing, ser. ICS'10. ACM, 2010, pp. 137--146. [Online]. Available
[3]
R. Shams and R. A. Kennedy, "Efficient histogram algorithms for NVIDIA CUDA compatible devices," in Proceedings of International Conference on Signal Processing and Communication Systems (ICSPCS), 2007, pp. 418--422.
[4]
V. Podlozhnyuk, "Histogram calculations in cuda," 2007. [Online]. Available: https://developer.download.nvidia.com/compute/cuda/1.1-Beta/x86website/projects/histogram64/doc/histogram.pdf
[5]
D. Merrill, "CUDA Unbound (CUB) Library," 2015. [Online]. Available: https://nvlabs.github.io/cub/
[6]
A. Irpino and R. Verde, "Basic statistics for distributional symbolic variables: a new metric-based approach," Advances in Data Analysis and Classification, vol. 9, no. 2, pp. 143--175, 2015.
[7]
E. Diday, "Principal component analysis for categorical histogram data: Some open directions of research," in Classification and Multivariate Analysis for Complex Data Structures, B. Fichet, D. Piccolo, R. Verde, and M. Vichi, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2011, pp. 3--15.
[8]
R. B. Gurung, T. Lindgren, and H. Boström, "Learning decision trees from histogram data using multiple subsets of bins," in Proceedings of the Twenty-Ninth International Florida Artificial Intelligence Research Society Conference. AAAI Press, 2016, p. 430--435, [ed] Zdravko Markov, Ingrid Russell. [Online]. Available: http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-221498
[9]
S. Satoh, "Generalized histogram: Empirical optimization of low dimensional features for image matching," in Computer Vision - ECCV 2004, T. Pajdla and J. Matas, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2004, pp. 210--223.
[10]
S. Darkner and J. Sporring, "Locally orderless registration," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 6, pp. 1437--1450, 2013.
[11]
K. Lu and H. Shen, "Multivariate volumetric data analysis and visualization through bottom-up subspace exploration," in 2017 IEEE Pacific Visualization Symposium, PacificVis 2017, Seoul, South Korea, April 18--21, 2017, 2017, pp. 141--150.
[12]
K. Wang, K. Lu, T. Wei, N. Shareef, and H. Shen, "Statistical visualization and analysis of large data using a value-based spatial distribution," in 2017 IEEE Pacific Visualization Symposium, PacificVis 2017, Seoul, South Korea, April 18--21, 2017, 2017, pp. 161--170.
[13]
C. E. Oancea and L. Rauchwerger, "Logical inference techniques for loop parallelization," in Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation, ser. PLDI '12. New York, NY, USA: ACM, 2012, pp. 509--520. [Online]. Available
[14]
R. Mitchell, "Gradient boosting, decision trees and xgboost with cuda," 2017. [Online]. Available: https://devblogs.nvidia.com/gradient-boosting-decision-trees-xgboost-cuda/
[15]
E. E. Catmull, "A subdivision algorithm for computer display of curved surfaces." Ph.D. dissertation, 1974, aAI7504786.
[16]
Y. Ding, Y. Zhao, X. Shen, M. Musuvathi, and T. Mytkowicz, "Yinyang k-means: A drop-in replacement of the classic k-means with consistent speedup," in Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37, ser. ICML'15. JMLR.org, 2015, pp. 579--587. [Online]. Available: http://dl.acm.org/citation.cfm?id=3045118.3045181
[17]
A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, "Automatic differentiation in pytorch," in NIPS 2017 Workshop on Autodiff, 2017. [Online]. Available: https://openreview.net/forum?id=BJJsrmfCZ
[18]
T. Henriksen, M. Dybdal, H. Urms, A. S. Kiehn, D. Gavin, H. Abelskov, M. Elsman, and C. Oancea, "APL on GPUs: A TAIL from the Past, Scribbled in Futhark," in Procs. of the 5th Int. Workshop on Functional High-Performance Computing, ser. FHPC'16. New York, NY, USA: ACM, 2016, pp. 38--43.
[19]
A. G. Baydin, B. A. Pearlmutter, A. A. Radul, and J. M. Siskind, "Automatic differentiation in machine learning: A survey," J. Mach. Learn. Res., vol. 18, no. 1, p. 5595--5637, Jan. 2017.
[20]
T. Henriksen and C. E. Oancea, "A T2 graph-reduction approach to fusion," in Proceedings of the 2Nd ACM SIGPLAN Workshop on Functional High-performance Computing, ser. FHPC '13. New York, NY, USA: ACM, 2013, pp. 47--58. [Online]. Available
[21]
T. Henriksen, F. Thorøe, M. Elsman, and C. Oancea, "Incremental flattening for nested data parallelism," in Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming, ser. PPoPP '19. New York, NY, USA: ACM, 2019, pp. 53--67. [Online]. Available
[22]
J. A. Stratton, C. Rodrigues, I.-J. Sung, N. Obeid, L.-W. Chang, N. Anssari, G. D. Liu, and W.-m. W. Hwu, "Parboil: A revised benchmark suite for scientific and commercial throughput computing," Center for Reliable and High-Performance Computing, vol. 127, 2012.
[23]
J. Koenderink and A. V. Doorn, "The structure of locally orderless images," International Journal of Computer Vision, vol. 31, no. 2, pp. 159--168, 1999.
[24]
C. Nugteren, G.-J. van den Braak, H. Corporaal, and B. Mesman, "High performance predictable histogramming on GPUs: Exploring and evaluating algorithm trade-offs," in Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, ser. GPGPU-4. ACM, 2011, pp. 1:1--1:8.
[25]
J. Gómez-Luna, J. M. González-Linares, J. I. Benavides, and N. Guil, "An optimized approach to histogram computation on GPU," Machine Vision and Applications, vol. 24, no. 5, pp. 899--908, Jul 2013.
[26]
B. Dhanasekaran and N. Rubin, "A new method for gpu based irregular reductions and its application to k-means clustering," in Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, ser. GPGPU-4. New York, NY, USA: ACM, 2011. [Online]. Available
[27]
R. Baghdadi, U. Beaugnon, A. Cohen, T. Grosser, M. Kruse, C. Reddy, S. Verdoolaege, A. Betts, A. F. Donaldson, J. Ketema et al., "Pencil: a platform-neutral compute intermediate language for accelerator programming," in 2015 International Conference on Parallel Architecture and Compilation (PACT). IEEE, 2015, pp. 138--149.
[28]
C. Reddy, M. Kruse, and A. Cohen, "Reduction drawing: Language constructs and polyhedral compilation for reductions on gpu," in Proceedings of the 2016 International Conference on Parallel Architectures and Compilation, ser. PACT '16. New York, NY, USA: ACM, 2016, pp. 87--97. [Online]. Available
[29]
T. L. McDonell, M. M. Chakravarty, G. Keller, and B. Lippmeier, "Optimising Purely Functional GPU Programs," in Procs. of Int. Conf. Funct. Prog. (ICFP), 2013.

Cited By

View all
  • (2024)A Comparison of OpenCL, CUDA, and HIP as Compilation Targets for a Functional Array LanguageProceedings of the 1st ACM SIGPLAN International Workshop on Functional Programming for Productivity and Performance10.1145/3677997.3678226(1-9)Online publication date: 28-Aug-2024
  • (2023)Reverse-Mode AD of Multi-Reduce and Scan in FutharkProceedings of the 35th Symposium on Implementation and Application of Functional Languages10.1145/3652561.3652575(1-14)Online publication date: 29-Aug-2023
  • (2023)High-Performance and Flexible Parallel Algorithms for Semisort and Related ProblemsProceedings of the 35th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3558481.3591071(341-353)Online publication date: 17-Jun-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '20: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
November 2020
1454 pages
ISBN:9781728199986

Sponsors

In-Cooperation

  • IEEE CS

Publisher

IEEE Press

Publication History

Published: 09 November 2020

Check for updates

Author Tags

  1. GPU
  2. functional programming
  3. parallelism

Qualifiers

  • Research-article

Conference

SC '20
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)14
  • Downloads (Last 6 weeks)4
Reflects downloads up to 26 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)A Comparison of OpenCL, CUDA, and HIP as Compilation Targets for a Functional Array LanguageProceedings of the 1st ACM SIGPLAN International Workshop on Functional Programming for Productivity and Performance10.1145/3677997.3678226(1-9)Online publication date: 28-Aug-2024
  • (2023)Reverse-Mode AD of Multi-Reduce and Scan in FutharkProceedings of the 35th Symposium on Implementation and Application of Functional Languages10.1145/3652561.3652575(1-14)Online publication date: 29-Aug-2023
  • (2023)High-Performance and Flexible Parallel Algorithms for Semisort and Related ProblemsProceedings of the 35th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3558481.3591071(341-353)Online publication date: 17-Jun-2023
  • (2022)Distributed parallel computing with Futhark: a functional language to generate distributed parallel codeProceedings of the 8th ACM SIGPLAN International Workshop on Libraries, Languages and Compilers for Array Programming10.1145/3520306.3534501(12-24)Online publication date: 13-Jun-2022
  • (2021)Getting to the point: index sets and parallelism-preserving autodiff for pointful array programmingProceedings of the ACM on Programming Languages10.1145/34735935:ICFP(1-29)Online publication date: 19-Aug-2021

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media