research-article

Nested data-parallelism on the gpu

Authors:

Lars Bergstrom,

John ReppyAuthors Info & Claims

ICFP '12: Proceedings of the 17th ACM SIGPLAN international conference on Functional programming

Pages 247 - 258

https://doi.org/10.1145/2364527.2364563

Published: 09 September 2012 Publication History

Abstract

Graphics processing units (GPUs) provide both memory bandwidth and arithmetic performance far greater than that available on CPUs but, because of their Single-Instruction-Multiple-Data (SIMD) architecture, they are hard to program. Most of the programs ported to GPUs thus far use traditional data-level parallelism, performing only operations that operate uniformly over vectors.

NESL is a first-order functional language that was designed to allow programmers to write irregular-parallel programs - such as parallel divide-and-conquer algorithms - for wide-vector parallel computers. This paper presents our port of the NESL implementation to work on GPUs and provides empirical evidence that nested data-parallelism (NDP) on GPUs significantly outperforms CPU-based implementations and matches or beats newer GPU languages that support only flat parallelism. While our performance does not match that of hand-tuned CUDA programs, we argue that the notational conciseness of NESL is worth the loss in performance. This work provides the first language implementation that directly supports NDP on a GPU.

References

[1]

Blelloch, G. and S. Chatterjee. VCODE: A data-parallel intermediate language. In FOMPC3, 1990, pp. 471--480.

[2]

Blelloch, G. and S. Chatterjee. CVL: A C vector language, 1993.

[3]

Blelloch, G. E., S. Chatterjee, J. C. Hardwick, J. Sipelstein, and M. Zagha. Implementation of a portable nested data-parallel language. JPDC, 21(1), 1994, pp. 4--14.

Digital Library

[4]

Barber, C. B., D. P. Dobkin, and H. Huhdanpaa. The quickhull algorithm for convex hulls. ACM TOMS, 22(4), 1996, pp. 469--483.

Digital Library

[5]

Bergstrom, L., M. Fluet, M. Rainey, J. Reppy, and A. Shaw. Lazy tree splitting. In ICFP '10. ACM, September 2010, pp. 93--104.

Digital Library

[6]

Barnes, J. and P. Hut. A hierarchical O(N log N) force calculation algorithm. Nature, 324, December 1986, pp. 446--449.

[7]

Blelloch, G. E. Programming parallel algorithms. CACM, 39(3), March 1996, pp. 85--97.

Digital Library

[8]

Burtscher, M. and K. Pingali. An efficient CUDA implementation of the tree-based Barnes Hut n-body algorithm. In GPU Computing Gems Emerald Edition, chapter 6, pp. 75--92. Elsevier Science Publishers, New York, NY, 2011.

[9]

Black, F. and M. Scholes. The pricing of options and corporate liabilities. JPE, 81(3), 1973, pp. 637--654.

[10]

Blelloch, G. E. and G. W. Sabot. Compiling collection-oriented languages onto massively parallel computers. JPDC, 8(2), 1990, pp. 119--134.

Digital Library

[11]

Cunningham, D., R. Bordawekar, and V. Saraswat. GPU programming in a high level language compiling X10 to CUDA. In X10 '11, San Jose, CA, May 2011. Available from http://x10-lang.org/.

Digital Library

[12]

Catanzaro, B., M. Garland, and K. Keutzer. Copperhead: compiling an embedded data parallel language. In PPoPP '11, San Antonio, TX, February 2011. ACM, pp. 47--56.

Digital Library

[13]

Chatterjee, S. Compiling nested data-parallel programs for shared-memory multiprocessors. ACM TOPLAS, 15(3), July 1993, pp. 400--462.

Digital Library

[14]

Chakravarty, M. M., G. Keller, S. Lee, T. L. McDonell, and V. Grover. Accelerating Haskell array codes with multicore GPUs. In DAMP '11, Austin, January 2011. ACM, pp. 3--14.

Digital Library

[15]

Chakravarty, M. M. T., G. Keller, R. Leshchinskiy, and W. Pfannenstiel. Nepal - nested data parallelism in Haskell. In Euro-Par '01, vol. 2150 of LNCS. Springer-Verlag, August 2001, pp. 524--534.

Digital Library

[16]

Chakravarty, M. M. T., R. Leshchinskiy, S. Peyton Jones, and G. Keller. Partial vectorisation of Haskell programs. In DAMP '08. ACM, January 2008, pp. 2--16. Available from http://clip.dia.fi.upm.es/Conferences/DAMP08/.

[17]

Dhanasekaran, B. and N. Rubin. A new method for GPU based irregular reductions and its application to k-means clustering. In GPGPU-4, Newport Beach, California, March 2011. ACM.

Digital Library

[18]

Ertl, M. A. Threaded code variations and optimizations. In EuroForth 2001, Schloss Dagstuhl, Germany, November 2001. pp. 49--55. Available from http://www.complang.tuwien.ac.at/papers/.

[19]

Gao, M., T.-T. Cao, A. Nanjappa, T.-S. Tan, and Z. Huang. A GPU Algorithm for Convex Hull. Technical Report TRA1/12, National University of Singapore, School of Computing, January 2012.

[20]

GHC. The Glasgow Haskell Compiler. Available from http://www.haskell.org/ghc.

[21]

Grelck, C. and S.-B. Scholz. SAC - A Functional Array Language for Efficient Multi-threaded Execution. IJPP, 34(4), August 2006, pp. 383--427.

Digital Library

[22]

Guo, J., J. Thiyagalingam, and S.-B. Scholz. Breaking the GPU programming barrier with the auto-parallelising SAC compiler. In DAMP '11, Austin, January 2011. ACM, pp. 15--24.

Digital Library

[23]

Hoberock, J. and N. Bell. Thrust: A productivity-oriented library for CUDA. In W. W. Hwu (ed.), GPU Computing Gems, Jade Edition, chapter 26, pp. 359--372. Morgan Kaufmann Publishers, October 2011.

[24]

Keller, G. Transformation-based Implementation of Nested Data Parallelism for Distributed Memory Machines. Ph.D. dissertation, Technische Universität Berlin, Berlin, Germany, 1999.

[25]

Khronos OpenCL Working Group. OpenCL 1.2 Specification, November 2011. Available from http://www.khronos.org/registry/cl/specs/opencl-1.2.pdf.

[26]

Larsen, B. Simple optimizations for an applicative array language for graphics processors. In DAMP '11, Austin, January 2011. ACM, pp. 25--34.

Digital Library

[27]

Leshchinskiy, R., M. M. T. Chakravarty, and G. Keller. Higher order flattening. In V. Alexandrov, D. van Albada, P. Sloot, and J. Dongarra (eds.), ICCS '06, number 3992 in LNCS. Springer-Verlag, May 2006, pp. 920--928.

Digital Library

[28]

Leshchinskiy, R. Higher-Order Nested Data Parallelism: Semantics and Implementation. Ph.D. dissertation, Technische Universität Berlin, Berlin, Germany, 2005.

[29]

Merrill, D., M. Garland, and A. Grimshaw. Scalable GPU graph traversal. In PPoPP '12, New Orleans, LA, February 2012. ACM, pp. 117--128.

Digital Library

[30]

Mendez-Lojo, M., M. Burtscher, and K. Pingali. A GPU implementation of inclusion-based points-to analysis. In PPoPP '12, New Orleans, LA, February 2012. ACM, pp. 107--116.

Digital Library

[31]

Mainland, G. and G. Morrisett. Nikola: Embedding compiled GPU functions in Haskell. In HASKELL '10, Baltimore, MD, September 2010. ACM, pp. 67--78.

Digital Library

[32]

NVIDIA. NVIDIA CUDA C Best Practices Guide, 2011.

[33]

NVIDIA. NVIDIA CUDA C Programming Guide, 2011. Available from http://developer.nvidia.com/category/zone/cuda-zone.

[34]

Parker, S. G., J. Bigler, A. Dietrich, H. Friedrich, J. Hoberock, D. Luebke, D. McAllister, M. McGuire, K. Morley, A. Robison, and M. Stich. OptiX: a general purpose ray tracing engine. ACM TOG, 29, July 2010.

Digital Library

[35]

Palmer, D. W., J. F. Prins, and S. Westfold. Work-efficient nested data-parallelism. In FoMPP5. IEEE Computer Society Press, 1995, pp. 186--193.

Digital Library

[36]

Proebsting, T. A. Optimizing an ANSI C interpreter with superoperators. In POPL '95, San Francisco, January 1995. ACM, pp. 322--332.

Digital Library

[37]

Sengupta, S., M. Harris, Y. Zhang, and J. D. Owens. Scan primitives for GPU computing. In GH '07, San Diego, CA, August 2007. Eurographics Association, pp. 97--106.

Digital Library

[38]

Yang, K., B. He, Q. Luo, P. V. Sander, and J. Shi. Stack-based parallel recursion on graphics processors. In PPoPP '09, Raleigh, NC, February 2009. ACM, pp. 299--300.

Digital Library

Cited By

van Balen DKeller Gde Wolff IMcDonell TRainey MScholz S(2024)Fusing Gathers with Integer Linear ProgrammingProceedings of the 1st ACM SIGPLAN International Workshop on Functional Programming for Productivity and Performance10.1145/3677997.3678227(10-23)Online publication date: 28-Aug-2024
https://dl.acm.org/doi/10.1145/3677997.3678227
Huang MYang W(2020)PFACC: An OpenACC‐like programming model for irregular nested parallelismSoftware: Practice and Experience10.1002/spe.286850:10(1877-1904)Online publication date: 9-Jul-2020
https://doi.org/10.1002/spe.2868
Pawlak WElsman MOancea C(2019)A functional approach to accelerating Monte Carlo based american option pricingProceedings of the 31st Symposium on Implementation and Application of Functional Languages10.1145/3412932.3412937(1-12)Online publication date: 25-Sep-2019
https://dl.acm.org/doi/10.1145/3412932.3412937
Show More Cited By

Index Terms

Nested data-parallelism on the gpu
1. Software and its engineering
  1. Software notations and tools
    1. Compilers
    2. General programming languages
      1. Language types
        Concurrent programming languages
        Distributed programming languages
        Parallel programming languages

Recommendations

Nested data-parallelism on the gpu
ICFP '12

Graphics processing units (GPUs) provide both memory bandwidth and arithmetic performance far greater than that available on CPUs but, because of their Single-Instruction-Multiple-Data (SIMD) architecture, they are hard to program. Most of the programs ...
Data-only flattening for nested data parallelism
PPoPP '13: Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming

Data parallelism has proven to be an effective technique for high-level programming of a certain class of parallel applications, but it is not well suited to irregular parallel computations. Blelloch and others proposed nested data parallelism (NDP) as ...
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing

The graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICFP '12: Proceedings of the 17th ACM SIGPLAN international conference on Functional programming

September 2012

392 pages

ISBN:9781450310543

DOI:10.1145/2364527

General Chair:
Peter Thiemann
University of Freiburg, Germany
,
Program Chair:
Robby Findler
Northwestern University, USA

ACM SIGPLAN Notices Volume 47, Issue 9
ICFP '12
September 2012
368 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/2398856
Issue’s Table of Contents

Copyright © 2012 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGPLAN: ACM Special Interest Group on Programming Languages

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 September 2012

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ICFP'12

Sponsor:

SIGPLAN

ICFP'12: ACM SIGPLAN International Conference on Functional Programming

September 9 - 15, 2012

Copenhagen, Denmark

Acceptance Rates

ICFP '12 Paper Acceptance Rate 32 of 88 submissions, 36%;

Overall Acceptance Rate 333 of 1,064 submissions, 31%

Upcoming Conference

ICFP '25

Sponsor:
sigplan

ACM SIGPLAN International Conference on Functional Programming

October 12 - 18, 2025

Singapore , Singapore

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

65
Total Citations
View Citations
587
Total Downloads

Downloads (Last 12 months)19
Downloads (Last 6 weeks)5

Reflects downloads up to 11 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

van Balen DKeller Gde Wolff IMcDonell TRainey MScholz S(2024)Fusing Gathers with Integer Linear ProgrammingProceedings of the 1st ACM SIGPLAN International Workshop on Functional Programming for Productivity and Performance10.1145/3677997.3678227(10-23)Online publication date: 28-Aug-2024
https://dl.acm.org/doi/10.1145/3677997.3678227
Huang MYang W(2020)PFACC: An OpenACC‐like programming model for irregular nested parallelismSoftware: Practice and Experience10.1002/spe.286850:10(1877-1904)Online publication date: 9-Jul-2020
https://doi.org/10.1002/spe.2868
Pawlak WElsman MOancea C(2019)A functional approach to accelerating Monte Carlo based american option pricingProceedings of the 31st Symposium on Implementation and Application of Functional Languages10.1145/3412932.3412937(1-12)Online publication date: 25-Sep-2019
https://dl.acm.org/doi/10.1145/3412932.3412937
Elsman MHenriksen TSerup NGibbons J(2019)Data-parallel flattening by expansionProceedings of the 6th ACM SIGPLAN International Workshop on Libraries, Languages and Compilers for Array Programming10.1145/3315454.3329955(14-24)Online publication date: 8-Jun-2019
https://dl.acm.org/doi/10.1145/3315454.3329955
Henriksen TThorøe FElsman MOancea CHollingsworth JKeidar I(2019)Incremental flattening for nested data parallelismProceedings of the 24th Symposium on Principles and Practice of Parallel Programming10.1145/3293883.3295707(53-67)Online publication date: 16-Feb-2019
https://dl.acm.org/doi/10.1145/3293883.3295707
Yan MHuang MYang W(2019)Order Analysis for Translating NESL Programs into Efficient GPU CodeNew Trends in Computer Technologies and Applications10.1007/978-981-13-9190-3_34(330-337)Online publication date: 11-Jul-2019
https://doi.org/10.1007/978-981-13-9190-3_34
SuB TDoring NBrinkmann ANagel L(2018)And Now for Something Completely Different: Running Lisp on GPUs2018 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER.2018.00060(434-444)Online publication date: Sep-2018
https://doi.org/10.1109/CLUSTER.2018.00060
Barwell ABrown CHammond K(2018)Finding parallel functional pearlsFuture Generation Computer Systems10.1016/j.future.2017.07.02479:P2(669-686)Online publication date: 1-Feb-2018
https://dl.acm.org/doi/10.1016/j.future.2017.07.024
Clifton-Everest RMcDonell TChakravarty MKeller G(2017)Streaming irregular arraysACM SIGPLAN Notices10.1145/3156695.312297152:10(174-185)Online publication date: 7-Sep-2017
https://dl.acm.org/doi/10.1145/3156695.3122971
Henriksen TSerup NElsman MHenglein FOancea C(2017)Futhark: purely functional GPU-programming with nested parallelism and in-place array updatesACM SIGPLAN Notices10.1145/3140587.306235452:6(556-571)Online publication date: 14-Jun-2017
https://dl.acm.org/doi/10.1145/3140587.3062354
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten