Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2364527.2364563acmconferencesArticle/Chapter ViewAbstractPublication PagesicfpConference Proceedingsconference-collections
research-article

Nested data-parallelism on the gpu

Published: 09 September 2012 Publication History

Abstract

Graphics processing units (GPUs) provide both memory bandwidth and arithmetic performance far greater than that available on CPUs but, because of their Single-Instruction-Multiple-Data (SIMD) architecture, they are hard to program. Most of the programs ported to GPUs thus far use traditional data-level parallelism, performing only operations that operate uniformly over vectors.
NESL is a first-order functional language that was designed to allow programmers to write irregular-parallel programs - such as parallel divide-and-conquer algorithms - for wide-vector parallel computers. This paper presents our port of the NESL implementation to work on GPUs and provides empirical evidence that nested data-parallelism (NDP) on GPUs significantly outperforms CPU-based implementations and matches or beats newer GPU languages that support only flat parallelism. While our performance does not match that of hand-tuned CUDA programs, we argue that the notational conciseness of NESL is worth the loss in performance. This work provides the first language implementation that directly supports NDP on a GPU.

References

[1]
Blelloch, G. and S. Chatterjee. VCODE: A data-parallel intermediate language. In FOMPC3, 1990, pp. 471--480.
[2]
Blelloch, G. and S. Chatterjee. CVL: A C vector language, 1993.
[3]
Blelloch, G. E., S. Chatterjee, J. C. Hardwick, J. Sipelstein, and M. Zagha. Implementation of a portable nested data-parallel language. JPDC, 21(1), 1994, pp. 4--14.
[4]
Barber, C. B., D. P. Dobkin, and H. Huhdanpaa. The quickhull algorithm for convex hulls. ACM TOMS, 22(4), 1996, pp. 469--483.
[5]
Bergstrom, L., M. Fluet, M. Rainey, J. Reppy, and A. Shaw. Lazy tree splitting. In ICFP '10. ACM, September 2010, pp. 93--104.
[6]
Barnes, J. and P. Hut. A hierarchical O(N log N) force calculation algorithm. Nature, 324, December 1986, pp. 446--449.
[7]
Blelloch, G. E. Programming parallel algorithms. CACM, 39(3), March 1996, pp. 85--97.
[8]
Burtscher, M. and K. Pingali. An efficient CUDA implementation of the tree-based Barnes Hut n-body algorithm. In GPU Computing Gems Emerald Edition, chapter 6, pp. 75--92. Elsevier Science Publishers, New York, NY, 2011.
[9]
Black, F. and M. Scholes. The pricing of options and corporate liabilities. JPE, 81(3), 1973, pp. 637--654.
[10]
Blelloch, G. E. and G. W. Sabot. Compiling collection-oriented languages onto massively parallel computers. JPDC, 8(2), 1990, pp. 119--134.
[11]
Cunningham, D., R. Bordawekar, and V. Saraswat. GPU programming in a high level language compiling X10 to CUDA. In X10 '11, San Jose, CA, May 2011. Available from http://x10-lang.org/.
[12]
Catanzaro, B., M. Garland, and K. Keutzer. Copperhead: compiling an embedded data parallel language. In PPoPP '11, San Antonio, TX, February 2011. ACM, pp. 47--56.
[13]
Chatterjee, S. Compiling nested data-parallel programs for shared-memory multiprocessors. ACM TOPLAS, 15(3), July 1993, pp. 400--462.
[14]
Chakravarty, M. M., G. Keller, S. Lee, T. L. McDonell, and V. Grover. Accelerating Haskell array codes with multicore GPUs. In DAMP '11, Austin, January 2011. ACM, pp. 3--14.
[15]
Chakravarty, M. M. T., G. Keller, R. Leshchinskiy, and W. Pfannenstiel. Nepal - nested data parallelism in Haskell. In Euro-Par '01, vol. 2150 of LNCS. Springer-Verlag, August 2001, pp. 524--534.
[16]
Chakravarty, M. M. T., R. Leshchinskiy, S. Peyton Jones, and G. Keller. Partial vectorisation of Haskell programs. In DAMP '08. ACM, January 2008, pp. 2--16. Available from http://clip.dia.fi.upm.es/Conferences/DAMP08/.
[17]
Dhanasekaran, B. and N. Rubin. A new method for GPU based irregular reductions and its application to k-means clustering. In GPGPU-4, Newport Beach, California, March 2011. ACM.
[18]
Ertl, M. A. Threaded code variations and optimizations. In EuroForth 2001, Schloss Dagstuhl, Germany, November 2001. pp. 49--55. Available from http://www.complang.tuwien.ac.at/papers/.
[19]
Gao, M., T.-T. Cao, A. Nanjappa, T.-S. Tan, and Z. Huang. A GPU Algorithm for Convex Hull. Technical Report TRA1/12, National University of Singapore, School of Computing, January 2012.
[20]
GHC. The Glasgow Haskell Compiler. Available from http://www.haskell.org/ghc.
[21]
Grelck, C. and S.-B. Scholz. SAC - A Functional Array Language for Efficient Multi-threaded Execution. IJPP, 34(4), August 2006, pp. 383--427.
[22]
Guo, J., J. Thiyagalingam, and S.-B. Scholz. Breaking the GPU programming barrier with the auto-parallelising SAC compiler. In DAMP '11, Austin, January 2011. ACM, pp. 15--24.
[23]
Hoberock, J. and N. Bell. Thrust: A productivity-oriented library for CUDA. In W. W. Hwu (ed.), GPU Computing Gems, Jade Edition, chapter 26, pp. 359--372. Morgan Kaufmann Publishers, October 2011.
[24]
Keller, G. Transformation-based Implementation of Nested Data Parallelism for Distributed Memory Machines. Ph.D. dissertation, Technische Universität Berlin, Berlin, Germany, 1999.
[25]
Khronos OpenCL Working Group. OpenCL 1.2 Specification, November 2011. Available from http://www.khronos.org/registry/cl/specs/opencl-1.2.pdf.
[26]
Larsen, B. Simple optimizations for an applicative array language for graphics processors. In DAMP '11, Austin, January 2011. ACM, pp. 25--34.
[27]
Leshchinskiy, R., M. M. T. Chakravarty, and G. Keller. Higher order flattening. In V. Alexandrov, D. van Albada, P. Sloot, and J. Dongarra (eds.), ICCS '06, number 3992 in LNCS. Springer-Verlag, May 2006, pp. 920--928.
[28]
Leshchinskiy, R. Higher-Order Nested Data Parallelism: Semantics and Implementation. Ph.D. dissertation, Technische Universität Berlin, Berlin, Germany, 2005.
[29]
Merrill, D., M. Garland, and A. Grimshaw. Scalable GPU graph traversal. In PPoPP '12, New Orleans, LA, February 2012. ACM, pp. 117--128.
[30]
Mendez-Lojo, M., M. Burtscher, and K. Pingali. A GPU implementation of inclusion-based points-to analysis. In PPoPP '12, New Orleans, LA, February 2012. ACM, pp. 107--116.
[31]
Mainland, G. and G. Morrisett. Nikola: Embedding compiled GPU functions in Haskell. In HASKELL '10, Baltimore, MD, September 2010. ACM, pp. 67--78.
[32]
NVIDIA. NVIDIA CUDA C Best Practices Guide, 2011.
[33]
NVIDIA. NVIDIA CUDA C Programming Guide, 2011. Available from http://developer.nvidia.com/category/zone/cuda-zone.
[34]
Parker, S. G., J. Bigler, A. Dietrich, H. Friedrich, J. Hoberock, D. Luebke, D. McAllister, M. McGuire, K. Morley, A. Robison, and M. Stich. OptiX: a general purpose ray tracing engine. ACM TOG, 29, July 2010.
[35]
Palmer, D. W., J. F. Prins, and S. Westfold. Work-efficient nested data-parallelism. In FoMPP5. IEEE Computer Society Press, 1995, pp. 186--193.
[36]
Proebsting, T. A. Optimizing an ANSI C interpreter with superoperators. In POPL '95, San Francisco, January 1995. ACM, pp. 322--332.
[37]
Sengupta, S., M. Harris, Y. Zhang, and J. D. Owens. Scan primitives for GPU computing. In GH '07, San Diego, CA, August 2007. Eurographics Association, pp. 97--106.
[38]
Yang, K., B. He, Q. Luo, P. V. Sander, and J. Shi. Stack-based parallel recursion on graphics processors. In PPoPP '09, Raleigh, NC, February 2009. ACM, pp. 299--300.

Cited By

View all
  • (2024)Fusing Gathers with Integer Linear ProgrammingProceedings of the 1st ACM SIGPLAN International Workshop on Functional Programming for Productivity and Performance10.1145/3677997.3678227(10-23)Online publication date: 28-Aug-2024
  • (2020)PFACC: An OpenACC‐like programming model for irregular nested parallelismSoftware: Practice and Experience10.1002/spe.286850:10(1877-1904)Online publication date: 9-Jul-2020
  • (2019)A functional approach to accelerating Monte Carlo based american option pricingProceedings of the 31st Symposium on Implementation and Application of Functional Languages10.1145/3412932.3412937(1-12)Online publication date: 25-Sep-2019
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICFP '12: Proceedings of the 17th ACM SIGPLAN international conference on Functional programming
September 2012
392 pages
ISBN:9781450310543
DOI:10.1145/2364527
  • cover image ACM SIGPLAN Notices
    ACM SIGPLAN Notices  Volume 47, Issue 9
    ICFP '12
    September 2012
    368 pages
    ISSN:0362-1340
    EISSN:1558-1160
    DOI:10.1145/2398856
    Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 September 2012

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. gpgpu
  2. gpu
  3. nesl
  4. nested data parallelism

Qualifiers

  • Research-article

Conference

ICFP'12
Sponsor:

Acceptance Rates

ICFP '12 Paper Acceptance Rate 32 of 88 submissions, 36%;
Overall Acceptance Rate 333 of 1,064 submissions, 31%

Upcoming Conference

ICFP '25
ACM SIGPLAN International Conference on Functional Programming
October 12 - 18, 2025
Singapore , Singapore

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)19
  • Downloads (Last 6 weeks)5
Reflects downloads up to 11 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Fusing Gathers with Integer Linear ProgrammingProceedings of the 1st ACM SIGPLAN International Workshop on Functional Programming for Productivity and Performance10.1145/3677997.3678227(10-23)Online publication date: 28-Aug-2024
  • (2020)PFACC: An OpenACC‐like programming model for irregular nested parallelismSoftware: Practice and Experience10.1002/spe.286850:10(1877-1904)Online publication date: 9-Jul-2020
  • (2019)A functional approach to accelerating Monte Carlo based american option pricingProceedings of the 31st Symposium on Implementation and Application of Functional Languages10.1145/3412932.3412937(1-12)Online publication date: 25-Sep-2019
  • (2019)Data-parallel flattening by expansionProceedings of the 6th ACM SIGPLAN International Workshop on Libraries, Languages and Compilers for Array Programming10.1145/3315454.3329955(14-24)Online publication date: 8-Jun-2019
  • (2019)Incremental flattening for nested data parallelismProceedings of the 24th Symposium on Principles and Practice of Parallel Programming10.1145/3293883.3295707(53-67)Online publication date: 16-Feb-2019
  • (2019)Order Analysis for Translating NESL Programs into Efficient GPU CodeNew Trends in Computer Technologies and Applications10.1007/978-981-13-9190-3_34(330-337)Online publication date: 11-Jul-2019
  • (2018)And Now for Something Completely Different: Running Lisp on GPUs2018 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER.2018.00060(434-444)Online publication date: Sep-2018
  • (2018)Finding parallel functional pearlsFuture Generation Computer Systems10.1016/j.future.2017.07.02479:P2(669-686)Online publication date: 1-Feb-2018
  • (2017)Streaming irregular arraysACM SIGPLAN Notices10.1145/3156695.312297152:10(174-185)Online publication date: 7-Sep-2017
  • (2017)Futhark: purely functional GPU-programming with nested parallelism and in-place array updatesACM SIGPLAN Notices10.1145/3140587.306235452:6(556-571)Online publication date: 14-Jun-2017
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media