Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2364474.2364484acmconferencesArticle/Chapter ViewAbstractPublication PagesicfpConference Proceedingsconference-collections
research-article

Financial software on GPUs: between Haskell and Fortran

Published: 15 September 2012 Publication History

Abstract

This paper presents a real-world pricing kernel for financial derivatives and evaluates the language and compiler tool chain that would allow expressive, hardware-neutral algorithm implementation and efficient execution on graphics-processing units (GPU). The language issues refer to preserving algorithmic invariants, e.g., inherent parallelism made explicit by map-reduce-scan functional combinators. Efficient execution is achieved by manually; applying a series of generally-applicable compiler transformations that allows the generated-OpenCL code to yield speedups as high as 70x and 540x on a commodity mobile and desktop GPU, respectively.
Apart from the concrete speed-ups attained, our contributions are twofold: First, from a language perspective;, we illustrate that even state-of-the-art auto-parallelization techniques are incapable of discovering all the requisite data parallelism when rendering the functional code in Fortran-style imperative array processing form. Second, from a performance perspective;, we study which compiler transformations are necessary to map the high-level functional code to hand-optimized OpenCL code for GPU execution. We discover a rich optimization space with nontrivial trade-offs and cost models. Memory reuse in map-reduce patterns, strength reduction, branch divergence optimization, and memory access coalescing, exhibit significant impact individually. When combined, they enable essentially full utilization of all GPU cores.
Functional programming has played a crucial double role in our case study: Capturing the naturally data-parallel structure of the pricing algorithm in a transparent, reusable and entirely hardware-independent fashion; and supporting the correctness of the subsequent compiler transformations to a hardware-oriented target language by a rich class of universally valid equational properties. Given the observed difficulty of automatically parallelizing imperative sequential code and the inherent labor of porting hardware-oriented and -optimized programs, our case study suggests that functional programming technology can facilitate high-level; expression of leading-edge performant portable; high-performance systems for massively parallel hardware architectures.

References

[1]
R. Allen and K. Kennedy. Optimizing Compilers for Modern Architectures. Morgan Kaufmann, 2002. ISBN 1-55860-286-0.
[2]
M. Amini, F. Coelho, F. Irigoin, and R. Keryell. Static Compilation Analysis for Host-Accelerator Communication Optimization. In Int. Work. Lang. and Compilers for Par. Computing (LCPC), 2011.
[3]
B. Armstrong and R. Eigenmann. Application of Automatic Parallelization to Modern Challenges of Scientific Computing Industries. In Int. Conf. Parallel Proc. (ICPP), pages 279--286, 2008.
[4]
L. Augustsson, H. Mansell, and G. Sittampalam. Paradise: A Two-Stage DSL Embedded in Haskell. In Int. Conf. on Funct. Prog. (ICFP), pages 225--228, 2008.
[5]
M. M. Baskaran, J. Ramanujam, and P. Sadayappan. Automatic C-to-CUDA Code Generation for Affine Programs. In Int. Conf. on Compiler Construction (CC), pages 244--263, 2010.
[6]
R. S. Bird. An Introduction to the Theory of Lists. In NATO Inst. on Logic of Progr. and Calculi of Discrete Design, pages 5--42, 1987.
[7]
F. Black and M. Scholes. The Pricing of Options and Corporate Liabilities. The Journal of Political Economy, pages 637--654, 1973.
[8]
G. Blelloch. Programming Parallel Algorithms. Communications of the ACM (CACM), 39 (3): 85--97, 1996.
[9]
W. Blume and R. Eigenmann. The Range Test: A Dependence Test for Symbolic, Non-Linear Expressions. In Procs. Int. Conf. on Supercomp, pages 528--537, 1994.
[10]
P. Bratley and B. L. Fox. Algorithm 659 Implementing Sobol's Quasirandom Sequence Generator. ACM Trans. on Math. Software (TOMS), 14(1): 88--100, 1988.
[11]
M. Chakravarty, R. Leshchinskiy, S. P. Jones, G. Keller, and S. Marlow. Data parallel Haskell: A status report. In Int. Work. on Declarative Aspects of Multicore Prog. (DAMP), pages 10--18, 2007.
[12]
M. M. Chakravarty, G. Keller, S. Lee, T. L. McDonell, and V. Grover. Accelerating Haskell Array Codes with Multicore GPUs. In Int. Work. on Declarative Aspects of Multicore Prog. (DAMP), pages 3--14, 2011.
[13]
M. Cole. Parallel Programming, List Homomorphisms and the Maximum Segment Sum Problem. In Procs. of Parco 93, 1993.
[14]
F. Dang, H. Yu, and L. Rauchwerger. The R-LRPD Test: Speculative Parallelization of Partially Parallel Loops. In Int. Par. and Distr. Processing Symp. (PDPS), pages 20--29, 2002.
[15]
C. Dubach, P. Cheng, R. Rabbah, D. F. Bacon, and S. J. Fink. Compiling a High-Level Language for GPUs. In Int. Conf. Prg. Lang. Design and Implem. (PLDI), pages 1--12, 2012.
[16]
P. Feautrier. Dataflow Analysis of Array and Scalar References. Int. Journal of Par. Prog, 20(1): 23--54, 1991.
[17]
J. Gibbons. The Third Homomorphism Theorem. Journal of Functional Programming (JFP), 6 (4): 657--665, 1996.
[18]
P. Glasserman. Monte Carlo Methods in Financial Engineering. Springer, New York, 2004. ISBN 0387004513.
[19]
S. Gorlatch. Systematic Extraction and Implementation of Divide-and-Conquer Parallelism. In PLILP'96, pages 274--288, 1996.
[20]
S. Gorlatch. Systematic Efficient Parallelization of Scan and Other List Homomorphisms. In Ann. European Conf. on Par. Proc. LNCS 1124, pages 401--408. Springer-Verlag, 1996.
[21]
M. W. Hall, S. P. Amarasinghe, B. R. Murphy, S.-W. Liao, and M. S. Lam. Interprocedural Parallelization Analysis in SUIF. Trans. on Prog. Lang. and Sys. (TOPLAS), 27(4): 662--731, 2005.
[22]
K. Hammond and G. Michaelson, editors. Research Directions in Parallel Functional Programming. Springer, London, 2000.
[23]
Z. Hu, M. Takeichi, and H. Iwasaki. Diffusion: Calculating Efficient Parallel Programs. In Int. Work. Partial Eval. and Semantics-Based Prg. Manip. (PEPM), pages 85--94, 1999.
[24]
J. Hughes. Why Functional Programming Matters. The Computer Journal, 32 (2): 98--107, 1989.
[25]
J. Hull. Options, Futures And Other Derivatives. Prentice Hall, 2009.
[26]
M. Joshi. Graphical Asian Options. Wilmott J., 2 (2): 97--107, 2010.
[27]
A. Lee, C. Yau, M. Giles, A. Doucet, and C. Holmes. On the Utility of Graphics Cards to Perform Massively Parallel Simulation of Advanced Monte Carlo Methods. J. Comp. Graph. Stat, 19 (4): 769--789, 2010.
[28]
S. Lee, S.-J. Min, and R. Eigenmann. OpenMP to GPGPU: a Compiler Framework for Automatic Translation and Optimization. In Int. Symp. Princ. and Practice of Par. Prog. (PPoPP), pages 101--110, 2009.
[29]
Y. Lin and D. Padua. Analysis of Irregular Single-Indexed Arrays and its Applications in Compiler Optimizations. In Procs. Int. Conf. on Compiler Construction, pages 202--218, 2000.
[30]
R. Loogen, Y. Ortega-Mallén, and R. Peñna-Marí. Parallel Functional Programming in Eden. J. of Funct. Prog. (JFP), 15 (3): 431--475, 2005.
[31]
B. Lu and J. Mellor-Crummey. Compiler Optimization of Implicit Reductions for Distributed Memory Multiprocessors. In Int. Par. Proc. Symp. (IPPS), 1998.
[32]
G. Mainland and G. Morrisett. Nikola: Embedding Compiled GPU Functions in Haskell. In Int. Symp. on Haskell, pages 67--78, 2010.
[33]
S. Marlow, R. Newton, and S. Peyton Jones. A Monad for Deterministic Parallelism. In Int. Symp. on Haskell, pages 71--82, 2011.
[34]
S. Moon and M. W. Hall. Evaluation of Predicated Array Data-Flow Analysis for Automatic Parallelization. In Int. Symp. Princ. and Practice of Par. Prog. (PPoPP), pages 84--95, 1999.
[35]
K. Morita, A. Morihata, K. Matsuzaki, Z. Hu, and M. Takeichi. Automatic Inversion Generates Divide-and-Conquer Parallel Programs. In Int. Conf. Prog. Lang. Design and Impl. (PLDI), pages 146--155, 2007.
[36]
F. Nord and E. Laure. Monte Carlo Option Pricing with Graphics Processing Units. In Int. Conf. ParCo, 2011.
[37]
C. E. Oancea and L. Rauchwerger. Logical Inference Techniques for Loop Parallelization. In Int. Conf. Prog. Lang. Design and Impl. (PLDI), 2012.
[38]
C. E. Oancea, A. Mycroft, and T. Harris. A Lightweight, In-Place Model for Software Thread-Level Speculation. In Int. Symp. on Par. Alg. Arch. (SPAA), pages 223--232, 2009.
[39]
Y. Paek, J. Hoeflinger, and D. Padua. Efficient and Precise Array Access Analysis. Trans. on Prog. Lang. and Sys. (TOPLAS), 24(1): 65--109, 2002.
[40]
S. Peyton Jones, J.-M. Eber, and J. Seward. Composing Contracts: an Adventure in Financial Engineering (functional pearl). In Int. Conf. on Funct. Prog. (ICFP), pages 280--292, 2000.
[41]
L. Pouchet and et al. Loop Transformations: Convexity, Pruning and Optimization. In Int. Conf. Princ. of Prog. Lang. (POPL), 2012.
[42]
W. Pugh and D. Wonnacott. Constraint-Based Array Dependence Analysis. Trans. on Prog. Lang. and Sys., 20(3): 635--678, 1998.
[43]
S. Rus, J. Hoeflinger, and L. Rauchwerger. Hybrid Analysis: Static & Dynamic Memory Reference Analysis. Int. Journal of Par. Prog, 31(3): 251--283, 2003.
[44]
S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and W.-m. W. Hwu. Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA. In Int. Symp. Princ. and Practice of Par. Prog. (PPoPP), pages 73--82, 2008.
[45]
M. M. Strout. Performance transformations for irregular applications. PhD thesis, 2003. AAI3094622.
[46]
P. Trinder, K. Hammond, J. Mattson Jr., A. Partridge, and S. Peyton Jones. GUM: a Portable Parallel Implementation of Haskell. In Int. Conf. Prg. Lang. Design and Implem. (PLDI), pages 78--88, 1996.
[47]
S.-Z. Ueng, M. Lathara, S. S. Baghsorkhi, and W.-M. W. Hwu. CUDA-Lite: Reducing GPU Programming Complexity. In Int. Work. Lang. and Compilers for Par. Computing (LCPC), pages 1--15, 2008.
[48]
M. Wichura. Algorithm AS 241: The percentage points of the Normal distribution. Journal of the Royal Statistical Society. Series C (Applied Statistics), 37 (3): 477--484, 1988.
[49]
Y. Yang, P. Xiang, J. Kong, and H. Zhou. A GPGPU Compiler for Memory Optimization and Parallelism Management. In Int. Conf. Prog. Lang. Design and Implem. (PLDI), pages 86--97, 2010.

Cited By

View all
  • (2023)Reverse-Mode AD of Multi-Reduce and Scan in FutharkProceedings of the 35th Symposium on Implementation and Application of Functional Languages10.1145/3652561.3652575(1-14)Online publication date: 29-Aug-2023
  • (2021)Acceleration of lattice models for pricing portfolios of fixed-income derivativesProceedings of the 7th ACM SIGPLAN International Workshop on Libraries, Languages and Compilers for Array Programming10.1145/3460944.3464309(27-38)Online publication date: 17-Jun-2021
  • (2021)Dataset Sensitive Autotuning of Multi-versioned Code Based on Monotonic PropertiesTrends in Functional Programming10.1007/978-3-030-83978-9_1(3-23)Online publication date: 23-Aug-2021
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
FHPC '12: Proceedings of the 1st ACM SIGPLAN workshop on Functional high-performance computing
September 2012
110 pages
ISBN:9781450315777
DOI:10.1145/2364474
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 September 2012

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. autoparallelization
  2. functional language
  3. memory coalescing
  4. strength reduction
  5. tiling

Qualifiers

  • Research-article

Conference

ICFP'12
Sponsor:

Acceptance Rates

Overall Acceptance Rate 18 of 25 submissions, 72%

Upcoming Conference

ICFP '25
ACM SIGPLAN International Conference on Functional Programming
October 12 - 18, 2025
Singapore , Singapore

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)4
  • Downloads (Last 6 weeks)0
Reflects downloads up to 12 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Reverse-Mode AD of Multi-Reduce and Scan in FutharkProceedings of the 35th Symposium on Implementation and Application of Functional Languages10.1145/3652561.3652575(1-14)Online publication date: 29-Aug-2023
  • (2021)Acceleration of lattice models for pricing portfolios of fixed-income derivativesProceedings of the 7th ACM SIGPLAN International Workshop on Libraries, Languages and Compilers for Array Programming10.1145/3460944.3464309(27-38)Online publication date: 17-Jun-2021
  • (2021)Dataset Sensitive Autotuning of Multi-versioned Code Based on Monotonic PropertiesTrends in Functional Programming10.1007/978-3-030-83978-9_1(3-23)Online publication date: 23-Aug-2021
  • (2019)A functional approach to accelerating Monte Carlo based american option pricingProceedings of the 31st Symposium on Implementation and Application of Functional Languages10.1145/3412932.3412937(1-12)Online publication date: 25-Sep-2019
  • (2019)Incremental flattening for nested data parallelismProceedings of the 24th Symposium on Principles and Practice of Parallel Programming10.1145/3293883.3295707(53-67)Online publication date: 16-Feb-2019
  • (2018)Modular acceleration: tricky cases of functional high-performance computingProceedings of the 7th ACM SIGPLAN International Workshop on Functional High-Performance Computing10.1145/3264738.3264740(10-21)Online publication date: 17-Sep-2018
  • (2016)FinParACM Transactions on Architecture and Code Optimization10.1145/289835413:2(1-27)Online publication date: 27-Jun-2016
  • (2016)GPU-Powered Shotgun Stochastic Search for Dirichlet Process Mixtures of Gaussian Graphical ModelsJournal of Computational and Graphical Statistics10.1080/10618600.2015.103788325:3(762-788)Online publication date: 5-Aug-2016
  • (2016)A language for hierarchical data parallel design-space exploration on GPUsJournal of Functional Programming10.1017/S095679681600004626Online publication date: 17-Mar-2016
  • (2015)Scalable conditional induction variables (CIV) analysisProceedings of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization10.5555/2738600.2738627(213-224)Online publication date: 7-Feb-2015
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media