research-article

Open access

HPVM: heterogeneous parallel virtual machine

Authors:

Maria Kotsifakou,

Prakalp Srivastava,

Matthew D. Sinclair,

Rakesh Komuravelli,

Sarita AdveAuthors Info & Claims

ACM SIGPLAN Notices, Volume 53, Issue 1

Pages 68 - 80

https://doi.org/10.1145/3200691.3178493

Published: 10 February 2018 Publication History

Abstract

We propose a parallel program representation for heterogeneous systems, designed to enable performance portability across a wide range of popular parallel hardware, including GPUs, vector instruction sets, multicore CPUs and potentially FPGAs. Our representation, which we call HPVM, is a hierarchical dataflow graph with shared memory and vector instructions. HPVM supports three important capabilities for programming heterogeneous systems: a compiler intermediate representation (IR), a virtual instruction set (ISA), and a basis for runtime scheduling; previous systems focus on only one of these capabilities. As a compiler IR, HPVM aims to enable effective code generation and optimization for heterogeneous systems. As a virtual ISA, it can be used to ship executable programs, in order to achieve both functional portability and performance portability across such systems. At runtime, HPVM enables flexible scheduling policies, both through the graph structure and the ability to compile individual nodes in a program to any of the target devices on a system. We have implemented a prototype HPVM system, defining the HPVM IR as an extension of the LLVM compiler IR, compiler optimizations that operate directly on HPVM graphs, and code generators that translate the virtual ISA to NVIDIA GPUs, Intel's AVX vector units, and to multicore X86-64 processors. Experimental results show that HPVM optimizations achieve significant performance improvements, HPVM translators achieve performance competitive with manually developed OpenCL code for both GPUs and vector hardware, and that runtime scheduling policies can make use of both program and runtime information to exploit the flexible compilation capabilities. Overall, we conclude that the HPVM representation is a promising basis for achieving performance portability and for implementing parallelizing compilers for heterogeneous parallel systems.

References

[1]

R. Allen and K. Kennedy. 2002. Optimizing Compilers for Modern Architectures. Morgan Kaufmann Publishers, Inc., San Francisco, CA.

Digital Library

[2]

Jason Ansel, Cy Chan, Yee Lok Wong, Marek Olszewski, Qin Zhao, Alan Edelman, and Saman Amarasinghe. 2009. PetaBricks: A Language and Compiler for Algorithmic Choice (PLDI).

Digital Library

[3]

E. A. Ashcroft and W. W. Wadge. 1977. Lucid, a Nonprocedural Language with Iteration. Commun. ACM (1977).

Digital Library

[4]

CÃl'dric Augonnet, Samuel Thibault, Raymond Namyst, and Pierre-AndrÃl' Wacrenier. 2011. StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures. Concurrency and Computation: Practice and Experience (2011).

Digital Library

[5]

Riyadh Baghdadi, Ulysse Beaugnon, Albert Cohen, Tobias Grosser, Michael Kruse, Chandan Reddy, Sven Verdoolaege, Adam Betts, Alas-tair F. Donaldson, Jeroen Ketema, Javed Absar, Sven van Haastregt, Alexey Kravets, Anton Lokhmotov, Robert David, and Elnar Hajiyev. 2015. PENCIL: A Platform-Neutral Compute Intermediate Language for Accelerator Programming. In Proceedings of the 2015 International Conference on Parallel Architecture and Compilation (PACT) (PACT '15). IEEE Computer Society, Washington, DC, USA, 138--149.

Digital Library

[6]

Michael Bauer, Sean Treichler, Elliot Slaughter, and Alex Aiken. 2012. Legion: Expressing Locality and Independence with Logical Regions (SC).

Digital Library

[7]

Tal Ben-Nun, Michael Sutton, Sreepathi Pai, and Keshav Pingali. 2017. Groute: An Asynchronous Multi-GPU Programming Model for Irregular Computations. In Proceedings of the 22Nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '17). ACM, New York, NY, USA, 235--248.

Digital Library

[8]

Nicolas Benoit and Stéphane Louise. 2010. Extending GCC with a Multi-grain Parallelism Adaptation Framework for MPSoCs. In 2nd Int'l Workshop on GCC Research Opportunities.

[9]

Nicolas Benoit and Stéphane Louise. 2016. Using an Intermediate Representation to Map Workloads on Heterogeneous Parallel Systems. In 24th Euromicro Conference.

[10]

Zoran Budimlic, Michael Burke, Vincent CavÃl', Kathleen Knobe, Geoff Lowney, Ryan Newton, Jens Palsberg, David Peixotto, Vivek Sarkar, Frank Schlimbach, and Sagnak Tasirlar. 2010. Concurrent Collections. Scientific Programming 18, 3--4 (2010), 203--217.

Digital Library

[11]

Li-wen Chang, Abdul Dakkak, Christopher I. Rodrigues, and Wen mei Hwu. 2015. Tangram: a High-level Language for Performance Portable Code Synthesis (MULTIPROG 2015).

[12]

D.E. Culler, S.C. Goldstein, K.E. Schauser, and T. Voneicken. 1993. TAM - A Compiler Controlled Threaded Abstract Machine. Parallel and Distributed Computing.

Digital Library

[13]

Kayvon Fatahalian, Daniel Reiter Horn, Timothy J. Knight, Larkhoon Leem, Mike Houston, Ji Young Park, Mattan Erez, Manman Ren, Alex Aiken, William J. Dally, and Pat Hanrahan. 2006. Sequoia: Programming the Memory Hierarchy (SC).

Digital Library

[14]

HSA Foundation. 2015. HSAIL. (2015). Retrieved January 17, 2018from http://www.hsafoundation.com/standards/

[15]

Vladimir Gajinov, Srdjan Stipic, Osman S. Unsal, Tim Harris, Eduard Ayguadé, and Adrián Cristal. 2012. Integrating Dataflow Abstractions into the Shared Memory Model. In 2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing. 243--251.

Digital Library

[16]

Vladimir Gajinov, Srdjan Stipic, Osman S. Unsal, Tim Harris, Eduard Ayguadé, and Adrián Cristal. 2012. Supporting Stateful Tasks in a Dataflow Graph. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques (PACT '12). ACM, New York, NY, USA, 435--436.

Digital Library

[17]

Al Geist, Adam Beguelin, Jack Dongarra, Weicheng Jiang, Robert Manchek, and Vaidyalingam S. Sunderam. 1994. PVM: A Users' Guide and Tutorial for Networked Parallel Computing. MIT press.

Digital Library

[18]

Google. 2013. Google Cloud Dataflow. (2013). Retrieved January 17, 2018 from https://cloud.google.com/dataflow/

[19]

Dounia Khaldi, Pierre Jouvelot, Francois Irigoin, and Corinne Ancourt. 2012. SPIRE: A Methodology for Sequential to Parallel Intermediate Representation Extension (CPC).

[20]

Khronos Group. 2012. SPIR 1.2 Specification. https://www.khronos.org/registry/spir/specs/spir_spec-1.2.pdf. (2012).

[21]

Chris Lattner and Vikram Adve. 2004. LLVM: A Compilation Framework for Lifelong Program Analysis and Transformation. In Proc. Conf. on Code Generation and Optimization. San Jose, CA, USA, 75--88.

Digital Library

[22]

Li-wen Chang. 2015. Personal Communication. (2015).

[23]

D. Majeti and V. Sarkar. 2015. Heterogeneous Habanero-C (H2C): A Portable Programming Model for Heterogeneous Processors (IPDPS Workshop).

Digital Library

[24]

Tim Mattson, Romain Cledat, Zoran Budimlic, Vincent Cave, Sanjay Chatterjee, Bala Seshasayee, Wijngaart Rob van der, and Vivek Sarkar. 2015. OCR: The Open Community Runtime Interface. Technical Report.

[25]

Takamichi Miyamoto, Saori Asaka, Hiroki Mikami, Masayoshi Mase, Yasutaka Wada, Hirofumi Nakano, Keiji Kimura, and Hironori Kasahara. 2008. Parallelization with Automatic Parallelizing Compiler Generating Consumer Electronics Multicore API. In 2008 IEEE International Symposium on Parallel and Distributed Processing with Applications. IEEE.

Digital Library

[26]

Rishiyur S. Nikhil. 1993. The Parallel Programming Language Id and its compilation for parallel machines (IJHSC).

[27]

NVIDIA. 2009. PTX: Parallel Thread Execution ISA. http://docs.nvidia.com/cuda/parallel-thread-execution/index.html. (2009).

[28]

NVIDIA. 2013. NVVM IR. http://docs.nvidia.com/cuda/nvvm-ir-spec. (2013).

[29]

M. Okamoto, K. Yamashita, H. Kasahara, and S. Narita. 1995. Hierarchical macro-dataflow computation scheme. In IEEE Pacific Rim Conference on Communications, Computers, and Signal Processing. Proceedings. IEEE.

[30]

LLVM Project. 2003. LLVM Language Reference Manual. (2003). Retrieved January 17, 2018from http://llvm.org/docs/LangRef.html

[31]

Qualcomm Technologies, Inc. 2014. MARE: Enabling Applications for Heterogeneous Mobile Devices. Technical Report.

[32]

Tao B. Schardl, William S. Moses, and Charles E. Leiserson. 2017. Tapir: Embedding Fork-Join Parallelism into LLVM's Intermediate Representation (PPoPP).

Digital Library

[33]

John A Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Daniel Liu, and Wen-Mei W Hwu. 2012. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Technical Report.

[34]

Arvind K. Sujeeth, Kevin J. Brown, Hyoukjoong Lee, Tiark Rompf, Hassan Chafi, Martin Odersky, and Kunle Olukotun. 2014. Delite: A Compiler Architecture for Performance-Oriented Embedded Domain-Specific Languages (ACM TECS).

Digital Library

[35]

William Thies, Michal Karczmarek, and Saman Amarasinghe. 2002. StreamIt: A Language for Streaming Applications (International Conference on Compiler Construction).

Digital Library

[36]

Yasutaka Wada, Akihiro Hayashi, Takeshi Masuura, Jun Shirako, Hirofumi Nakano, Hiroaki Shikano, Keiji Kimura, and Hironori Kasahara. 2011. A Parallelizing Compiler Cooperative Heterogeneous Multicore Processor Architecture. Springer Berlin Heidelberg, Berlin, Heidelberg.

[37]

Yonghong Yan, Jisheng Zhao, Yi Guo, and Vivek Sarkar. 2009. Hierarchical Place Trees: A Portable Abstraction for Task Parallelism and Data Movement. In Proceedings of the 22Nd International Conference on Languages and Compilers for Parallel Computing (LCPC'09). Springer-Verlag, Berlin, Heidelberg, 172--187.

Digital Library

[38]

Jin Zhou and Brian Demsky. 2010. Bamboo: A Data-centric, Object-oriented Approach to Many-core Software. In Proceedings of the 31st ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '10). ACM, New York, NY, USA, 388--399.

Digital Library

Cited By

Fabjančič MMachidon OSharif HZhao YMisailović SPejović V(2024)Mobiprox: Supporting Dynamic Approximate Computing on MobilesIEEE Internet of Things Journal10.1109/JIOT.2024.336595711:9(16873-16886)Online publication date: 1-May-2024
https://doi.org/10.1109/JIOT.2024.3365957
McMichen TGreiner NZhong PSossai FPatel ACampanoni S(2024)Representing Data Collections in an SSA Form2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)10.1109/CGO57630.2024.10444817(308-321)Online publication date: 2-Mar-2024
https://doi.org/10.1109/CGO57630.2024.10444817
Martínez PPeccerillo BBartolini SGarcía JBernabé G(2022)Performance portability in a real world applicationInternational Journal of High Performance Computing Applications10.1177/1094342022107710736:3(419-439)Online publication date: 1-May-2022
https://dl.acm.org/doi/10.1177/10943420221077107
Show More Cited By

Index Terms

HPVM: heterogeneous parallel virtual machine
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Heterogeneous (hybrid) systems

Recommendations

HPVM: heterogeneous parallel virtual machine
PPoPP '18: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

We propose a parallel program representation for heterogeneous systems, designed to enable performance portability across a wide range of popular parallel hardware, including GPUs, vector instruction sets, multicore CPUs and potentially FPGAs. Our ...
RegMutex: inter-warp GPU register time-sharing
ISCA '18: Proceedings of the 45th Annual International Symposium on Computer Architecture

Registers are the fastest and simultaneously the most expensive kind of memory available to GPU threads. Due to existence of a great number of concurrently executing threads, and the high cost of context switching mechanisms, contemporary GPUs are ...
Leveraging GPUs using cooperative loop speculation

Graphics processing units, or GPUs, provide TFLOPs of additional performance potential in commodity computer systems that frequently go unused by most applications. Even with the emergence of languages such as CUDA and OpenCL, programming GPUs remains a ...

Comments

Information & Contributors

Information

Published In

cover image ACM SIGPLAN Notices

ACM SIGPLAN Notices Volume 53, Issue 1

PPoPP '18

January 2018

426 pages

ISSN:0362-1340

EISSN:1558-1160

DOI:10.1145/3200691

Editor:
Matthew Fluet
Rodchester Institude of Technology

Issue’s Table of Contents

PPoPP '18: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
February 2018
442 pages
ISBN:9781450349826
DOI:10.1145/3178487
General Chair:
Andreas Krall
Vienna University of Technology, Austria
,
Program Chair:
Thomas R. Gross
ETH Zürich, Switzerland

Copyright © 2018 Owner/Author.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 February 2018

Published in SIGPLAN Volume 53, Issue 1

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

MARCO
National Science Foundation
DARPA
SRC STARNet C-FAR

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

41
Total Citations
View Citations
2,861
Total Downloads

Downloads (Last 12 months)331
Downloads (Last 6 weeks)42

Reflects downloads up to 01 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Fabjančič MMachidon OSharif HZhao YMisailović SPejović V(2024)Mobiprox: Supporting Dynamic Approximate Computing on MobilesIEEE Internet of Things Journal10.1109/JIOT.2024.336595711:9(16873-16886)Online publication date: 1-May-2024
https://doi.org/10.1109/JIOT.2024.3365957
McMichen TGreiner NZhong PSossai FPatel ACampanoni S(2024)Representing Data Collections in an SSA Form2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)10.1109/CGO57630.2024.10444817(308-321)Online publication date: 2-Mar-2024
https://doi.org/10.1109/CGO57630.2024.10444817
Martínez PPeccerillo BBartolini SGarcía JBernabé G(2022)Performance portability in a real world applicationInternational Journal of High Performance Computing Applications10.1177/1094342022107710736:3(419-439)Online publication date: 1-May-2022
https://dl.acm.org/doi/10.1177/10943420221077107
Isaev MMcDonald NYoung JVuduc R(2022)ParaGraph: An application-simulator interface and toolkit for hardware-software co-designProceedings of the 51st International Conference on Parallel Processing10.1145/3545008.3545069(1-13)Online publication date: 29-Aug-2022
https://dl.acm.org/doi/10.1145/3545008.3545069
Martínez PBernabé GGarcía J(2022)HDNN: a cross-platform MLIR dialect for deep neural networksThe Journal of Supercomputing10.1007/s11227-022-04417-378:11(13814-13830)Online publication date: 25-Mar-2022
https://dl.acm.org/doi/10.1007/s11227-022-04417-3
Pilato CSoldavini S(2022)Accelerator Design with High-Level SynthesisHandbook of Computer Architecture10.1007/978-981-15-6401-7_19-1(1-33)Online publication date: 27-Jan-2022
https://doi.org/10.1007/978-981-15-6401-7_19-1
Martínez PPeccerillo BBartolini SGarcía JBernabé G(2022)Applying Intel's oneAPI to a machine learning case studyConcurrency and Computation: Practice and Experience10.1002/cpe.691734:13Online publication date: 7-Apr-2022
https://doi.org/10.1002/cpe.6917
Sharifian AHojabr RRahimi NLiu SGuha ANowatzki TShriraman A(2019)μIR -An intermediate representation for transforming and optimizing the microarchitecture of application acceleratorsProceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3352460.3358292(940-953)Online publication date: 12-Oct-2019
https://dl.acm.org/doi/10.1145/3352460.3358292
Kim JLee SJohnston BVetter J(2024)IRIS: A Performance-Portable Framework for Cross-Platform Heterogeneous ComputingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.342901035:10(1796-1809)Online publication date: Oct-2024
https://doi.org/10.1109/TPDS.2024.3429010
Yusuf AAdegbija TGajaria D(2024)Domain-Specific STT-MRAM-Based In-Memory Computing: A SurveyIEEE Access10.1109/ACCESS.2024.336563212(28036-28056)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3365632
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents