research-article

TAPAS: generating parallel accelerators from parallel programs

Authors:

Steven Margerm,

Amirali Sharifian,

Arrvindh Shriraman,

Gilles PokamAuthors Info & Claims

MICRO-51: Proceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture

Pages 245 - 257

https://doi.org/10.1109/MICRO.2018.00028

Published: 20 October 2018 Publication History

Abstract

High-level-synthesis (HLS) tools generate accelerators from software programs to ease the task of building hardware. Unfortunately, current HLS tools have limited support for concurrency, which impacts the speedup achievable with the generated accelerator. Current approaches only target fixed static patterns (e.g., pipeline, data-parallel kernels). This constraints the ability of software programmers to express concurrency. Moreover, the generated accelerator loses a key benefit of parallel hardware, dynamic asynchrony, and the potential to hide long latency and cache misses.

We have developed TAPAS, an HLS toolchain for generating parallel accelerators from programs with dynamic parallelism. TAPAS is built on top of Tapir [22], [39], which embeds fork-join parallelism into the compiler's intermediate-representation. TAPAS leverages the compiler IR to identify parallelism and synthesizes the hardware logic. TAPAS provides first-class architecture support for spawning, coordinating and synchronizing tasks during accelerator execution. We demonstrate TAPAS can generate accelerators for concurrent programs with heterogeneous, nested and recursive parallelism. Our evaluation on Intel-Altera DE1-SoC and Arria-10 boards demonstrates that TAPAS generated accelerators achieve 20X the power efficiency of an Intel Xeon, while maintaining comparable performance. We also show that TAPAS enables lightweight tasks that can be spawned in ≃10 cycles and enables accelerators to exploit available fine-grain parallelism. TAPAS is a complete HLS toolchain for synthesizing parallel programs to accelerators and is open-sourced.

References

[1]

Vivado Design Suite. https://www.xilinx.com/products/design-tools/vivado.html.

[2]

S N Agathos and V V Dimakopoulos. Compiler-Assisted OpenMP Runtime Organization for Embedded Multicores. TR-2016-1, University of Ioannina, 2016.

[3]

Ryo Asai and Andrey Vladimirov. Intel Cilk Plus for complex parallel algorithms - "Enormous Fast Fourier Transforms" (EFFT) library. Journal of Parallel Computing, 2015.

Digital Library

[4]

Jonathan Bachrach, Huy Vo, Brian Richards, Yunsup Lee, Andrew Waterman, Rimas Avizienis, John Wawrzynek, and Krste Asanovic. Chisel: Constructing hardware in a scala embedded language. https://github.com/freechipsproject/chisel3.

[5]

David F Bacon, Rodric Rabbah, and Sunil Shukla. Fpga programming for the masses. Communications of the ACM, 56(4):56--63, 2013.

Digital Library

[6]

Lars Bauer, Artjom Grudnitsky, Muhammad Shafique 0001, and Jörg Henkel. PATS - A Performance Aware Task Scheduler for Runtime Reconfigurable Processors. In Proc. of the FCCM, 2012.

Digital Library

[7]

Robert D Blumofe and Charles E Leiserson. Scheduling Multithreaded Computations by Work Stealing. Journal of ACM, 1999.

Digital Library

[8]

Andrew Canis, Jongsok Choi, Blair Fort, Ruolong Lian, Qijing Huang, Nazanin Calagar, Marcel Gort, Jia Jun Qin, Mark Aldham, Tomasz Czajkowski, Stephen Brown, and Jason Anderson. From software to accelerators with LegUp high-level synthesis. In In Proc. of CASES, 2013.

Digital Library

[9]

George Charitopoulos, Iosif Koidis, Kyprianos Papadimitriou, and Dionisios Pnevmatikatos. Run-time management of systems with partially reconfigurable fpgas. Integration, the VLSI Journal, 57:34--44, 2017.

Digital Library

[10]

George Charitopoulos, Iosif Koidis, Kyprianos Papadimitriou, and Dionisios N Pnevmatikatos. Hardware Task Scheduling for Partially Reconfigurable FPGAs. In Proc. of ARC, 2015.

[11]

J. Choi, S. Brown, and J. Anderson. From software threads to parallel hardware in high-level synthesis for fpgas. In Proc. of FPT, 2013.

[12]

Jongsok Choi, Stephen Dean Brown, and Jason Helge Anderson. From pthreads to multicore hardware systems in legup high-level synthesis for fpgas. IEEE Trans. VLSI Syst., 25(10), 2017.

[13]

Jongsok Choi, Ruolong Lian, Stephen Dean Brown, and Jason Helge Anderson. A unified software approach to specify pipeline and spatial parallelism in FPGA hardware. In 27th IEEE International Conference on Application-specific Systems, Architectures and Processors, 2016.

[14]

Jongsok Choi, Kevin Nam, Andrew Canis, Jason Helge Anderson, Stephen Dean Brown, and Tomasz S Czajkowski. Impact of Cache Architecture and Interface on Performance and Area of FPGA-Based Processor/Parallel-Accelerator Systems. In Proc. of the FCCM, 2012.

Digital Library

[15]

Eric S Chung, James C Hoe, and Ken Mai. CoRAM: an in-fabric memory architecture for FPGA-based computing. In Proc. of the 19th FPGA, 2011.

Digital Library

[16]

J. Cong, B. Liu, S. Neuendorffer, J. Noguera, K. Vissers, and Z. Zhang. High-level synthesis for fpgas: From prototyping to deployment. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 30(4), 2011.

Digital Library

[17]

Mattias De Wael, Stefan Marr, and Tom Van Cutsem. Fork/join parallelism in the wild: Documenting patterns and anti-patterns in java programs using the fork/join framework. In Proc. of the PPPJ, 2014.

[18]

A DeHon, J Adams, M deLorimier, N Kapre, Y Matsuda, H Naeimi, M Vanier, and M Wrighton. Design patterns for reconfigurable computing. In Proc. of the 12th FCCM, 2004.

Digital Library

[19]

R Domingo, R Salvador, H Fabelo, D Madroñal, S Ortega, R Lazcano, E Juárez, G Callicó, and C Sanz. High-level design using intel fpga opencl: A hyperspectral imaging spatial-spectral classifier. In Proc. of the Reconfigurable Communication-centric Systems-on-Chip (ReCoSoC), 2017.

[20]

Yoav Etsion, Felipe Cabarcas, Alejandro Rico, Alex Ramirez, Rosa M Badia, Eduard Ayguade, Jesus Labarta, and Mateo Valero. Task Superscalar: An Out-of-Order Task Pipeline. In Proc. of the 43rd MICRO, 2010.

Digital Library

[21]

Antonio Filgueras, Eduard Gil, Carlos Álvarez 0001, Daniel Jiménez-González, Xavier Martorell, Jan Langer, and Juanjo Noguera. Heterogeneous tasking on SMP/FPGA SoCs - The case of OmpSs and the Zynq. In Proc. of VLSI-SoC, 2013.

[22]

Matteo Frigo, Charles E Leiserson, and Keith H Randall. The Implementation of the Cilk-5 Multithreaded Language. In Proc. of the PLDI, 1998.

Digital Library

[23]

Anca Iordache, Guillaume Pierre, Peter Sanders, Jose Gabriel de F. Coutinho, and Mark Stillwell. High performance in the cloud with fpga groups. In Proceedings of the 9th International Conference on Utility and Cloud Computing, 2016.

Digital Library

[24]

Mark C Jeffrey, Suvinay Subramanian, Cong Yan, Joel S Emer, and Daniel Sanchez. A scalable architecture for ordered parallelism. In Proc. of the 48th MICRO, 2015.

Digital Library

[25]

Lana Josipović, Radhika Ghosal, and Paolo Ienne. Dynamically scheduled high-level synthesis. In Proc. of the FPGA, 2018.

Digital Library

[26]

N Kapre and H Patel. Applying Models of Computation to OpenCL Pipes for FPGA Computing. Proc. of the 5th Intl. Workshop on OpenCL, 2017.

Digital Library

[27]

Sanjeev Kumar, Christopher J Hughes, and Anthony Nguyen. Carbon: architectural support for fine-grained parallelism on chip multiprocessors. In Proc. of the 34th ISCA, 2007.

Digital Library

[28]

I-Ting Angelina Lee, Charles E. Leiserson, Tao B. Schardl, Zhunping Zhang, and Jim Sukha. On-the-fly pipeline parallelism. ACM Transactions on Parallel Computing, Oct 2015.

Digital Library

[29]

Zhaoshi Li, Leibo Liu, Yangdong Deng, Shouyi Yin, Yao Wang, and Shaojun Wei. Aggressive Pipelining of Irregular Applications on Reconfigurable Hardware. In Proc. of ISCA, 2017.

Digital Library

[30]

Yi Lu, Thomas Marconi, Koen Bertels, and Georgi Gaydadjiev. A Communication Aware Online Task Scheduling Algorithm for FPGA-Based Partially Reconfigurable Systems. Proc. of the FCCM, 2010.

Digital Library

[31]

Divya Mahajan, Jongse Park, Emmanuel Amaro, Hardik Sharma, Amir Yazdanbakhsh, Joon Kyung Kim, and Hadi Esmaeilzadeh. TABLA: A unified template-based framework for accelerating statistical machine learning. In Proc. of the 22nd HPCA, 2016.

[32]

Marc S Orr, Bradford M Beckmann, Steven K Reinhardt, and David A Wood. Fine-grain task aggregation and coordination on GPUs. In Proc. of the 41st ISCA, 2014.

Digital Library

[33]

Raghu Prabhakar, David Koeplinger, Kevin J Brown, HyoukJoong Lee, Christopher De Sa, Christos Kozyrakis, and Kunle Olukotun. Generating Configurable Hardware from Parallel Patterns. In Proc. of the 21st ASPLOS, 2016.

Digital Library

[34]

Raghu Prabhakar, Yaqi Zhang, David Koeplinger, Matt Feldman, Tian Zhao, Stefan Hadjis, Ardavan Pedram, Christos Kozyrakis, and Kunle Olukotun. Plasticine: A reconfigurable architecture for parallel paterns. In Proc. of the 44th ISCA, 2017.

Digital Library

[35]

Andrew Putnam. FPGAs in the Datacenter - Combining the Worlds of Hardware and Software Development. ACM Great Lakes Symposium on VLSI, 2017.

Digital Library

[36]

Andrew Putnam, Dave Bennett, Eric Dellinger, Jeff Mason, and Prasanna Sundararajan. CHiMPS - a high-level compilation flow for hybrid CPU-FPGA architectures. In Proc. of the FPGA, page 261, 2008.

Digital Library

[37]

Andrew Putnam, Susan J Eggers, Dave Bennett, Eric Dellinger, Jeff Mason, Henry Styles, Prasanna Sundararajan, and Ralph Wittig. Performance and power of cache-based reconfigurable computing. Proc. of the ISCA, 2009.

Digital Library

[38]

Daniel Sanchez, Richard M Yoo, and Christos Kozyrakis. Flexible architectural support for fine-grain scheduling. In Proc. of the 15th ASPLOS, 2010.

Digital Library

[39]

Tao B Schardl, William S Moses, and Charles E Leiserson. Tapir - Embedding Fork-Join Parallelism into LLVM's Intermediate Representation. In In Proc. of PPOPP, 2017.

Digital Library

[40]

Shreesha Srinath, Berkin Ilbeyi, Mingxing Tan, Gai Liu, Zhiru Zhang, and Christopher Batten. Architectural Specialization for Inter-Iteration Loop Dependence Patterns. In Proc. of the 47th MICRO, 2014.

Digital Library

[41]

Olivier Tardieu, Haichuan Wang, and Haibo Lin. A work-stealing scheduler for X10's task parallelism with suspension. Proc. of the 17th PPOPP, 2017.

Digital Library

[42]

Hasitha Muthumala Waidyasooriya, Masanori Hariyama, and Kunio Uchiyama. FPGA-Oriented Parallel Programming. In Design of FPGA-Based Computing Systems with OpenCL. October 2017.

Digital Library

[43]

Christopher S Zakian, Timothy A K Zakian, Abhishek Kulkarni, Buddhika Chamith, and Ryan R Newton. Concurrent Cilk - Lazy Promotion from Tasks to Threads in C/C ++. In Proc. of LCPC, 2015.

Digital Library

Cited By

Khatti MTian XSedigh Baroughi ARaj Baranwal AChi YGuo LCong JFang Z(2024)PASTA: Programming and Automation Support for Scalable Task-Parallel HLS Programs on Modern Multi-Die FPGAsACM Transactions on Reconfigurable Technology and Systems10.1145/367684917:3(1-31)Online publication date: 5-Aug-2024
https://dl.acm.org/doi/10.1145/3676849
Ye HJun HChen DTsafrir DMUSUVATHI MGupta RAbu-Ghazaleh N(2024)HIDA: A Hierarchical Dataflow Compiler for High-Level SynthesisProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3617232.3624850(215-230)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3617232.3624850
Majumder KBondhugula UAamodt TSwift MJerger N(2023)HIR: An MLIR-based Intermediate Representation for Hardware Accelerator DescriptionProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 410.1145/3623278.3624767(189-201)Online publication date: 25-Mar-2023
https://dl.acm.org/doi/10.1145/3623278.3624767
Show More Cited By

TAPAS: generating parallel accelerators from parallel programs
1. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language types

Recommendations

HeteroCL: A Multi-Paradigm Programming Infrastructure for Software-Defined Reconfigurable Computing
FPGA '19: Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

With the pursuit of improving compute performance under strict power constraints, there is an increasing need for deploying applications to heterogeneous hardware architectures with accelerators, such as GPUs and FPGAs. However, although these ...
Code generation from a domain-specific language for C-based HLS of hardware accelerators
CODES '14: Proceedings of the 2014 International Conference on Hardware/Software Codesign and System Synthesis

As today's computer architectures are becoming more and more heterogeneous, a plethora of options including CPUs, GPUs, DSPs, reconfigurable logic (FPGAs), and other application-specific processors come into consideration for close-to-sensor processing. ...
A new generic HLS approach for heterogeneous computing: on the feasibility of high-level synthesis in HSA-compatible systems
SAMOS '18: Proceedings of the 18th International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation

Recent advances like deep learning algorithms or virtual reality applications require an amount of computational power in increasingly smaller devices never seen before. Heterogeneous architectures are seen as a solution to this problem, since they ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MICRO-51: Proceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture

October 2018

1015 pages

ISBN:9781538662403

General Chairs:
Mark Oskin
University of Washington
,
Koji Inoue
Kyushu University

Sponsors

SIGMICRO: ACM Special Interest Group on Microarchitectural Research and Processing

Publisher

IEEE Press

Publication History

Published: 20 October 2018

Check for updates

Author Tags

Qualifiers

Research-article

Conference

MICRO-51

Sponsor:

SIGMICRO

MICRO-51: The 51st Annual IEEE/ACM International Symposium on Microarchitecture

October 20 - 24, 2018

Fukuoka, Japan

Acceptance Rates

Overall Acceptance Rate 484 of 2,242 submissions, 22%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

10
Total Citations
View Citations
92
Total Downloads

Downloads (Last 12 months)7
Downloads (Last 6 weeks)2

Reflects downloads up to 09 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Khatti MTian XSedigh Baroughi ARaj Baranwal AChi YGuo LCong JFang Z(2024)PASTA: Programming and Automation Support for Scalable Task-Parallel HLS Programs on Modern Multi-Die FPGAsACM Transactions on Reconfigurable Technology and Systems10.1145/367684917:3(1-31)Online publication date: 5-Aug-2024
https://dl.acm.org/doi/10.1145/3676849
Ye HJun HChen DTsafrir DMUSUVATHI MGupta RAbu-Ghazaleh N(2024)HIDA: A Hierarchical Dataflow Compiler for High-Level SynthesisProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3617232.3624850(215-230)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3617232.3624850
Majumder KBondhugula UAamodt TSwift MJerger N(2023)HIR: An MLIR-based Intermediate Representation for Hardware Accelerator DescriptionProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 410.1145/3623278.3624767(189-201)Online publication date: 25-Mar-2023
https://dl.acm.org/doi/10.1145/3623278.3624767
Cheng LRuttenberg MJung DRichmond DTaylor MOskin MBatten CAamodt TJerger NSwift M(2023)Beyond Static Parallel Loops: Supporting Dynamic Task Parallelism on Manycore Architectures with Software-Managed Scratchpad MemoriesProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3582016.3582020(46-58)Online publication date: 25-Mar-2023
https://dl.acm.org/doi/10.1145/3582016.3582020
Zacharopoulos GEjjeh AJing YYang EJia TBrumar IIntan JHuzaifa MAdve SAdve VWei GBrooks D(2023)Trireme: Exploration of Hierarchical Multi-level Parallelism for Hardware AccelerationACM Transactions on Embedded Computing Systems10.1145/358039422:3(1-23)Online publication date: 20-Apr-2023
https://dl.acm.org/doi/10.1145/3580394
Han SJang MKang JAamodt TJerger NSwift M(2023)ShakeFlow: Functional Hardware Description with Latency-Insensitive Interface CombinatorsProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3575701(702-717)Online publication date: 27-Jan-2023
https://dl.acm.org/doi/10.1145/3575693.3575701
Vahdatniya PSharifian AHojabr RShriraman AKloeckner AMoreira J(2022)mu-grindProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1145/3559009.3569671(346-358)Online publication date: 8-Oct-2022
https://dl.acm.org/doi/10.1145/3559009.3569671
Sozzo EConficconi DZeni ASalaris MSciuto DSantambrogio M(2022)Pushing the Level of Abstraction of Digital System Design: A Survey on How to Program FPGAsACM Computing Surveys10.1145/353298955:5(1-48)Online publication date: 3-Dec-2022
https://dl.acm.org/doi/10.1145/3532989
Dadu VNowatzki TFalsafi BFerdman MLu SWenisch T(2022)TaskStream: accelerating task-parallel workloads by recovering program structureProceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3503222.3507706(1-13)Online publication date: 28-Feb-2022
https://dl.acm.org/doi/10.1145/3503222.3507706
Baskaran SSampson J(2020)Decentralized Offload-based Execution on Memory-centric Compute CoresProceedings of the International Symposium on Memory Systems10.1145/3422575.3422778(61-76)Online publication date: 28-Sep-2020
https://dl.acm.org/doi/10.1145/3422575.3422778

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents