Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1109/MICRO.2018.00028acmconferencesArticle/Chapter ViewAbstractPublication PagesmicroConference Proceedingsconference-collections
research-article

TAPAS: generating parallel accelerators from parallel programs

Published: 20 October 2018 Publication History

Abstract

High-level-synthesis (HLS) tools generate accelerators from software programs to ease the task of building hardware. Unfortunately, current HLS tools have limited support for concurrency, which impacts the speedup achievable with the generated accelerator. Current approaches only target fixed static patterns (e.g., pipeline, data-parallel kernels). This constraints the ability of software programmers to express concurrency. Moreover, the generated accelerator loses a key benefit of parallel hardware, dynamic asynchrony, and the potential to hide long latency and cache misses.
We have developed TAPAS, an HLS toolchain for generating parallel accelerators from programs with dynamic parallelism. TAPAS is built on top of Tapir [22], [39], which embeds fork-join parallelism into the compiler's intermediate-representation. TAPAS leverages the compiler IR to identify parallelism and synthesizes the hardware logic. TAPAS provides first-class architecture support for spawning, coordinating and synchronizing tasks during accelerator execution. We demonstrate TAPAS can generate accelerators for concurrent programs with heterogeneous, nested and recursive parallelism. Our evaluation on Intel-Altera DE1-SoC and Arria-10 boards demonstrates that TAPAS generated accelerators achieve 20X the power efficiency of an Intel Xeon, while maintaining comparable performance. We also show that TAPAS enables lightweight tasks that can be spawned in ≃10 cycles and enables accelerators to exploit available fine-grain parallelism. TAPAS is a complete HLS toolchain for synthesizing parallel programs to accelerators and is open-sourced.

References

[1]
Vivado Design Suite. https://www.xilinx.com/products/design-tools/vivado.html.
[2]
S N Agathos and V V Dimakopoulos. Compiler-Assisted OpenMP Runtime Organization for Embedded Multicores. TR-2016-1, University of Ioannina, 2016.
[3]
Ryo Asai and Andrey Vladimirov. Intel Cilk Plus for complex parallel algorithms - "Enormous Fast Fourier Transforms" (EFFT) library. Journal of Parallel Computing, 2015.
[4]
Jonathan Bachrach, Huy Vo, Brian Richards, Yunsup Lee, Andrew Waterman, Rimas Avizienis, John Wawrzynek, and Krste Asanovic. Chisel: Constructing hardware in a scala embedded language. https://github.com/freechipsproject/chisel3.
[5]
David F Bacon, Rodric Rabbah, and Sunil Shukla. Fpga programming for the masses. Communications of the ACM, 56(4):56--63, 2013.
[6]
Lars Bauer, Artjom Grudnitsky, Muhammad Shafique 0001, and Jörg Henkel. PATS - A Performance Aware Task Scheduler for Runtime Reconfigurable Processors. In Proc. of the FCCM, 2012.
[7]
Robert D Blumofe and Charles E Leiserson. Scheduling Multithreaded Computations by Work Stealing. Journal of ACM, 1999.
[8]
Andrew Canis, Jongsok Choi, Blair Fort, Ruolong Lian, Qijing Huang, Nazanin Calagar, Marcel Gort, Jia Jun Qin, Mark Aldham, Tomasz Czajkowski, Stephen Brown, and Jason Anderson. From software to accelerators with LegUp high-level synthesis. In In Proc. of CASES, 2013.
[9]
George Charitopoulos, Iosif Koidis, Kyprianos Papadimitriou, and Dionisios Pnevmatikatos. Run-time management of systems with partially reconfigurable fpgas. Integration, the VLSI Journal, 57:34--44, 2017.
[10]
George Charitopoulos, Iosif Koidis, Kyprianos Papadimitriou, and Dionisios N Pnevmatikatos. Hardware Task Scheduling for Partially Reconfigurable FPGAs. In Proc. of ARC, 2015.
[11]
J. Choi, S. Brown, and J. Anderson. From software threads to parallel hardware in high-level synthesis for fpgas. In Proc. of FPT, 2013.
[12]
Jongsok Choi, Stephen Dean Brown, and Jason Helge Anderson. From pthreads to multicore hardware systems in legup high-level synthesis for fpgas. IEEE Trans. VLSI Syst., 25(10), 2017.
[13]
Jongsok Choi, Ruolong Lian, Stephen Dean Brown, and Jason Helge Anderson. A unified software approach to specify pipeline and spatial parallelism in FPGA hardware. In 27th IEEE International Conference on Application-specific Systems, Architectures and Processors, 2016.
[14]
Jongsok Choi, Kevin Nam, Andrew Canis, Jason Helge Anderson, Stephen Dean Brown, and Tomasz S Czajkowski. Impact of Cache Architecture and Interface on Performance and Area of FPGA-Based Processor/Parallel-Accelerator Systems. In Proc. of the FCCM, 2012.
[15]
Eric S Chung, James C Hoe, and Ken Mai. CoRAM: an in-fabric memory architecture for FPGA-based computing. In Proc. of the 19th FPGA, 2011.
[16]
J. Cong, B. Liu, S. Neuendorffer, J. Noguera, K. Vissers, and Z. Zhang. High-level synthesis for fpgas: From prototyping to deployment. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 30(4), 2011.
[17]
Mattias De Wael, Stefan Marr, and Tom Van Cutsem. Fork/join parallelism in the wild: Documenting patterns and anti-patterns in java programs using the fork/join framework. In Proc. of the PPPJ, 2014.
[18]
A DeHon, J Adams, M deLorimier, N Kapre, Y Matsuda, H Naeimi, M Vanier, and M Wrighton. Design patterns for reconfigurable computing. In Proc. of the 12th FCCM, 2004.
[19]
R Domingo, R Salvador, H Fabelo, D Madroñal, S Ortega, R Lazcano, E Juárez, G Callicó, and C Sanz. High-level design using intel fpga opencl: A hyperspectral imaging spatial-spectral classifier. In Proc. of the Reconfigurable Communication-centric Systems-on-Chip (ReCoSoC), 2017.
[20]
Yoav Etsion, Felipe Cabarcas, Alejandro Rico, Alex Ramirez, Rosa M Badia, Eduard Ayguade, Jesus Labarta, and Mateo Valero. Task Superscalar: An Out-of-Order Task Pipeline. In Proc. of the 43rd MICRO, 2010.
[21]
Antonio Filgueras, Eduard Gil, Carlos Álvarez 0001, Daniel Jiménez-González, Xavier Martorell, Jan Langer, and Juanjo Noguera. Heterogeneous tasking on SMP/FPGA SoCs - The case of OmpSs and the Zynq. In Proc. of VLSI-SoC, 2013.
[22]
Matteo Frigo, Charles E Leiserson, and Keith H Randall. The Implementation of the Cilk-5 Multithreaded Language. In Proc. of the PLDI, 1998.
[23]
Anca Iordache, Guillaume Pierre, Peter Sanders, Jose Gabriel de F. Coutinho, and Mark Stillwell. High performance in the cloud with fpga groups. In Proceedings of the 9th International Conference on Utility and Cloud Computing, 2016.
[24]
Mark C Jeffrey, Suvinay Subramanian, Cong Yan, Joel S Emer, and Daniel Sanchez. A scalable architecture for ordered parallelism. In Proc. of the 48th MICRO, 2015.
[25]
Lana Josipović, Radhika Ghosal, and Paolo Ienne. Dynamically scheduled high-level synthesis. In Proc. of the FPGA, 2018.
[26]
N Kapre and H Patel. Applying Models of Computation to OpenCL Pipes for FPGA Computing. Proc. of the 5th Intl. Workshop on OpenCL, 2017.
[27]
Sanjeev Kumar, Christopher J Hughes, and Anthony Nguyen. Carbon: architectural support for fine-grained parallelism on chip multiprocessors. In Proc. of the 34th ISCA, 2007.
[28]
I-Ting Angelina Lee, Charles E. Leiserson, Tao B. Schardl, Zhunping Zhang, and Jim Sukha. On-the-fly pipeline parallelism. ACM Transactions on Parallel Computing, Oct 2015.
[29]
Zhaoshi Li, Leibo Liu, Yangdong Deng, Shouyi Yin, Yao Wang, and Shaojun Wei. Aggressive Pipelining of Irregular Applications on Reconfigurable Hardware. In Proc. of ISCA, 2017.
[30]
Yi Lu, Thomas Marconi, Koen Bertels, and Georgi Gaydadjiev. A Communication Aware Online Task Scheduling Algorithm for FPGA-Based Partially Reconfigurable Systems. Proc. of the FCCM, 2010.
[31]
Divya Mahajan, Jongse Park, Emmanuel Amaro, Hardik Sharma, Amir Yazdanbakhsh, Joon Kyung Kim, and Hadi Esmaeilzadeh. TABLA: A unified template-based framework for accelerating statistical machine learning. In Proc. of the 22nd HPCA, 2016.
[32]
Marc S Orr, Bradford M Beckmann, Steven K Reinhardt, and David A Wood. Fine-grain task aggregation and coordination on GPUs. In Proc. of the 41st ISCA, 2014.
[33]
Raghu Prabhakar, David Koeplinger, Kevin J Brown, HyoukJoong Lee, Christopher De Sa, Christos Kozyrakis, and Kunle Olukotun. Generating Configurable Hardware from Parallel Patterns. In Proc. of the 21st ASPLOS, 2016.
[34]
Raghu Prabhakar, Yaqi Zhang, David Koeplinger, Matt Feldman, Tian Zhao, Stefan Hadjis, Ardavan Pedram, Christos Kozyrakis, and Kunle Olukotun. Plasticine: A reconfigurable architecture for parallel paterns. In Proc. of the 44th ISCA, 2017.
[35]
Andrew Putnam. FPGAs in the Datacenter - Combining the Worlds of Hardware and Software Development. ACM Great Lakes Symposium on VLSI, 2017.
[36]
Andrew Putnam, Dave Bennett, Eric Dellinger, Jeff Mason, and Prasanna Sundararajan. CHiMPS - a high-level compilation flow for hybrid CPU-FPGA architectures. In Proc. of the FPGA, page 261, 2008.
[37]
Andrew Putnam, Susan J Eggers, Dave Bennett, Eric Dellinger, Jeff Mason, Henry Styles, Prasanna Sundararajan, and Ralph Wittig. Performance and power of cache-based reconfigurable computing. Proc. of the ISCA, 2009.
[38]
Daniel Sanchez, Richard M Yoo, and Christos Kozyrakis. Flexible architectural support for fine-grain scheduling. In Proc. of the 15th ASPLOS, 2010.
[39]
Tao B Schardl, William S Moses, and Charles E Leiserson. Tapir - Embedding Fork-Join Parallelism into LLVM's Intermediate Representation. In In Proc. of PPOPP, 2017.
[40]
Shreesha Srinath, Berkin Ilbeyi, Mingxing Tan, Gai Liu, Zhiru Zhang, and Christopher Batten. Architectural Specialization for Inter-Iteration Loop Dependence Patterns. In Proc. of the 47th MICRO, 2014.
[41]
Olivier Tardieu, Haichuan Wang, and Haibo Lin. A work-stealing scheduler for X10's task parallelism with suspension. Proc. of the 17th PPOPP, 2017.
[42]
Hasitha Muthumala Waidyasooriya, Masanori Hariyama, and Kunio Uchiyama. FPGA-Oriented Parallel Programming. In Design of FPGA-Based Computing Systems with OpenCL. October 2017.
[43]
Christopher S Zakian, Timothy A K Zakian, Abhishek Kulkarni, Buddhika Chamith, and Ryan R Newton. Concurrent Cilk - Lazy Promotion from Tasks to Threads in C/C ++. In Proc. of LCPC, 2015.

Cited By

View all
  • (2024)PASTA: Programming and Automation Support for Scalable Task-Parallel HLS Programs on Modern Multi-Die FPGAsACM Transactions on Reconfigurable Technology and Systems10.1145/367684917:3(1-31)Online publication date: 5-Aug-2024
  • (2024)HIDA: A Hierarchical Dataflow Compiler for High-Level SynthesisProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3617232.3624850(215-230)Online publication date: 27-Apr-2024
  • (2023)HIR: An MLIR-based Intermediate Representation for Hardware Accelerator DescriptionProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 410.1145/3623278.3624767(189-201)Online publication date: 25-Mar-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MICRO-51: Proceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture
October 2018
1015 pages
ISBN:9781538662403

Sponsors

Publisher

IEEE Press

Publication History

Published: 20 October 2018

Check for updates

Author Tags

  1. FPGA
  2. HLS
  3. LLVM
  4. TAPAS
  5. chisel
  6. cilk
  7. dynamic parallelism
  8. hardware accelerator
  9. high-level synthesis
  10. power efficiency

Qualifiers

  • Research-article

Conference

MICRO-51
Sponsor:

Acceptance Rates

Overall Acceptance Rate 484 of 2,242 submissions, 22%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)7
  • Downloads (Last 6 weeks)2
Reflects downloads up to 09 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)PASTA: Programming and Automation Support for Scalable Task-Parallel HLS Programs on Modern Multi-Die FPGAsACM Transactions on Reconfigurable Technology and Systems10.1145/367684917:3(1-31)Online publication date: 5-Aug-2024
  • (2024)HIDA: A Hierarchical Dataflow Compiler for High-Level SynthesisProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3617232.3624850(215-230)Online publication date: 27-Apr-2024
  • (2023)HIR: An MLIR-based Intermediate Representation for Hardware Accelerator DescriptionProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 410.1145/3623278.3624767(189-201)Online publication date: 25-Mar-2023
  • (2023)Beyond Static Parallel Loops: Supporting Dynamic Task Parallelism on Manycore Architectures with Software-Managed Scratchpad MemoriesProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3582016.3582020(46-58)Online publication date: 25-Mar-2023
  • (2023)Trireme: Exploration of Hierarchical Multi-level Parallelism for Hardware AccelerationACM Transactions on Embedded Computing Systems10.1145/358039422:3(1-23)Online publication date: 20-Apr-2023
  • (2023)ShakeFlow: Functional Hardware Description with Latency-Insensitive Interface CombinatorsProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3575701(702-717)Online publication date: 27-Jan-2023
  • (2022)mu-grindProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1145/3559009.3569671(346-358)Online publication date: 8-Oct-2022
  • (2022)Pushing the Level of Abstraction of Digital System Design: A Survey on How to Program FPGAsACM Computing Surveys10.1145/353298955:5(1-48)Online publication date: 3-Dec-2022
  • (2022)TaskStream: accelerating task-parallel workloads by recovering program structureProceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3503222.3507706(1-13)Online publication date: 28-Feb-2022
  • (2020)Decentralized Offload-based Execution on Memory-centric Compute CoresProceedings of the International Symposium on Memory Systems10.1145/3422575.3422778(61-76)Online publication date: 28-Sep-2020

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media