Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3623278.3624750acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article
Open access

Manticore: Hardware-Accelerated RTL Simulation with Static Bulk-Synchronous Parallelism

Published: 07 February 2024 Publication History

Abstract

The demise of Moore's Law and Dennard Scaling has revived interest in specialized computer architectures and accelerators. Verification and testing of this hardware depend heavily upon cycle-accurate simulation of register-transfer-level (RTL) designs. The fastest software RTL simulators can simulate designs at 1--1000 kHz, i.e., more than three orders of magnitude slower than hardware. Improved simulators can increase designers' productivity by speeding design iterations and permitting more exhaustive exploration.
One possibility is to exploit low-level parallelism, as RTL expresses considerable fine-grain concurrency. Unfortunately, state-of-the-art RTL simulators often perform best on a single core since modern processors cannot effectively exploit fine-grain parallelism.
This work presents Manticore: a parallel computer designed to accelerate RTL simulation. Manticore uses a static bulk-synchronous parallel (BSP) execution model to eliminate fine-grain synchronization overhead. It relies entirely on a compiler to schedule resources and communication, which is feasible since RTL code contains few divergent execution paths. With static scheduling, communication and synchronization no longer incur runtime overhead, making fine-grain parallelism practical. Moreover, static scheduling dramatically simplifies processor implementation, significantly increasing the number of cores that fit on a chip. Our 225-core FPGA implementation running at 475 MHz outperforms a state-of-the-art RTL simulator running on desktop and server computers in 8 out of 9 benchmarks.

References

[1]
Azure pricing calculator. https://azure.microsoft.com/en-us/pricing/calculator/.
[2]
Virtual machine series. https://azure.microsoft.com/en-us/pricing/details/virtual-machines/series/.
[3]
Dennis Abts, Garrin Kimmell, Andrew C. Ling, John Kim, Matthew Boyd, Andrew Bitar, Sahil Parmar, Ibrahim Ahmed, Roberto DiCecco, David Han, John Thompson, Michael Bye, Jennifer Hwang, Jeremy Fowers, Peter Lillian, Ashwin Murthy, Elyas Mehtabuddin, Chetan Tekur, Thomas Sohmers, Kris Kang, Stephen Maresh, and Jonathan Ross. A software-defined tensor streaming multiprocessor for large-scale machine learning. In Valentina Salapura, Mohamed Zahran, Fred Chong, and Lingjia Tang, editors, ISCA '22: The 49th Annual International Symposium on Computer Architecture, New York, New York, USA, June 18 - 22, 2022, pages 567--580. ACM, 2022.
[4]
Dennis Abts, Jonathan Ross, Jonathan Sparling, Mark Wong-VanHaren, Max Baker, Tom Hawkins, Andrew Bell, John Thompson, Temesghen Kahsai, Garrin Kimmell, Jennifer Hwang, Rebekah Leslie-Hurd, Michael Bye, E. R. Creswick, Matthew Boyd, Mahitha Venigalla, Evan Laforge, Jon Purdy, Purushotham Kamath, Dinesh Maheshwari, Michael Beidler, Geert Rosseel, Omar Ahmad, Gleb Gagarin, Richard Czekalski, Ashay Rane, Sahil Parmar, Jeff Werner, Jim Sproch, Adrian Macias, and Brian Kurtz. Think fast: A tensor streaming processor (TSP) for accelerating deep learning workloads. In 47th ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2020, Valencia, Spain, May 30 - June 3, 2020, pages 145--158. IEEE, 2020.
[5]
Scott Beamer. A case for accelerating software RTL simulation. IEEE Micro, 40(4):112--119, 2020.
[6]
Scott Beamer and David Donofrio. Efficiently exploiting low activity factors to accelerate RTL simulation. In 57th ACM/IEEE Design Automation Conference, DAC 2020, San Francisco, CA, USA, July 20--24, 2020, pages 1--6. IEEE, 2020.
[7]
Scott Beamer, Thomas Nijssen, Krishna Pandian, and Kyle Zhang. ESSENT: A high-performance RTL simuator. In Workshop on Open-Source EDA Technology (WOSET), at International Conference on Computer-Aided Design (ICCAD), 2021.
[8]
Peter Birch. Open source FPGA-based emulation with nexus. In Workshop on Open-Source EDA Technology (WOSET), number 1, 2022.
[9]
Thomas Bourgeat, Clement Pit-Claudel, Adam Chlipala, and Arvind. The essence of bluespec: a core language for rule-based hardware design. In Alastair F. Donaldson and Emina Torlak, editors, Proceedings of the 41st ACM SIGPLAN International Conference on Programming Language Design and Implementation, PLDI 2020, London, UK, June 15--20, 2020, pages 243--257. ACM, 2020.
[10]
Aydin Buluc, Henning Meyerhenke, Ilya Safro, Peter Sanders, and Christian Schulz. Recent advances in graph partitioning, 2013.
[11]
Adrian M. Caulfield, Eric S. Chung, Andrew Putnam, Hari Angepat, Jeremy Fowers, Michael Haselman, Stephen Heil, Matt Humphrey, Puneet Kaur, Joo-Young Kim, Daniel Lo, Todd Massengill, Kalin Ovtcharov, Michael Papamichael, Lisa Woods, Sitaram Lanka, Derek Chiou, and Doug Burger. A cloud-scale acceleration architecture. In 49th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2016, Taipei, Taiwan, October 15--19, 2016, pages 7:1--7:13. IEEE Computer Society, 2016.
[12]
Debapriya Chatterjee, Andrew DeOrio, and Valeria Bertacco. Event-driven gate-level simulation with gp-gpus. In Proceedings of the 46th Design Automation Conference, DAC 2009, San Francisco, CA, USA, July 26--31, 2009, pages 557--562. ACM, 2009.
[13]
Debapriya Chatterjee, Andrew DeOrio, and Valeria Bertacco. GCS: high-performance gate-level simulation with gpgpus. In Luca Benini, Giovanni De Micheli, Bashir M. Al-Hashimi, and Wolfgang Müller, editors, Design, Automation and Test in Europe, DATE 2009, Nice, France, April 20--24, 2009, pages 1332--1337. IEEE, 2009.
[14]
Debapriya Chatterjee, Andrew DeOrio, and Valeria Bertacco. Gate-level simulation with GPU computing. ACM Trans. Design Autom. Electr. Syst., 16(3):30:1--30:26, 2011.
[15]
Yuze Chi, Jason Cong, Peng Wei, and Peipei Zhou. SODA: Stencil with optimized dataflow architecture. In 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), page 1--8. IEEE Press, 2018.
[16]
Jason Cong, Peng Li, Bingjun Xiao, and Peng Zhang. An optimal microarchitecture for stencil computation acceleration based on non-uniform partitioning of data reuse buffers. In The 51st Annual Design Automation Conference 2014, DAC '14, San Francisco, CA, USA, June 1--5, 2014, pages 77:1--77:6. ACM, 2014.
[17]
Jason Cong, Chang Wu, and Yuzheng Ding. Cut ranking and pruning: Enabling a general and efficient FPGA mapping solution. In Sinan Kaptanoglu and Steve Trimberger, editors, Proceedings of the 1999 ACM/SIGDA Seventh International Symposium on Field Programmable Gate Arrays, FPGA 1999, Monterey, CA, USA, February 21--23, 1999, pages 29--35. ACM, 1999.
[18]
C.M. Fiduccia and R.M. Mattheyses. A linear-time heuristic for improving network partitions. In 19th Design Automation Conference, pages 175--181, 1982.
[19]
Peter Flake, Phil Moorby, Steve Golson, Arturo Salz, and Simon J. Davidmann. Verilog HDL and its ancestors and descendants. Proc. ACM Program. Lang., 4(HOPL):87:1--87:90, 2020.
[20]
Hasan Genc, Seah Kim, Alon Amid, Ameer Haj-Ali, Vighnesh Iyer, Pranav Prakash, Jerry Zhao, Daniel Grubb, Harrison Liew, Howard Mao, Albert J. Ou, Colin Schmidt, Samuel Steffl, John Charles Wright, Ion Stoica, Jonathan Ragan-Kelley, Krste Asanovic, Borivoje Nikolic, and Yakun Sophia Shao. Gemmini: Enabling systematic deep-learning architecture evaluation via full-stack integration. In 58th ACM/IEEE Design Automation Conference, DAC 2021, San Francisco, CA, USA, December 5--9, 2021, pages 769--774. IEEE, 2021.
[21]
John L. Hennessy and David A. Patterson. A new golden age for computer architecture. Commun. ACM, 62(2):48--60, 2019.
[22]
Jim Hogan. Hogan compares palladium, veloce, eve zebu, aldec, bluespec, dini. online, apr 2018.
[23]
Xilinx Inc. Alveo data center accelerator card platforms. Xilinx Inc., August 2022.
[24]
A. Jahanshahi, R. Sharifi, M. Rezvani, and H. Zamani. Inf4Edge: Automatic resource-aware generation of energy-efficient CNN inference accelerator for edge embedded fpgas. In 2021 12th International Green and Sustainable Computing Conference (IGSC), pages 1--8, Los Alamitos, CA, USA, oct 2021. IEEE Computer Society.
[25]
Zhe Jia, Blake Tillman, Marco Maggioni, and Daniele Paolo Scarpazza. Dissecting the graphcore ipu architecture via microbenchmarking, 2019.
[26]
Nachiket Kapre and Jan Gray. Hoplite: A deflection-routed directional torus noc for fpgas. ACM Trans. Reconfigurable Technol. Syst., 10(2):14:1--14:24, 2017.
[27]
Sagar Karandikar, Howard Mao, Donggyu Kim, David Biancolin, Alon Amid, Dayeol Lee, Nathan Pemberton, Emmanuel Amaro, Colin Schmidt, Aditya Chopra, Qijing Huang, Kyle Kovacs, Borivoje Nikolic, Randy H. Katz, Jonathan Bachrach, and Krste Asanovic. Firesim: Fpga-accelerated cycle-exact scale-out system simulation in the public cloud. In Murali Annavaram, Timothy Mark Pinkston, and Babak Falsafi, editors, 45th ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2018, Los Angeles, CA, USA, June 1--6, 2018, pages 29--42. IEEE Computer Society, 2018.
[28]
George Karypis, Rajat Aggarwal, Vipin Kumar, and Shashi Shekhar. Multilevel hypergraph partitioning: applications in VLSI domain. IEEE Trans. Very Large Scale Integr. Syst., 7(1):69--79, 1999.
[29]
Donggyu Kim. riscv-mini. https://github.com/ucb-bar/riscv-mini.
[30]
Krishnamurthy. An improved min-cut algonthm for partitioning vlsi networks. IEEE Transactions on Computers, C-33(5):438--446, 1984.
[31]
Jiajie Li, Yuze Chi, and Jason Cong. HeteroHalide: From image processing dsl to efficient fpga acceleration. In Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA '20, page 51--57, New York, NY, USA, 2020. Association for Computing Machinery.
[32]
Dian-Lun Lin, Haoxing Ren, Yanqing Zhang, Brucek Khailany, and Tsung-Wei Huang. From RTL to CUDA: A GPU acceleration flow for RTL simulation with batch stimulus. In Proceedings of the 51st International Conference on Parallel Processing, ICPP 2022, Bordeaux, France, 29 August 2022 - 1 September 2022, pages 88:1--88:12. ACM, 2022.
[33]
Thierry Moreau, Tianqi Chen, Luis Vega, Jared Roesch, Eddie Q. Yan, Lianmin Zheng, Josh Fromm, Ziheng Jiang, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. A hardware-software blueprint for flexible deep learning specialization. IEEE Micro, 39(5):8--16, 2019.
[34]
Rishiyur S. Nikhil. Bluespec system verilog: efficient, correct RTL from high level specifications. In 2nd ACM & IEEE International Conference on Formal Methods and Models for Co-Design (MEMOCODE 2004), 23--25 June 2004, San Diego, California, USA, Proceedings, pages 69--70. IEEE Computer Society, 2004.
[35]
Thomas Norrie, Nishant Patil, Doe Hyun Yoon, George Kurian, Sheng Li, James Laudon, Cliff Young, Norman P. Jouppi, and David A. Patterson. The design process for google's training chips: Tpuv2 and tpuv3. IEEE Micro, 41(2):56--63, 2021.
[36]
Open-Source FPGA Bitcoin Miner. https://github.com/progranism/Open-Source-FPGA-Bitcoin-Miner.
[37]
Clément Pit-Claudel, Thomas Bourgeat, Stella Lau, Arvind, and Adam Chlipala. Effective simulation and debugging for a high-level hardware language using software compilers. In Tim Sherwood, Emery D. Berger, and Christos Kozyrakis, editors, ASPLOS '21: 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Virtual Event, USA, April 19--23, 2021, pages 789--803. ACM, 2021.
[38]
Andrew Putnam, Adrian M. Caulfield, Eric S. Chung, Derek Chiou, Kypros Constantinides, John Demme, Hadi Esmaeilzadeh, Jeremy Fowers, Gopi Prashanth Gopal, Jan Gray, Michael Haselman, Scott Hauck, Stephen Heil, Amir Hormati, Joo-Young Kim, Sitaram Lanka, James R. Larus, Eric Peterson, Simon Pope, Aaron Smith, Jason Thong, Phillip Yi Xiao, and Doug Burger. A reconfigurable fabric for accelerating large-scale datacenter services. In ACM/IEEE 41st International Symposium on Computer Architecture, ISCA 2014, Minneapolis, MN, USA, June 14--18, 2014, pages 13--24. IEEE Computer Society, 2014.
[39]
Hao Qian and Yangdong Deng. Accelerating RTL simulation with gpus. In Joel R. Phillips, Alan J. Hu, and Helmut Graeb, editors, 2011 IEEE/ACM International Conference on Computer-Aided Design, ICCAD 2011, San Jose, California, USA, November 7--10, 2011, pages 687--693. IEEE Computer Society, 2011.
[40]
Lauro Rizzatti and Charley Selvidge. Designing a modern hardware emulation platform. online, jan 2020.
[41]
Kamil Rocki, Dirk Van Essendelft, Ilya Sharapov, Robert Schreiber, Michael Morrison, Vladimir Kibardin, Andrey Portnoy, Jean-Francois Dietiker, Madhava Syamlal, and Michael James. Fast stencil-code computation on a wafer-scale processor. In Christine Cuicchi, Irene Qualters, and William T. Kramer, editors, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2020, Virtual Event/ Atlanta, Georgia, USA, November 9--19, 2020, page 58. IEEE/ACM, 2020.
[42]
Bashar Romanous, Mohammadreza Rezvani, Junjie Huang, Daniel Wong, Evangelos E. Papalexakis, Vassilis J. Tsotras, and Walid Najjar. High-performance parallel radix sort on fpga. In 2020 IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), pages 224--224, 2020.
[43]
Vladimir Rybalkin, Jonas Ney, Menbere Kina Tekleyohannes, and Norbert Wehn. When massive GPU parallelism ain't enough: A novel hardware architecture of 2D-LSTM neural network. ACM Trans. Reconfigurable Technol. Syst., 15(1), nov 2021.
[44]
L.A. Sanchis. Multiple-way network partitioning. IEEE Transactions on Computers, 38(1):62--81, 1989.
[45]
Vivek Sarkar and John L. Hennessy. Compile-time partitioning and scheduling of parallel programs. In Richard L. Wexelblat, editor, Proceedings of the 1986 SIGPLAN Symposium on Compiler Construction, Palo Alto, California, USA, June 25--27, 1986, pages 17--26. ACM, 1986.
[46]
Kaz Sato and Clif Young. An in-depth look at google's first tensor processing unit (TPU), 2017.
[47]
Sebastian Schlag, Tobias Heuer, Lars Gottesbüren, Yaroslav Akhremtsev, Christian Schulz, and Peter Sanders. High-quality hypergraph partitioning. CoRR, abs/2106.08696, 2021.
[48]
Wilson Snyder. Verilator, accelerated: Accelerating development, and case study of accelerating performance. 2nd Workshop on Open-Source Design Automation (OSDA).
[49]
Tobias Strauch. Combining simulation and FPGA based verification to an affordable and ultra-fast multi-billion-gate verification system. In Proceedings of the 30th International Workshop on Rapid System Prototyping, RSP 2019, New York, NY, USA, October 17--18, 2019, pages 22--28. ACM, 2019.
[50]
Xiang Tian and Khaled Benkrid. Design and implementation of a high performance financial monte-carlo simulation engine on an FPGA supercomputer. In Tarek A. El-Ghazawi, Yao-Wen Chang, Juinn-Dar Huang, and Proshanta Saha, editors, 2008 International Conference on Field-Programmable Technology, FPT 2008, Taipei, Taiwan, December 7--10, 2008, pages 81--88. IEEE, 2008.
[51]
Yatish Turakhia, Gill Bejerano, and William J. Dally. Darwin: A genomics coprocessor provides up to 15, 000x acceleration on long read assembly. In Xipeng Shen, James Tuck, Ricardo Bianchini, and Vivek Sarkar, editors, Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2018, Williamsburg, VA, USA, March 24--28, 2018, pages 199--213. ACM, 2018.
[52]
ultraembedded. High throughput JPEG decoder. https://github.com/ultraembedded/core_jpeg.
[53]
Leslie G. Valiant. A bridging model for parallel computation. Commun. ACM, 33(8):103--111, 1990.
[54]
Elliot Waingold, Michael B. Taylor, Devabhaktuni Srikrishna, Vivek Sarkar, Walter Lee, Victor Lee, Jang Kim, Matthew I. Frank, Peter Finch, Rajeev Barua, Jonathan Babb, Saman P. Amarasinghe, and Anant Agarwal. Baring it all to software: Raw machines. Computer, 30(9):86--93, 1997.
[55]
Xuechao Wei, Cody Hao Yu, Peng Zhang, Youxiang Chen, Yuxin Wang, Han Hu, Yun Liang, and Jason Cong. Automated systolic array architecture synthesis for high throughput CNN inference on fpgas. In Proceedings of the 54th Annual Design Automation Conference, DAC 2017, Austin, TX, USA, June 18--22, 2017, pages 29:1--29:6. ACM, 2017.
[56]
Christian Wimmer and Michael Franz. Linear scan register allocation on SSA form. In Andreas Moshovos, J. Gregory Steffan, Kim M. Hazelwood, and David R. Kaeli, editors, Proceedings of the CGO 2010, The 8th International Symposium on Code Generation and Optimization, Toronto, Ontario, Canada, April 24--28, 2010, pages 170--179. ACM, 2010.
[57]
Claire Wolf. Yosys open synthesis suite. https://yosyshq.net/yosys/.
[58]
Tao Yang and A. Gerasoulis. Dsc: scheduling parallel tasks on an unbounded number of processors. IEEE Transactions on Parallel and Distributed Systems, 5(9):951--967, 1994.
[59]
Yanqing Zhang, Haoxing Ren, and Brucek Khailany. Opportunities for RTL and gate level simulation using gpus (invited talk). In IEEE/ACM International Conference On Computer Aided Design, ICCAD 2020, San Diego, CA, USA, November 2--5, 2020, pages 166:1--166:5. IEEE, 2020.
[60]
Ümit V. Çatalyürek, Karen D. Devine, Marcelo Fonseca Faraj, Lars Gottesbüren, Tobias Heuer, Henning Meyerhenke, Peter Sanders, Sebastian Schlag, Christian Schulz, Daniel Seemaier, and Dorothea Wagner. More recent advances in (hyper)graph partitioning, 2022.

Cited By

View all
  • (2024)FireAxe: Partitioned FPGA-Accelerated Simulation of Large-Scale RTL Designs2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00044(501-515)Online publication date: 29-Jun-2024
  • (2024)Viper: Utilizing Hierarchical Program Structure to Accelerate Multi-Core SimulationIEEE Access10.1109/ACCESS.2024.335406912(17669-17678)Online publication date: 2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ASPLOS '23: Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 4
March 2023
430 pages
ISBN:9798400703942
DOI:10.1145/3623278
This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 February 2024

Check for updates

Qualifiers

  • Research-article

Conference

ASPLOS '23

Acceptance Rates

Overall Acceptance Rate 535 of 2,713 submissions, 20%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)588
  • Downloads (Last 6 weeks)71
Reflects downloads up to 09 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)FireAxe: Partitioned FPGA-Accelerated Simulation of Large-Scale RTL Designs2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00044(501-515)Online publication date: 29-Jun-2024
  • (2024)Viper: Utilizing Hierarchical Program Structure to Accelerate Multi-Core SimulationIEEE Access10.1109/ACCESS.2024.335406912(17669-17678)Online publication date: 2024

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media