Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3613424.3614257acmconferencesArticle/Chapter ViewAbstractPublication PagesmicroConference Proceedingsconference-collections
research-article
Open access

Accelerating RTL Simulation with Hardware-Software Co-Design

Published: 08 December 2023 Publication History

Abstract

Fast simulation of digital circuits is crucial to build modern chips. But RTL (Register-Transfer-Level) simulators are slow, as they cannot exploit multicores well. Slow simulation lengthens chip design time and makes bugs more frequent.
We present ASH, a parallel architecture tailored to simulation workloads. ASH consists of a tightly codesigned hardware architecture and compiler for RTL simulation. ASH exploits two key opportunities. First, it performs dataflow execution of small tasks to leverage the fine-grained parallelism in simulation workloads. Second, it performs selective event-driven execution to run only the fraction of the design exercised each cycle, skipping ineffectual tasks. ASH hardware provides a novel combination of dataflow and speculative execution, and ASH’s compiler features several novel techniques to automatically leverage this hardware.
We evaluate ASH in simulation using large Verilog designs. An ASH chip with 256 simple cores is gmean 1,485 × faster than 1-core Verilator, and it is 32 × faster than parallel Verilator on a server CPU with 32 complex cores, while using 3 × less area.

References

[1]
Maleen Abeydeera and Daniel Sanchez. 2020. Chronos: Efficient Speculative Parallelism for Accelerators. In Proc. of the 25th intl. conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS-XXV).
[2]
Maleen Abeydeera, Suvinay Subramanian, Mark C. Jeffrey, Joel Emer, and Daniel Sanchez. 2017. SAM: Optimizing Multithreaded Cores for Speculative Parallelism. In Proc. of the 26th Intl. Conf. on Parallel Architectures and Compilation Techniques (PACT-26).
[3]
Arvind. 2005. Passing the token. In Proc. of the 32nd annual Intl. Symp. on Computer Architecture (ISCA-32).
[4]
Arvind, Rishiyur S. Nikhil, and Keshav K. Pingali. 1989. I-Structures: Data Structures For Parallel Computing. ACM TOPLAS 11, 4 (1989).
[5]
Grant Ayers, Nayana Prasad Nagendra, David I. August, Hyoun Kyu Cho, Svilen Kanev, Christos Kozyrakis, Trivikram Krishnamurthy, Heiner Litz, Tipp Moseley, and Parthasarathy Ranganathan. 2019. AsmDB: understanding and mitigating front-end stalls in warehouse-scale computers. In Proc. of the 46th annual Intl. Symp. on Computer Architecture (ISCA-46).
[6]
Jonathan Babb, Russell Tessier, Matthew Dahl, Silvina Zimi Hanono, David M Hoki, and Anant Agarwal. 1997. Logic emulation with virtual wires. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 16, 6 (1997).
[7]
Scott Beamer. 2020. A Case for Accelerating Software RTL Simulation. IEEE Micro 40, 4 (2020), 112–119.
[8]
Scott Beamer and David Donofrio. 2020. Efficiently exploiting low activity factors to accelerate RTL simulation. In Proc. of the 57th Design Automation Conf. (DAC-57).
[9]
Daniel K Beece, G Deiberg, Georgina Papp, and Frank Villante. 1988. The IBM engineering verification engine. In Proc. of the 25th Design Automation Conf. (DAC-25).
[10]
Ranjita Bhagwan and Bill Lin. 2000. Fast and scalable priority queue architecture for high-speed network switches. In Proc. of the IEEE Infocom 2000.
[11]
Janusz A Brzozowski and Carl-Johan H Seger. 1995. Asynchronous circuits. Springer.
[12]
Cadence. 2015. Palladium Z1 enterprise emulation platform. https://www.cadence.com/content/dam/cadence-www/global/en_US/documents/tools/system-design-verification/palladium-z1-ds.pdf, archived at https://perma.cc/MD6F-EYGQ.
[13]
Cadence. 2019. Protium X1 enterprise prototyping platform. https://www.cadence.com/en_US/home/tools/system-design-and-verification/emulation-and-prototyping/protium.html.
[14]
K. Mani Chandy and Jayadev Misra. 1981. Asynchronous distributed simulation via a sequence of parallel computations. Commun. ACM 24, 4 (1981).
[15]
Jack B Dennis and David P Misunas. 1975. A preliminary architecture for a basic data-flow processor. In Proc. of the 2nd annual Intl. Symp. on Computer Architecture (ISCA-2).
[16]
Hadi Esmaeilzadeh, Emily Blem, Renee St. Amant, Karthikeyan Sankaralingam, and Doug Burger. 2011. Dark silicon and the end of multicore scaling. In Proc. of the 38th annual Intl. Symp. on Computer Architecture (ISCA-38).
[17]
Yoav Etsion, Felipe Cabarcas, Alejandro Rico, Alex Ramirez, Rosa M. Badia, Eduard Ayguade, Jesus Labarta, and Mateo Valero. 2010. Task Superscalar: An Out-of-Order Task Pipeline. In Proc. of the 43rd annual IEEE/ACM intl. symp. on Microarchitecture (MICRO-43).
[18]
Richard Fujimoto. 1989. The virtual time machine. In Proc. of the 1st ACM Symp. on Parallelism in Algorithms and Architectures (SPAA).
[19]
Richard Fujimoto. 1990. Parallel discrete event simulation. Commun. ACM 33, 10 (1990).
[20]
Richard M. Fujimoto, Jya-Jang Tsai, and Ganesh C. Gopalakrishnan. 1992. Design and evaluation of the rollback chip: Special purpose hardware for Time Warp. IEEE Trans. Comput. 41, 1 (1992).
[21]
Lance Hammond, Vicky Wong, Mike Chen, Brian D. Carlstrom, John D. Davis, Ben Hertzberg, Manohar K. Prabhu, Honggo Wijaya, Christos Kozyrakis, and Kunle Olukotun. 2004. Transactional memory coherence and consistency. In Proc. of the 31st annual Intl. Symp. on Computer Architecture (ISCA-31).
[22]
Maurice Herlihy and J Eliot B Moss. 1993. Transactional memory: Architectural support for lock-free data structures. In Proc. of the 20th annual Intl. Symp. on Computer Architecture (ISCA-20).
[23]
William N.N. Hung and Richard Sun. 2018. Challenges in large FPGA-based logic emulation systems. In Proc. of the 2018 Intl. Symp. on Physical Design (ISPD).
[24]
IBS data on IC design costs. 2018. As Chip Design Costs Skyrocket, 3nm Process Node Is in Jeopardy. https://www.extremetech.com/computing/272096-3nm-process-node.
[25]
David Jefferson. 1985. Virtual time. ACM TOPLAS 7, 3 (1985).
[26]
David Jefferson, Brian Beckman, Fred Wieland, Leo Blume, Mike DiLoreto, Phil Hontalas, Pierre Laroche, Kathy Sturdevant, Jack Tupman, Van Warren, John Wedel, Herb Younger, and Steve Bellenot. 1987. Distributed Simulation and the Time Warp Operating System. In Proc. of the 11st Symp. on Operating System Principles (SOSP-11).
[27]
Mark C. Jeffrey, Suvinay Subramanian, Cong Yan, Joel Emer, and Daniel Sanchez. 2015. A scalable architecture for ordered parallelism. In Proc. of the 48th annual IEEE/ACM intl. symp. on Microarchitecture (MICRO-48).
[28]
Sagar Karandikar, Howard Mao, Donggyu Kim, David Biancolin, Alon Amid, Dayeol Lee, Nathan Pemberton, Emmanuel Amaro, Colin Schmidt, Aditya Chopra, Qijing Huang, Kyle Kovacs, Borivoje Nikolic, Randy Katz, Jonathan Bachrach, and Krste Asanovic. 2018. FireSim: FPGA-accelerated cycle-exact scale-out system simulation in the public cloud. In Proc. of the 45th annual Intl. Symp. on Computer Architecture (ISCA-45).
[29]
George Karypis and Vipin Kumar. 1998. A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs. SIAM Journal on Scientific Computing 20, 1 (1998).
[30]
Ubaid R. Khan, Henry L. Owen, and Joseph L.A. Hughes. 1993. FPGA architectures for ASIC hardware emulators. In Proc. of the Sixth Annual IEEE Intl. ASIC Conf. and Exhibit.
[31]
Donggyu Kim, Jerry Zhao, Jonathan Bachrach, and Krste Asanović. 2019. Simmani: Runtime power modeling for arbitrary RTL with automatic signal selection. In Proc. of the 52nd annual IEEE/ACM intl. symp. on Microarchitecture (MICRO-52).
[32]
Sangpyo Kim, Jongmin Kim, Michael Jaemin Kim, Wonkyung Jung, John Kim, Minsoo Rhu, and Jung Ho Ahn. 2022. BTS: An accelerator for bootstrappable fully homomorphic encryption. In Proc. of the 49th annual Intl. Symp. on Computer Architecture (ISCA-49).
[33]
Alex Krasnov, Andrew Schultz, John Wawrzynek, Greg Gibeling, and Pierre-Yves Droz. 2007. RAMP Blue: A message-passing manycore system in FPGAs. In Proc. of the 2007 intl. conf. on Field Programmable Logic and Applications (FPL).
[34]
Helena Krupnova and Gabriele Saucier. 2000. FPGA-based emulation: Industrial and custom prototyping solutions. In Proc. of the 10th intl. conf. on Field-Programmable Logic and Applications (FPL).
[35]
Dian-Lun Lin, Haoxing Ren, Yanqing Zhang, Brucek Khailany, and Tsung-Wei Huang. 2023. From RTL to CUDA: A GPU Acceleration Flow for RTL Simulation with Batch Stimulus. In Proc. of the 51st Intl. Conf. on Parallel Processing (ICPP-51).
[36]
Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. 2005. Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation. SIGPLAN Notices (2005).
[37]
Mentor/Siemens. 2017. Veloce Strato. https://eda.sw.siemens.com/en-US/ic/veloce/strato-hardware/.
[38]
Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill, and David A. Wood. 2006. LogTM: Log-based transactional memory. In Proc. of the 12nd IEEE intl. symp. on High Performance Computer Architecture (HPCA-12).
[39]
Nangate Inc.2008. The NanGate 45nm Open Cell Library. http://www.nangate.com/?page_id=2325.
[40]
Rishiyur S. Nikhil and Arvind. 1990. Executing a program on the MIT tagged-token dataflow architecture. IEEE Trans. on Computers 39, 3 (1990).
[41]
Tony Nowatzki, Vinay Gangadhar, Newsha Ardalani, and Karthikeyan Sankaralingam. 2017. Stream-dataflow acceleration. In Proc. of the 44th annual Intl. Symp. on Computer Architecture (ISCA-44).
[42]
Tony Nowatzki, Vinay Gangadhar, and Karthikeyan Sankaralingam. 2015. Exploring the potential of heterogeneous Von Neumann/dataflow execution models. In Proc. of the 42nd annual Intl. Symp. on Computer Architecture (ISCA-42).
[43]
Heidi Pan, Krste Asanović, Robert Cohn, and Chi-Keung Luk. 2005. Controlling Program Execution through Binary Instrumentation. SIGARCH Computer Architecture News (2005).
[44]
Yale N Patt, Wen-mei Hwu, and Michael Shebanow. 1985. HPS, a new microarchitecture: Rationale and introduction. In Proc. of the 18th annual workshop and symp. on Microprogramming and Microarchitecture (MICRO-18).
[45]
Gregory F Pfister. 1982. The Yorktown simulation engine: Introduction. In Proc. of the 19th Design Automation Conf. (DAC-19).
[46]
Clément Pit-Claudel, Thomas Bourgeat, Stella Lau, Arvind, and Adam Chlipala. 2021. Effective Simulation and Debugging for a High-Level Hardware Language Using Software Compilers. In Proc. of the 26th intl. conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS-XXVI).
[47]
Jose Renau, Karin Strauss, Luis Ceze, Wei Liu, Smruti Sarangi, James Tuck, and Josep Torrellas. 2005. Thread-level speculation on a CMP can be energy efficient. In Proc. of the Intl. Conf. on Supercomputing (ICS’05).
[48]
Lauro Rizzatti. 2015. Hardware emulation: Three decades of evolution. Part I. https://s3.amazonaws.com/verificationhorizons.verificationacademy.com/volume-11_issue-1/articles/stream/hardware-emulation-three-decades-of-evolution_vh-v11-i1.pdf, archived at https://perma.cc/F3DU-U6ZK. Verification Horizons 11, 1 (2015), 26–27.
[49]
Lauro Rizzatti. 2015. Hardware emulation: Three decades of evolution. Part II. https://s3.amazonaws.com/verificationhorizons.verificationacademy.com/volume-11_issue-2/articles/stream/hardware-emulation-three-decades-of-evolution-part-II_vh-v11-i2.pdf, archived at https://perma.cc/XB4N-C7MS. Verification Horizons 11, 2 (2015), 40–42.
[50]
Lauro Rizzatti. 2015. Hardware emulation: Three decades of evolution. Part III. https://s3.amazonaws.com/verificationhorizons.verificationacademy.com/volume-11_issue-3/articles/stream/hardware-emulation-three-decades-of-evolution-part-iii_vh-v11-i3.pdf, archived at https://perma.cc/BK4D-NAJX. Verification Horizons 11, 3 (2015), 15–18.
[51]
Efraim Rotem, Yuli Mandelblat, Vadim Basin, Eli Weissmann, Arik Gihon, Rajshree Chabukswar, Russ Fenger, and Monica Gupta. 2021. Alder Lake Architecture. In IEEE Hot Chips 33 Symposium (HotChips-33).
[52]
Nikola Samardzic, Axel Feldmann, Aleksandar Krastev, Srinivas Devadas, Ronald Dreslinski, Christopher Peikert, and Daniel Sanchez. 2021. F1: A fast and programmable accelerator for fully homomorphic encryption. In Proc. of the 54th annual IEEE/ACM intl. symp. on Microarchitecture (MICRO-54).
[53]
Nikola Samardzic, Axel Feldmann, Aleksandar Krastev, Nathan Manohar, Nicholas Genise, Srinivas Devadas, Karim Eldefrawy, Chris Peikert, and Daniel Sanchez. 2022. CraterLake: a hardware accelerator for efficient unbounded computation on encrypted data. In Proc. of the 49th annual Intl. Symp. on Computer Architecture (ISCA-49).
[54]
Alireza Shafaei, Yanzhi Wang, Xue Lin, and Massoud Pedram. 2014. FinCACTI: Architectural analysis and modeling of caches with deeply-scaled FinFET devices. In Proc. of IEEE Computer Society Annual Symposium on VLSI (ISVLSI).
[55]
Timothy Sherwood, Erez Perelman, Greg Hamerly, and Brad Calder. 2002. Automatically characterizing large scale program behavior. In Proc. of the 10th intl. conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS-X).
[56]
Wilson Snyder. 2003. Verilator. https://www.veripool.org/verilator/.
[57]
Wilson Snyder. 2018. Verilator 4.0: Open Source Simulation Goes Multithreaded. In The Open Source Digital Design Conference (ORConf).
[58]
Wilson Snyder. 2020. Verilator, Accelerated. In 2nd Workshop on Open-Source Design Automation (OSDA).
[59]
Gurindar S Sohi, Scott E Breach, and TN Vijaykumar. 1995. Multiscalar processors. In Proc. of the 22nd annual Intl. Symp. on Computer Architecture (ISCA-22).
[60]
SpinalHDL. 2018. A FPGA friendly 32 bit RISC-V CPU implementation. https://github.com/SpinalHDL/VexRiscv.
[61]
J. Gregory Steffan and Todd C. Mowry. 1998. The Potential for Using Thread-Level Data Speculation to Facilitate Automatic Parallelization. In Proc. of the 4th IEEE intl. symp. on High Performance Computer Architecture (HPCA-4).
[62]
Steven Swanson, Andrew Schwerin, Martha Mercaldi, Andrew Petersen, Andrew Putnam, Ken Michelson, Mark Oskin, and Susan J. Eggers. 2007. The WaveScalar architecture. ACM Transactions on Computer Systems (TOCS) 25, 2 (2007).
[63]
Synopsys Inc.2018. ZeBu Server 4. https://www.synopsys.com/verification/emulation/zebu-server.html.
[64]
Zhangxi Tan, Zhenghao Qian, Xi Chen, Krste Asanovic, and David Patterson. 2015. DIABLO: A warehouse-scale computer network simulator using FPGAs. In Proc. of the 20th intl. conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS-XX).
[65]
Zhangxi Tan, Andrew Waterman, Rimas Avizienis, Yunsup Lee, Henry Cook, David Patterson, and Krste Asanović. 2010. RAMP Gold: an FPGA-based architecture simulator for multiprocessors. In Proc. of the 47th Design Automation Conf. (DAC-47).
[66]
Zhangxi Tan, Andrew Waterman, Henry Cook, Sarah Bird, Krste Asanović, and David Patterson. 2010. A case for FAME: FPGA architecture model execution. In Proc. of the 37th annual Intl. Symp. on Computer Architecture (ISCA-37).
[67]
The Chronos FPGA Framework to accelerate ordered applications. 2020. https://github.com/SwarmArch/chronos/.
[68]
Blaise Tine, Krishna Praveen Yalamarthy, Fares Elsabbagh, and Kim Hyesoon. 2021. Vortex: Extending the RISC-V ISA for GPGPU and 3D-Graphics. In Proc. of the 54th annual IEEE/ACM intl. symp. on Microarchitecture (MICRO-54).
[69]
Robert M Tomasulo. 1967. An efficient algorithm for exploiting multiple arithmetic units. IBM Journal of research and Development 11, 1 (1967).
[70]
Ray Turner. 2004. A primer on processor-based emulation. EETimes, https://www.eetimes.com/a-primer-on-processor-based-emulation/.
[71]
Haoyuan Wang and Scott Beamer. 2023. RepCut: Superlinear Parallel RTL Simulation with Replication-Aided Partitioning. In Proc. of the 28th intl. conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS-XXVIII).
[72]
David Wentzlaff, Patrick Griffin, Henry Hoffmann, Liewei Bao, Bruce Edwards, Carl Ramey, Matthew Mattina, Chyi-Chang Miao, John F. Brown III, and Anant Agarwal. 2007. On-chip interconnection architecture of the Tile Processor. IEEE Micro 27, 5 (2007).
[73]
Claire Wolf. 2014. Yosys Open SYnthesis Suite. http://www.clifford.at/yosys/.
[74]
Xilinx. 2019. Alveo U250 Data Center Accelerator Card. https://www.xilinx.com/products/boards-and-kits/alveo/u250.html.
[75]
Fahimeh Yazdanpanah, Carlos Alvarez-Martinez, Daniel Jimenez-Gonzalez, and Yoav Etsion. 2013. Hybrid Dataflow/Von-Neumann Architectures. IEEE Trans. on Parallel and Distributed Systems (2013).
[76]
Victor A. Ying, Mark C. Jeffrey, and Daniel Sanchez. 2020. T4: Compiling Sequential Code for Effective Speculative Parallelization in Hardware. In Proc. of the 47th annual Intl. Symp. on Computer Architecture (ISCA-47).
[77]
Rumi Zahir. 2012. Medfield smartphone SOC Intel® Atom Z2460 processor. In IEEE Hot Chips 24 Symposium (HotChips-24).

Cited By

View all
  • (2024)FireAxe: Partitioned FPGA-Accelerated Simulation of Large-Scale RTL Designs2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00044(501-515)Online publication date: 29-Jun-2024
  • (2024)Viper: Utilizing Hierarchical Program Structure to Accelerate Multi-Core SimulationIEEE Access10.1109/ACCESS.2024.335406912(17669-17678)Online publication date: 2024

Index Terms

  1. Accelerating RTL Simulation with Hardware-Software Co-Design

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      MICRO '23: Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture
      October 2023
      1528 pages
      ISBN:9798400703294
      DOI:10.1145/3613424
      This work is licensed under a Creative Commons Attribution-ShareAlike International 4.0 License.

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 08 December 2023

      Check for updates

      Author Tags

      1. dataflow execution
      2. domain-specific architectures.
      3. hardware acceleration
      4. register-transfer-level
      5. simulation
      6. speculative execution

      Qualifiers

      • Research-article
      • Research
      • Refereed limited

      Funding Sources

      • DARPA

      Conference

      MICRO '23
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 484 of 2,242 submissions, 22%

      Upcoming Conference

      MICRO '24

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)1,278
      • Downloads (Last 6 weeks)167
      Reflects downloads up to 16 Oct 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)FireAxe: Partitioned FPGA-Accelerated Simulation of Large-Scale RTL Designs2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00044(501-515)Online publication date: 29-Jun-2024
      • (2024)Viper: Utilizing Hierarchical Program Structure to Accelerate Multi-Core SimulationIEEE Access10.1109/ACCESS.2024.335406912(17669-17678)Online publication date: 2024

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Get Access

      Login options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media