Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
65 views

Parallel Multi-Core Verilog HDL Simulation

Uploaded by

Nguyen Van Toan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views

Parallel Multi-Core Verilog HDL Simulation

Uploaded by

Nguyen Van Toan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 139

University of Massachusetts Amherst

ScholarWorks@UMass Amherst

Doctoral Dissertations Dissertations and Theses

Summer August 2014

Parallel Multi-core Verilog HDL Simulation


Tariq B. Ahmad
University of Massachusetts Amherst

Follow this and additional works at: https://scholarworks.umass.edu/dissertations_2

Part of the Computer and Systems Architecture Commons, Digital Circuits Commons, Hardware
Systems Commons, and the VLSI and Circuits, Embedded and Hardware Systems Commons

Recommended Citation
Ahmad, Tariq B., "Parallel Multi-core Verilog HDL Simulation" (2014). Doctoral Dissertations. 45.
https://scholarworks.umass.edu/dissertations_2/45

This Open Access Dissertation is brought to you for free and open access by the Dissertations and Theses at
ScholarWorks@UMass Amherst. It has been accepted for inclusion in Doctoral Dissertations by an authorized
administrator of ScholarWorks@UMass Amherst. For more information, please contact
scholarworks@library.umass.edu.
PARALLEL MULTI-CORE VERILOG
HDL SIMULATION

A Dissertation Presented

by

TARIQ BASHIR AHMAD

Submitted to the Graduate School of the


University of Massachusetts Amherst in partial fulfillment
of the requirements for the degree of

DOCTOR OF PHILOSOPHY

May 2014

Electrical and Computer Engineering



c Copyright by Tariq Bashir Ahmad 2014

All Rights Reserved


PARALLEL MULTI-CORE VERILOG
HDL SIMULATION

A Dissertation Presented
by

TARIQ BASHIR AHMAD

Approved as to style and content by:

Maciej J. Ciesielski, Chair

Sandip Kundu, Member

Michael Zink, Member

Charles Weems, Member

Christopher V. Hollot, Department Head


Electrical and Computer Engineering
To all those who believe.
ACKNOWLEDGMENTS

I would like to thanks Professor Maciej Ciesielski for helping me when i needed
the most and his constant support and mentorship. I also want to thank all the
committee members. I must thank Professor C.M. Krishna as well for his help. I am

grateful to Dusung Kim for helping me start this project. I want to acknowledge my
friend Dr. Faisal M. Kashif for his constant support and mentorship. I am indebted
to Fulbright (United States Educational Foundation in Pakistan) in their efforts to

help me during my PhD. I cannot forget their favors and i will always remember Dr.

Grace Clark and Rita Akhtar for what they did for me.

I must also mention that my technical life transformed when i was offered an

internship at Marvell, where i got to discover my technical weaknesses and how to

overcome those. I am greatly indebted to Awais Nemat and Guy Hutchison for their

constant feedback, willing to help and guidance. It was because of their help, Dr.

Faisal’s help and Fulbright’s support, i was able to overcome a major obstacle in my

PhD in fall of 2010. The way to this internship started at the parent’s house of Amer
Haider in spring 2009. I must thank Amer, his mother Ayesha Haider, his father

Muzaffar Haider and Hidaya foundation for being hospitable and becoming means to

where i am today.

I must thank Ameen Ashraf for helping in getting an intership at Apple Computer
in Summer 2011.

Last but not the least, i want to thank again my parents, my family and everyone
around me who has been a positive influence in my life.

v
ABSTRACT

PARALLEL MULTI-CORE VERILOG


HDL SIMULATION

MAY 2014

TARIQ BASHIR AHMAD

B.S., GIK INSTITUTE OF ENGINEERING


M.S., UNIVERSITY OF MASSACHUSETTS AMHERST

Ph.D., UNIVERSITY OF MASSACHUSETTS AMHERST

Directed by: Professor Maciej J. Ciesielski

In the era of multi-core computing, the push for creating true parallel applications

that can run on individual CPUs is on the rise. Application of parallel discrete event

simulation (PDES) to hardware design verification looks promising, given the com-

plexity of today’s hardware designs. Unfortunately, the challenges imposed by lack


of inherent parallelism, suboptimal design partitioning, synchronization and com-
munication overhead, and load balancing, render this approach largely ineffective.

This thesis presents three techniques for accelerating simulation at three levels of ab-

straction namely, RTL, functional gate-level (zero-delay) and gate-level timing. We


review contemporary solutions and then propose new ways of speeding up simulation
at the three levels of abstraction. We demonstrate the effectiveness of the proposed
approaches on several industrial hardware designs.

vi
TABLE OF CONTENTS

Page

ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii

CHAPTER

1. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 Importance of Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1


1.2 Problems with Parallel Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2.1 Design Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5


1.2.2 Communication and Synchronization between Partitions . . . . . . . . . 6
1.2.3 Applicability of Parallel Simulation to Large Designs . . . . . . . . . . . . 9

1.3 Parallel Simulation Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11


1.4 Formal Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.4.1 Equivalence Checking (EC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13


1.4.2 Model Checking and Property Checking . . . . . . . . . . . . . . . . . . . . . . 14

1.5 Static Timing Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15


1.6 Why Gate-level Simulation? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2. PREVIOUS WORK ON PARALLEL SIMULATION . . . . . . . . . . . . . . 17

2.1 Factors Affecting the Performance of Parallel HDL Simulation . . . . . . . . . 18

2.1.1 Timing Granularity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18


2.1.2 Hardware Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.1.3 Issues in Design Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

vii
2.1.4 Time Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.2 Prediction-based Parallel Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22


2.3 Multi-level Temporal Parallel Event-Driven Simulation . . . . . . . . . . . . . . . 24
2.4 Differences between Distributed Simulation and MULTES . . . . . . . . . . . . . 27
2.5 Parallel Computer Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.5.2 New Trends in Computer Architecture . . . . . . . . . . . . . . . . . . . . . . . 31

2.5.2.1 Parallelism on Single Core Machine . . . . . . . . . . . . . . . . . . 31

2.5.3 Classification of Parallel Architectures . . . . . . . . . . . . . . . . . . . . . . . 33

2.5.3.1 Single-Instruction, Multiple-Data (SIMD) . . . . . . . . . . . . 33


2.5.3.2 Multiple-Instruction, Multiple-Data (MIMD) . . . . . . . . . 33
2.5.3.3 Symmetric Multiprocessing (SMP) . . . . . . . . . . . . . . . . . . 33
2.5.3.4 Non-uniform Memory Access (NUMA) . . . . . . . . . . . . . . . 34

2.5.4 Memory Organization in Multi-core Machines . . . . . . . . . . . . . . . . . 34

2.5.4.1 Distributed Memory Machines (DMM) . . . . . . . . . . . . . . . 35


2.5.4.2 Shared Memory Machines (SMM) . . . . . . . . . . . . . . . . . . . 35

2.5.5 Thread Level Parallelism (TLP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3. PARALLEL MULTI-CORE VERILOG HDL SIMULATION


BASED ON FUNCTIONAL PARTITIONING . . . . . . . . . . . . . . . . . 37

3.1 Predicting Input Stimulus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38


3.2 Preliminary Results of Predictor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.3 Quantitative Overhead Measurement in Multi-Core Simulation
Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.4 Prediction-based Multi-Core Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.4.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44


3.4.2 Dealing with Mismatches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.5 Architecture of Prediction-based Gate-level Simulation . . . . . . . . . . . . . . . 49


3.6 Experiments on Real Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.7 Dealing with Resynthesized and Retimed Designs . . . . . . . . . . . . . . . . . . . . 52
3.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.9 Appendix A: Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.10 Appendix B: Simulation Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.11 Appendix C: Designs Unsuitable for Multi-core Simulation . . . . . . . . . . . . 66

viii
4. EXTENDING PARALLEL MULTI-CORE VERILOG HDL
SIMULATION PERFORMANCE BASED ON DOMAIN
PARTITIONING USING VERILATOR AND OPENMP . . . . . . . 68

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.2 Simulator Internals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.3 Parallelizing using OpenMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.5 Dependencies in the Testbench . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5. ACCELERATING RTL SIMULATION IN TEMPORAL


DOMAIN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

5.1.1 Issues with Co-Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79


5.1.2 Issues with Multi-Core Simulators . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

5.2 Temporal Parallel Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

5.2.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.2.2 Integration with the current ASIC/FPGA design flow . . . . . . . . . . 82

5.3 Exploring Circuit Unrolling option for Parallel Simulation . . . . . . . . . . . . . 83


5.4 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

5.4.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.4.2 Simulation of Small Custom Design Circuit . . . . . . . . . . . . . . . . . . . 86
5.4.3 Simulation by varying the Unroll factor (F) . . . . . . . . . . . . . . . . . . . 86
5.4.4 Simulation by varying the number of cores . . . . . . . . . . . . . . . . . . . . 89

5.5 Muti-core Architecture of Temporal RTL Simulation . . . . . . . . . . . . . . . . . 92

5.5.1 Load Balancing in the Multi-core Architecture . . . . . . . . . . . . . . . . 92


5.5.2 Simulation of industry standard design . . . . . . . . . . . . . . . . . . . . . . . 93

5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

6. ACCELERATING GATE-LEVEL TIMING SIMULATION . . . . . . . 97

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

6.1.1 Issues with Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

6.2 Hybrid Approach to Gate-level Timing Simulation . . . . . . . . . . . . . . . . . . . 98

6.2.1 Basic Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

ix
6.2.2 Design Partitioning for Gate level Simulation . . . . . . . . . . . . . . . . . 99
6.2.3 Integration with the existing ASIC/FPGA Design Flow . . . . . . . 104
6.2.4 Early Gate-level Timing Simulation . . . . . . . . . . . . . . . . . . . . . . . . . 105

6.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

6.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106


6.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

6.4 Verification of Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109


6.5 New Gate-level Timing Simulation Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.6 Conclusion and Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

7. CONCLUSION AND FUTURE WORK . . . . . . . . . . . . . . . . . . . . . . . . . 113

7.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113


7.2 Performance Gain by Opensource Simulation Software . . . . . . . . . . . . . . . 114
7.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

7.3.1 Future Work in Improving Gate-level Timing Simulation . . . . . . 115


7.3.2 Future Work in Accelerating Time Parallel RTL
Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
7.3.3 Future Work in Accelerating Multi-core RTL or Functional
Gate-level Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

8. PUBLICATIONS, SUPPORT AND


ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

8.1 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118


8.2 Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
8.3 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

x
LIST OF TABLES

Table Page

3.1 Accuracy of RTL predictor for gate-level timing . . . . . . . . . . . . . . . . . . . . . 40

3.2 Accuracy of functional gate-level for gate-level timing . . . . . . . . . . . . . . . . 41

3.3 Quantitative communication and synchronization overhead


measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.4 Accuracy of RTL predictor at the register boundary . . . . . . . . . . . . . . . . . . 47

3.5 Single core simulation performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.6 Multi-core simulation performance of AES-128 . . . . . . . . . . . . . . . . . . . . . . . 51

3.7 Multi-core simulation performance of JPEG encoder . . . . . . . . . . . . . . . . . . 51

3.8 Multi-core simulation performance of Triple DES . . . . . . . . . . . . . . . . . . . . 52

3.9 RTL prediction-based Multi-core functional GL simulation of


bi-partitioned designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3.10 Simulation profile of AES-128 benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3.11 Simulation profile of Triple DES benchmark . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.12 Simulation profile of JPEG benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.13 Simulation profile of PCI benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

3.14 Simulation profile of VGA benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

3.15 Simulation profile of AC97 benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

3.16 Multi-core simulation performance of VGA (T1 = 612 min) . . . . . . . . . . . 66

3.17 Multi-core simulation performance of PCI (T1 = 17 min) . . . . . . . . . . . . . 67

xi
3.18 Multi-core simulation performance of AC97 (T1 = 4 min) . . . . . . . . . . . . . 67

4.1 RTL simulation of AES-128 with 65000,00 vectors using Verilator and
OpenMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

4.2 Gate-level (zero-delay) simulation of AES-128 with 65000,00 vectors


using Verilator and OpenMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

4.3 RTL simulation of RCA-128 with 65000,00 vectors using Verilator


and OpenMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

4.4 Gate-level (zero-delay) simulation of RCA-128 with 65000,00 vectors


using Verilator and OpenMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

5.1 Performance comparison of iterative and unrolled circuits . . . . . . . . . . . . . 85

5.2 RTL simulation speedup for single-frame circuit . . . . . . . . . . . . . . . . . . . . . 87

5.3 RTL simulation speedup for circuit unrolled 2 times. . . . . . . . . . . . . . . . . . 88

5.4 RTL simulation speedup for circuit unrolled 4 times. . . . . . . . . . . . . . . . . . 88

5.5 Effect of varying number of cores on RTL simulation time . . . . . . . . . . . . . 90

5.6 Load Balancing on simple circuit by varying number of cores . . . . . . . . . . 94

5.7 AES-128 speedup with parallel simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

6.1 Design Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

6.2 Simulation speedup of AES-128 for variable number of blocks in SDF


annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

6.3 Speedup with hybrid gate-level timing simulation . . . . . . . . . . . . . . . . . . . 109

6.4 Accuracy of hybrid gate-level timing simulation at the register


boundary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

7.1 Classification of HDL designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

7.2 Speedup at various levels of abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

xii
LIST OF FIGURES

Figure Page

1.1 Simulation in ASIC and FPGA design flow . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 AES-128 simulation performance in Xilinx FPGA design flow . . . . . . . . . . . 4

1.3 CPU, Memory and Ethernet improvements over the decade . . . . . . . . . . . . 7

1.4 Communication between Design Partitions . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.5 Parallel Simulation and CPU performance . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.1 Predictor modeling in hardware design simulation flow . . . . . . . . . . . . . . . 23

2.2 Distributed parallel simulation using accurate prediction . . . . . . . . . . . . . . 24

2.3 NUMA hardware configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.1 Standalone simulation of a design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.2 Parallel multi-core simulation of a design . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.3 Parallel multi-core simulation in the ASIC design flow [25] . . . . . . . . . . . . 39

3.4 Setup for measuring communication and synchronization overhead . . . . . 42

3.5 Setup for measuring synchronization overhead . . . . . . . . . . . . . . . . . . . . . . . 42

3.6 Multi-core Simulation of RCA128 on 2 cores (with comm and synch


overhead) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.7 Multi-core Simulation of RCA128 on 2 cores (no comm overhead) . . . . . . 45

3.8 NUMA hardware configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.9 Gate-level simulation using accurate RTL prediction . . . . . . . . . . . . . . . . . . 47

xiii
3.10 Architecture of parallel GL simulation using accurate RTL
prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.11 Bi-partitioned (area-based) AES-128 multi-core simulation time . . . . . . . . 58

3.12 Bi-partitioned (area-based) AES-128 multi-core simulation CPU


utilization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

3.13 Tri-partitioned (instance-based) AES-128 multi-core simulation


time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

3.14 Tri-partitioned (instance-based) AES-128 multi-core simulation CPU


utilization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

3.15 Bi-partitioned (area-based) JPEG multi-core simulation time . . . . . . . . . . 61

3.16 Bi-partitioned (area-based) JPEG multi-core simulation CPU


utilization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

3.17 Bi-partitioned (instance-based) JPEG multi-core simulation time . . . . . . . 62

3.18 Bi-partitioned (instance-based) JPEG multi-core CPU utilization . . . . . . 63

3.19 Bi-partitioned (instance-based) Triple DES multi-core simulation


time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

3.20 Tri-partitioned (instance-based) VGA multi-core simulation time . . . . . . . 64

3.21 Oct-partitioned (instance-based) pci multi-core simulation time . . . . . . . . 64

3.22 Oct-partitioned (instance-based) ac97 multi-core simulation time . . . . . . . 65

3.23 Multi-core simulation performance of AES-128 . . . . . . . . . . . . . . . . . . . . . . . 65

3.24 Multi-core simulation performance of JPEG . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.1 HDL simulator internals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

4.2 Extending Verilator for parallel programming . . . . . . . . . . . . . . . . . . . . . . . . 72

4.3 Speedup of RCA-128 with Verilator using OpenMP . . . . . . . . . . . . . . . . . . 74

4.4 Speedup of AES-128 with Verilator using OpenMP . . . . . . . . . . . . . . . . . . . 75

4.5 Performance comparison of Verilator and VCS at RTL . . . . . . . . . . . . . . . . 76

xiv
4.6 Performance comparison of Verilator and VCS at functional
gate-level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

4.7 Multi-core performance comparison of Verilator and VCS at RTL and


functional gate-level for AES-128 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5.1 Temporal Parallel Simulation (TPS) concept . . . . . . . . . . . . . . . . . . . . . . . . 81

5.2 Temporal RTL simulation setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

5.3 simple circuit for RTL simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

5.4 simple circuit unrolled twice for RTL simulation . . . . . . . . . . . . . . . . . . . . . 84

5.5 RTL acceleration setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

5.6 RTL simulation speedup as a function of number of slices for different


unroll factors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

5.7 RTL simulation speedup as a function of number of frames for


different slices. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

5.8 Parallel RTL simulation across multiple CPU cores . . . . . . . . . . . . . . . . . . . 90

5.9 RTL simulation speedup as a function of the number of cores . . . . . . . . . . 91

5.10 RTL simulation speedup as a function of the number of cores for


different unroll factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

5.11 Multi-core architecture of temporal RTL simulation . . . . . . . . . . . . . . . . . . 93

5.12 Temporal RTL simulation on four cores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

5.13 Temporal RTL simulation on two cores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

5.14 AES-128 design in CBC mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

5.15 AES-128 simulation configuration on two cores . . . . . . . . . . . . . . . . . . . . . . 95

6.1 Drop down in simulation performance with level of abstraction +


debugging enabled . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

6.2 Gate-level timing simulation with full SDF back-annotation . . . . . . . . . . . 99

xv
6.3 Hybrid Gate-level timing simulation with partial SDF
back-annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

6.4 Static Timing Analysis (STA) of VGA controller design . . . . . . . . . . . . . . 101

6.5 Static Timing Analysis (STA) of AES-128 controller design . . . . . . . . . . . 102

6.6 Automated partitioning and simulation flow for hybrid gate-level


timing simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

6.7 Sample timing constraint file (tfile) for AES-128 design . . . . . . . . . . . . . . 103

6.8 Proposed flow for hybrid gate-level timing simulation . . . . . . . . . . . . . . . . 104

6.9 Early timing simulation using RTL with estimate of peripheral


timing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

6.10 Instance hierarchy of AES-128 design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

6.11 Full SDF-Annotated Signal versus Selective SDF-Annotated Signal


when one block in STA (aes sbox4) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

6.12 Full SDF-Annotated Signal versus Selective SDF-Annotated Signal


when two blocks in STA (aes sbox4 and aes sbox5) . . . . . . . . . . . . . . 108

6.13 Full SDF-Annotated Signal versus Selective SDF-Annotated Signal


when majority of the blocks are in STA . . . . . . . . . . . . . . . . . . . . . . . . . 109

6.14 Verification flow for hybrid gate-level timing simulation . . . . . . . . . . . . . . 110

6.15 Traditional simulation flow in ASIC/FPGA design . . . . . . . . . . . . . . . . . . 111

6.16 Proposed flow of early simulation in ASIC/FPGA design . . . . . . . . . . . . . 112

xvi
CHAPTER 1

INTRODUCTION

As design size and complexity increase, so is the need to verify the design quickly
with the given coverage goals. This alongside with reduced design cycle of three to six

months makes verification a lot more challenging. Today, verification takes 60-75% of
the design cycle time and on an average the ratio of verification to design engineers

is 3:1 [10], [33] This work addresses the issue of simulation performance which is very

much needed today, as the designs continue to become more complex. We particularly

look at HDL (Hardware Description Language) simulation performance at three levels

of abstraction. We look at ways of improving simulation performance at the RTL level,

functional gate-level (zero-delay) and gate-level timing. The techniques for improving

simulation performance at these levels of abstraction are described in the remainder

of this document. It is expected that following the proposed techniques at each level

of abstraction will tremendously reduce the hardware design and verification time.

This chapter discusses simulation and formal verification based techniques that

are used to verify hardware designs. In particular, it addresses the challenges faced
by parallel hardware simulation as it continues to gain importance with pervasiveness
of multi-core computing.

1.1 Importance of Simulation


Computer simulation is used extensively to support modeling of systems that are
to be implemented in hardware; or to mimic complex phenomena that are otherwise

1
difficult to reproduce, e.g., traffic pattern on a busy airport, testing new internet pro-
tocol, etc. With time and advancements in technology, humans want to build even
larger and more complex systems. The conventional methods of modeling and simula-
tion on computers with a single processing unit (CPU) cannot cope with the memory
and execution time requirements of today’s complex systems. To accommodate this
demand, use of distributed and parallel computing is a must. Distributed computing
in the form of clusters of workstations, multiprocessors and multi-cores have become
widespread due to their cost-effective nature [39].

Hardware systems are typically modeled as discrete time systems. The state of

such systems can change and be observed at discrete time instants. In event driven
simulation, events occur and change the state of the system at discrete time instants.

Distributed simulation consists of execution of a single program on multiple CPUs,

which communicate and synchronize with each other using standard communication

interfaces. Simulation process of a portion of the system on a computer is referred

to as logical process (LP). Logical processes (LPs) maintain state information, event

queue and local time reference, and communicate via standard communication in-

terface. A change in the state of a LP is communicated to the affected LPs via

time-stamped messages. A synchronization algorithm assures correct order of sim-

ulation among the LPs [39]. A special case of distributed simulation is when the

simulation is distributed to individual cores on a single chip referred to as multi-core


simulation.

Hardware Description Language (HDL) simulation remains an extremely popular


method of design verification, because of its ease of use, inexpensive computing plat-
form, 100% signal controllability and signal observability [28]. Figure 1.1 illustrates
the use of simulation in a typical Application Specific Integrated Circuit (ASIC) de-
sign flow. Synthesis refers to converting the RTL model of the design into technology

2
Start Synthesis

Post-synthesis
Algorithm dev functional and
in C/C++ timing simulations

Layout

RTL translation
by HDL
Post-layout
functional and
timing simulations

Functional
Simulation End

Figure 1.1. Simulation in ASIC and FPGA design flow

dependent gate-level netlist. Layout means physical placement of the gates and wiring

between them.

Field Programmable Gate Array (FPGA) design flow is similar but may have

additional steps like translation and tehnology mapping. Translation refers to merg-

ing different netlists (RTL, intellectual property (IP), schematic) into one gate-level
netlist. Technology mapping refers to mapping the translated gate-level netlist onto
FPGA physical resources. Placement and routing (P&R) means connecting the phys-
ical resources in the mapped netlist and extracting timing. The time gap between
the two extremes of the RTL and P&R simulations is as large as 45x. It is worth
noting that simulation is needed after every phase in the FPGA design flow. Figure
1.2 shows time required at different simulation phases of AES-128 FPGA design flow

performed using traditional event-driven Verilog simulator.

3
350

300

250

Simulation time (min)


200

150

100

50

0
RTL Post−Syn Post−Trans Post−Map Post−PAR
Level of abstraction

Figure 1.2. AES-128 simulation performance in Xilinx FPGA design flow

As the designs are getting large, reducing the simulation time has become nec-

essary. Parallel simulation attempts to address this challenge. So far, the speedup

offered by parallel simulation for real world applications has been difficult to achieve

for several reasons:

1. Lack of inherent parallelism in the design itself;

2. Difficulty in design partitioning;

3. Communication overhead;

4. Synchronization overhead; and

5. Load balancing.

The remainder of this chapter reviews some of these issues and draws conclusions
regarding research directions to remedy these problems.

4
1.2 Problems with Parallel Simulation

In the wake of multi-core technology, which promises faster communication be-

tween the processor cores and greater processing speeds of processor cores, parallel
multi-core simulation should result in speedup that is linear in the number of pro-

cessor cores. Unfortunately, this is not the case. This is due to the problems of lack
of inherent parallelism, design partitioning and load balancing, communication and
synchronization overheads. In this section, we discuss these problems in detail.

1.2.1 Design Partitioning

Design Partitioning is an important aspect of distributed parallel simulation as it

strongly affects the communication between the partitions and event synchronization.

Various partitioning algorithms have been proposed. The partitioning could be static

or dynamic. Static partitioning methods partitions the design without considering

its effect on simulation. For example, it could partition HDL design using metrics

like the number of instances, estimated number of gates, number of modules, etc.

The advantage of such a partitioning scheme is that it is quick and easy to generate.

The obvious disadvantage is that the resulting partitions could be unbalanced as the

workload requirements are not known prior to simulation. The idea of pre-simulation
has been proposed but it adds an extra processing step, unless it can be done as part

of a complete simulation-based flow.

Dynamic partitioning uses simulation statistics as heuristics to partition the de-

sign. One could simulate the entire design for a few clock cycles to partition the design.
One can also combine static and dynamic partitioning to achieve optimal partitioning.
Note that coming up with perfectly balanced partitions is a known NP-hard problem
[15]. Given this objective, minimizing communication and synchronization overhead
may pose conflicting requirements [15].

5
1.2.2 Communication and Synchronization between Partitions

Minimizing communication and synchronization among partitions is another big

challenge after partitioning the design for multi-core simulation. Communication


overhead is defined as the time spent in exchanging data among partitions. Both

data bandwidth and frequency of communication among partitions impact communi-


cation overhead. Synchronization overhead is defined as the time spent by each local
simulation for guaranteeing no causality violation or the time needed to coordinate
all local simulations. It can get worse as the number of partitions increase. Usually

simulators do Profiling to identify places where there are synchronization issues. Note

that synchronization requires communication among partitions. Therefore, it can be


treated as a particular case of communication overhead.

Once the design is partitioned, the partitions can be simulated independently

but only if there is no dependency between them. This is hardly the case in real

world designs where partitions need to exchange data in time. If the frequency of

this communication dominates the simulation of individual partitions, then speedup

over standard simulation cannot be achieved, and often speed degradation happens.

Before the advent of multi-core processors, multi-cluster and multi-processor archi-

tectures were used for distributed parallel simulation [16]. The communication and

synchronization overhead between partitions was understandable. In multi-core pro-


cessor architecture with shared or distributed memory architectures, this communi-
cation and synchronization overhead should be reduced as individual processor cores

run faster than the previous generation and exchange data through shared memory
rather than through long interconnects. The problem is that communication and
synchronization overhead has not decreased with the advancements in technology.
In fact, it has become a bottleneck in distributed parallel simulation that must be

6
Technology improvements over the decade
30

25

Performance Improvement
20

15

10

0
CPU Memory Memory Latency Ethernet Ethernet Latency
Technology

Figure 1.3. CPU, Memory and Ethernet improvements over the decade

overcome to get a reasonable speedup. This is the main theme of the proposal and

will be addressed at length in the following chapters.

Figure 1.3 shows performance improvement in CPU, Memory and Ethernet tech-

nologies. It shows that performance varies from one technology generation to the

other. While CPU has achieved the largest speedup over the decade, Memory and
Ethernet latencies have not kept the same pace. This is the main reason why speedup
of parallel simulation has not been significant compared to the CPU speedup. Figure

1.4 shows a bi-partitioned design, where each partition is simulated on an individual


CPU core. While the two simulations can run faster with faster CPU and inter-
connect technology, there is a significant performance gap unless the interconnect is
made faster and the frequency of communication between the partitions is reduced.
Equation 1.1 shows the formula for speedup

7
T1
speedup = (1.1)
Tpar + Tcomm

where T1 is the simulation time on a single processor, Tpar is the simulation time

on parallel processor configuration and Tcomm is the communication time spent in


exchanging data between parallel processors during parallel simulation. Tpar is given
by the famous Amdahl’s law, Equation 1.2, where P represents the portion of work
that speeds up by a factor S. Equation 1.1 states that if the Tcomm does not improve
at the same rate as Tpar , speedup is going to be limited. Figure 1.5 shows four curves

to illustrate this fact. It shows that the gap between CPU performance and commu-
nication overhead is maximum if there is no improvement in interconnect technology.

The gap between CPU performance and communication overhead decreases when the

interconnect and the frequency of communication improves. This clearly shows that

communication and synchronization between parallel simulations will remain the bot-

tleneck unless there is a significant improvement in the interconnect and frequency of

communication.

T1
Tpar = (1.2)
1−P + P
S

Recently, a new method has been proposed [27] for parallelizing simulations by
eliminating inter-simulation communication. This is done by predicting input stimu-
lus to individual partitions using a predictor (typically available from simulation at

a higher level of abstraction). This work deals with reducing the frequency of com-
munication between parallel simulations using accurate prediction model proposed
in [27] applied to gate-level simulation. This approach exploits the inherent design
hierarchy to overcome the partitioning problem. Communication overhead between
local simulations is avoided using accurate prediction model at each local simulation.

8
Partition1 Partition2

CPU Core 0 CPU Core 1

Figure 1.4. Communication between Design Partitions

It has already been shown that if the prediction is 100% accurate, the communication

overhead is entirely eliminated.

1.2.3 Applicability of Parallel Simulation to Large Designs

It is a common misunderstanding that large designs are more suitable for parallel

simulation. This may not be true in entirety. Usually large designs have portions

of code that use cross module reference to improve signal observability [16]. Such

portions of code cannot be run in parallel. Furthermore, it is impossible to partition


the testbench as testbench is often reactive [16]. Modern testbench environments
incorporate C/C++ reference models, Programming Language Interface (PLI) calls,

Tool Command Language (TCL) scripts etc. This makes it impossible to build en-
vironment for parallel simulation because of serial dependencies. Nevertheless, if the
design is too large to fit into a single computer’s memory, parallel simulation can
be useful by running simulation on many networked computers. This was certainly
true when the designs did not exceed 32-bit memory space. Now when the 64-bit

9
CPU and Parallel Simulation Performance
16
CPU speedup
14 Parallel Sim speedup
Parallel Sim with improved latency
12 Parallel Sim with improved latency and synch

10
speedup

0
−10 years −5 years today +5 years
years

Figure 1.5. Parallel Simulation and CPU performance

computers are prevalent, some people see the need of parallel simulation diminishing

[16].

Another trap that researchers have fallen into is the design itself. It is easy to cook

up designs that are best for parallel simulation [16]. In those designs, the speedup
obtained could be elusive. Such designs are not practical and often far from the
industrial designs and practices. Testbench also affects simulation performance; test-
bench with unconstrained stimulus create uniform work load which tends to increase
the performance of parallel simulation. In real life, unconstrained stimulus does not
apply as majority of input patterns could be illegal (never produced by the actual de-
sign). Zhu et al [42] have shown that parallel simulation using original testbench runs

10
slower than the single processor simulation because testbench exercised constrained
deterministic patterns. When they modified the testbench to exercise unconstrained
patterns, speedup was possible. However, there are cases where unconstrained ran-
dom stimulus is suitable, such as random test pattern generation for automatic test
pattern generation (ATPG).

As a result of open source efforts, some designs are available at Opencores [32] that
offer designs along with the testbench environments used by industry. Furthermore,
there are compiled Open source simulators like Icarus Verilog [34] and CVer [18] for
HDL simulation. The only downside of using open source simulators is that they

are not as fast as commercial Verilog simulators like VCS [40] and NCVerilog [30].

Hence, when reporting parallel simulation speedup using open source simulators, its

performance comparison with commercial simulators should be provided.

1.3 Parallel Simulation Applications

As mentioned earlier, achieving parallel simulation speedup is a challenging task.

However, there are still applications that are more applicable to parallel simulation.

Following is a brief list of such applications :

1. Simulation of manufacturing tests generated by ATPG tools. These tests use

unconstrained stimulus and hence are good for parallel simulation.

2. Use of inexpensive computers to simulate large design. As the computing cost

has gone down significantly, distributed parallel computing reduces the wait
time in a single computer.

3. Simulation with full waveform dumping. If the design requires full waveform
dumping, partitioning the design can distribute the I/O activity. This increases
simulation performance as simulation and dumping are done in parallel.

11
4. Simulation of symmetric designs. Designs such as routers or symmetric multi-
processors (SMP) have similar workload within each block and little communi-
cation between blocks which make them ideal for parallel simulation.

Chang and Browy [16] have shown simulation speedup on various register transfer
level (RTL) and gate-level designs, which are all good candidates for parallel simu-
lation. However, they have not mentioned how they achieved this speedup or what
partitioning strategy was used. In particular, RTL speedup could be misleading as

RTL evolves during the design cycle. Furthermore, testbench for RTL also changes
on daily or weekly basis as part of regression run. This is achieved by changing the
random seed to the testbench, which creates different tests for each run. They also

have not mentioned whether the gate-level simulation is functional (zero-delay) or

timing simulation. Zhu et al. [42] have shown that graphic processor units (GPU)

are suitable for parallel functional (zero-delay) simulation because of large number of

processing pipelines and parallelism within each pipeline. In general, GPU is based

upon the single program multiple data (SPMD) architecture. Another important

point is that setting up parallel simulation environment takes a considerable effort. If

the throughput of the design is large, parallel simulation overhead imposed can dom-

inate the simulation and can actually cause speed degradation. Parallel simulation is

useful when it takes days or weeks to simulate the design on a single processor simu-
lation. Chang and Browy propose a metric to predict whether parallel simulation can
provide speedup over single processor simulation is cycles/second measured in terms

of wall-clock time. Chang and Browy [16] suggest that single processor simulation
which is slower than 100 cycles/second is a good candidate for parallel simulation.

12
1.4 Formal Verification

An alternative to simulation is formal verification and static timing analysis

(STA). Some of these techniques use simulation internally to enhance their efficiency.
Formal Verification techniques verify a design without the stimulus. This gives formal

verification a huge advantage as it completely eliminates the need for a testbench.


Formal verification can be divided into equivalence checking and model or property
checking. Both are briefly reviewed in the remainder of this section.

1.4.1 Equivalence Checking (EC)

Equivalence checking (EC) determines whether two design implementations are

equivalent. For example, equivalence checking is used to determine whether RTL

and synthesized gate-level netlists are functionally identical. It is not feasible to

perform equivalence checking by simulation as it would mean simulating the whole

input space. Sometimes, user can guide the equivalence checker tool by identifying

equivalent nodes (cut points) in the two designs to prune the input search space.

ABC from UC Berkeley, Synopsys Formality and Cadence Conformal tools are the

equivalence checking tools being used in the industry and academia.

There are two approaches to perform EC. The first approach searches for an
input pattern or patterns that would distinguish the two designs. This is called

Satisfiability (SAT) approach. According to this approach, two designs expressed in

terms of conjunctive normal form (CNF) formulas F1 and F2 are equivalent, if F1 ̸=


F2 is unsatisfiable. If this is not true, a counterexample trace is produced that can
help debug the problem. The counterexample trace can be simulated to see if the
counterexample was an unintended boundary condition or a real bug.

13
The other EC approach compares by converts the designs into canonical repre-
sentation such as Reduced Order Binary Decision Diagram (ROBDD) and checks for
their equivalence. ROBDDs for two equivalent designs must be identical.

Applications of equivalence checking are not restricted to equivalence checking

between RTL and post-synthesis gate-level netlist but also to Engineering Change
Order (ECO) and pre and post-scan netlists. It should be noted that as the design
gets large, equivalence checking techniques suffer from memory explosion problem.
Therefore, reduction of the design size is often necessary because of the memory

capacity issues.

1.4.2 Model Checking and Property Checking

Model (or property) checking takes a design and proves or disproves a set of prop-

erties given as specification of the design. If two designs are sequential and mapping

between their states is not known, then it is not possible to perform equivalence

checking. Model checking checks the entire state space, either constrained or uncon-

strained to determine the validity of the properties. Design is transformed into finite

state machine (FSM) and property checking determines if there is a state or sequence

of states that violates the property or it is unreachable from an initial state. The

design is usually given in terms of RTL description. As with equivalence checking,

model checking suffers from capacity issues and cannot model the whole design. A
typical practice in the industry is to use model checking on specific RTL blocks in
a design. Another limitation of model checking is the issue of completeness of prop-
erties. It is hard to determine if a certain set of properties completely specifies the
design intent. There are no good or complete coverage metrics for property checking
either. On the other hand, for the designs whose properties can be specified exactly,
such as arithmetic blocks (e.g., multiplier, adder, etc), model checking cannot prove or
disprove property beyond a certain bit-width. It should be noted that model checking

14
is not used for property checking on the gate-level netlist because of capacity issues.
Contrary to simulation, model checking cannot guarantee that the design will work
when fabricated as it cannot be done on a chip level.

1.5 Static Timing Analysis

Static Timing Analysis (STA) is a static technique to verify timing of the design.
STA analyzes a design given timing library associated with the design. It then reports
the slowest critical path in the design, which determines the maximum frequency of

the design. While STA technology has improved a lot over the years and it is quite

mature at present, it suffers from pitfalls of manual constraints. A designer can


inadvertently add a false-path or a multi-cycle path that is never exercised by the

design or miss such a path. Further, STA does not work for asynchronous interfaces.

Hence, to validate the constraints or the results of STA, simulation is necessary.

1.6 Why Gate-level Simulation?

It is clear from the above description that simulation has its own special place in

the design hierarchy and it is not going away in the near future. As the design gets

refined into lower levels of abstraction, such as gate-level and layout level, functional
(zero-delay) and timing simulations can validate the results of STA or equivalence

checking. Moreover, neither STA nor equivalence checking can find bugs due to X

(unknown signal) propagation. Even though RTL regression is run on a daily basis,
industry uses gate-level simulation before sign-off.

Gate-level simulation is necessary after RTL synthesis to validate the result of


synthesis. At this stage, gate-level simulation can be functional (zero-delay) or unit-
delay, where all gate-level cells are assumed to have delay value of 1 timescale units.
Later, gate-level timing simulation can be performed in the pre-layout or post-layout

15
stage using standard delay format (SDF) back annotation. Gate-level simulations are
considered a must for verifying timing critical paths of asynchronous design which are
skipped by STA tool. Further, gate-level simulation is used to verify the constraints
of static verification tools such as STA and equivalence checking. These constraints
are added manually and the quality of results from static tools are as good as the
constraints are. Gate-level simulation is also used to verify the power up, power down
and reset sequence of the full chip. It is also used to estimate dynamic power drawn
by the chip. Finally it needs not to be mentioned that gate-level simulation is used

after Engineering Change Order (ECO) to verify the changes. There is a tool named
Bugscope (by the company NextOp now part of Atrenta) that takes RTL as input
and outputs a set of properties that can be used by model checking to verify the

design. Internally, the tool uses simulation to generate properties of the design.

16
CHAPTER 2

PREVIOUS WORK ON PARALLEL SIMULATION

Event-driven HDL simulation is a dominant technique used for functional and


timing simulation [28]. However, traditional event-driven simulation suffers from
very low performance because of its inherently sequential nature and a need for event

synchronization. To address this issue, distributed parallel HDL simulation has been

proposed to alleviate the low performance of traditional event-driven HDL simulation

[27] [11] [12]. Chapter 1 discussed challenges in parallel HDL simulation. In this

chapter, we present literature survey on parallel simulation, especially parallel HDL

simulation and the associated hardware on which the simulation is run. Next, a

recently proposed, competing parallel simulation technique known as time parallel

HDL simulation is presented and compared against the spatially distributed parallel

HDL simulation.

The literature on parallel simulation is rich. Most of the known work concerns

traditional parallel simulation, which is based on physical partitioning of the design

into modules, distributed to individual simulators. We refer to this approach as spatial


parallelism, since the simulation relies on physical partitioning of the design in spatial
domain. This simulation concept has been known since late 1980s as Parallel Discrete
Event Simulation (PDES) [21].

17
2.1 Factors Affecting the Performance of Parallel HDL Sim-
ulation

Bailey et al. [13] lists five factors that affect the performance of parallel HDL
simulation: timing granularity, design structure, target architecture, partitioning,
and synchronization algorithm. We discuss them briefly here and elaborate on the
current hardware and software trends.

2.1.1 Timing Granularity

Timing granularity (also known as timing resolution) and design structure are
design-dependent factors over which simulation has no control. Increasing timing

resolution can increase the amount of processing, which in turn decreases simulation

performance. In general, simulation performance varies dramatically from one design

structure to another. Figure 1.1 shows design structure at various levels of abstrac-

tion. The design structure at higher level of abstraction, e.g., C++, simulates faster

than the design structure at the lower level, e.g., gate-level.

2.1.2 Hardware Architecture

Architecture of the target platform or execution machine also impacts parallel sim-

ulation performance. Here we discuss various computer hardware and software trends
that exploit parallelism. A detailed discussion on parallel computer architecture is

presented as an appendix in this chapter.

• Multi-Cluster is a computer system composed of several workstations forming


a cluster and communicating over the network.

• Multi-Processor is a system that contains two or more processing units (CPUs)


on different chips, connected through (typically long) inter-chip interconnects.

18
• Multi-core is a computer system with two or more CPUs on the same chip,
sharing memory resources and connected through short intra-chip interconnects.

• Multitasking/Multiprocessing is a method in which multiple tasks or pro-


cesses run on a CPU. It is the responsibility of the Operating System (OS) to
switch between the tasks to give an impression of multitasking. In the case of
a computer with a single CPU core, only one task runs at any point of time,
meaning that the CPU is actively executing instructions for that task. Multi-
tasking solves this problem by scheduling which task may run at any given time

and when another waiting task takes a turn. When running on a multi-core

system, multitasking OS can truly execute multiple tasks concurrently. The


multiple computing engines work independently on different tasks.

• Multi-threading extends the idea of multitasking into applications, so that

one can subdivide specific operation within a single application into individual

threads. All the threads can run in parallel. The OS divides processing time

not only among different applications, but also among each thread within the

application.

• Pipelining sequences the execution of multiple instructions like cars on an


assembly line. The execution of each instruction is divided into several steps

which are performed by dedicated hardware units. Pipelining is similar to an

assembly line, in which each stage focuses on one unit of work. The result or
each stage passes to the next stage until the final stage. To apply the pipelining
strategy to an application that will run on a multi-core CPU, the algorithm is
divided into steps that require roughly the same amount of work, and runs each
step on a separate core. The algorithm can process multiple sets of data or the
data that streams continuously.

19
2.1.3 Issues in Design Partitioning

As mentioned earlier, assigning LPs to different CPUs, to make the simulation

load uniformly balanced among the LPs, is a known NP-hard problem. Given this
objective, minimizing communication and synchronization overhead may pose a con-

flicting requirement. As a result, heuristic-based partitioning algorithms have been


proposed that provide near optimal result. The major difficulty in partitioning is that
the simulation load of a LP is determined at run time. Hence, workload requirements
are not known prior to simulation. The idea of pre-simulation has been proposed

in which simulation is run for a short time interval or even full simulation is run to

profile the simulation. However, it adds an extra processing step, unless it can be
done as part of a complete simulation-based flow. Such a case is shown in Figure

1.1, where simulation at a higher level of abstraction can act as pre-simulation for

simulation at a lower level of abstraction. This is one of the major points of the

proposed approach, which shall be explained further in the next section. Another

problem is the granularity of LP, which relates to the number of atomic operations

that are assigned to a given LP. Assigning one atomic operation per LP can result in

high communication overhead, while assigning one LP per processor can result in an

unnecessarily blocked computation.

2.1.4 Time Synchronization

Chamberlain [15] mentioned four types of synchronization algorithms to synchro-


nize simulation time among LPs:

• Oblivious algorithm evaluates all LPs at each time step, regardless of the
event activity. This eliminates event queue at each LP. Correct scheduling can
ensure the correctness of the simulation.

20
• Synchronous algorithm constraints the simulation time of each LP to be the
same. All LPs must synchronize to find next simulation time step depending
on the event activity.

• Conservative algorithm is an asynchronous algorithm which permits different

simulation times among LPs. It processes messages in non-decreasing order to


preserve causality at all times. This condition is enforced by advancing local
simulation time to the smallest time stamp received from any neighboring LP.

• Optimistic algorithm is also known as Time Warp algorithm [24]. In this


approach, events are immediately processed at LPs until an event with time

stamp earlier (straggler event) than the local simulated time arrives. This causes

LP to roll-back to a previous time so that the straggler event could be processed.

The state must be saved at all LPs to allow roll-back.

The conservative and optimistic approaches differ in the way modules of the parti-

tioned design communicate during simulation to synchronize data. Their performance

varies with the design and partition strategy. Several variations of these methods

have been offered, differing in the way they handle inter-simulation synchronization.

Gafni [22] uses state saving concept and rollback mechanism by restoring the saved

state. Time Warp [24] (optimistic approach), was able to reduce message passing
overhead by using shared memory. Fujimoto [20] and Nicol [31] improved the con-
servative method by introducing the concept of lookahead. Chatterjee [17] proposed

the parallel event-driven gate level simulation using general purpose GPUs (Graphic
Processing Units). However, it could only handle zero-delay (functional) gate-level
simulation, but not the gate-level timing simulation. Zhu et al. [42] developed a
distributed algorithm for GPUs that can handle arbitrary delays, but still suffers
from heavy synchronization and communication overhead inherent to all distributed

21
simulation techniques. In addition, these methods do not scale and are often based
on manual partitioning.

It should be emphasized that the difficulty of spatial partitioning lies not only in
solving the inter-module communication and synchronization problem, but mostly in

design partitioning that will minimize this communication. The success of traditional
spatially distributed simulation then strongly depends on such ideal partitioning,
which itself is a known intractable problem and cannot be successfully applied to
complex industrial designs. To facilitate this partitioning, some researchers, e.g., Li

et al. [29], propose partitioning based on design hierarchy. In this approach, the

design is partitioned along the boundary of the module, a basic unit of code in HDL.
While it addresses the communication problem to a certain degree, it still does not

resolve the synchronization problem.

2.2 Prediction-based Parallel Simulation

The key idea of the prediction-based approach, originally proposed in [27] is to

predict input stimulus and apply it to each module instead of the actual input. The

predicted input and output stimulus could be obtained from the simulation of design

model at a higher abstraction level (such as RTL) than the one being simulated (such
as gate-level). Figure 2.1 shows how higher level simulation can act as predictor for

lower level simulation in hardware design simulation. The base of the arrow shows
the predictor simulation and the tip of the arrow shows the target simulation.

Figure 1.4 (in Chapter 1) shows a design consisting of two module partitions con-
nected in such a fashion that their inputs depend upon each other. The predicted
input values obtained by running higher level simulation are stored in local memory
and applied to the input ports of a local module assigned to a given LP. Then, the
actual output values at the output ports of that module are compared on-the-fly with

22
Algorithmic simulation
in C/C++

Behavioral simulation
in HDL (Verilog, Vhdl)

Functional gate-level
simulation

Gate-level Timing
simulation (SDF annotation)

Figure 2.1. Predictor modeling in hardware design simulation flow

the predicted output values, also stored in a local memory. This is illustrated in Fig-

ure 2.2, which shows two sub-modules being simulated in parallel. Each sub-module
uses predicted inputs by default, while their actual outputs are compared against the

predicted outputs (stored earlier in local memory). A multiplexer at each sub-module

selects between the predicted inputs and actual inputs. While both sub-modules can
access their actual inputs from the other sub-module, there is associated synchroniza-
tion and communication overhead which is the major bottleneck in parallel discrete
event simulation (PDES). The main goal of this approach is to minimize this overhead
as much as possible.

As long as the prediction of the input stimulus is correct, remote memory access
that imposes communication and synchronization between local simulations is com-

23
Figure 2.2. Distributed parallel simulation using accurate prediction

pletely eliminated. In this arrangement, only local memory access for fetching the

prediction data is needed. This phase of simulation is called the prediction phase.
Only when the prediction fails, are the actual input values, coming from the other

local simulation, used for simulation; this phase of simulation is called the actual

phase.

When prediction fails, each local simulation must roll back to the nearest check-

point. This is possible by periodically saving design state during the prediction phase

at selected checkpoints. When parallel simulation enters the actual phase, it should

try to return to the prediction phase as soon as possible to attain maximum speed-up.

This is done by continuously comparing the actual outputs of all local simulations

with their predicted outputs and counting the number of matches on-the-fly. Af-

ter the number of matches exceeds a predetermined value, the simulation is switched
back to the prediction phase. We are going to instrument this approach for functional
gate-level (zero-delay) simulation. Another challenge to be addressed in this thesis,
is to minimize the time spent in the actual phase. This depends upon the accuracy
of the predictor.

24
2.3 Multi-level Temporal Parallel Event-Driven Simulation

In contrast to the parallel discrete event HDL simulation described above, which

partitions the design in spatial domain, there has been some interesting work on
parallel discrete event HDL simulation in time domain [26] [19]. This approach, called

multi-level temporal parallel event-driven simulation (MULTES) [26] [19], parallelizes


simulation by dividing the entire simulation run into independent time intervals in
time domain. It accomplishes this by dividing the simulation time into a number time
intervals related to the number of processors. Each interval also referred to as a slice

is then simulated in a different LP. The key requirement of this technique to work is

finding the initial state of each slice. The initial state of each slice must match the
final state of the previous slice. For example, the initial state of slice i must match

the final state of slice i − 1 for each slice i. MULTES terms this requirement as

horizontal state matching problem. The initial state of each slice cannot be obtained

without knowing the final state of the last slice. MULTES overcomes the problem of

finding the initial state by running a reference simulation at higher level of abstraction

and saving the values of all the state elements in the design. However, as the target

simulation is at a lower level of abstraction and may involve timing, the initial state

obtained from reference simulation may not be the correct one in time. In summary,

MULTES contains two simulation steps :

1. Fast reference simulation that runs at higher level of abstraction, such as


RTL or functional (zero-delay) gate-level and saves design state.

2. Target simulation that runs in parallel at lower level of abstraction such as


functional (zero-delay) gate-level or gate-level timing.

For timing simulation, the design state (all flip-flops in the design) is restored using
reference simulation which could be RTL or functional (zero-delay) gate-level. This

25
state saving is known as checkpointing. If the design is a single clock design and there
is no timing violation, then reference and target simulations are cycle-consistent.
This means that the two simulations produce the same result within the required
number of clock cycles. In such a case, restoring state using reference simulation
will lead to correct target simulation. However, depending upon the position of
checkpointing, there could be mismatch between parallel target simulation and golden
target simulation at the beginning of the target slice. MULTES solves this problem
by providing overlap between consecutive target slices. For example slice n − 1 and

slice n are allowed to share the simulation time. Since the mismatch occurs at the
end of the slice period n − 1 and beginning of slice period n, the period is discarded
from slice n. The correct simulation for this period is generated by slice n − 1.

An important feature of MULTES is that it handles designs with multiple asyn-

chronous clocks. It attempts to solve the problem of clock domain crossings (CDC)

in multi-clock designs, in which data or control signal is sent from one clock do-

main to the other. The issue in CDC designs is that gate-level timing simulation

is not 100% cycle-consistent with reference simulation, even if there are no timing

violations. Since simple state saving and restoring could cause mismatch between

parallel target simulation and golden target simulation, MULTES proposes abstract
delay annotation (ADA) to deal with CDC. In ADA, CDC path delay, obtained from
SDF is copied from gate-level to reference simulation. When CDC path delay is an-

notated to reference simulation besides target simulation, both simulations become


cycle consistent.

An important issue addressed in this method is handling testbench. While the


state of the Design Under Test (DUT) can be stored during reference simulation,
the state of testbench cannot be stored likewise. This is because testbench does not

usually contain memory elements and may have software constructs which cannot be

26
saved. Similarly, the state of Intellectual Property (IP) blocks in design cannot be
saved and restored with checkpointing. To handle this issue, MULTES uses testbench
forwarding technique. In this technique, rather than saving the state of the testbench,
testbench is simulated from the beginning to the starting point of each slice (initial
state). This is accomplished by saving the output of DUT (which is input to the
testbench) during reference simulation. This essentially creates a dummy DUT. The
testbench is simulated with the dummy DUT from the beginning to the starting point
of each slice. At this point in time, dummy DUT is replaced by actual DUT and state

of the DUT is restored from the data stored at the checkpoint. This is done for each
slice independently.

2.4 Differences between Distributed Simulation and MULTES

MULTES [26] [19] offers an interesting alternative technique for parallel simula-

tion. There are similarities and fundamental differences between MULTES and PDES

[27] [35] [36] [8] for HDL simulation. We discuss them briefly in this section.

MULTES divides the simulation time into multiple time slices for each time slice

to be simulated independently. PDES on the other hand uses spatial partitioning to

divide the design into multiple partitions which are simulated independently. Both

MULTES and PDES use model at higher level of abstraction for reference simulation.

For example, both MULTES and PDES use RTL for parallel functional (zero-delay)
gate-level simulation. Note: from now on, we will use the term functional gate-level
simulation to mean functional (zero-delay) gate-level simulation.

State matching in MULTES is a challenging problem. If the design is transformed


radically (using re-timing and re-synthesis) between reference simulation and target
simulation, restoring the state of the target simulation is difficult to impossible [26]

27
[19]. PDES does not suffer from state matching problem as each partition is simulated
from the beginning of simulation time.

MULTES cannot overcome the limitations of a large design. This means that
each parallel slice simulation will simulate the whole design regardless of the slice

period which could be large or small. PDES partitions the design and distributes the
partitions to individual simulators. Hence, the entire simulation load is divided into
smaller loads distributed to each partition. MULTES performs checkpointing peri-
odically, while in PDES the reference simulation is stored at the partition boundary

for the entire simulation time. This will increase the amount of dump data on the

hard disk for PDES. Note that MULTES also performs data dumping for testbench
forwarding besides periodic checkpointing. In this work, we will try to eliminate this

dumping for PDES, so that reference simulation (RTL) is co-simulated with the target

simulation (functional gate-level).

We should emphasize that MUTLES is not suited for multi-core architecture be-

cause of uniform memory requirements of each slice. For large designs, it does not

scale well with the multi-core architecture. PDES scales well with the multi-core

architecture as it partitions the design and hence the memory requirements of each

partition are lower than the original design.

Finally, MULTES uses a complex tool chain and techniques, including: PLI for
checkpointing; data dumping and restoring; Synopsys Formality or a similar tool for
state matching; ABC tool for assisting state matching to detect signal correspondence;
Cadence Encounter tool for finding clock domain crossings; and LEX and YACC for
parsing SDF file for abstract delay annotation (ADA). Further, some of the steps in
MULTES (such as ADA) are not fully automated and require manual effort. In con-
trast, PDES when applied to parallel HDL simulation does not have such a complex

28
tool chain dependency and it integrates seamlessly into the ASIC or FPGA design
flow. PDES has its own challenges that are addressed in the next chapter.

2.5 Parallel Computer Architecture


2.5.1 Introduction

Parallel or high performance computing is not a new concept. The concept has
been widely known in scientific and engineering communities where large simulations
are done on a cluster of computers. The simulation computation to be performed is

partitioned into several workloads which are simulated independently and in parallel
on many machines. The simulation workloads should be independent of each other
thus requiring the original simulation computation to be suitable for parallelism.

Scientific simulations are generally suitable for parallelism. Parallel programming is

becoming mainstream because of advances in computer hardware [37].

Today, hardware manufacturers are integrating more and more CPUs on a single

processor chip. The entire processor chip is called multi-core processor. They come

in various configurations like single multi-core processor, shared-memory system con-

sisting of many multi-core processors or a cluster of multi-core processors connected

via network. It is predicted that by the year 2015, Intel’s typical processor would

have dozen to hundreds of cores where some of the cores would be dedicated to say
graphics, encryption, network, DSP etc. This type of multi-core system is called
heterogenous multi-core [37].

Having many cores available potentially increases performance of user applications


as each user application can use more hardware resources. Additionally, operating
system (OS) should support mapping user applications to the available hardware
cores so the applications can run in parallel. This is known as true multiprocessing,
where each application process gets mapped to a separate processor core. There is

29
also a need of increasing the performance of a single application by running it on
multiple cores. This area is full of challenges as there is no automatic conversion of a
sequential program into a parallel program. As hardware advancements continue to
take place, there is a dire need to convert the existing sequential software programs
to take advantage of the existing computer power. If this is not done, much of the
compute power available is going to remain unused [37].

The process of parallel programming starts by formulating of an algorithm for a


particular problem. The algorithm is then decomposed into several pieces called tasks

which are expected to run independently on multiple cores. Dividing an algorithm

into appropriate tasks is often manual and is one of the main challenges faced by
a programmer. The tasks are then assigned to one or more threads in a parallel

programming language e.g., pthreads or OpenMP in C/C++. This step is called

scheduling. Later the assignment of threads to cores is called mapping. The tasks

of an algorithm can be independent or dependent. In the latter case, tasks need

to follow a certain order due to dependencies and may not execute concurrently.

Tasks may also need to communicate with each other and hence synchronization

between the tasks is necessary so that tasks are not writing to the same memory

location simultaneously or reading before write takes place at a particular memory

location. Synchronization depends a lot on memory organization of the underlying

hardware. Thus, it is imperative to know about the underlying hardware configuration


for successful parallel programming.

Shared memory and distributed memory are two main memory organizations in
multi-core machines. Shared memory allows uniform global access to all processor
cores. Information exchange between the cores is done through sharing memory
location. This sharing must be done in synchronized manner where in case of a read,
a core does not read from a memory location where a write is pending. Similarly there

30
should not be simultaneous writes by cores to one memory location. For distributed
memory machines, each processor core has private memory which can only be accessed
by the core attached to it. Information exchange between cores is done through
explicit communication such as message passing. Another form of synchronization
is called barrier synchronization which is available for both shared and distributed
memory machines. In barrier synchronization, all processes on all cores have to wait
at a barrier point until all other processes have reached that barrier. Only when all
processes have reached this barrier, they can continue execution after the barrier.

Measuring performance of a parallel computing application is done by measuring

parallel execution time which is the maximum of compute time on all the cores and
time for communication and synchronization. This time should be smaller than se-

quential execution time of the application on a single core else parallelization is not

worth. Speedup is the ratio of parallel execution time to the sequential execution

time. Efficiency is the ratio of speedup to the number of cores.

2.5.2 New Trends in Computer Architecture


Parallel execution of an application strongly depends upon the architecture of

the underlying machine e.g., the number of available cores, memory organization etc.

We discuss how parallelism can be exploited from single core machines to multi-core
machines [37].

2.5.2.1 Parallelism on Single Core Machine


• Bit level Parallelism There are various ways of exploiting parallelism on a
single core machine. One way is to use wider bit widths i.e., switching from
32 bits to 64 bits as 64 bit machines have become pervasive since last couple
of years. 64 bit computing refers to datapath width, integer size and memory

31
address width to be 64 bits. This has also lead to accuracy of floating point
numbers.

• Pipelining Before pipelining was introduced, computing was single-cycle, which


meant only one instruction could be processed at a time by CPU after which
it could process the next instruction. Pipelining instruction means instruction
processing is split into multiple stages temporally like an assembly line so that
instruction fetch stage, instruction decode stage, instruction execution stage
and write back stage could happen in parallel on multiple instructions. This

staging of processing allows overlapping of instructions e.g., if one instruction i1

is in the instruction decode stage another instruction i2 can enter the instruc-
tion fetch stage. In the next clock cycle, i1 enters execution stage, whereas i2

enters decode stage and a new instruction i3 enters instruction fetch stage, etc.

• Parallelism by many Execution Units There are two ways of achieving this

parallelism: dynamic and static. Dynamic or Superscalar architecture allows

multiple instructions to be issued simultaneously during a clock cycle by taking

advantage of the fact that there are more than one functional units inside a

single CPU core such as ALUs (arithmetic logic units), FPU (floating point

units), load store units, etc. Superscalar relies on hardware to determine which

instructions can be launched simultaneously from a sequential program.

VLIW (very long instruction word) relies on compiler (software) to determine


which instruction may be executed in parallel. These instructions are then
launched in parallel. In a VLIW processor, each VLIW instruction specifies
several independent operations hence called very long instruction word that are
executed in parallel by the hardware. The maximum number of operations in
a VLIW instruction are equal to the number of execution units available in the
processor.

32
• Thread or Process level Parallelism In a single core machine, thread or
process level parallelism is used to give illusion to an application (in case of
multithreading) or multiple applications (in case of processes) that there are
multiple CPUs. In fact, this is not the case as the machine is single CPU core
machine. What happens is that OS time slices threads or processes so quickly
that it seems threads or processes are running independently. This illusion has
become a reality with multi-core CPUs.

2.5.3 Classification of Parallel Architectures

We discuss the classification which is most relevant to parallel programming.

2.5.3.1 Single-Instruction, Multiple-Data (SIMD)

In SIMD, multiple processing elements execute the same instruction on a different

data set. Each processing element has private access to (shared or distributed) data

memory but there is a single program memory from which a single instruction is

fetched and dispatched to the multiple processing elements.

2.5.3.2 Multiple-Instruction, Multiple-Data (MIMD)


MIMD is similar to SIMD except that each processing element has a separate

program or instruction memory (shared or distributed). At each step, each processing

element loads a separate instruction and data, executes it and write the result back to
the data memory. Hence, processing elements work asynchronously with each other.

2.5.3.3 Symmetric Multiprocessing (SMP)

SMP consists of one or more processing elements with access to common memory.
A program is parallelized by program taking different paths on various processing
elements. The program starts running on one processing element and as soon as
part of the program which can be parallelized is encountered, the execution gets split

33
across multiple processing elements. In the parallel portion, each processing element
works on the same program but with different data set. SMP faces serious challenges
in terms of scalability to many cores.

2.5.3.4 Non-uniform Memory Access (NUMA)


NUMA addresses the scalability issue with SMP by adding local memory for

multiple cores. Multiple cores are coupled together using local memory as shown in
Figure 2.3. It shows that the cost of access to local memory is less than the cost to
access remote memory. This architecture allows scalability to many cores.

Figure 2.3. NUMA hardware configuration

2.5.4 Memory Organization in Multi-core Machines

There are two views of memory that need to be considered: the physical mem-
ory view and the programmer memory view. For physical memory view, computers
with shared physical memory, such as multiprocessors and computers with distributed
memory, such as multicomputers exist. For programmer’s view, memory organiza-
tion can be distinguished between shared memory machine (SMM) and distributed
memory machine (DMM). Note that programmer’s view need not be consistent with

34
the actual physical memory view. For example, programmer can treat the memory
as shared memory while the physical view of the memory is distributed.

2.5.4.1 Distributed Memory Machines (DMM)

DMM consists of number of processing elements (also known as nodes) and an


interconnection network connecting the nodes. A node is an independent entity con-

sisting of processing element, local memory and may contain I/O. The local memory
is private to each node. When a node needs data from some other node, explicit mes-
sage passing protocol, e.g., message passing interface (MPI) is used to fetch that data

from the other node. Direct Memory Access (DMA) controller can be used to offload
this communication from the processing element. Example of DMM is a cluster of

computers on a local area network (LAN).

2.5.4.2 Shared Memory Machines (SMM)

SMM consists of computers with same physical memory or global memory. It

typically consists of several processing units connected to a global memory via inter-

connection network. Since the memory is global, no explicit communication between

processing nodes is required to share data. However, due to global nature of memory,

synchronization becomes necessary as multiple processing elements can end up read-


ing or writing to the same memory location. Parallel model for programming shared

memory machines is based on execution of multiple threads. A thread is an indepen-

dent flow of execution which shares data with other threads using global memory. It
is the job of operating system (OS) to map a thread to a processor core.

2.5.5 Thread Level Parallelism (TLP)


TLP means running multiple applications to use the processing resources on a

multi-core machine efficiently. Each such application can be called a thread and this
is true multithreading, as each thread gets mapped to a separate processing core.

35
TLP can also happen at an application level where parts of an application become
threads and execute on multiple cores. Another trend is hyperthreading, where OS
gives an illusion that there are multiple cores available for processing to use processing
elements more effectively.

36
CHAPTER 3

PARALLEL MULTI-CORE VERILOG HDL SIMULATION


BASED ON FUNCTIONAL PARTITIONING

Parallel multi-core Verilog hdl simulation based on functional partitioning is per-


formed by running each partition (also called as sub-design) of the design on a separate

processing logical processor (LP). Functional partitioning divides the functionality

of the original design into sub-functionalities which are then executed on different
LPs. Figure 3.1 shows a design in traditional event-driven simulation environment,

while Figure 3.2 shows the same design in parallel multi-core simulation environment.

Note that it shows an ideal case where the two partitions are completely independent

(Partition1 can be simulated without Partition2 and vice-versa) and hence can be

simulated separately. This may not be the case for most of the simulations (because

of dependencies between the partitions) and this issue will be addressed later in this

chapter.

DUT

Partition1
TestBench

Partition2

Figure 3.1. Standalone simulation of a design

In this work, use use the parallel multi-core HDL simulation technique based on
the concept of accurate prediction [27] [35] [36] [8]. We use the approach of Li et

37
Partition1 Partition2

CPU1 CPU2

Figure 3.2. Parallel multi-core simulation of a design

al. [29] to partition the design along the hierarchy boundary, but add a higher level

predictor model to reduce the synchronization and communication overhead between


the modules.

3.1 Predicting Input Stimulus

It is clear that prediction accuracy is one the most critical factor in this approach

as explained in [27] [27] [35] [36] [8]. Nearly 100% prediction accuracy will give

almost linear speed-up even when the number of processor cores increases (within

certain bounds). Hence, we must find a way to get an accurate prediction data. As

discussed before, the proposed idea is to obtain this data from the results of earlier

simulation, using higher level design model. Such a model is typically available as

part of the design refinement from higher level of abstraction to a lower level of
abstraction. It is important to realize that the closer the two abstraction levels are
(for the predictor/reference and actual/target simulations), the more accurate the

actual simulation is going to be. For example, prediction data for parallel functional
gate-level simulation can be obtained from register transfer level (RTL) simulation;
and the prediction data for parallel gate-level timing simulation can be obtained
from gate-level zero-delay simulation. Both these scenarios are depicted in Figure
3.3. Simulation at a higher level of abstraction can be performed at least 10× faster
than the one at the lower level of abstraction. We argue that an accurate prediction

38
data can be obtained by fast simulation using simulation model at a higher level of
abstraction. Also, as this fast simulation at a higher level of abstraction is already an
integral part of the design flow, as show in Figure 3.3, obtaining the prediction data
does not incur any additional simulation overhead.

Figure 3.3. Parallel multi-core simulation in the ASIC design flow [25]

3.2 Preliminary Results of Predictor

To evaluate the predictor idea, several preliminary experiments were performed.


First, how accurately can higher level model (such as RTL) be compared to the
lower level model (such as 0-delay gate-level or gate-level timing)? For this reason,
lower bound on the prediction accuracy is measured by comparing values of the
registers in the design during RTL and gate-level timing simulation. Here, register

39
values saved during RTL simulation serve as prediction data for the gate-level timing
simulation. Table 3.4 shows preliminary experimental results of predictor modeling.
Design registers are chosen for two reasons. First it is possible that a register value
may not propagate to the module output during simulation. Hence, it is possible
that RTL and functional gate-level simulations are identical at the module boundary
but inconsistent on register outputs due to unknown signals (X) in RTL or gate-level
design. Secondly, the focus was on register values because at present the proposed
partitioning strategy for parallel gate-level timing simulation is restricted to the

flip-flop boundary. Of course, not all registers will appear at the partition boundary.
That is why the last column represents just a lower bound on the prediction accuracy;
the actual prediction accuracy is always higher than this lower bound. Such a lower

bound already shows high prediction accuracy (>98% on average) for this choice of

predicted data (RTL).

Table 3.1. Accuracy of RTL predictor for gate-level timing

Design A: Total # of B: # of RTL vs C: # of RTL vs Lower Bound


Name registers gate-level timing gate-level timing on prediction
A=B+C register match mismatch accuracy
VGA Controller 1611 1584 27 98.3 %
AC-97 2219 2156 63 97.1 %
PCI 1773 1739 34 98 %
AES-128 530 530 0 100 %

Table 3.2 shows another experimental result of predictor modeling. Here the
content of design registers during the functional gate-level simulation and gate-level
timing simulation are compared. The register values saved during functional gate-
level simulation serve as prediction data for the gate-level timing simulation. Note
that moving from RTL to functional gate-level improves the accuracy of predictor

40
(>99% on average). In general, the closer the reference and target simulations in the
design hierarchy, the more accurate the prediction data would be.

Table 3.2. Accuracy of functional gate-level for gate-level timing

Design A: Total # of B: # of gate-level C: # of gate-level Lower Bound


Name register 0-delay vs 0-delay vs on prediction
gate-level timing gate-level timing accuracy
A=B+C register match mismatch
VGA Controller 1611 1584 27 98.3 %
AC-97 2219 2213 6 99.7 %
PCI 1773 1773 0 100 %
AES-128 530 530 0 100 %

3.3 Quantitative Overhead Measurement in Multi-Core Sim-


ulation Environment
In addition to design partitioning, a big challenge in multi-core simulation is to

minimize communication and synchronization among partitions. Synchronization

overhead is defined as the time spent during simulation to guarantee that there is

no violation of causality among local simulations. It may cause performance degra-

dation even when event activities in partitions have no or little dependencies. Further,

synchronization overhead increases as the number of partitions increases. Commu-


nication overhead is defined as the time spent in exchanging data among partitions.

Both data bandwidth and frequency of communication among partitions impact com-
munication overhead. To illustrate minimization of these overheads, we explicitly
measure the following on a synthetic RTL design:

1. both communication + synchronization, and

2. only synchronization overhead.

41
The base design consists of a 128-bit Ripple Carry Adder (RCA) block and a
testbench feeding stimulus to the adder. To create two or more partitions, the adder
block is instantiated as many times and chained as shown in Figure 3.4. Figure 3.5
shows synchronization overhead measurement setup where partitions don’t exchange
data with each other, and instead data is locally generated using a predictor (to
be explained in the next section) in each partition. Both single-core and multi-core
versions of Synopsys VCS simulator were used for these measurements on an quad-core
Intel machine with 8GB RAM in Non-uniform Memory Access (NUMA) architecture.

As shown in Table 3.3, a straightforward application of multi-core simulation does


exploit design level parallelism in the design to a certain degree but the speedup is
not that high (1.36 and 1.46 for 2 and 3 cores respectively).

Figure 3.4. Setup for measuring communication and synchronization overhead

Figure 3.5. Setup for measuring synchronization overhead

42
Table 3.3. Quantitative communication and synchronization overhead measurement

No of No of Single-core Multi-core Multi-core Speedup Speedup


CPU cores Partitions simulation simulation simulation tsc tmc com+syn
used time tsc synch+comm synch overhead / /
(sec) tmc com+syn (sec) tmc syn (sec) tmc com+syn tmc syn
2 2 52 38 20 1.36 1.90
3 3 85 58 21 1.46 2.76
4 4 72 77 26 0.93 2.96
6 6 104 114 37 0.91 3.08
8 8 142 150 42 0.94 3.57

As the number of partitions are increased, communication + synchronization over-

head dominates design level parallelism and speed degradation takes place (0.93, 0.91
and 0.94 for 4, 6 and 8 partitions respectively). To see the effect of the synchroniza-

tion overhead only, the communication overhead was eliminated and the simulation

was done using the configuration shown in Figure 3.5. This experiment demonstrates

that such a configuration significantly improves the performance of multi-core sim-

ulation up to a certain number of cores. Specifically, for 2 and 3 cores the speedup

approaches the number of cores. As the number of partitions, n, increases, synchro-

nization overhead starts limiting the speedup from approaching the theoretical limit

of n. Therefore, for large designs, it is better to group multiple partitions to limit the

synchronization overhead. Figure 3.6 shows speedup improvement from multi-core

simulation of RCA128 adder on two cores. The green portion in the plot represents
the degree of parallelism in the two cores. Ideally, we want to increase this degree
of parallelism as much as possible. Hence we eliminate communication overhead as

illustrated in Figure 3.5. The result of removing communication overhead is shown


in Figure 3.7, which shows that the degree of parallelism has increased almost twice,
resulting in a speedup that approaches n = 2

Hence, as expected we conclude that minimizing (or removing) communication


overhead is beneficial for the performance of multi-core simulation. Synchronization

43
Figure 3.6. Multi-core Simulation of RCA128 on 2 cores (with comm and synch
overhead)

overhead can be greatly reduced by choosing the right number of partitions. In the

next section, we propose a generic method to minimize communication overhead for


boosting performance of multi-core simulations.

3.4 Prediction-based Multi-Core Simulation


3.4.1 Basic Idea
In principle, parallel HDL simulation with multi-core technology looks more promis-
ing than the original distributed parallel HDL simulation distributed among net-
worked PCs or multi-processors. In multi-core distributed parallel simulation, inter-
module communication can be accomplished by a straightforward memory read/write.
However, for large number of cores, this quickly increases the global communication

44
Figure 3.7. Multi-core Simulation of RCA128 on 2 cores (no comm overhead)

and synchronization overhead between the partitioned modules. NUMA architecture

in particular poses serious problems to parallel event-driven HDL simulation, due to

its sensitivity to the partitioning overhead, caused by non-uniform memory access

cost.

Figure 3.8 shows a conceptual configuration of NUMA, where local memory access

is much faster than the remote memory access. For example, memory access of CPU

core 4 to remote memory is much slower than to its local memory. This causes se-
vere performance degradation in parallel simulation, where extensive communication
and synchronization takes place between a large number of local simulations. This

situation becomes worse when the number of processor cores and the number of the
partitioned local modules for local simulation increase.

In our work we use the approach of [16] [39] to partition the gate-level design
along the module boundary, but add a local (in the partition) higher level predictor
model to reduce the communication overhead between the partitions. This is based

45
on a recently proposed technique using accurate stimulus prediction [27] [35] [36] [8].
The key idea of this approach is to predict input stimulus for each partition and apply
it locally instead of the actual input coming from the other partition. The predicted
input stimulus is obtained by simulating the design at a higher level of abstraction
(such as RTL) than the one being simulated (such as the functional gate-level) .

Figure 3.8. NUMA hardware configuration

During reference simulation such as RTL, all inputs and output responses of each

partition are stored (dumped) on a disk to serve as input stimulus for the actual

gate-level simulation Note that modern simulators allow parallel dumping option on

multi-core machines. Therefore, parallel dumping does not affect the performance of
RTL simulation and this dumping overhead can be ignored. The other aspect is the

disk space to store (dump) the stimulus which is ample in the current computing
machines. During the gate-level simulation the input stimulus is obtained from the

RTL predictor instead from the other partitions. Table 3.4 shows the accuracy of
RTL stimulus as predictor at the register boundary. A cycle by cycle comparison is
done between the RTL and functional gate-level simulations at the clock boundary for
all registers in the design. Cadence Comparescan tool was used to compare register

46
values at the clock cycle boundary. The high accuracy of the RTL prediction shows
that it can act as good signal predictor for gate-level simulation.

Figure 3.9 shows simulator architecture configuration for two partitions. In this
configuration each gate-level module uses predicted inputs from RTL by default, while

their actual outputs are compared against the predicted RTL outputs. A multiplexer
at each module selects between the predicted inputs and actual inputs. As long as
the prediction is correct, remote memory access that imposes communication and
synchronization between local simulations is eliminated. Only when the prediction

fails, are the actual input values, coming from the other local simulation, used in

simulation.
Table 3.4. Accuracy of RTL predictor at the register boundary

Design A: Total # of B: # of RTL vs functional Lower Bound on


Name registers GL register match prediction accuracy
VGA Controller 1611 1611 100 %
AC-97 2219 2219 100 %
PCI 1773 1773 100 %
AES-128 530 530 100 %

Figure 3.9. Gate-level simulation using accurate RTL prediction

47
3.4.2 Dealing with Mismatches
According to Kim et al. [27], when mismatch happens each local simulation
must roll back to the nearest checkpoint: a design state saved periodically during
simulation when predicted inputs are being used. When parallel simulation enters
the actual phase (predicted inputs are no longer used) , it will try to return to
the prediction phase as soon as possible to attain maximum speed-up. However, this
approach has not been confirmed experimentally. We found that checkpointing of the
design state during gate-level simulation is very costly in terms of time and space as it

involves dumping of vast amounts of simulation data to the disk. Moreover, simulation
rollback impedes the performance of parallel gate-level simulation. If rollback happens
frequently due to mismatches, performance advantage of prediction-based simulation

is lost. Therefore, in our work we emphasize and concentrate on prediction accuracy

and make a best effort to achieve that. If a mismatch occurs, simulation is paused and

switched back to the original gate-level simulation configuration (with its unavoidable

synchronization and communication overhead) by disconnecting the RTL predictor

and rolling back to the last good state provided by RTL. Note that the RTL state is

already saved (dumped) during the reference simulation. Then, the original gate-level

simulation is run to the point where mismatch occurred, to determine and debug the

cause of mismatch. After fixing the gate-level netlist the simulation is restarted in

the predictive mode. We already described how to quantify the accuracy of RTL
prediction by running Comparescan against all RTL and gate-level design registers.
Another approach is to run Functional Equivalence Checking between RTL and gate-
level design at the partition boundary and apply prediction to only those signals
that exist in both, the RTL and gate-level netlists. Note that functional equivalence
checking is typically performed earlier in the design cycle, so there is no additional
overhead introduced by this process. If RTL and gate-level designs are identical at the

48
partition boundary, communication between the partitions, as shown in Figure 4, can
be eliminated using RTL predictor. Thus, the two simulations can run independently.

3.5 Architecture of Prediction-based Gate-level Simulation


The effect of running RTL model of the entire design in every partition to act as
predictor for local gate-level simulation, as described in [27], has not been quantita-
tively measured in practice. We have run a series of experiments and found that this
approach is prohibitively expensive both in terms of memory and instrumentation.

Instead, we propose running only the required portion of the entire RTL design in

every partition (the portion of RTL that provides stimulus to a given partition and
compares the response of the partition). Note that this stimulus and response for

each partition is already saved during original RTL simulation. Figure 3.10 shows

the architecture of local simulation for a gate-level design partitioned into four blocks.

3.6 Experiments on Real Designs


We measured the performance of gate-level simulation of three Opencores [32]

designs: AES-128, JPEG and 3DES. Table 3.5 shows simulation performance on

single-core simulator. The designs are synthesized with Synopsys Design Compiler

using TSMC 65nm standard cell library. Single-core and multi-core versions of Synop-
sys VCS simulator were used to simulate all gate-level designs on octa-core Intel CPU

with NUMA architecture. Two partitioning schemes were explored. The first is static
partitioning based on the area of the synthesized logic. Module instances weighted
in terms of their synthesized area are grouped to form two or more partitions. The
second partitioning scheme is dynamic one based on RTL simulation profiling. In
this scheme, RTL simulation of the design is run with profiling option to find the
most time consuming module instances. These module instances then become par-
titions in the gate-level simulation. One could also run short gate-level simulation

49
Figure 3.10. Architecture of parallel GL simulation using accurate RTL prediction

with profiling option to find the time consuming module instances. It turned out

that static partitioning hardly improved simulation performance and hence was not

used for more experiments. Tables 3.6, 3.7 and 3.8 show performance improvements
of AES-128, JPEG and 3DES with parallel simulation.
Tables 3.6, 3.7 and 3.8 show that prediction based parallel gate-level simulation
improves the performance of original parallel gate-level simulation by removing com-
munication overhead between the partitions. These tables echo our findings, presented
in Section 3, that it is worth removing communication overhead; and that the syn-
chronization overhead increases with the number of partitions. These results also
show the right number of partitions (3 for AES, 2 for JPEG, and 3 for 3DES) as a
point beyond which the synchronization overhead reduces speedup from approaching

50
Table 3.5. Single core simulation performance

Design Synthesized area Single core GL


Name in NAND2 sim time T1 (min)
AES-128 18400 160
JPEG 968788 183
3DES 96650 254
VGA 144189 612
PCI 20709 17
AC97 53140 4

Table 3.6. Multi-core simulation performance of AES-128

AES-128 Partitioning MC sim MC sim pred MC sim MC pred sim


(# of partitions) Scheme T2 (min) T3 (min) speedup speedup
(T1/T2) (T2/T3)
2 area-based 192 149 0.83 1.28
2 instance-based 165 102 0.96 1.61
3 instance-based 125 50 1.28 2.51
4 instance-based 142 62 1.12 2.29
6 instance-based 144 69 1.11 2.08
8 instance-based 139 75 1.15 1.85

Table 3.7. Multi-core simulation performance of JPEG encoder

JPEG encoder Partitioning MC sim MC sim pred MC sim MC pred sim


(# of partitions) Scheme T2 (min) T3 (min) speedup speedup
(T1/T2) (T2/T3)
2 area-based 180 160 1.01 1.14
2 instance-based 110 57 1.63 1.93
3 instance-based 74 28 2.47 2.64
4 instance-based 69 25 2.65 2.76
5 instance-based 70 26 2.61 2.69

51
Table 3.8. Multi-core simulation performance of Triple DES

3-DES Partitioning MC sim MC sim pred MC sim MC pred sim


(# of partitions) Scheme T2 (min) T3 (min) speedup speedup
(T1/T2) (T2/T3)
2 instance-based 248 227 1.02 1.10
3 instance-based 254 124 1.0 2.04

the theoretical limit, the number of CPU cores. JPEG encoder is one such design with
less communication between partitions to begin with. In this case, removing commu-

nication overhead can improve simulation performance only slightly. Nevertheless,

the speedup approaches the number of cores for n = 2.

3.7 Dealing with Resynthesized and Retimed Designs


Code changes, synthesis, and various optimizations can transform gate-level netlist

to a point that RTL and gate-level netlist may not be 100% pin compatible at the block

or module level boundary. To account for this fact, we assume that RTL prediction

can only be used for 50% - 80% of the gate-level signals at the partition boundary.

For those 50% - 80% signals RTL can act as a signal predictor. To find out which

RTL signals can be used as predictor for gate-level simulation, Equivalence Checking

can be used. We used Synopsys Formality equivalence checking tool for this purpose.
Note that functional equivalence checking is typically performed earlier in the design
cycle, so no additional overhead is introduced by this process. Also, as mentioned

in Section 4, one can run Cadence Comparescan tool to find equivalent pins between
RTL and gate-level netlist. Table 3.9 shows the performance of benchmarks with
RTL prediction used for 50% and 80% signals during gate-level simulation.

52
Table 3.9. RTL prediction-based Multi-core functional GL simulation of bi-
partitioned designs

Design Partitioning MC sim MC sim MC sim MC sim 50% MC sim 80%


Name Scheme T2 (min) 50% pred 80% pred pred speedup pred speedup
T3 (min) T4 (min) (T2/T3) (T2/T4)
AES-128 instance-based 165 125 110 1.32 1.50
JPEG instance-based 110 94 73 1.17 1.50
3-DES instance-based 254 175 148 1.45 1.71

3.8 Conclusion
With the increased presence of multi-core processors, most high-performance work-

stations and PCs have adopted NUMA advanced memory architecture. We conducted
a series of experiments showing that a straightforward application of multi-core sim-

ulation on such architecture does not bring the expected improvement in simulation

performance. This is mostly due to communication and synchronization activity

performed by the simulators. To this end we presented a solution to greatly reduce

communication and synchronization overhead in a distributed event-driven functional

gate-level simulation for multi-core NUMA machines. It is achieved by performing

simulation with a highly accurate stimulus prediction that comes from a higher level

(in this case, RTL) model. Apart from eliminating the communication overhead be-

tween partitions using predictor, choosing small number of partitions also reduces
synchronization overhead. The proposed technique is generic and works independent

of the partitioning scheme. Further the performance cost of dumping can be ignored
as new simulators have the option of parallel dumping on multi-core machines.

3.9 Appendix A: Profiling


In this section, we show simulation profile of the Opencores [32] benchmarks. The
profiling shows which benchmarks are good candidates for multi-core simulation and
which ones are not. We used Cadence Incisive 13.1 simulator for profiling the bench-

53
marks. The following Tables show the simulation profile of the five benchmarks.
Tables 3.10, 3.11 and 3.12 show that these benchmarks have good inherent paral-
lelism marked by low testbench activity and high design activity. The tables also
show the modules which are most active. These are ideal candidates for multi-core
simulation. For example from Table 3.10, aes sbox can be simulated on one CPU
core and aes key expand 128 can be simulated on the other CPU core.

On the other hand, Tables 3.13, 3.14 and 3.15 show designs with low inherent
parallelism marked by high testbench activity and low design activity. These designs

are not good candidates for multi-core simulation and multi-core simulation of such

benchmarks can result in speed degradation as will be shown in Appendix B.


Table 3.10. Simulation profile of AES-128 benchmark

Most Active Module % Activity

aes sbox 24.2


aes key expand 128 12.4
testbench 3.9
aes rcon 2.8
simulation overhead 9.8

3.10 Appendix B: Simulation Plots


This section shows simulation plots of the benchmarks confirming results of Ap-
pendix A. The plots of AES, triple DES and JPEG show parallel activity which is
exploited by multi-core simulator. The other benchmarks have little parallel activity.
The conventions for interpreting various segments of this and the following graphs
are as follows [38]:

• Any information about master partition (that contains testbench) starts with
M. Any information related to slave partitions (design partitions other than
the testbench) start with P or S.

54
Table 3.11. Simulation profile of Triple DES benchmark

Most Active Module % Activity

key sel 21.3


des3 12.5
crp 11.7
des 7.7
testbench 10.5
simulation overhead 7.4
sbox1 3.7
sbox2 3.6
sbox3 3.4
sbox4 3.3
sbox5 3.6
sbox6 3.4
sbox7 3.6
sbox8 4

Table 3.12. Simulation profile of JPEG benchmark

Most Active Module % Activity

y huff 18.1
cr huff 17.9
cb huff 17.7
y dct 8.5
cb dct 7.7
testbench 4.3
simulation overhead 1.4
cr dct 7.6
ff checker 6.6
fifo out 5.9
RGB2YCBCR 1.2

55
Table 3.13. Simulation profile of PCI benchmark

Most Active Module % Activity

testbench 62.8
simulation overhead 4.8
pci target32 sm 3.5
pci out reg 2.9
pci target32 interface 2.4
pci unsupported 2.2
pci bridge32 2
WB MASTER BEHAVIORAL 2
pci pci decoder 1.8

Table 3.14. Simulation profile of VGA benchmark

Most Active Module % Activity

testbench 36.8
simulation overhead 25.5
vga fifo 13.8
vga col proc 7.5
vga fifo dc 4
vga pgen 3.2
vga wb master 2.7

Table 3.15. Simulation profile of AC97 benchmark

Most Active Module % Activity

testbench 48.2
simulation overhead 23
ac97 soc 8.3
ac97 rst 4.4
ac97 codec sout 1.6
ac97 codec sim 1.3

56
• The M1 segment in the left hand most column accumulates the time spent by
the master process executing its events. This time does not run in parallel with
the slave processes, but runs sequentially by itself. This time should be small
relative to the S1 times.

• The M2 segment in the left hand most column accumulates the time spent by
the master process waiting for all slaves to communicate their synchronized
value changes for the delta. This time should be as large as possible.

• The M3 segment in the left hand most column accumulates the time spent by
the master process propagating values changes received during the M2 segment.

This time, like M1, also does not run in parallel with the slave processes. This

time should be as small as possible.

• The M4 segment in the left hand most column accumulates the time spent by the

master process sending updated port signal values and next time information

to each of the slave processes. This time should be as small as possible.

• The S1 segments in the slave columns accumulates the time spent by the slave

processes executing their respective events. These times have the potential of

running in parallel with all the other S1 slave times. These times should be

large relative to the M1 and S3 times.

• The S2 segments in the slave columns accumulates the time spent by the slave
processes sending updated port signal values and next time information to the

master process. These times should be as small as possible.

• The S3 segments in the slave columns accumulates the time spent by the slave
processes waiting for the master to send its updated port signal values. These
times should be as small as possible.

57
Figure 3.11 shows that the parallel activity in the slave partitions are not uniform
and the simulation performance is low. It takes 192 minutes to simulate AES-128
which is worse than single-core simulation time of 160 minutes. Figure 3.12 shows
the CPU utilization during this simulation. It shows that approximately (130/200)%
CPU utilization which is not that high. Ideally this ratio should be be close to 200%
for bi-partitioned design running on two CPU cores.

Figure 3.11. Bi-partitioned (area-based) AES-128 multi-core simulation time

Figure 3.13 shows another simulation of the same design where partitioning is done
based on the number of module instances. Also the number of partitions is increased
from two to three. It shows that parallel simulation activity in all slave partitions is
uniform and the simulation performance is much better than the earlier case. It takes

58
Figure 3.12. Bi-partitioned (area-based) AES-128 multi-core simulation CPU uti-
lization

125 minutes to simulate AES-128 with this partitioning on a multi-core simulator.

Hence the speedup is 160/125 = 1.28. Figure 3.14 shows CPU utilization for this

partitioning during simulation. It shows that the utilization is close to (180/200)%

for 2 CPUs.

Figure 3.15 shows the simulation performance of JPEG design for area-based

partitioning. It shows that the parallel activity in slave partitions is very unbalanced.

As a result the simulation time turns out to be 180 minutes which is worse than
single-core simulation time of 167 minutes. Figure 3.16 shows the CPU utilization for

this partitioning. It shows that simulation is utilization only half (100/200)% of the
resources. Ideally the CPU utilization should be close to 200%.

Figure 3.17 shows the simulation performance of JPEG for instance-based par-
titioning. It shows that the parallel simulation activity inside slave partitions are
relatively well balanced. The simulation time is 93 minutes. Hence, the speedup
compared to single-core simulation is 167/93 = 1.79 which is quite significant. Fig-

59
Figure 3.13. Tri-partitioned (instance-based) AES-128 multi-core simulation time

Figure 3.14. Tri-partitioned (instance-based) AES-128 multi-core simulation CPU


utilization

60
Figure 3.15. Bi-partitioned (area-based) JPEG multi-core simulation time

Figure 3.16. Bi-partitioned (area-based) JPEG multi-core simulation CPU utiliza-


tion

61
ure 3.18 shows the CPU utilization for this partitioning. It shows that the CPU
utilization is close to (165/200)% which is quite significant.

Figure 3.17. Bi-partitioned (instance-based) JPEG multi-core simulation time

It is also shown that for CPU-bound applications like AES and JPEG, speedup

does not increase linearly with the number of cores. This is due to synchronization
overhead that increases with the number of partitions. As a result, speedup saturation
is evident in Figures 3.23 and 3.24. This confirms our experimental results tabulated
in Section 3.3.

62
Figure 3.18. Bi-partitioned (instance-based) JPEG multi-core CPU utilization

Figure 3.19. Bi-partitioned (instance-based) Triple DES multi-core simulation time

63
Figure 3.20. Tri-partitioned (instance-based) VGA multi-core simulation time

Figure 3.21. Oct-partitioned (instance-based) pci multi-core simulation time

64
Figure 3.22. Oct-partitioned (instance-based) ac97 multi-core simulation time

Figure 3.23. Multi-core simulation performance of AES-128

65
Figure 3.24. Multi-core simulation performance of JPEG

3.11 Appendix C: Designs Unsuitable for Multi-core Simu-


lation
In the previous Appendices, we mentioned that designs with low design activity

(less computation and more input/output) like VGA, PCI and AC97 lack inherent

parallelism. This makes them unsuitable for multi-core simulation. We tabulate their

multi-core simulation results in this section for the sake of completion of the discussion

on multi-core simulation. Tables 3.16, 3.17 and 3.18 show the simulation degradation
using multi-core simulation.

Table 3.16. Multi-core simulation performance of VGA (T1 = 612 min)

VGA Partitioning MC sim Speedup


(# of partitions) Scheme T2 (min) (T1/T2)

2 instance-based 1643 0.37


3 instance-based 1591 0.38
4 instance-based 1796 0.34

66
Table 3.17. Multi-core simulation performance of PCI (T1 = 17 min)

PCI Partitioning MC sim Speedup


(# of partitions) Scheme T2 (min) (T1/T2)

2 instance-based 83 0.2
4 instance-based 79 0.2
6 instance-based 99 0.17
8 instance-based 100 0.17

Table 3.18. Multi-core simulation performance of AC97 (T1 = 4 min)

AC97 Partitioning MC sim Speedup


(# of partitions) Scheme T2 (min) (T1/T2)

2 instance-based 48 0.08
4 instance-based 49 0.08
6 instance-based 47 0.08
8 instance-based 65 0.06

67
CHAPTER 4

EXTENDING PARALLEL MULTI-CORE VERILOG HDL


SIMULATION PERFORMANCE BASED ON DOMAIN
PARTITIONING USING VERILATOR AND OPENMP

4.1 Introduction
In the previous Chapter, we used Synopsys VCS multi-core simulator [40] to im-

prove performance of functional gate-level (zero-delay) simulation. We observed some

speedup for designs having inherent parallelism. We also concluded that communica-

tion, synchronization and design partitioning were barriers to speedup and scalability.

It needs to be restated that VCS multi-core simulator [40] partitions the design across

multiple CPU cores and allows only this type of partitioning. The type of partitioning

allowed by VCS multi-core is known as functional partitioning [14]. In this type of

partitioning, the focus is on the computation that needs to be performed rather than

the data that is input to the computation. The original computation is partitioned
into different sub-computations that are performed in parallel.

In contrast, the partitioning scheme which relies on partitioning the data is called
domain partitioning [14]. In this Chapter, we shall explore this type of partitioning.

4.2 Simulator Internals


Commerical simulators like Synopsys VCS [40] and Cadence Incisive [1] are pro-
prietary simulators and do not allow end user to look into simulator inner workings.
Tweaking commercial simulators from inside is almost impossible. Nevertheless the
simulator simulates a design in three stages [4] :

68
1. Compilation;

2. Elaboration; and

3. Execution.

During the compilation stage, HDL design is subjected to macros preprocessing


and syntax error checking. After successful completion of preprocessing and error
checking, the design is parsed into an internal parsed form, convenient for the next
stage processing but not visible to the user.

In the elaboration stage, the internal parsed representation of the HDL source

is expanded starting from the root or top level module. The hierarchy of the HDL

design is traversed and instantiations of the submodules are replaced by the actual

modules all the way to the primitive level. This means that all submodules that have

instantiations are expanded as well until primitive level is reached. If there are no

optimizations, like dead code elimination or constant propagation, the design is ready

for the next stage.

In the execution stage, the design, still being invisible to the user, is passed to a

code generator that generates code like C/C++ or similar, that can be turned into an
executable form by a compiler like GNU C/C++ compiler [3]. Figure 4.1 describes

inner workings of a simulator.

Synopsys VCS [40] simulator internally converts HDL design into C/C++ code

and then compiles the design using GNU C/C++ compiler. This can be verified
by simulating the design and looking at the simulation log which can be redirected
to a file during simulation or examined directly from the screen. The existence of
csrc directory as a result of simulation also proves the point. This directory is cre-
ated whenever VCS simulation is run. Also user can create simulation executable
by entering the csrc directory and running the command make product. However,

69
Figure 4.1. HDL simulator internals

70
tweaking the C/C++ code generated by VCS is difficult because of its cryptic nature
and external library dependencies which are not visible to the user.

In order to overcome the aforementioned difficulties in tweaking simulator inter-


nals, we chose opensource simulator Verilator [41] which translates Verilog HDL into

C/C++ code and then compiles the C/C++ code to generate simulation executable.
Verilator has gained a lot of popularity and is being used across the EDA industry by
major companies. Besides being opensource and free, it is extremely fast compared
to commercial simulators. Details about Verilator performance, pros and cons can be

checked at [41].

4.3 Parallelizing using OpenMP


Open Multi-processing (OpenMP) [7] is an application programming interface

(API) library for parallel programming shared-memory machines using C/C++ or

Fortran. It is relatively easily to perform parallel programming using OpenMP as

its syntax is easy and requires only a few changes to convert a serial program into a

parallel program. Its other major competitors are:

1. Posix threads (Pthreads), which requires full manual effort for parallel program-

ming.

2. Message passing interface (MPI), which is primarily used for distributed mem-

ory systems.

Our goal is to perform parallel HDL simulation by domain partitioning using


OpenMP. Figure 4.2 shows how to extend HDL simulation by adding parallelization.

4.4 Results
It turns out that single core simulation performance of Verilator is much better
than that of commercial simulators like Synopsys VCS. This performance can be

71
Figure 4.2. Extending Verilator for parallel programming

72
further improved by adding parallelization using OpenMP. The combination of the
two created the best parallel HDL simulator capable of handling RTL and functional
gate-level (zero-delay) designs. Tables 4.1, 4.2, 4.3 and 4.4 show performance of AES-
128 and RCA-128 RTL and functional gate-level simulations respectively. Figures 4.4
and 4.3 compare the speedup of RTL and GL0 simulation for RCA-128 and AES-128
designs.

Table 4.1. RTL simulation of AES-128 with 65000,00 vectors using Verilator and
OpenMP

Number of Wall clock Speedup


threads time (sec)
1 367 1
2 186 1.97
3 126 2.91
4 128 2.86
6 117 3.13
8 111 3.30

Table 4.2. Gate-level (zero-delay) simulation of AES-128 with 65000,00 vectors using
Verilator and OpenMP

Number of Wall clock Speedup


threads time (min)
1 126 1
2 63 2
3 43 2.93
4 46 2.73
6 41 3.07
8 38 3.31

4.5 Dependencies in the Testbench


There are designs, where testbench cannot be partitioned as shown in the previous
section. Such a testbench is reactive, where the state of the testbench depends upon

73
Table 4.3. RTL simulation of RCA-128 with 65000,00 vectors using Verilator and
OpenMP

Number of Wall clock Speedup


threads time (sec)
1 19 1
2 23 0.82
3 16 1.18
4 14 1.35
6 12 1.58
8 9 2.11

Table 4.4. Gate-level (zero-delay) simulation of RCA-128 with 65000,00 vectors


using Verilator and OpenMP

Number of Wall clock Speedup


threads time (min)
1 11 1
2 6 1.83
3 3 3.66
4 4 2.75
6 3 3.66
8 2.8 3.92

RTL speedup
4 GL0 speedup

3.5

2.5
speedup

1.5

0.5
1 2 3 4 5 6 7 8
# of Threads

Figure 4.3. Speedup of RCA-128 with Verilator using OpenMP

74
3.5
RTL speedup
GL0 speedup

2.5

speedup 2

1.5

1
1 2 3 4 5 6 7 8
# of Threads

Figure 4.4. Speedup of AES-128 with Verilator using OpenMP

the state of DUT. We experimented with such a design to see how its performance

degrades when simulated in parallel. We took AES-128 design and configured it such

that one of its output feeds back into one of the inputs. This causes dependency as

one cannot encrypt two plain texts in parallel because the second plain text needs

the output of the first one. It was observed that despite dependencies, the perfor-

mance of the design was not worse than a single threaded simulation. Hence, in the

presence of dependencies, OpenMP still keeps the performance comparable to single

threaded simulation. Note that this is not the case with functional partitioning where
dependencies cause performance degradation, which is worse than running single core

simulation.
Figures 4.5 and 4.6 show comparison of a single core simulation performance
of Verilator and VCS at RTL and functional gate-level. These figures show that
Verilator beats VCS by huge margin and seems to be the best way to perform parallel
simulation. Also, we extended the capability of Verilator to make it multi-core using
OpenMP. Figure 4.7 compares the multi-core performance of Verilator and VCS for

75
AES-128 design. This clearly shows Verilator performs much better than VCS in
multi-core simulation as well.

80
Verilator
VCS
70

Single Core Simulation Time (minutes)


60

50

40

30

20

10

0
AES−128 RCA−128
RTL Designs

Figure 4.5. Performance comparison of Verilator and VCS at RTL

1500
Verilator
VCS
Single Core GL0 Simulation Time (minutes)

1000

500

0
AES−128 RCA−128
Gate−level Designs

Figure 4.6. Performance comparison of Verilator and VCS at functional gate-level

76
1000
Verilator
900
VCS
Multi Core Simulation Time (minutes)

800

700

600

500

400

300

200

100

0
AES−128 RTL AES−128 GL0
Designs

Figure 4.7. Multi-core performance comparison of Verilator and VCS at RTL and
functional gate-level for AES-128

77
CHAPTER 5

ACCELERATING RTL SIMULATION IN TEMPORAL


DOMAIN

Simulation of the Register transfer level (RTL) model is one of the first and manda-
tory steps of the design verification flow. Such a simulation needs to be repeated often
due to the changing nature of the design in its early development stages and after

consecutive bug fixing. Despite its relatively high level of abstraction RTL simulation

is a very time consuming process, often requiring nightly or week-long regression runs.

In this chapter, we propose an original approach to accelerating RTL simulation that

leverages parallelism offered by multi-core machines. However, in contrast to tradi-

tional parallel distributed RTL simulation which distributes simulation to individual

processors, the proposed method accelerates RTL simulation in temporal domain by

dividing the entire simulation run into independent simulation slices, each to be run

on a separate processor core. It is combined with fast simulation in C/C++ or higher


level language that provide the required initial state for each independent simulation

slice. This chapter paper describes the basic idea of the method and provides some

experimental results showing its effectiveness in improving RTL simulation perfor-


mance.

RTL simulation is used to verify the functionality of RTL design. As the design is
at an early stage in the design flow, RTL description may keep changing to accommo-
date more enhancements or as a result of bugs caught during RTL simulation. Hence,
RTL simulation is a must and it is done as exhaustively as possible using directed
and constrained random simulation. RTL regressions are run on nightly or weekly

78
basis to keep RTL in a bug-free state. Depending upon the size and complexity of
the design, RTL regression may take a few hours to several weeks to run. It should
be noted that RTL simulation is much faster than gate-level functional (zero-delay)
and gate-level timing simulations. Even then, designers want to simulate RTL faster,
leveraging multi-core machines. In this chapter, we discuss the idea of accelerating
RTL simulation and propose a few approaches that can potentially improve RTL
simulation.

5.1 Introduction
5.1.1 Issues with Co-Simulation

An approach of using design model at a higher level of abstraction for simulation

of a design model at a lower level of abstraction has been already used in industry

[28]. However, its application is limited to the selected portions of the design. For

example, instead of simulating an entire design at the gate-level, parts of the design

are simulated at the gate-level, while rest is simulated at RTL. This co-simulation

approach works faster than pure gate-level simulation, but slower than pure RTL

simulation. Also, this approach does not parallelize the entire gate-level or RTL

simulation. Such methods are applicable to processor designs, and to the designs
that rely on higher level models, such as Instruction Set Architecture (ISA). Some

designs, such as SoC, may not have such architectural models, which makes the

problem further difficult.

5.1.2 Issues with Multi-Core Simulators


Recently commercial EDA tool vendors have introduced multi-core simulators

that run on multi-core machines. Unfortunately, these simulators have limited suc-
cess because of high cost, communication and synchronization over-head mentioned

79
earlier, inability to support Verilog PLI (Programming Language Interface) and new
SystemVerilog testbench features.

5.2 Temporal Parallel Simulation


5.2.1 Preliminaries

RTL simulation performance can be improved if dependencies in RTL simulation


are removed somehow. We discuss two types of dependencies

1. Time dependency: Before simulating the entire RTL design at a particular time
t, the design must be simulated at all times from 0 to t − 1.

2. Spatial dependency: At a particular time t, one component of RTL design de-

pends upon the value from another component of the RTL design.

In this work, we concentrate on removing the time dependency in simulation of

a design. Temporal parallel simulation (TPS) exploits time dependency while PDES

exploits spatial dependency in a design. In TPS, simulation time intervals are made

independent by pre-computing the initial state of each time interval. This allows TPS

to achieve full parallelism by avoiding communication and synchronization overhead,

inherent in PDES.

To provide a correct initial state of each time interval (slice) for parallel RTL
simulation, we follow a two-step approach [27][TCAD] proposed earlier for gate level

simulations.

1. Reference Simulation: Simulation that provides initial state of each time slice
in TPS. Normally, this simulation is much faster.

2. Target Simulation: Simulation of a time slice that uses initial state provided
by reference simulation. Normally, this simulation is slower compared to the
reference simulation.

80
The basic idea of TPS is illustrated in Figure 5.1. It shows fast reference simu-
lation to provide the initial state of each slice for target simulation run. MULTES
[27][TCAD] applied this idea to speed up gate-level timing simulation by using fast
RTL simulation as reference. The initial states were obtained from checkpoints saved
during reference simulation and then restored for gate-level target simulation. It was
speculated [27] that this idea could be used for RTL simulation as well, but the diffi-
culty was to find a suitable higher-level design model such as ESL (Electronic System
Level), that could be used as reference for RTL simulation. The difficulty comes

mostly from solving state matching problem between the ESL and RTL models mak-
ing this approach impractical. Instead, in this work we compute the initial states for
the RTL simulation slices, using a higher level model such as C/C++ or SystemC

simulation, ”on the fly” as they are needed by the RTL simulation. This approach

has additional advantage that it avoids saving and restoring the initial states, which

would add time and space to the process.

Figure 5.1. Temporal Parallel Simulation (TPS) concept

81
The number of target simulations that can be run in parallel is determined by the
number of CPU cores available. The theoretical performance of TPS, measured in
total simulation time Ttps can be expressed by Equation 5.1 where


n
Ttps = (Tref (i) + Ttarget (i)) (5.1)
i=1

• Tref (i) denotes the time to run reference simulation to provide the initial state
for target simulation of the ith time slice.

• Ttarget (i) denotes the target simulation time for the ith time slice.

5.2.2 Integration with the current ASIC/FPGA design flow


We should mention that the concept of reference simulation is compatible with

the standard ASIC and FPGA design flow where design is successively refined from

a higher level of abstraction to a lower level of abstraction. Thus, any simulation at

a lower level of abstraction (target simulation) can be performed in parallel using a

higher level of abstraction (reference simulation) as proposed in Figure 5.1. In this

work, we use C/C++ as reference simulation to enable parallel RTL target simulation.

We assume SystemC, C/C++ or any higher level model of the design is already

available, as many designs are first simulated in C/C++ in the early design phase.

Furthermore, there are Open source tools, such as Verilator [41] that can convert
RTL description into equivalent C/C++ description. Once the C/C++ model for the
design is available, there is no need to translate the Verilog testbench into C/C++
testbench. A C/C++ model can be invoked directly from RTL via PLI, which is a
standard practice in the industry [28], as shown in Figure 5.2. Figure 5.2 shows how
testbench can invoke C/C++ model to obtain the initial state of any slice in time.

82
Figure 5.2. Temporal RTL simulation setup

5.3 Exploring Circuit Unrolling option for Parallel Simula-


tion
In addition to parallelizing simulation by dividing it into a number of simulation

slices, we also investigated another direction in speeding up RTL simulation. Namely,

we considered replacing iterative simulation of a single frame by simulation of a fixed

number of frames F of the circuit, forming a larger combinational circuit.

Figure 5.3 [4] shows a circuit whose output f at a given time depends on the value

of output k of the flip-flop. Initially the value of k is 0. The value of f determines

the value k of the flip-flop in the 2nd clock cycle. This value of k in turn determines
the new value of f in the 2nd clock cycle, which then determines the new value of

k for the 3rd clock cycle, and so on. Hence, to determine the value of f in the nth

clock cycle, the value of k needs to be known in the (n − 1)st clock cycle. Sequential
simulation over n clock cycles naturally resolves this problem.

Figure 5.4 [4] shows the circuit in Figure 5.3 unrolled twice. Note the absence of
the flip-flop. The value of j in the first clock cycle provides signal k for the second
cycle, etc. The two circuits are described differently at RTL but they produce identical
values of f in every clock cycle. Note that there is no clock in the unrolled circuit in

83
Figure 5.4, which makes the simulation faster. The verification engineer must create
a virtual clock in the testbench to make sure that input signals are applied at the
appropriate time.

Figure 5.3. simple circuit for RTL simulation

Figure 5.4. simple circuit unrolled twice for RTL simulation

Extending this idea further, the circuit can be unrolled for several time frames, F .
Unrolling the circuit offers some advantages in simulation, as it replaces the sequential
circuit by a combinational one, which can be simulated faster. Furthermore several
cycles of the original circuits can be simulated simultaneously. While the time needed
to simulate each set of F time frames will be longer than for a single frame, the number
of simulation cycles needed to simulate the design over some simulation time ts will be

84
reduced to ts/F . We experimented with this idea by observing the effect of unrolling
the circuit on the simulation speed. Table 5.1 compares the simulation performance
of the circuits shown in Figures 5.3 and 5.4 on a single-core machine. It shows that
circuit unrolled twice is 1.2× faster than the original circuit. Results of unrolling
over larger number of frames F will be presented in the next section, together with
analyzing the effect of size of the simulation slices on the simulation speedup.

Table 5.1. Performance comparison of iterative and unrolled circuits

#
of Iterative Unrolled 2x
clock circuit circuit
cycles T1 (sec) T2 (sec)
(Billions)
1 12 10
2 24 20
3 36 30
4 48 40
5 60 50

5.4 Experiments and Results


5.4.1 Setup

We will now combine the idea of unrolling the circuit over a fixed number of time
frames, F with the parallel simulation scheme described in Section 5.2 and observe
their effect on simulation speedup. We simulated the circuit in Figure 5.3 for an

unroll factor F = 2, 4, 6, 8, 10, and 12 on an a quad-core Intel machine with 8GB


RAM. In our experiments we used Cadence Incisive Verilog simulator; the reference
simulation in C is invoked using Verilog PLI.

85
5.4.2 Simulation of Small Custom Design Circuit
In the first set of experiments we used the example circuit in Figure 5.3. The
circuit was simulated on a two CPU cores, using the simulation configuration shown
in Figure 5.5. Core 1 simulates RTL for ”odd” slices: 0 − i, 2i − 3i, etc., where
i is a sufficiently large number of clock cycles, while core 2 performs simulation for
”even” slices: i − 2i, 3i − 4i, etc. The first slice starts with a known initial state and is
directly subjected to RTL simulation (for time TRT L ). At the same time, core 2 starts
simulating the second slice (i to 2i) starting at the initial state at time i. This initial

state is provided by fast C reference simulation (Tc ). To simulate next slice (2i - 3i)
at the first core, additional processing is needed to provide it with the required initial
state. It is composed of two components: i) fast ”testbench forwarding” (Tf ) to bring

the testbench to a state where it is ready to feed the design with correct stimulus; and

ii) the actual C simulation (Tc ). While the C simulation time Tc remains constant,

the testbench forwarding time Tf increases linearly with the number of time slices as

it must always execute the testbench from the beginning. This makes the number

of slices per core an important factor. Ideally, we want to keep the sum Tf + Tc

much smaller than TRT L to gain speedup over traditional RTL simulation. Figure 5.5

also shows comparators to make sure that reference value from C/C++ simulation

matches the actual value from RTL simulation.

5.4.3 Simulation by varying the Unroll factor (F)

Tables 5.2, 5.3 and 5.4 show that, as the number of frames per simulation cycle
(unroll factor F ) increases, simulation speedup improves further. It approaches 2 for
the case when F =12 and when the number of slices is 4. Note that these tables show
the worst case time reported from the two cores.

Figure 5.6 summarizes these results in a plot for 1 billion clock cycles for a 2-core
machine. Specifically, it shows a family of speedup plots for unroll factors ranging

86
Machine1

RTL C RTL C RTL

0 2 4 6 8 10

Machine2

C RTL C RTL C
0 2 4 6 8 10

Figure 5.5. RTL acceleration setup

from 1 to 12, as a function of the total number of slices. Note that the plot for F =1

(single frame) the greatest speedup is for 2 slices (one per each core) and then drops

as the number of slices increases. This is dictated by an added overhead introduced

by switching between C and RTL and the lower slice granularity for this iterative

(single frame) case. At the same time, the speedup improves locally (around 4 slices)

for the cases when the frames are unrolled several times, offsetting this overhead.

Figure 5.7 shows the relationship between the speedup and the number of frames F

as a family of plots.

Table 5.2. RTL simulation speedup for single-frame circuit

# Traditional # of Forwarding C sim RTL sim Speedup


of clock RTL Sim slices time time time T0/(Tf+Tc+Trtl)
cycles time Tf (sec) Tc (sec) Trtl
(billions) T0 (sec)
1 764 2 0 104 465 1.64
1 764 4 0,47 49 600 1.27
1 764 10 0,19,35,53,75 19 694 1.10
2 1492 2 0 197 933 1.60
2 1492 4 0,92 99 1200 1.24
2 1492 8 0,46,93,140 50 1242 1.20

87
Table 5.3. RTL simulation speedup for circuit unrolled 2 times.

# Traditional # of Forwarding C sim RTL sim Speedup


of clock RTL Sim slices time time time T0/(Tf+Tc+Trtl)
cycles time Tf (sec) Tc (sec) Trtl
(billions) T0 (sec)
1 764 2 0 367 328 1.09
1 764 4 0,170 47 409 1.22
1 764 10 0,18,38,58,110 30 446 1.09
2 1492 2 0 720 644 1.09
2 1492 4 0,268 152 836 1.18
2 1492 8 0,47,95,227 73 919 1.09

Table 5.4. RTL simulation speedup for circuit unrolled 4 times.

# Traditional # of Forwarding C sim RTL sim Speedup


of clock RTL Sim slices time time time T0/(Tf+Tc+Trtl)
cycles time Tf (sec) Tc (sec) Trtl
(billions) T0 (sec)
1 764 2 0 345 302 1.18
1 764 4 0,150 72 362 1.30
1 764 10 0,16,37,56,107 30 301 1.14
2 1492 2 0 650 603 1.19
2 1492 4 0,245 145 742 1.31
2 1492 8 0,48,98,228 72 827 1.17

Figure 5.6. RTL simulation speedup as a function of number of slices for different
unroll factors.

88
Figure 5.7. RTL simulation speedup as a function of number of frames for different
slices.

5.4.4 Simulation by varying the number of cores

In this experiment, we vary the number of cores to see their impact on simulation

performance. In this configuration, the original simulation time is divided into a

number of cores, so there are as many slices as cores. For example, if number of

cores are 4, the simulation is divided into 4 slices that are run simultaneously. This

is shown in Figure 5.8. Clearly, the speedup is determined by core 4 which has the

slowest run time among all the cores because it spends most of the time in testbench

forwarding. This issue is addressed in the next section.

Table 5.5 shows the speedup in RTL simulation as a function of the number
of cores for the simulation configuration shown in Figure 5.8. Figure 5.9 show the
speedup plot for Table 5.5. It shows that speedup factor saturates around 10 cores.
Thus increasing cores beyond 12 and more is not useful for this design. Figure 5.10
shows speedup against the number of cores when the circuit is unrolled by a factor
of 4, 6 and 8 time frames.

89
Figure 5.8. Parallel RTL simulation across multiple CPU cores

Table 5.5. Effect of varying number of cores on RTL simulation time

# # of Traditional Parallel
of clock cycles RTL sim RTL sim Speedup
CPU (billions) time time T1/T2
Cores T1 (sec) T2 (sec)

2 1 764 435 1.75


2 2 1492 988 1.51
4 1 764 270 2.82
4 2 1492 532 2.80
6 1 764 211 3.62
6 2 1492 421 3.54
8 1 764 187 4.08
8 2 1492 375 3.97
10 1 764 164 4.65
10 2 1492 328 4.54
12 1 764 154 4.96
12 2 1492 309 4.82
16 1 764 150 5.09
16 2 1492 292 5.10

90
Figure 5.9. RTL simulation speedup as a function of the number of cores

Figure 5.10. RTL simulation speedup as a function of the number of cores for
different unroll factors

91
5.5 Muti-core Architecture of Temporal RTL Simulation
We propose an architecture of temporal RTL simulation that exploits multi-core
architecture of the underlying hardware. The basic setup is shown in Figure 5.11.
The new architecture shows that Electronic System Level (ESL) simulation runs as
an independent thread on a CPU core. This thread simulates the design at ESL level,
checkpoints the state and spawns RTL simulation of a slice on a free CPU core. At
the end of each time slice simulation, ESL thread checks for horizontal state matching
(whether for slice i + 1 beginning state of ESL matches the state of RTL for slice i

at the end ). If there is a state matching between slice i and slice i+1, for every
time slice i, ESL is known to be accurately predicting the initial state of slice i+1 for
every time slice i. This mode of the simulation is called ”Prediction Mode”, where

ESL simulation correctly predicts the initial state of each time slice. If, on the other

hand, horizontal state matching fails for a slice i+1, the simulation result of slice i+1

is discarded and then slice i+1 is re-simulated using the state from previous slice

i rather than the ESL. This mode of the simulation is called the ”Actual Mode”.

The actual mode imposes re-simulation overhead but it affects simulation of only the

slice/s which experience state mismatch while not affecting the rest of the simulation.

In traditional simulation, the whole simulation needs to be restarted if there is a

simulation mismatch or discrepancy.

5.5.1 Load Balancing in the Multi-core Architecture


The proposed architecture also provides load balancing. The width of the time
slice that is simulated at each core may not be identical. The average number of cores
that are busy at any time can be controlled by the ESL thread. As soon as a core
is free, it is selected by the ESL thread to simulate the next time slice by providing
it with the initial state. Figures 5.12 and 5.13 illustrate the case of load balancing
for the simple circuit shown in Figure 5.3. Figure 5.12 shows simulation of a design

92
Figure 5.11. Multi-core architecture of temporal RTL simulation

on four cores. Tref represents the time to provide the initial state for a time slice to

be simulated at RTL. Figure 5.13 shows simulation of the same design on two cores.

Note that the width of the RTL time slice in Figure 5.13 is twice the width of the

RTL time slice in Figure 5.12. It turns out that two-core configuration simulates the

design faster than the four-core configuration. This is because four-core configuration

is not load-balanced. Core 4 in a four-core configuration should be simulated the least

amount of time as it takes the longest time Tref to provide it with its initial state. On

the other hand, the 2-core configuration does not have this issue. Table 5.6 compares

the simulation results. We used Cadence Incisive simulator 13.1 for RTL simulation
on a quad-core Intel CPU with 8GB RAM. From this experiment, we conclude that

simulating a design on large number of cores does not necessarily lead to speedup.

Proper load balancing is necessary to get the best possible speedup.

5.5.2 Simulation of industry standard design


In the second set of experiments we applied our parallel RTL simulation method-
ology to AES-128 design [32]. Figure 5.14 shows the design configuration used in this
experiment.

93
Figure 5.12. Temporal RTL simulation on four cores

Figure 5.13. Temporal RTL simulation on two cores

Table 5.6. Load Balancing on simple circuit by varying number of cores

# # of Traditional Parallel
of clock RTL sim RTL sim Speedup
CPU cycles time time T1/T2
Cores (Billions) T1 (sec) T2 (sec)
2 1 764 435 1.75
2 2 1492 988 1.51
4 1 764 570 1.34
4 2 1492 1280 1.16

94
Figure 5.14. AES-128 design in CBC mode

The 128-bit input vectors are: plain text (PT), key and initialization vector (IV).

The output vector is 128-bit cipher text (CT). As can be seen, the design is similar
in structure to the simple circuit shown in Figure 5.3. To accelerate cipher text

computation, we used C model of the design together with RTL to parallelize the

computation across multiple machines (cores). In this experiment we used a two-

core machine, and the simulation run was partitioned into 5 slices (three on the first

core and two on the second) as this offered the best overall simulation performance.

Figure 5.15 shows this configuration. The results shown in Table 5.7 indicate that

the simulation performance was capped at about 1.7× speedup on the 2-core CPU.

Figure 5.15. AES-128 simulation configuration on two cores

95
Table 5.7. AES-128 speedup with parallel simulation

# # of # of Traditional Parallel
of time plain RTL sim RTL sim Speedup
CPU slices texts time time T1/T2
Cores (millions) T1 (sec) T2 (sec)
2 5 0.1 5 5 1.00
2 5 1 52 33 1.57
2 5 10 517 340 1.52
2 5 100 4200 2700 1.55

5.6 Conclusion
This chapter presented an approach of accelerating RTL simulation targeting
multi-core CPUs. It presented a new technique for accelerating RTL simulation based

on temporal partitioning of the simulation and using higher level model (C/C++)

to provide the initial states for the individual simulation slices. We showed that sim-

ulation can be accelerated by making intelligent choices in terms of the number of

slices, number of CPU cores, and by unrolling the circuit by a number of time frames

per simulation cycle. To the best of our knowledge, this is the first attemp that has

considered RTL simulation acceleration using temporal partitioning with higher level

model (C) targeting multi-core machines.

96
CHAPTER 6

ACCELERATING GATE-LEVEL TIMING SIMULATION

6.1 Introduction
Traditional dynamic simulation with back-annotation in standard delay format
(SDF) cannot be reliably performed on large designs. The large size of SDF files
makes the event-driven timing simulation extremely slow as it has to process an ex-

cessive number of events. In order to accelerate gate-level timing simulation we pro-

pose a fast prediction-based gate-level timing simulation that combines static timing

analysis (STA) at the block level with dynamic timing simulation at the I/O inter-

faces. We demonstrate that the proposed timing simulation can be done earlier in

the design cycle in parallel with synthesis.

6.1.1 Issues with Simulation

As already mentioned in Chapter 1, the dominant technique used for functional

and timing simulation is event-driven HDL simulation [28]. However, event-driven

simulation suffers from very low performance because of its inherently sequential na-

ture and heavy event activities in gate-level simulation. As the design gets refined into
lower levels of abstraction, and as more debugging features are added into the design,
simulation time increases significantly. Figure 6.1 shows the simulation performance
of AES-128 design [32] at various levels of abstraction with debugging features en-
abled. As the level of abstraction goes down to gate or layout level and debugging
features are enabled, simulation performance drops down significantly. This is due
to a large number of events at the gate-level or layout level, timing checks and disk
access to dump simulation data.

97
Figure 6.1. Drop down in simulation performance with level of abstraction + de-
bugging enabled

This work addresses the issue of improving performance of event-driven gate-level

timing simulation using static timing analysis (STA) as ”timing predictor” at the

block level [9]. We propose an automatic partitioning scheme that partitions the

gate-level netlist into blocks for SDF annotation and STA. We also propose a new

design/verification flow where timing simulation can be done early in the design cycle

using cycle-accurate RTL.

6.2 Hybrid Approach to Gate-level Timing Simulation


6.2.1 Basic Concept
We present a new approach to improve performance of gate-level timing simulation
[9]. The basic idea is to use static timing analysis (STA) as timing predictor at the
block level. It uses worst case delay, captured by STA, instead of the actual cell delays
for annotating block-level timing during simulation. This idea is illustrated in Figures
6.2 and 6.3. Figure 6.2 shows gate-level timing simulation of a design consisting of two
blocks, with timing simulation accomplished with SDF back-annotation applied to the

98
entire design. However, for large designs, such SDF back-annotation will negatively
impact the performance of gate-level timing simulation.

To improve the performance of gate-level timing simulation, we propose a hybrid


approach, shown in Figure 6.3, where only gate-level block2 is SDF back-annotated.

Gate-level block1 is analyzed by STA tool which reports the maximum delay inside
the block. Only this value is back-annotated during simulation as dsta at the output
of block1. This type of timing annotation is termed as selective SDF annotation.
Note that STA can be performed on the gate-level block1 as part of the whole design

or separately if input/output (I/O) delays are modeled appropriately.

Essentially, block1 is simulated in functional (zero-delay) mode i.e., without SDF

back-annotation, while block2 is simulated with SDF back-annotation. In case of

multiple blocks, the proposed STA based timing prediction approach can be used for

majority of the blocks to speed up gate-level timing simulation. Designers typically

know the timing critical blocks in a design where selective SDF back-annotation can

be used to quickly verify design timing.

Figure 6.2. Gate-level timing simulation with full SDF back-annotation

6.2.2 Design Partitioning for Gate level Simulation

Partitioning of gate-level netlist into blocks for SDF annotation and STA is a
challenging problem as verification engineer may not have insight to identify timing-

99
Figure 6.3. Hybrid Gate-level timing simulation with partial SDF back-annotation

critical blocks. Furthermore, partitioning schemes are often manually done. This may

cause a problem when dealing with huge gate-level netlists. Often gate-level netlist
is flattened and hierarchy is not preserved. We propose a partitioning scheme based

upon STA that is fully automated and works for flat or hierarchical gate-level netlist.

This is one of the most important contributions of this chapter.

The main goal of STA is to calculate slowest (critical path) in the design. One

can choose to report not only the most timing critical path but the next most timing

critical path and so on. STA report then reports these timing critical path/s and the

associated module instances. See Figures 6.4 and 6.5 for most timing critical paths

in VGA and AES-128 designs [32]. Since these paths are time critical, one would

always want to do SDF back-annotated timing simulation on these module instances

to make sure that their timing conforms to STA results. In brief, one can include all
the module instances that are in the timing critical path/s for SDF back-annotation.
We call this group of instances Block2, as shown in Figure 6.3. All the other module
instances can be considered not timing critical. These module instances shall be
simulated in function-al (zero-delay) mode. This group of instances is called Block1.
However, one needs to run STA on Block1 to find out their worst case delay dsta as
shown in Figure 6.3. All of this can be automated in a flow as shown in Figure 6.6.
Sample timing constraint file tf ile is shown in Figure 6.7 for AES-128 design [32].

100
Figure 6.4. Static Timing Analysis (STA) of VGA controller design

101
Figure 6.5. Static Timing Analysis (STA) of AES-128 controller design

102
Figure 6.6. Automated partitioning and simulation flow for hybrid gate-level timing
simulation

Figure 6.7. Sample timing constraint file (tfile) for AES-128 design

103
6.2.3 Integration with the existing ASIC/FPGA Design Flow
Figure 6.8 shows the flow for this approach. The key idea is to capture peripheral
timing of each block via static timing analysis and various estimates derived from
time budgeting. As majority of the design blocks are simulated in functional (zero-
delay) mode, except at the block periphery, this should result in a significant speedup
compared to the simulation with full SDF back-annotation.

To further improve the performance of gate-level timing simulation, the majority


of gate-level blocks can be replaced by their cycle-accurate RTL blocks with peripheral
timing captured via time budgeting or other estimates, to be explained next.

Figure 6.8. Proposed flow for hybrid gate-level timing simulation

104
6.2.4 Early Gate-level Timing Simulation
The concept of early gate-level timing simulation is shown in Figure 6.9, where
gate-level Block1 is replaced by equivalent RTL. Now Block1 is simulated in RTL
instead of its gate-level model. The key idea is to perform timing simulation using
estimated timing dest early in the design cycle when all blocks have not been synthe-
sized. The estimated timing can come from time budgeting or a tool like Synopsys DC
Explorer [23]. This is in contrast to the conventional approach, where gate-level sim-
ulation is performed later in the design flow, after synthesis or place and route step,

when all the detailed delay data is available. Major simulator vendors have already
embraced the idea of early timing simulation based on the estimated delays realizing
that performing gate-level timing simulations late in the design cycle is prohibitively

slow. Verification engineers get around this problem by performing gate-level timing

simulation of only time critical blocks with few test vectors. However, they are not

able to perform full chip timing simulation with large number of test vectors, which

often leaves certain timing bugs undetected. Synopsys has recently announced a new

product called DC Explorer [23] that is based on the same idea of early design explo-

ration. It can do early synthesis, timing and other estimates with enough accuracy

for designs to start the simulation process early in the design flow. Synopsys DC

Explorer is rapidly getting adoption in the industry.

Figure 6.9. Early timing simulation using RTL with estimate of peripheral timing

105
6.3 Experiments
6.3.1 Experimental Setup

We tested the proposed approach by measuring the performance of gate-level tim-


ing simulation of several Opencores designs [32] namely AES-128 , 3-DES , VGA

controller and JPEG encoder designs . We used Cadence Incisive Unified Simulator
13.1 on quad-core Intel CPU with 8GB RAM. The designs were synthesized with
Synopsys Design Compiler using TSMC 65nm standard cell library. All these de-
signs except VGA controller are single clock designs. The following Table 6.1 shows
essential statistics for these designs.

Table 6.1. Design Statistics

Design Synthesized
Name Area in
NAND2
AES-128 18400
3-DES 96650
VGA 144189
JPEG 968788

6.3.2 Results

First, we show simulation results with the AES-128 design. We start with SDF an-
notation of majority of blocks (to accommodate many timing critical paths) and then
gradually decrease the number of blocks in SDF annotation to one (to accommodate

the worst case timing path) . The module hierarchy for AES-128 is shown in Figure
6.10. Table 6.2 shows the results. It shows that significant speedup over full SDF
annotated timing simulation can be attained.
The waveforms in Figure 6.11 illustrate the difference between full SDF annotation
and selective SDF annotation when only one block (aes sbox4) is in STA. It shows
that signal from selective SDF annotation is delayed more than the SDF-annotated

106
Figure 6.10. Instance hierarchy of AES-128 design

Table 6.2. Simulation speedup of AES-128 for variable number of blocks in SDF
annotation

# of Module Full SDF Selective


module instances annotated SDF annotated
instances in 0-delay timing sim timing sim Speedup
in SDF T1 (min) T2 (min) (T1/T2)
annotation
/17
16 test.u0.us00 172 115 1.49
16 test.u0.u0 172 84 2.04
15 test.u0.us00 to test.u0.us01 172 110 1.56
13 test.u0.us00 to test.u0.us03 172 100 1.72
9 test.u0.us00 to test.u0.us13 172 77 2.23
7 test.u0.us00 to test.u0.us23 172 56 3.07
1 test.u0.us00 to test.u0.us33 172 37 4.64

107
signal due to STA delay, but contains no glitches (hence has fewer events to process
during simulation and hence faster simulation). Both signals match at the clock cycle
boundary. Similarly Figures 6.12 and 6.13 show the same effect when two (aes sbox4
and aes sbox5) and majority of the aes sboxes blocks are in STA.

Figure 6.11. Full SDF-Annotated Signal versus Selective SDF-Annotated Signal


when one block in STA (aes sbox4)

Figure 6.12. Full SDF-Annotated Signal versus Selective SDF-Annotated Signal


when two blocks in STA (aes sbox4 and aes sbox5)

In the next set of experiments, all designs were divided into two gate-level blocks,
Block1 and Block2, as shown in Figure 6.3. Block2 contains module instances from
the most timing critical path. Here, only one timing critical path is considered.

The approach has an additional advantage that it validates the result of STA which
is dependent upon manual constraints entry. If the simulation shown in Figure 6.9
exhibits timing failure, it will help debug STA constraints. Once the constraints are

108
Figure 6.13. Full SDF-Annotated Signal versus Selective SDF-Annotated Signal
when majority of the blocks are in STA

corrected, STA is run again to provide the new #dsta value. This STA-to-simulation
cycle is repeated until all timing failures are debugged and removed from the simula-

tion.

Table 6.3 shows the speedup obtained using our hybrid gate-level timing simula-

tion over full SDF back-annotated gate-level timing simulation.

Table 6.3. Speedup with hybrid gate-level timing simulation

Full SDF Hybrid


Design annotated Timing
Name timing sim sim Speedup
T1 (min) T2 (min) (T1/T2)
AES-128 172 37 4.64
3-DES 196 51 3.92
VGA 812 232 3.50
JPEG 273 79 3.45

6.4 Verification of Simulation Results


In order to verify the timing correctness of the approach, we propose the following
dumping-based flow, shown in Figure 6.14. Note that this is an optional step, used

109
only to verify the proposed simulation approach. In practice, verification engineer
can skip this step to reduce the verification time.

Figure 6.14. Verification flow for hybrid gate-level timing simulation

While testbench can verify functional correctness of the two simulations, the pro-

posed verification scheme helps in verifying timing correctness of the two simulations.

In order for both simulations to be timing correct, the monitored signals from the

two simulations should match at the clock cycle boundary. Unfortunately, dumping,

as shown in Table 6.4 can drastically reduce simulation performance. Further, the

amount of dumping can cause the disk to quickly become full. Therefore, it is rec-

ommended that dumping should be done for a small time interval rather than for the
entire simulation. We used small simulation intervals to verify timing correctness of

the output signals of the designs. Cadence Comparescan tool was used to compare
the dumped signals. The tool reported the signals to be matching at the clock cycle
boundary. Table 6.4 shows comparison between full SDF gate-level timing simula-
tion and proposed hybrid gate-level timing simulation for all the flip-flops/registers
in VGA and AES-128 designs. The fact that the values of the registers match at
the clock cycle boundary during the entire simulation confirms the accuracy of our
approach.

110
Table 6.4. Accuracy of hybrid gate-level timing simulation at the register boundary

A:Total # B: # of Full Bound on


Design of Regs SDF timing hybrid pred
Name in design vs selective accuracy
SDF timing reg (B/A)
VGA 1611 1611 100 %
AES 530 530 100 %

6.5 New Gate-level Timing Simulation Flow


We also propose the design/verification flow in which gate-level timing simulation

is performed early in the design cycle, using estimates from time budgeting and/or

STA. Tools like Synopsys DC Explorer [23] can provide timing estimates for running
gate-level timing simulation. As already mentioned, performing gate-level timing sim-

ulation late in the design cycle is prohibitively slow and may result in design changes

back in the RTL or may require ECO. Furthermore, the idea of performing long full

chip timing simulation in a short amount of time is much welcomed by the industry.

Figures 6.15 and 6.16 show the traditional and new flow for simulation, respectively.

The obvious advantage of the new flow is rapid gate-level timing simulation early in

the design cycle so that timing checks are validated and bugs are caught early on.

Figure 6.15. Traditional simulation flow in ASIC/FPGA design

6.6 Conclusion and Future Directions


Today, system-on-chip (SoC) designs have become widespread. These designs
integrate multiple hardware cores working at different frequencies. Timing simulation
of such multi-clock domain designs is critical. Traditional dynamic simulation with

111
Figure 6.16. Proposed flow of early simulation in ASIC/FPGA design

SDF back-annotation cannot be done on such large designs. In addition, event-driven


timing simulation is extremely slow, suffers from capacity issues because of large SDF
files (exceeding 10GB for small SoC designs) and is generally done late in the ASIC

design cycle after synthesis or layout.

This chapter proposed an approach to hybrid gate-level timing simulation [9] that

makes use of STA and selective SDF back-annotation to accelerate gate-level timing

simulation. In this approach, STA acts as timing predictor for blocks which are run

without SDF back-annotation. The approach also validates the result of STA which

is dependent upon manual constraints entry. The proposed approach can be applied

to multi-clock do-main designs with clock domain crossings (CDC).

112
CHAPTER 7

CONCLUSION AND FUTURE WORK

7.1 Conclusion
In the previous Chapters, we described three techniques for accelerating HDL
simulation at three levels of abstraction namely

1. Register transfer level (RTL);

2. Functional gate-level (zero-delay); and

3. Gate-level timing.

We also classified designs into three categories:

1. Designs with inherent parallelism;

2. Designs with lack of inherent parallelism; and

3. Arithmetic circuits.

Common design elements/operations in a hardware design are:

1. Arithmetic and logic operations;

2. Matrix operations;

3. Input/Output (I/O) operations;

4. Filtering/DSP; and

113
5. Network dataflow operations

From the above categorization, it is clear that the designs chosen in our experi-

ments encompass almost all operations and design categories. Table 7.1 categorizes
the chosen designs into the above categories.

Table 7.1. Classification of HDL designs

Designs with Designs which


inherent lack inherent
parallelism Parallelism
ALU ops AES,3-DES RCA, AC97,SHA1
Matrix ops AES
I/O ops PCI, VGA
DSP ops JPEG
Network ops

Together, application of the proposed verification techniques at the three levels of

abstraction can bring huge improvements in verification time. Table 7.2 shows the

performance improvement at each level of abstraction.

Table 7.2. Speedup at various levels of abstraction

AES 3-DES JPEG RCA-128 VGA


design design design design design
RTL 3.3× 2.11×
GL 0-Delay 3.3× 2.04× 2.76× 3.92×
GL Timing 4.64× 3.92× 3.45× 3.5×

7.2 Performance Gain by Opensource Simulation Software


We explored opensource simulation software Verilator [41], which translates Ver-
ilog HDL design into C/C++ and then compiles the design into an executable. Being
able to use Verilator for performance gain was one of the most challenging and re-
warding parts of the thesis. Not only were we able to run faster single-core simulation

114
compared to Synopsys VCS [40], but were able to add multi-core simulation capa-
bility in Verilator by using OpenMP [7]. To the best of our knowledge, this way of
parallelization has not been explored before. We were able to increase performance of
both RTL and gate-level simulations using Verilator with OpenMP. It is worth men-
tioning that running cost-free simulation software i.e., Verilator with OpenMP on a
linux platform like Red Hat [5] and Centos [2] offers a huge financial advantage for
researchers and companies with limited budget. This work is a contribution towards
opensource software as this thesis work has benefitted from opensource simulation

software and opencores designs [32].

7.3 Future Work


The work can be extended at all the three levels of abstraction that are addressed

in this thesis. We address all three levels, i.e., time parallel RTL, gate-level timing

and multi-core RTL or functional gate-level (zero-delay) simulation.

7.3.1 Future Work in Improving Gate-level Timing Simulation

We outline a few directions of research.

The first challenge in timing closure in modern process technologies is the varia-

tion between different Process, Voltage, Temperature (PVT) corners. Traditionally, it


was sufficient to verify the 8 timing corners (Vmin/Vmax, Pfast/Pslow, Thigh/Tlow).

Today, due to increased variation the number of timing corners has grown. For ex-
ample, the variations between different layers (transistors, M1M2, higher metal, etc.)
are not correlated and combinations of fast/slow metal versus fast/slow transistors
need to be analyzed. This can be addressed using statistical static timing analysis
(SSTA) such as Synopsys Prime Time VX [6]

This work focused only on setup time violations, as we were dealing with pre-
layout verification. There can be hold time violations in any block within a chip -

115
regardless if it has critical (”long”) timing paths. Running simulations with a reduced
SDF file means that hold violations may not be detected. To fix potential hold time
violations, it is recommended that one starts with the proposed hybrid methodology
at the post-layout stage. If there are any hold violations, fix those hold violations.
In the next step, increase the number of blocks in SDF annotation gradually. If hold
violations exist, fix them and then add more blocks in SDF annotation until all hold
violations are fixed. In the worst case, it is possible that all blocks are SDF annotated
but the probability of this happening might be insignificant.

We showed that by using a reduced SDF file, the simulation times are significantly

reduced. It would be interesting and challenging if someone could propose an algo-


rithm to identify a subset of the SDF that is sufficient to detect all setup and hold

time violations, at all timing corners. This is very much needed today.

The verification of timing results can be tweaked more for improvements. We

ran Cadence CompareScan for some time to compare and verify values across all the

registers in a design on a cycle by cycle basis. However, how much of CompareScan

simulation is needed to say that the results are verified is not known. It is worth in-

vestigating how much effort is needed to perform the matching of full SDF-annotated

and hybrid simulation runs?

7.3.2 Future Work in Accelerating Time Parallel RTL Simulation

We demonstrated that as one increases the number of cores, the time parallel
simulation approach does not scale. The question arises, is it possible to change the
architecture or revamp the scheme to make it scalable with the number of CPU cores?
What are the potential barriers to scalability and how can they be overcome?

Another interesting idea would be to compare the horizontal state matching ap-
proaches between ESL-RTL and RTL-GL0 to find out the similarities and differences.
As a result, it may lead to restructure or redefine horizontal state matching.

116
7.3.3 Future Work in Accelerating Multi-core RTL or Functional Gate-
level Simulation
We explored both partitioning the design across multi-cores using VCS multi-core
simulator, as well as partitioning the design across number of test-vectors, which is
single program multiple data (SPMD) approach, using Verilator. It turns out that
SPMD approach is more scalable than design partitioning. Future work in this di-
rection could be design partitioning combined with SPMD approach using Verilator.
This has the potential to be the best performance-driven simulation approach if prop-

erly instrumented.

117
CHAPTER 8

PUBLICATIONS, SUPPORT AND


ACKNOWLEDGMENTS

8.1 Publications

1. T. B. Ahmad, and M. Ciesielski, ”Fast STA Prediction-based Gate-level Timing

Simulation,”Design Automation and Test in Europe, (DATE 2014) (Accepted


for conference).

2. T. B. Ahmad, and M. Ciesielski, ”An Approach to Multi-core Functional Gate-

level Simulation Minimizing Synchronization and Communication Overheads,”

Microprocessor Test and Verification Conference, (MTVCON 2013).

3. T. B. Ahmad, N. Kim, B. Min, A. Kalia, M. Ciesielski, and S. Yang, ”Scalable

Parallel Event-driven HDL Simulation for Multi-Cores,” Synthesis, Modeling,

Analysis and Simulation Methods and Applications to Circuit Design, (SMACD

2012).

4. T. B. Ahmad, M. Ciesielski, D. Kim, and S. Yang, ”Application of Parallel


Distributed Event Driven Simulation for Accelerating Hardware Verification,”

Advances in Distributed and Parallel Computing (ADPC 2012).

5. T. B. Ahmad, N. Kim, B. Min, A. Kalia, M. Ciesielski, and S. Yang, ”Scal-


able Parallel Event-driven HDL Simulation for Multi-Cores,” Work in progress
(WIP), Design Automation Conference, (DAC 2012).

118
6. M. Basith, T. Ahmad, A. Rossi, and M. Ciesielski, ”Algebraic Approach to
Arithmetic Design Verification,” Formal Methods in Computer Aided Design
(FMCAD 2011).

8.2 Support

This works has been supported by funding from the National Scicence Foundation
(NSF) award NO. CCF-1017530.

8.3 Acknowledgements
I would like to acknowledge Wilson Synder from Cavium Network for creating

a tool like Verilator [41]. I also want to thank Hristo Iliev from HPC team at the

Computing and Communications Center of RWTH Aachen University Germany for

helping me overcome difficulties in parallel programming using OpenMP. I also want

to thank Guy Hutchison, my mentor and my manager at Marvell Semiconductor,

for being available throughout to listen and discuss ideas related to my research and

opensource hardware.

Lastly, i want to acknowledge my advisor Professor Maciej Ciesielski for being an

excellent advisor. He is far beyond any of his colleagues. His energy, passion, work

ethics and presentation are all too good. He has always been open to new ideas,

meeting new people, expanding skills etc., which is why he is so good. I wish i could
become like him one day. Salut a Profesor Ciesielski.

119
BIBLIOGRAPHY

[1] Cadence incisive simulator. http://www.cadence.com/products/sd/


enterprise_simulator/pages/default.aspx.

[2] Community enterprise linux (centos). http://www.centos.org.

[3] Gcc, the gnu compiler collection. http://gcc.gnu.org/.

[4] Hdl simulation internals. http://iverilog.wikia.com/wiki/Simulation.

[5] Red hat linux. http://www.redhat.com.

[6] Synopsys prime time vx. http://www.synopsys.com/Tools/Implementation/


SignOff/Pages/PrimeTime.aspx.

[7] Openmp 4.0 application program interface. http://www.openmp.org/


mp-documents/OpenMP4.0.0.pdf, 2013.

[8] Ahmad, Tariq B., and Ciesielski, Maciej. An approach to multi-core functional
gate-level simulation minimizing synchronization and communication overheads.
In Microprocessor Test and Verification Conference (MTVCON) (2013).

[9] Ahmad, Tariq B., and Ciesielski, Maciej. Fast sta prediction-based gate-level
timing simulation. In Design and Test Europe (DATE) (2014).

[10] Anderson, T., and Bhagat, R. Tackling functional verification for virtual com-
ponents. In ISD Magazine (2000).

[11] Automation, Avery Design. Simcluster datasheet. http://www.averydesign.


com.

[12] Automation, Axiom Design. Mp-sim datasheet. http://www.axiomda.com.

[13] Bailey, Mary L., Jr., Jack V. Briner, and Chamberlain, Roger D. Parallel logic
simulation of vlsi systems. ACM Comput. Surv. 26, 3 (1994), 255–294.

[14] Barney, Blaise, et al. Introduction to parallel computing. Lawrence Livermore


National Laboratory 6, 13 (2010), 10.

[15] Chamberlain, Roger D. Parallel logic simulation of vlsi systems. In DAC (1995),
pp. 139–143.

120
[16] Chang, Kai-Hui, and Browy, Chris. Parallel logic simulation: Myth or reality?
IEEE Computer 45, 4 (2012), 67–73.

[17] Chatterjee, Debapriya, DeOrio, Andrew, and Bertacco, Valeria. Event-driven


gate-level simulation with gp-gpus. In Proceedings of the 46th Annual Design
Automation Conference (New York, NY, USA, 2009), DAC ’09, ACM, pp. 557–
562.

[18] Cver, GNU. http://sourceforge.net/projects/gplcver/.

[19] D. Kim, M. Ciesielski, and Yang, S. ”multes: Multi-level temporal parallel event-
driven simulation. In IEEE Trans. on CAD of Integrated Circuits and Systems
(2013), pp. 845–857.

[20] Fujimoto, Richard. Time warp on a shared memory multiprocessor. In ICPP


(3) (1989), pp. 242–249.

[21] Fujimoto, Richard. Parallel discrete event simulation. Commun. ACM 33, 10
(1990), 30–53.

[22] Gafni, A. Rollback mechanisms for optimistic distributed simulation systems.


SCS Multiconference on Distributed Computing (1988), 61–67.

[23] Inc, Synopsys Design. Synopsys dc explorer. http://www.synopsys.com/


tools/implementation/rtlsynthesis/dcexplorer/Pages/default.aspx.

[24] Jefferson, David R. Virtual time. ACM Trans. Program. Lang. Syst. 7, 3 (July
1985), 404–425.

[25] Kim, Dusung. MULTES : Multi-level Temporal-parallel Event-driven Simulation.


PhD thesis, University of Massachusetts Amherst, 2012.

[26] Kim, Dusung, Ciesielski, Maciej J., Shim, Kyuho, and Yang, Seiyang. Temporal
parallel simulation: A fast gate-level hdl simulation using higher level models.
In DATE (2011), pp. 1584–1589.

[27] Kim, Dusung, Ciesielski, Maciej J., and Yang, Seiyang. A new distributed
event-driven gate-level hdl simulation by accurate prediction. In DATE (2011),
pp. 547–550.

[28] Lam, William K. Hardware Design Verification: Simulation and Formal Method-
Based Approaches. Prentice Hall, 2005.

[29] Li, Lijun, and Tropper, Carl. A design-driven partitioning algorithm for dis-
tributed verilog simulation. In PADS (2007), pp. 211–218.

[30] NC-Verilog, Cadence. http://www.cadence.com.

121
[31] Nicol, David M. Principles of conservative parallel simulation. In Proceedings of
the 28th conference on Winter simulation (Washington, DC, USA, 1996), WSC
’96, IEEE Computer Society, pp. 128–135.

[32] OPENCORES. http://www.opencores.org.

[33] Rashinkar, P., and Singh, L. New soc verification techniques. In IP/SOC 2001
(2001).

[34] stephen williams. http://iverilog.icarus.com/.

[35] Tariq B. Ahmad, Namdo Kim, Byeong Min Apurva Kalia Maciej Ciesielski, and
Yang, Seiyang. Scalable parallel event-driven hdl simulation for multi-cores.
In Synthesis, Modeling, Analysis and Simulation Methods and Applications to
Circuit Design (SMACD) (2012), pp. 217–220.

[36] Tariq B. Ahmad, Dusung Kim, Maciej Ciesielski, and Yang, Seiyang. ”appli-
cation of parallel distributed event driven simulation for accelerating hardware
verification. In Advances in Distributed and Parallel Computing (ADPC) (2012).

[37] Thomas Rauber, Gudula Runger. Parallel Programming for Multicore and Clus-
ter Systems. Springer-Verlag, 2010.

[38] Tompkins, Joe, and Joshi, Prathamesh. Improving Functional Gate Level Sim-
ulation Performance A Case Study. Synopsys User Group Boston (2011).

[39] Tropper, Carl. Guest editor’s introduction: Parallel discrete-event simulation


applications. J. Parallel Distrib. Comput. 62, 3 (2002), 327–335.

[40] VCS, Synopsys. http://www.synopsys.com.

[41] Wilson Snyder, Paul Wasson, and Galbi, Duane. Verilator. http://www.
veripool.org/wiki/verilator, 2007.

[42] Zhu, Yuhao, Wang, Bo D., and Deng, Yangdong. Massively parallel logic simu-
lation with gpus. ACM Trans. Design Autom. Electr. Syst. 16, 3 (2011), 29.

122

You might also like