Parallel Multi-Core Verilog HDL Simulation
Parallel Multi-Core Verilog HDL Simulation
ScholarWorks@UMass Amherst
Part of the Computer and Systems Architecture Commons, Digital Circuits Commons, Hardware
Systems Commons, and the VLSI and Circuits, Embedded and Hardware Systems Commons
Recommended Citation
Ahmad, Tariq B., "Parallel Multi-core Verilog HDL Simulation" (2014). Doctoral Dissertations. 45.
https://scholarworks.umass.edu/dissertations_2/45
This Open Access Dissertation is brought to you for free and open access by the Dissertations and Theses at
ScholarWorks@UMass Amherst. It has been accepted for inclusion in Doctoral Dissertations by an authorized
administrator of ScholarWorks@UMass Amherst. For more information, please contact
scholarworks@library.umass.edu.
PARALLEL MULTI-CORE VERILOG
HDL SIMULATION
A Dissertation Presented
by
DOCTOR OF PHILOSOPHY
May 2014
A Dissertation Presented
by
I would like to thanks Professor Maciej Ciesielski for helping me when i needed
the most and his constant support and mentorship. I also want to thank all the
committee members. I must thank Professor C.M. Krishna as well for his help. I am
grateful to Dusung Kim for helping me start this project. I want to acknowledge my
friend Dr. Faisal M. Kashif for his constant support and mentorship. I am indebted
to Fulbright (United States Educational Foundation in Pakistan) in their efforts to
help me during my PhD. I cannot forget their favors and i will always remember Dr.
Grace Clark and Rita Akhtar for what they did for me.
I must also mention that my technical life transformed when i was offered an
overcome those. I am greatly indebted to Awais Nemat and Guy Hutchison for their
constant feedback, willing to help and guidance. It was because of their help, Dr.
Faisal’s help and Fulbright’s support, i was able to overcome a major obstacle in my
PhD in fall of 2010. The way to this internship started at the parent’s house of Amer
Haider in spring 2009. I must thank Amer, his mother Ayesha Haider, his father
Muzaffar Haider and Hidaya foundation for being hospitable and becoming means to
where i am today.
I must thank Ameen Ashraf for helping in getting an intership at Apple Computer
in Summer 2011.
Last but not the least, i want to thank again my parents, my family and everyone
around me who has been a positive influence in my life.
v
ABSTRACT
MAY 2014
In the era of multi-core computing, the push for creating true parallel applications
that can run on individual CPUs is on the rise. Application of parallel discrete event
simulation (PDES) to hardware design verification looks promising, given the com-
This thesis presents three techniques for accelerating simulation at three levels of ab-
vi
TABLE OF CONTENTS
Page
ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
CHAPTER
1. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
vii
2.1.4 Time Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.5.2 New Trends in Computer Architecture . . . . . . . . . . . . . . . . . . . . . . . 31
viii
4. EXTENDING PARALLEL MULTI-CORE VERILOG HDL
SIMULATION PERFORMANCE BASED ON DOMAIN
PARTITIONING USING VERILATOR AND OPENMP . . . . . . . 68
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.2 Simulator Internals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.3 Parallelizing using OpenMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.5 Dependencies in the Testbench . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.2.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.2.2 Integration with the current ASIC/FPGA design flow . . . . . . . . . . 82
5.4.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.4.2 Simulation of Small Custom Design Circuit . . . . . . . . . . . . . . . . . . . 86
5.4.3 Simulation by varying the Unroll factor (F) . . . . . . . . . . . . . . . . . . . 86
5.4.4 Simulation by varying the number of cores . . . . . . . . . . . . . . . . . . . . 89
5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
ix
6.2.2 Design Partitioning for Gate level Simulation . . . . . . . . . . . . . . . . . 99
6.2.3 Integration with the existing ASIC/FPGA Design Flow . . . . . . . 104
6.2.4 Early Gate-level Timing Simulation . . . . . . . . . . . . . . . . . . . . . . . . . 105
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
x
LIST OF TABLES
Table Page
xi
3.18 Multi-core simulation performance of AC97 (T1 = 4 min) . . . . . . . . . . . . . 67
4.1 RTL simulation of AES-128 with 65000,00 vectors using Verilator and
OpenMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
xii
LIST OF FIGURES
Figure Page
xiii
3.10 Architecture of parallel GL simulation using accurate RTL
prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
xiv
4.6 Performance comparison of Verilator and VCS at functional
gate-level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
xv
6.3 Hybrid Gate-level timing simulation with partial SDF
back-annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.7 Sample timing constraint file (tfile) for AES-128 design . . . . . . . . . . . . . . 103
xvi
CHAPTER 1
INTRODUCTION
As design size and complexity increase, so is the need to verify the design quickly
with the given coverage goals. This alongside with reduced design cycle of three to six
months makes verification a lot more challenging. Today, verification takes 60-75% of
the design cycle time and on an average the ratio of verification to design engineers
is 3:1 [10], [33] This work addresses the issue of simulation performance which is very
much needed today, as the designs continue to become more complex. We particularly
functional gate-level (zero-delay) and gate-level timing. The techniques for improving
of this document. It is expected that following the proposed techniques at each level
of abstraction will tremendously reduce the hardware design and verification time.
This chapter discusses simulation and formal verification based techniques that
are used to verify hardware designs. In particular, it addresses the challenges faced
by parallel hardware simulation as it continues to gain importance with pervasiveness
of multi-core computing.
1
difficult to reproduce, e.g., traffic pattern on a busy airport, testing new internet pro-
tocol, etc. With time and advancements in technology, humans want to build even
larger and more complex systems. The conventional methods of modeling and simula-
tion on computers with a single processing unit (CPU) cannot cope with the memory
and execution time requirements of today’s complex systems. To accommodate this
demand, use of distributed and parallel computing is a must. Distributed computing
in the form of clusters of workstations, multiprocessors and multi-cores have become
widespread due to their cost-effective nature [39].
Hardware systems are typically modeled as discrete time systems. The state of
such systems can change and be observed at discrete time instants. In event driven
simulation, events occur and change the state of the system at discrete time instants.
which communicate and synchronize with each other using standard communication
to as logical process (LP). Logical processes (LPs) maintain state information, event
queue and local time reference, and communicate via standard communication in-
ulation among the LPs [39]. A special case of distributed simulation is when the
2
Start Synthesis
Post-synthesis
Algorithm dev functional and
in C/C++ timing simulations
Layout
RTL translation
by HDL
Post-layout
functional and
timing simulations
Functional
Simulation End
dependent gate-level netlist. Layout means physical placement of the gates and wiring
between them.
Field Programmable Gate Array (FPGA) design flow is similar but may have
additional steps like translation and tehnology mapping. Translation refers to merg-
ing different netlists (RTL, intellectual property (IP), schematic) into one gate-level
netlist. Technology mapping refers to mapping the translated gate-level netlist onto
FPGA physical resources. Placement and routing (P&R) means connecting the phys-
ical resources in the mapped netlist and extracting timing. The time gap between
the two extremes of the RTL and P&R simulations is as large as 45x. It is worth
noting that simulation is needed after every phase in the FPGA design flow. Figure
1.2 shows time required at different simulation phases of AES-128 FPGA design flow
3
350
300
250
150
100
50
0
RTL Post−Syn Post−Trans Post−Map Post−PAR
Level of abstraction
As the designs are getting large, reducing the simulation time has become nec-
essary. Parallel simulation attempts to address this challenge. So far, the speedup
offered by parallel simulation for real world applications has been difficult to achieve
3. Communication overhead;
5. Load balancing.
The remainder of this chapter reviews some of these issues and draws conclusions
regarding research directions to remedy these problems.
4
1.2 Problems with Parallel Simulation
tween the processor cores and greater processing speeds of processor cores, parallel
multi-core simulation should result in speedup that is linear in the number of pro-
cessor cores. Unfortunately, this is not the case. This is due to the problems of lack
of inherent parallelism, design partitioning and load balancing, communication and
synchronization overheads. In this section, we discuss these problems in detail.
strongly affects the communication between the partitions and event synchronization.
Various partitioning algorithms have been proposed. The partitioning could be static
its effect on simulation. For example, it could partition HDL design using metrics
like the number of instances, estimated number of gates, number of modules, etc.
The advantage of such a partitioning scheme is that it is quick and easy to generate.
The obvious disadvantage is that the resulting partitions could be unbalanced as the
workload requirements are not known prior to simulation. The idea of pre-simulation
has been proposed but it adds an extra processing step, unless it can be done as part
sign. One could simulate the entire design for a few clock cycles to partition the design.
One can also combine static and dynamic partitioning to achieve optimal partitioning.
Note that coming up with perfectly balanced partitions is a known NP-hard problem
[15]. Given this objective, minimizing communication and synchronization overhead
may pose conflicting requirements [15].
5
1.2.2 Communication and Synchronization between Partitions
simulators do Profiling to identify places where there are synchronization issues. Note
but only if there is no dependency between them. This is hardly the case in real
world designs where partitions need to exchange data in time. If the frequency of
over standard simulation cannot be achieved, and often speed degradation happens.
tectures were used for distributed parallel simulation [16]. The communication and
run faster than the previous generation and exchange data through shared memory
rather than through long interconnects. The problem is that communication and
synchronization overhead has not decreased with the advancements in technology.
In fact, it has become a bottleneck in distributed parallel simulation that must be
6
Technology improvements over the decade
30
25
Performance Improvement
20
15
10
0
CPU Memory Memory Latency Ethernet Ethernet Latency
Technology
Figure 1.3. CPU, Memory and Ethernet improvements over the decade
overcome to get a reasonable speedup. This is the main theme of the proposal and
Figure 1.3 shows performance improvement in CPU, Memory and Ethernet tech-
nologies. It shows that performance varies from one technology generation to the
other. While CPU has achieved the largest speedup over the decade, Memory and
Ethernet latencies have not kept the same pace. This is the main reason why speedup
of parallel simulation has not been significant compared to the CPU speedup. Figure
7
T1
speedup = (1.1)
Tpar + Tcomm
where T1 is the simulation time on a single processor, Tpar is the simulation time
to illustrate this fact. It shows that the gap between CPU performance and commu-
nication overhead is maximum if there is no improvement in interconnect technology.
The gap between CPU performance and communication overhead decreases when the
interconnect and the frequency of communication improves. This clearly shows that
communication and synchronization between parallel simulations will remain the bot-
communication.
T1
Tpar = (1.2)
1−P + P
S
Recently, a new method has been proposed [27] for parallelizing simulations by
eliminating inter-simulation communication. This is done by predicting input stimu-
lus to individual partitions using a predictor (typically available from simulation at
a higher level of abstraction). This work deals with reducing the frequency of com-
munication between parallel simulations using accurate prediction model proposed
in [27] applied to gate-level simulation. This approach exploits the inherent design
hierarchy to overcome the partitioning problem. Communication overhead between
local simulations is avoided using accurate prediction model at each local simulation.
8
Partition1 Partition2
It has already been shown that if the prediction is 100% accurate, the communication
It is a common misunderstanding that large designs are more suitable for parallel
simulation. This may not be true in entirety. Usually large designs have portions
of code that use cross module reference to improve signal observability [16]. Such
Tool Command Language (TCL) scripts etc. This makes it impossible to build en-
vironment for parallel simulation because of serial dependencies. Nevertheless, if the
design is too large to fit into a single computer’s memory, parallel simulation can
be useful by running simulation on many networked computers. This was certainly
true when the designs did not exceed 32-bit memory space. Now when the 64-bit
9
CPU and Parallel Simulation Performance
16
CPU speedup
14 Parallel Sim speedup
Parallel Sim with improved latency
12 Parallel Sim with improved latency and synch
10
speedup
0
−10 years −5 years today +5 years
years
computers are prevalent, some people see the need of parallel simulation diminishing
[16].
Another trap that researchers have fallen into is the design itself. It is easy to cook
up designs that are best for parallel simulation [16]. In those designs, the speedup
obtained could be elusive. Such designs are not practical and often far from the
industrial designs and practices. Testbench also affects simulation performance; test-
bench with unconstrained stimulus create uniform work load which tends to increase
the performance of parallel simulation. In real life, unconstrained stimulus does not
apply as majority of input patterns could be illegal (never produced by the actual de-
sign). Zhu et al [42] have shown that parallel simulation using original testbench runs
10
slower than the single processor simulation because testbench exercised constrained
deterministic patterns. When they modified the testbench to exercise unconstrained
patterns, speedup was possible. However, there are cases where unconstrained ran-
dom stimulus is suitable, such as random test pattern generation for automatic test
pattern generation (ATPG).
As a result of open source efforts, some designs are available at Opencores [32] that
offer designs along with the testbench environments used by industry. Furthermore,
there are compiled Open source simulators like Icarus Verilog [34] and CVer [18] for
HDL simulation. The only downside of using open source simulators is that they
are not as fast as commercial Verilog simulators like VCS [40] and NCVerilog [30].
Hence, when reporting parallel simulation speedup using open source simulators, its
However, there are still applications that are more applicable to parallel simulation.
has gone down significantly, distributed parallel computing reduces the wait
time in a single computer.
3. Simulation with full waveform dumping. If the design requires full waveform
dumping, partitioning the design can distribute the I/O activity. This increases
simulation performance as simulation and dumping are done in parallel.
11
4. Simulation of symmetric designs. Designs such as routers or symmetric multi-
processors (SMP) have similar workload within each block and little communi-
cation between blocks which make them ideal for parallel simulation.
Chang and Browy [16] have shown simulation speedup on various register transfer
level (RTL) and gate-level designs, which are all good candidates for parallel simu-
lation. However, they have not mentioned how they achieved this speedup or what
partitioning strategy was used. In particular, RTL speedup could be misleading as
RTL evolves during the design cycle. Furthermore, testbench for RTL also changes
on daily or weekly basis as part of regression run. This is achieved by changing the
random seed to the testbench, which creates different tests for each run. They also
timing simulation. Zhu et al. [42] have shown that graphic processor units (GPU)
are suitable for parallel functional (zero-delay) simulation because of large number of
processing pipelines and parallelism within each pipeline. In general, GPU is based
upon the single program multiple data (SPMD) architecture. Another important
the throughput of the design is large, parallel simulation overhead imposed can dom-
inate the simulation and can actually cause speed degradation. Parallel simulation is
useful when it takes days or weeks to simulate the design on a single processor simu-
lation. Chang and Browy propose a metric to predict whether parallel simulation can
provide speedup over single processor simulation is cycles/second measured in terms
of wall-clock time. Chang and Browy [16] suggest that single processor simulation
which is slower than 100 cycles/second is a good candidate for parallel simulation.
12
1.4 Formal Verification
(STA). Some of these techniques use simulation internally to enhance their efficiency.
Formal Verification techniques verify a design without the stimulus. This gives formal
input space. Sometimes, user can guide the equivalence checker tool by identifying
equivalent nodes (cut points) in the two designs to prune the input search space.
ABC from UC Berkeley, Synopsys Formality and Cadence Conformal tools are the
There are two approaches to perform EC. The first approach searches for an
input pattern or patterns that would distinguish the two designs. This is called
13
The other EC approach compares by converts the designs into canonical repre-
sentation such as Reduced Order Binary Decision Diagram (ROBDD) and checks for
their equivalence. ROBDDs for two equivalent designs must be identical.
between RTL and post-synthesis gate-level netlist but also to Engineering Change
Order (ECO) and pre and post-scan netlists. It should be noted that as the design
gets large, equivalence checking techniques suffer from memory explosion problem.
Therefore, reduction of the design size is often necessary because of the memory
capacity issues.
Model (or property) checking takes a design and proves or disproves a set of prop-
erties given as specification of the design. If two designs are sequential and mapping
between their states is not known, then it is not possible to perform equivalence
checking. Model checking checks the entire state space, either constrained or uncon-
strained to determine the validity of the properties. Design is transformed into finite
state machine (FSM) and property checking determines if there is a state or sequence
of states that violates the property or it is unreachable from an initial state. The
model checking suffers from capacity issues and cannot model the whole design. A
typical practice in the industry is to use model checking on specific RTL blocks in
a design. Another limitation of model checking is the issue of completeness of prop-
erties. It is hard to determine if a certain set of properties completely specifies the
design intent. There are no good or complete coverage metrics for property checking
either. On the other hand, for the designs whose properties can be specified exactly,
such as arithmetic blocks (e.g., multiplier, adder, etc), model checking cannot prove or
disprove property beyond a certain bit-width. It should be noted that model checking
14
is not used for property checking on the gate-level netlist because of capacity issues.
Contrary to simulation, model checking cannot guarantee that the design will work
when fabricated as it cannot be done on a chip level.
Static Timing Analysis (STA) is a static technique to verify timing of the design.
STA analyzes a design given timing library associated with the design. It then reports
the slowest critical path in the design, which determines the maximum frequency of
the design. While STA technology has improved a lot over the years and it is quite
design or miss such a path. Further, STA does not work for asynchronous interfaces.
It is clear from the above description that simulation has its own special place in
the design hierarchy and it is not going away in the near future. As the design gets
refined into lower levels of abstraction, such as gate-level and layout level, functional
(zero-delay) and timing simulations can validate the results of STA or equivalence
checking. Moreover, neither STA nor equivalence checking can find bugs due to X
(unknown signal) propagation. Even though RTL regression is run on a daily basis,
industry uses gate-level simulation before sign-off.
15
stage using standard delay format (SDF) back annotation. Gate-level simulations are
considered a must for verifying timing critical paths of asynchronous design which are
skipped by STA tool. Further, gate-level simulation is used to verify the constraints
of static verification tools such as STA and equivalence checking. These constraints
are added manually and the quality of results from static tools are as good as the
constraints are. Gate-level simulation is also used to verify the power up, power down
and reset sequence of the full chip. It is also used to estimate dynamic power drawn
by the chip. Finally it needs not to be mentioned that gate-level simulation is used
after Engineering Change Order (ECO) to verify the changes. There is a tool named
Bugscope (by the company NextOp now part of Atrenta) that takes RTL as input
and outputs a set of properties that can be used by model checking to verify the
design. Internally, the tool uses simulation to generate properties of the design.
16
CHAPTER 2
synchronization. To address this issue, distributed parallel HDL simulation has been
[27] [11] [12]. Chapter 1 discussed challenges in parallel HDL simulation. In this
simulation and the associated hardware on which the simulation is run. Next, a
HDL simulation is presented and compared against the spatially distributed parallel
HDL simulation.
The literature on parallel simulation is rich. Most of the known work concerns
17
2.1 Factors Affecting the Performance of Parallel HDL Sim-
ulation
Bailey et al. [13] lists five factors that affect the performance of parallel HDL
simulation: timing granularity, design structure, target architecture, partitioning,
and synchronization algorithm. We discuss them briefly here and elaborate on the
current hardware and software trends.
Timing granularity (also known as timing resolution) and design structure are
design-dependent factors over which simulation has no control. Increasing timing
resolution can increase the amount of processing, which in turn decreases simulation
structure to another. Figure 1.1 shows design structure at various levels of abstrac-
tion. The design structure at higher level of abstraction, e.g., C++, simulates faster
Architecture of the target platform or execution machine also impacts parallel sim-
ulation performance. Here we discuss various computer hardware and software trends
that exploit parallelism. A detailed discussion on parallel computer architecture is
18
• Multi-core is a computer system with two or more CPUs on the same chip,
sharing memory resources and connected through short intra-chip interconnects.
and when another waiting task takes a turn. When running on a multi-core
one can subdivide specific operation within a single application into individual
threads. All the threads can run in parallel. The OS divides processing time
not only among different applications, but also among each thread within the
application.
assembly line, in which each stage focuses on one unit of work. The result or
each stage passes to the next stage until the final stage. To apply the pipelining
strategy to an application that will run on a multi-core CPU, the algorithm is
divided into steps that require roughly the same amount of work, and runs each
step on a separate core. The algorithm can process multiple sets of data or the
data that streams continuously.
19
2.1.3 Issues in Design Partitioning
load uniformly balanced among the LPs, is a known NP-hard problem. Given this
objective, minimizing communication and synchronization overhead may pose a con-
in which simulation is run for a short time interval or even full simulation is run to
profile the simulation. However, it adds an extra processing step, unless it can be
done as part of a complete simulation-based flow. Such a case is shown in Figure
1.1, where simulation at a higher level of abstraction can act as pre-simulation for
simulation at a lower level of abstraction. This is one of the major points of the
proposed approach, which shall be explained further in the next section. Another
problem is the granularity of LP, which relates to the number of atomic operations
that are assigned to a given LP. Assigning one atomic operation per LP can result in
high communication overhead, while assigning one LP per processor can result in an
• Oblivious algorithm evaluates all LPs at each time step, regardless of the
event activity. This eliminates event queue at each LP. Correct scheduling can
ensure the correctness of the simulation.
20
• Synchronous algorithm constraints the simulation time of each LP to be the
same. All LPs must synchronize to find next simulation time step depending
on the event activity.
stamp earlier (straggler event) than the local simulated time arrives. This causes
The conservative and optimistic approaches differ in the way modules of the parti-
varies with the design and partition strategy. Several variations of these methods
have been offered, differing in the way they handle inter-simulation synchronization.
Gafni [22] uses state saving concept and rollback mechanism by restoring the saved
state. Time Warp [24] (optimistic approach), was able to reduce message passing
overhead by using shared memory. Fujimoto [20] and Nicol [31] improved the con-
servative method by introducing the concept of lookahead. Chatterjee [17] proposed
the parallel event-driven gate level simulation using general purpose GPUs (Graphic
Processing Units). However, it could only handle zero-delay (functional) gate-level
simulation, but not the gate-level timing simulation. Zhu et al. [42] developed a
distributed algorithm for GPUs that can handle arbitrary delays, but still suffers
from heavy synchronization and communication overhead inherent to all distributed
21
simulation techniques. In addition, these methods do not scale and are often based
on manual partitioning.
It should be emphasized that the difficulty of spatial partitioning lies not only in
solving the inter-module communication and synchronization problem, but mostly in
design partitioning that will minimize this communication. The success of traditional
spatially distributed simulation then strongly depends on such ideal partitioning,
which itself is a known intractable problem and cannot be successfully applied to
complex industrial designs. To facilitate this partitioning, some researchers, e.g., Li
et al. [29], propose partitioning based on design hierarchy. In this approach, the
design is partitioned along the boundary of the module, a basic unit of code in HDL.
While it addresses the communication problem to a certain degree, it still does not
predict input stimulus and apply it to each module instead of the actual input. The
predicted input and output stimulus could be obtained from the simulation of design
model at a higher abstraction level (such as RTL) than the one being simulated (such
as gate-level). Figure 2.1 shows how higher level simulation can act as predictor for
lower level simulation in hardware design simulation. The base of the arrow shows
the predictor simulation and the tip of the arrow shows the target simulation.
Figure 1.4 (in Chapter 1) shows a design consisting of two module partitions con-
nected in such a fashion that their inputs depend upon each other. The predicted
input values obtained by running higher level simulation are stored in local memory
and applied to the input ports of a local module assigned to a given LP. Then, the
actual output values at the output ports of that module are compared on-the-fly with
22
Algorithmic simulation
in C/C++
Behavioral simulation
in HDL (Verilog, Vhdl)
Functional gate-level
simulation
Gate-level Timing
simulation (SDF annotation)
the predicted output values, also stored in a local memory. This is illustrated in Fig-
ure 2.2, which shows two sub-modules being simulated in parallel. Each sub-module
uses predicted inputs by default, while their actual outputs are compared against the
selects between the predicted inputs and actual inputs. While both sub-modules can
access their actual inputs from the other sub-module, there is associated synchroniza-
tion and communication overhead which is the major bottleneck in parallel discrete
event simulation (PDES). The main goal of this approach is to minimize this overhead
as much as possible.
As long as the prediction of the input stimulus is correct, remote memory access
that imposes communication and synchronization between local simulations is com-
23
Figure 2.2. Distributed parallel simulation using accurate prediction
pletely eliminated. In this arrangement, only local memory access for fetching the
prediction data is needed. This phase of simulation is called the prediction phase.
Only when the prediction fails, are the actual input values, coming from the other
local simulation, used for simulation; this phase of simulation is called the actual
phase.
When prediction fails, each local simulation must roll back to the nearest check-
point. This is possible by periodically saving design state during the prediction phase
at selected checkpoints. When parallel simulation enters the actual phase, it should
try to return to the prediction phase as soon as possible to attain maximum speed-up.
This is done by continuously comparing the actual outputs of all local simulations
with their predicted outputs and counting the number of matches on-the-fly. Af-
ter the number of matches exceeds a predetermined value, the simulation is switched
back to the prediction phase. We are going to instrument this approach for functional
gate-level (zero-delay) simulation. Another challenge to be addressed in this thesis,
is to minimize the time spent in the actual phase. This depends upon the accuracy
of the predictor.
24
2.3 Multi-level Temporal Parallel Event-Driven Simulation
In contrast to the parallel discrete event HDL simulation described above, which
partitions the design in spatial domain, there has been some interesting work on
parallel discrete event HDL simulation in time domain [26] [19]. This approach, called
is then simulated in a different LP. The key requirement of this technique to work is
finding the initial state of each slice. The initial state of each slice must match the
final state of the previous slice. For example, the initial state of slice i must match
the final state of slice i − 1 for each slice i. MULTES terms this requirement as
horizontal state matching problem. The initial state of each slice cannot be obtained
without knowing the final state of the last slice. MULTES overcomes the problem of
finding the initial state by running a reference simulation at higher level of abstraction
and saving the values of all the state elements in the design. However, as the target
simulation is at a lower level of abstraction and may involve timing, the initial state
obtained from reference simulation may not be the correct one in time. In summary,
For timing simulation, the design state (all flip-flops in the design) is restored using
reference simulation which could be RTL or functional (zero-delay) gate-level. This
25
state saving is known as checkpointing. If the design is a single clock design and there
is no timing violation, then reference and target simulations are cycle-consistent.
This means that the two simulations produce the same result within the required
number of clock cycles. In such a case, restoring state using reference simulation
will lead to correct target simulation. However, depending upon the position of
checkpointing, there could be mismatch between parallel target simulation and golden
target simulation at the beginning of the target slice. MULTES solves this problem
by providing overlap between consecutive target slices. For example slice n − 1 and
slice n are allowed to share the simulation time. Since the mismatch occurs at the
end of the slice period n − 1 and beginning of slice period n, the period is discarded
from slice n. The correct simulation for this period is generated by slice n − 1.
chronous clocks. It attempts to solve the problem of clock domain crossings (CDC)
in multi-clock designs, in which data or control signal is sent from one clock do-
main to the other. The issue in CDC designs is that gate-level timing simulation
is not 100% cycle-consistent with reference simulation, even if there are no timing
violations. Since simple state saving and restoring could cause mismatch between
parallel target simulation and golden target simulation, MULTES proposes abstract
delay annotation (ADA) to deal with CDC. In ADA, CDC path delay, obtained from
SDF is copied from gate-level to reference simulation. When CDC path delay is an-
usually contain memory elements and may have software constructs which cannot be
26
saved. Similarly, the state of Intellectual Property (IP) blocks in design cannot be
saved and restored with checkpointing. To handle this issue, MULTES uses testbench
forwarding technique. In this technique, rather than saving the state of the testbench,
testbench is simulated from the beginning to the starting point of each slice (initial
state). This is accomplished by saving the output of DUT (which is input to the
testbench) during reference simulation. This essentially creates a dummy DUT. The
testbench is simulated with the dummy DUT from the beginning to the starting point
of each slice. At this point in time, dummy DUT is replaced by actual DUT and state
of the DUT is restored from the data stored at the checkpoint. This is done for each
slice independently.
MULTES [26] [19] offers an interesting alternative technique for parallel simula-
tion. There are similarities and fundamental differences between MULTES and PDES
[27] [35] [36] [8] for HDL simulation. We discuss them briefly in this section.
MULTES divides the simulation time into multiple time slices for each time slice
divide the design into multiple partitions which are simulated independently. Both
MULTES and PDES use model at higher level of abstraction for reference simulation.
For example, both MULTES and PDES use RTL for parallel functional (zero-delay)
gate-level simulation. Note: from now on, we will use the term functional gate-level
simulation to mean functional (zero-delay) gate-level simulation.
27
[19]. PDES does not suffer from state matching problem as each partition is simulated
from the beginning of simulation time.
MULTES cannot overcome the limitations of a large design. This means that
each parallel slice simulation will simulate the whole design regardless of the slice
period which could be large or small. PDES partitions the design and distributes the
partitions to individual simulators. Hence, the entire simulation load is divided into
smaller loads distributed to each partition. MULTES performs checkpointing peri-
odically, while in PDES the reference simulation is stored at the partition boundary
for the entire simulation time. This will increase the amount of dump data on the
hard disk for PDES. Note that MULTES also performs data dumping for testbench
forwarding besides periodic checkpointing. In this work, we will try to eliminate this
dumping for PDES, so that reference simulation (RTL) is co-simulated with the target
We should emphasize that MUTLES is not suited for multi-core architecture be-
cause of uniform memory requirements of each slice. For large designs, it does not
scale well with the multi-core architecture. PDES scales well with the multi-core
architecture as it partitions the design and hence the memory requirements of each
Finally, MULTES uses a complex tool chain and techniques, including: PLI for
checkpointing; data dumping and restoring; Synopsys Formality or a similar tool for
state matching; ABC tool for assisting state matching to detect signal correspondence;
Cadence Encounter tool for finding clock domain crossings; and LEX and YACC for
parsing SDF file for abstract delay annotation (ADA). Further, some of the steps in
MULTES (such as ADA) are not fully automated and require manual effort. In con-
trast, PDES when applied to parallel HDL simulation does not have such a complex
28
tool chain dependency and it integrates seamlessly into the ASIC or FPGA design
flow. PDES has its own challenges that are addressed in the next chapter.
Parallel or high performance computing is not a new concept. The concept has
been widely known in scientific and engineering communities where large simulations
are done on a cluster of computers. The simulation computation to be performed is
partitioned into several workloads which are simulated independently and in parallel
on many machines. The simulation workloads should be independent of each other
thus requiring the original simulation computation to be suitable for parallelism.
Today, hardware manufacturers are integrating more and more CPUs on a single
processor chip. The entire processor chip is called multi-core processor. They come
via network. It is predicted that by the year 2015, Intel’s typical processor would
have dozen to hundreds of cores where some of the cores would be dedicated to say
graphics, encryption, network, DSP etc. This type of multi-core system is called
heterogenous multi-core [37].
29
also a need of increasing the performance of a single application by running it on
multiple cores. This area is full of challenges as there is no automatic conversion of a
sequential program into a parallel program. As hardware advancements continue to
take place, there is a dire need to convert the existing sequential software programs
to take advantage of the existing computer power. If this is not done, much of the
compute power available is going to remain unused [37].
into appropriate tasks is often manual and is one of the main challenges faced by
a programmer. The tasks are then assigned to one or more threads in a parallel
scheduling. Later the assignment of threads to cores is called mapping. The tasks
to follow a certain order due to dependencies and may not execute concurrently.
Tasks may also need to communicate with each other and hence synchronization
between the tasks is necessary so that tasks are not writing to the same memory
Shared memory and distributed memory are two main memory organizations in
multi-core machines. Shared memory allows uniform global access to all processor
cores. Information exchange between the cores is done through sharing memory
location. This sharing must be done in synchronized manner where in case of a read,
a core does not read from a memory location where a write is pending. Similarly there
30
should not be simultaneous writes by cores to one memory location. For distributed
memory machines, each processor core has private memory which can only be accessed
by the core attached to it. Information exchange between cores is done through
explicit communication such as message passing. Another form of synchronization
is called barrier synchronization which is available for both shared and distributed
memory machines. In barrier synchronization, all processes on all cores have to wait
at a barrier point until all other processes have reached that barrier. Only when all
processes have reached this barrier, they can continue execution after the barrier.
parallel execution time which is the maximum of compute time on all the cores and
time for communication and synchronization. This time should be smaller than se-
quential execution time of the application on a single core else parallelization is not
worth. Speedup is the ratio of parallel execution time to the sequential execution
the underlying machine e.g., the number of available cores, memory organization etc.
We discuss how parallelism can be exploited from single core machines to multi-core
machines [37].
31
address width to be 64 bits. This has also lead to accuracy of floating point
numbers.
is in the instruction decode stage another instruction i2 can enter the instruc-
tion fetch stage. In the next clock cycle, i1 enters execution stage, whereas i2
enters decode stage and a new instruction i3 enters instruction fetch stage, etc.
• Parallelism by many Execution Units There are two ways of achieving this
advantage of the fact that there are more than one functional units inside a
single CPU core such as ALUs (arithmetic logic units), FPU (floating point
units), load store units, etc. Superscalar relies on hardware to determine which
32
• Thread or Process level Parallelism In a single core machine, thread or
process level parallelism is used to give illusion to an application (in case of
multithreading) or multiple applications (in case of processes) that there are
multiple CPUs. In fact, this is not the case as the machine is single CPU core
machine. What happens is that OS time slices threads or processes so quickly
that it seems threads or processes are running independently. This illusion has
become a reality with multi-core CPUs.
data set. Each processing element has private access to (shared or distributed) data
memory but there is a single program memory from which a single instruction is
element loads a separate instruction and data, executes it and write the result back to
the data memory. Hence, processing elements work asynchronously with each other.
SMP consists of one or more processing elements with access to common memory.
A program is parallelized by program taking different paths on various processing
elements. The program starts running on one processing element and as soon as
part of the program which can be parallelized is encountered, the execution gets split
33
across multiple processing elements. In the parallel portion, each processing element
works on the same program but with different data set. SMP faces serious challenges
in terms of scalability to many cores.
multiple cores. Multiple cores are coupled together using local memory as shown in
Figure 2.3. It shows that the cost of access to local memory is less than the cost to
access remote memory. This architecture allows scalability to many cores.
There are two views of memory that need to be considered: the physical mem-
ory view and the programmer memory view. For physical memory view, computers
with shared physical memory, such as multiprocessors and computers with distributed
memory, such as multicomputers exist. For programmer’s view, memory organiza-
tion can be distinguished between shared memory machine (SMM) and distributed
memory machine (DMM). Note that programmer’s view need not be consistent with
34
the actual physical memory view. For example, programmer can treat the memory
as shared memory while the physical view of the memory is distributed.
sisting of processing element, local memory and may contain I/O. The local memory
is private to each node. When a node needs data from some other node, explicit mes-
sage passing protocol, e.g., message passing interface (MPI) is used to fetch that data
from the other node. Direct Memory Access (DMA) controller can be used to offload
this communication from the processing element. Example of DMM is a cluster of
typically consists of several processing units connected to a global memory via inter-
processing nodes is required to share data. However, due to global nature of memory,
dent flow of execution which shares data with other threads using global memory. It
is the job of operating system (OS) to map a thread to a processor core.
multi-core machine efficiently. Each such application can be called a thread and this
is true multithreading, as each thread gets mapped to a separate processing core.
35
TLP can also happen at an application level where parts of an application become
threads and execute on multiple cores. Another trend is hyperthreading, where OS
gives an illusion that there are multiple cores available for processing to use processing
elements more effectively.
36
CHAPTER 3
of the original design into sub-functionalities which are then executed on different
LPs. Figure 3.1 shows a design in traditional event-driven simulation environment,
while Figure 3.2 shows the same design in parallel multi-core simulation environment.
Note that it shows an ideal case where the two partitions are completely independent
(Partition1 can be simulated without Partition2 and vice-versa) and hence can be
simulated separately. This may not be the case for most of the simulations (because
of dependencies between the partitions) and this issue will be addressed later in this
chapter.
DUT
Partition1
TestBench
Partition2
In this work, use use the parallel multi-core HDL simulation technique based on
the concept of accurate prediction [27] [35] [36] [8]. We use the approach of Li et
37
Partition1 Partition2
CPU1 CPU2
al. [29] to partition the design along the hierarchy boundary, but add a higher level
It is clear that prediction accuracy is one the most critical factor in this approach
as explained in [27] [27] [35] [36] [8]. Nearly 100% prediction accuracy will give
almost linear speed-up even when the number of processor cores increases (within
certain bounds). Hence, we must find a way to get an accurate prediction data. As
discussed before, the proposed idea is to obtain this data from the results of earlier
simulation, using higher level design model. Such a model is typically available as
part of the design refinement from higher level of abstraction to a lower level of
abstraction. It is important to realize that the closer the two abstraction levels are
(for the predictor/reference and actual/target simulations), the more accurate the
actual simulation is going to be. For example, prediction data for parallel functional
gate-level simulation can be obtained from register transfer level (RTL) simulation;
and the prediction data for parallel gate-level timing simulation can be obtained
from gate-level zero-delay simulation. Both these scenarios are depicted in Figure
3.3. Simulation at a higher level of abstraction can be performed at least 10× faster
than the one at the lower level of abstraction. We argue that an accurate prediction
38
data can be obtained by fast simulation using simulation model at a higher level of
abstraction. Also, as this fast simulation at a higher level of abstraction is already an
integral part of the design flow, as show in Figure 3.3, obtaining the prediction data
does not incur any additional simulation overhead.
Figure 3.3. Parallel multi-core simulation in the ASIC design flow [25]
39
values saved during RTL simulation serve as prediction data for the gate-level timing
simulation. Table 3.4 shows preliminary experimental results of predictor modeling.
Design registers are chosen for two reasons. First it is possible that a register value
may not propagate to the module output during simulation. Hence, it is possible
that RTL and functional gate-level simulations are identical at the module boundary
but inconsistent on register outputs due to unknown signals (X) in RTL or gate-level
design. Secondly, the focus was on register values because at present the proposed
partitioning strategy for parallel gate-level timing simulation is restricted to the
flip-flop boundary. Of course, not all registers will appear at the partition boundary.
That is why the last column represents just a lower bound on the prediction accuracy;
the actual prediction accuracy is always higher than this lower bound. Such a lower
bound already shows high prediction accuracy (>98% on average) for this choice of
Table 3.2 shows another experimental result of predictor modeling. Here the
content of design registers during the functional gate-level simulation and gate-level
timing simulation are compared. The register values saved during functional gate-
level simulation serve as prediction data for the gate-level timing simulation. Note
that moving from RTL to functional gate-level improves the accuracy of predictor
40
(>99% on average). In general, the closer the reference and target simulations in the
design hierarchy, the more accurate the prediction data would be.
overhead is defined as the time spent during simulation to guarantee that there is
dation even when event activities in partitions have no or little dependencies. Further,
Both data bandwidth and frequency of communication among partitions impact com-
munication overhead. To illustrate minimization of these overheads, we explicitly
measure the following on a synthetic RTL design:
41
The base design consists of a 128-bit Ripple Carry Adder (RCA) block and a
testbench feeding stimulus to the adder. To create two or more partitions, the adder
block is instantiated as many times and chained as shown in Figure 3.4. Figure 3.5
shows synchronization overhead measurement setup where partitions don’t exchange
data with each other, and instead data is locally generated using a predictor (to
be explained in the next section) in each partition. Both single-core and multi-core
versions of Synopsys VCS simulator were used for these measurements on an quad-core
Intel machine with 8GB RAM in Non-uniform Memory Access (NUMA) architecture.
42
Table 3.3. Quantitative communication and synchronization overhead measurement
head dominates design level parallelism and speed degradation takes place (0.93, 0.91
and 0.94 for 4, 6 and 8 partitions respectively). To see the effect of the synchroniza-
tion overhead only, the communication overhead was eliminated and the simulation
was done using the configuration shown in Figure 3.5. This experiment demonstrates
ulation up to a certain number of cores. Specifically, for 2 and 3 cores the speedup
nization overhead starts limiting the speedup from approaching the theoretical limit
of n. Therefore, for large designs, it is better to group multiple partitions to limit the
simulation of RCA128 adder on two cores. The green portion in the plot represents
the degree of parallelism in the two cores. Ideally, we want to increase this degree
of parallelism as much as possible. Hence we eliminate communication overhead as
43
Figure 3.6. Multi-core Simulation of RCA128 on 2 cores (with comm and synch
overhead)
overhead can be greatly reduced by choosing the right number of partitions. In the
44
Figure 3.7. Multi-core Simulation of RCA128 on 2 cores (no comm overhead)
cost.
Figure 3.8 shows a conceptual configuration of NUMA, where local memory access
is much faster than the remote memory access. For example, memory access of CPU
core 4 to remote memory is much slower than to its local memory. This causes se-
vere performance degradation in parallel simulation, where extensive communication
and synchronization takes place between a large number of local simulations. This
situation becomes worse when the number of processor cores and the number of the
partitioned local modules for local simulation increase.
In our work we use the approach of [16] [39] to partition the gate-level design
along the module boundary, but add a local (in the partition) higher level predictor
model to reduce the communication overhead between the partitions. This is based
45
on a recently proposed technique using accurate stimulus prediction [27] [35] [36] [8].
The key idea of this approach is to predict input stimulus for each partition and apply
it locally instead of the actual input coming from the other partition. The predicted
input stimulus is obtained by simulating the design at a higher level of abstraction
(such as RTL) than the one being simulated (such as the functional gate-level) .
During reference simulation such as RTL, all inputs and output responses of each
partition are stored (dumped) on a disk to serve as input stimulus for the actual
gate-level simulation Note that modern simulators allow parallel dumping option on
multi-core machines. Therefore, parallel dumping does not affect the performance of
RTL simulation and this dumping overhead can be ignored. The other aspect is the
disk space to store (dump) the stimulus which is ample in the current computing
machines. During the gate-level simulation the input stimulus is obtained from the
RTL predictor instead from the other partitions. Table 3.4 shows the accuracy of
RTL stimulus as predictor at the register boundary. A cycle by cycle comparison is
done between the RTL and functional gate-level simulations at the clock boundary for
all registers in the design. Cadence Comparescan tool was used to compare register
46
values at the clock cycle boundary. The high accuracy of the RTL prediction shows
that it can act as good signal predictor for gate-level simulation.
Figure 3.9 shows simulator architecture configuration for two partitions. In this
configuration each gate-level module uses predicted inputs from RTL by default, while
their actual outputs are compared against the predicted RTL outputs. A multiplexer
at each module selects between the predicted inputs and actual inputs. As long as
the prediction is correct, remote memory access that imposes communication and
synchronization between local simulations is eliminated. Only when the prediction
fails, are the actual input values, coming from the other local simulation, used in
simulation.
Table 3.4. Accuracy of RTL predictor at the register boundary
47
3.4.2 Dealing with Mismatches
According to Kim et al. [27], when mismatch happens each local simulation
must roll back to the nearest checkpoint: a design state saved periodically during
simulation when predicted inputs are being used. When parallel simulation enters
the actual phase (predicted inputs are no longer used) , it will try to return to
the prediction phase as soon as possible to attain maximum speed-up. However, this
approach has not been confirmed experimentally. We found that checkpointing of the
design state during gate-level simulation is very costly in terms of time and space as it
involves dumping of vast amounts of simulation data to the disk. Moreover, simulation
rollback impedes the performance of parallel gate-level simulation. If rollback happens
frequently due to mismatches, performance advantage of prediction-based simulation
and make a best effort to achieve that. If a mismatch occurs, simulation is paused and
switched back to the original gate-level simulation configuration (with its unavoidable
and rolling back to the last good state provided by RTL. Note that the RTL state is
already saved (dumped) during the reference simulation. Then, the original gate-level
simulation is run to the point where mismatch occurred, to determine and debug the
cause of mismatch. After fixing the gate-level netlist the simulation is restarted in
the predictive mode. We already described how to quantify the accuracy of RTL
prediction by running Comparescan against all RTL and gate-level design registers.
Another approach is to run Functional Equivalence Checking between RTL and gate-
level design at the partition boundary and apply prediction to only those signals
that exist in both, the RTL and gate-level netlists. Note that functional equivalence
checking is typically performed earlier in the design cycle, so there is no additional
overhead introduced by this process. If RTL and gate-level designs are identical at the
48
partition boundary, communication between the partitions, as shown in Figure 4, can
be eliminated using RTL predictor. Thus, the two simulations can run independently.
Instead, we propose running only the required portion of the entire RTL design in
every partition (the portion of RTL that provides stimulus to a given partition and
compares the response of the partition). Note that this stimulus and response for
each partition is already saved during original RTL simulation. Figure 3.10 shows
the architecture of local simulation for a gate-level design partitioned into four blocks.
designs: AES-128, JPEG and 3DES. Table 3.5 shows simulation performance on
single-core simulator. The designs are synthesized with Synopsys Design Compiler
using TSMC 65nm standard cell library. Single-core and multi-core versions of Synop-
sys VCS simulator were used to simulate all gate-level designs on octa-core Intel CPU
with NUMA architecture. Two partitioning schemes were explored. The first is static
partitioning based on the area of the synthesized logic. Module instances weighted
in terms of their synthesized area are grouped to form two or more partitions. The
second partitioning scheme is dynamic one based on RTL simulation profiling. In
this scheme, RTL simulation of the design is run with profiling option to find the
most time consuming module instances. These module instances then become par-
titions in the gate-level simulation. One could also run short gate-level simulation
49
Figure 3.10. Architecture of parallel GL simulation using accurate RTL prediction
with profiling option to find the time consuming module instances. It turned out
that static partitioning hardly improved simulation performance and hence was not
used for more experiments. Tables 3.6, 3.7 and 3.8 show performance improvements
of AES-128, JPEG and 3DES with parallel simulation.
Tables 3.6, 3.7 and 3.8 show that prediction based parallel gate-level simulation
improves the performance of original parallel gate-level simulation by removing com-
munication overhead between the partitions. These tables echo our findings, presented
in Section 3, that it is worth removing communication overhead; and that the syn-
chronization overhead increases with the number of partitions. These results also
show the right number of partitions (3 for AES, 2 for JPEG, and 3 for 3DES) as a
point beyond which the synchronization overhead reduces speedup from approaching
50
Table 3.5. Single core simulation performance
51
Table 3.8. Multi-core simulation performance of Triple DES
the theoretical limit, the number of CPU cores. JPEG encoder is one such design with
less communication between partitions to begin with. In this case, removing commu-
to a point that RTL and gate-level netlist may not be 100% pin compatible at the block
or module level boundary. To account for this fact, we assume that RTL prediction
can only be used for 50% - 80% of the gate-level signals at the partition boundary.
For those 50% - 80% signals RTL can act as a signal predictor. To find out which
RTL signals can be used as predictor for gate-level simulation, Equivalence Checking
can be used. We used Synopsys Formality equivalence checking tool for this purpose.
Note that functional equivalence checking is typically performed earlier in the design
cycle, so no additional overhead is introduced by this process. Also, as mentioned
in Section 4, one can run Cadence Comparescan tool to find equivalent pins between
RTL and gate-level netlist. Table 3.9 shows the performance of benchmarks with
RTL prediction used for 50% and 80% signals during gate-level simulation.
52
Table 3.9. RTL prediction-based Multi-core functional GL simulation of bi-
partitioned designs
3.8 Conclusion
With the increased presence of multi-core processors, most high-performance work-
stations and PCs have adopted NUMA advanced memory architecture. We conducted
a series of experiments showing that a straightforward application of multi-core sim-
ulation on such architecture does not bring the expected improvement in simulation
simulation with a highly accurate stimulus prediction that comes from a higher level
(in this case, RTL) model. Apart from eliminating the communication overhead be-
tween partitions using predictor, choosing small number of partitions also reduces
synchronization overhead. The proposed technique is generic and works independent
of the partitioning scheme. Further the performance cost of dumping can be ignored
as new simulators have the option of parallel dumping on multi-core machines.
53
marks. The following Tables show the simulation profile of the five benchmarks.
Tables 3.10, 3.11 and 3.12 show that these benchmarks have good inherent paral-
lelism marked by low testbench activity and high design activity. The tables also
show the modules which are most active. These are ideal candidates for multi-core
simulation. For example from Table 3.10, aes sbox can be simulated on one CPU
core and aes key expand 128 can be simulated on the other CPU core.
On the other hand, Tables 3.13, 3.14 and 3.15 show designs with low inherent
parallelism marked by high testbench activity and low design activity. These designs
are not good candidates for multi-core simulation and multi-core simulation of such
• Any information about master partition (that contains testbench) starts with
M. Any information related to slave partitions (design partitions other than
the testbench) start with P or S.
54
Table 3.11. Simulation profile of Triple DES benchmark
y huff 18.1
cr huff 17.9
cb huff 17.7
y dct 8.5
cb dct 7.7
testbench 4.3
simulation overhead 1.4
cr dct 7.6
ff checker 6.6
fifo out 5.9
RGB2YCBCR 1.2
55
Table 3.13. Simulation profile of PCI benchmark
testbench 62.8
simulation overhead 4.8
pci target32 sm 3.5
pci out reg 2.9
pci target32 interface 2.4
pci unsupported 2.2
pci bridge32 2
WB MASTER BEHAVIORAL 2
pci pci decoder 1.8
testbench 36.8
simulation overhead 25.5
vga fifo 13.8
vga col proc 7.5
vga fifo dc 4
vga pgen 3.2
vga wb master 2.7
testbench 48.2
simulation overhead 23
ac97 soc 8.3
ac97 rst 4.4
ac97 codec sout 1.6
ac97 codec sim 1.3
56
• The M1 segment in the left hand most column accumulates the time spent by
the master process executing its events. This time does not run in parallel with
the slave processes, but runs sequentially by itself. This time should be small
relative to the S1 times.
• The M2 segment in the left hand most column accumulates the time spent by
the master process waiting for all slaves to communicate their synchronized
value changes for the delta. This time should be as large as possible.
• The M3 segment in the left hand most column accumulates the time spent by
the master process propagating values changes received during the M2 segment.
This time, like M1, also does not run in parallel with the slave processes. This
• The M4 segment in the left hand most column accumulates the time spent by the
master process sending updated port signal values and next time information
• The S1 segments in the slave columns accumulates the time spent by the slave
processes executing their respective events. These times have the potential of
running in parallel with all the other S1 slave times. These times should be
• The S2 segments in the slave columns accumulates the time spent by the slave
processes sending updated port signal values and next time information to the
• The S3 segments in the slave columns accumulates the time spent by the slave
processes waiting for the master to send its updated port signal values. These
times should be as small as possible.
57
Figure 3.11 shows that the parallel activity in the slave partitions are not uniform
and the simulation performance is low. It takes 192 minutes to simulate AES-128
which is worse than single-core simulation time of 160 minutes. Figure 3.12 shows
the CPU utilization during this simulation. It shows that approximately (130/200)%
CPU utilization which is not that high. Ideally this ratio should be be close to 200%
for bi-partitioned design running on two CPU cores.
Figure 3.13 shows another simulation of the same design where partitioning is done
based on the number of module instances. Also the number of partitions is increased
from two to three. It shows that parallel simulation activity in all slave partitions is
uniform and the simulation performance is much better than the earlier case. It takes
58
Figure 3.12. Bi-partitioned (area-based) AES-128 multi-core simulation CPU uti-
lization
Hence the speedup is 160/125 = 1.28. Figure 3.14 shows CPU utilization for this
for 2 CPUs.
Figure 3.15 shows the simulation performance of JPEG design for area-based
partitioning. It shows that the parallel activity in slave partitions is very unbalanced.
As a result the simulation time turns out to be 180 minutes which is worse than
single-core simulation time of 167 minutes. Figure 3.16 shows the CPU utilization for
this partitioning. It shows that simulation is utilization only half (100/200)% of the
resources. Ideally the CPU utilization should be close to 200%.
Figure 3.17 shows the simulation performance of JPEG for instance-based par-
titioning. It shows that the parallel simulation activity inside slave partitions are
relatively well balanced. The simulation time is 93 minutes. Hence, the speedup
compared to single-core simulation is 167/93 = 1.79 which is quite significant. Fig-
59
Figure 3.13. Tri-partitioned (instance-based) AES-128 multi-core simulation time
60
Figure 3.15. Bi-partitioned (area-based) JPEG multi-core simulation time
61
ure 3.18 shows the CPU utilization for this partitioning. It shows that the CPU
utilization is close to (165/200)% which is quite significant.
It is also shown that for CPU-bound applications like AES and JPEG, speedup
does not increase linearly with the number of cores. This is due to synchronization
overhead that increases with the number of partitions. As a result, speedup saturation
is evident in Figures 3.23 and 3.24. This confirms our experimental results tabulated
in Section 3.3.
62
Figure 3.18. Bi-partitioned (instance-based) JPEG multi-core CPU utilization
63
Figure 3.20. Tri-partitioned (instance-based) VGA multi-core simulation time
64
Figure 3.22. Oct-partitioned (instance-based) ac97 multi-core simulation time
65
Figure 3.24. Multi-core simulation performance of JPEG
(less computation and more input/output) like VGA, PCI and AC97 lack inherent
parallelism. This makes them unsuitable for multi-core simulation. We tabulate their
multi-core simulation results in this section for the sake of completion of the discussion
on multi-core simulation. Tables 3.16, 3.17 and 3.18 show the simulation degradation
using multi-core simulation.
66
Table 3.17. Multi-core simulation performance of PCI (T1 = 17 min)
2 instance-based 83 0.2
4 instance-based 79 0.2
6 instance-based 99 0.17
8 instance-based 100 0.17
2 instance-based 48 0.08
4 instance-based 49 0.08
6 instance-based 47 0.08
8 instance-based 65 0.06
67
CHAPTER 4
4.1 Introduction
In the previous Chapter, we used Synopsys VCS multi-core simulator [40] to im-
speedup for designs having inherent parallelism. We also concluded that communica-
tion, synchronization and design partitioning were barriers to speedup and scalability.
It needs to be restated that VCS multi-core simulator [40] partitions the design across
multiple CPU cores and allows only this type of partitioning. The type of partitioning
partitioning, the focus is on the computation that needs to be performed rather than
the data that is input to the computation. The original computation is partitioned
into different sub-computations that are performed in parallel.
In contrast, the partitioning scheme which relies on partitioning the data is called
domain partitioning [14]. In this Chapter, we shall explore this type of partitioning.
68
1. Compilation;
2. Elaboration; and
3. Execution.
In the elaboration stage, the internal parsed representation of the HDL source
is expanded starting from the root or top level module. The hierarchy of the HDL
design is traversed and instantiations of the submodules are replaced by the actual
modules all the way to the primitive level. This means that all submodules that have
instantiations are expanded as well until primitive level is reached. If there are no
optimizations, like dead code elimination or constant propagation, the design is ready
In the execution stage, the design, still being invisible to the user, is passed to a
code generator that generates code like C/C++ or similar, that can be turned into an
executable form by a compiler like GNU C/C++ compiler [3]. Figure 4.1 describes
Synopsys VCS [40] simulator internally converts HDL design into C/C++ code
and then compiles the design using GNU C/C++ compiler. This can be verified
by simulating the design and looking at the simulation log which can be redirected
to a file during simulation or examined directly from the screen. The existence of
csrc directory as a result of simulation also proves the point. This directory is cre-
ated whenever VCS simulation is run. Also user can create simulation executable
by entering the csrc directory and running the command make product. However,
69
Figure 4.1. HDL simulator internals
70
tweaking the C/C++ code generated by VCS is difficult because of its cryptic nature
and external library dependencies which are not visible to the user.
C/C++ code and then compiles the C/C++ code to generate simulation executable.
Verilator has gained a lot of popularity and is being used across the EDA industry by
major companies. Besides being opensource and free, it is extremely fast compared
to commercial simulators. Details about Verilator performance, pros and cons can be
checked at [41].
its syntax is easy and requires only a few changes to convert a serial program into a
1. Posix threads (Pthreads), which requires full manual effort for parallel program-
ming.
2. Message passing interface (MPI), which is primarily used for distributed mem-
ory systems.
4.4 Results
It turns out that single core simulation performance of Verilator is much better
than that of commercial simulators like Synopsys VCS. This performance can be
71
Figure 4.2. Extending Verilator for parallel programming
72
further improved by adding parallelization using OpenMP. The combination of the
two created the best parallel HDL simulator capable of handling RTL and functional
gate-level (zero-delay) designs. Tables 4.1, 4.2, 4.3 and 4.4 show performance of AES-
128 and RCA-128 RTL and functional gate-level simulations respectively. Figures 4.4
and 4.3 compare the speedup of RTL and GL0 simulation for RCA-128 and AES-128
designs.
Table 4.1. RTL simulation of AES-128 with 65000,00 vectors using Verilator and
OpenMP
Table 4.2. Gate-level (zero-delay) simulation of AES-128 with 65000,00 vectors using
Verilator and OpenMP
73
Table 4.3. RTL simulation of RCA-128 with 65000,00 vectors using Verilator and
OpenMP
RTL speedup
4 GL0 speedup
3.5
2.5
speedup
1.5
0.5
1 2 3 4 5 6 7 8
# of Threads
74
3.5
RTL speedup
GL0 speedup
2.5
speedup 2
1.5
1
1 2 3 4 5 6 7 8
# of Threads
the state of DUT. We experimented with such a design to see how its performance
degrades when simulated in parallel. We took AES-128 design and configured it such
that one of its output feeds back into one of the inputs. This causes dependency as
one cannot encrypt two plain texts in parallel because the second plain text needs
the output of the first one. It was observed that despite dependencies, the perfor-
mance of the design was not worse than a single threaded simulation. Hence, in the
threaded simulation. Note that this is not the case with functional partitioning where
dependencies cause performance degradation, which is worse than running single core
simulation.
Figures 4.5 and 4.6 show comparison of a single core simulation performance
of Verilator and VCS at RTL and functional gate-level. These figures show that
Verilator beats VCS by huge margin and seems to be the best way to perform parallel
simulation. Also, we extended the capability of Verilator to make it multi-core using
OpenMP. Figure 4.7 compares the multi-core performance of Verilator and VCS for
75
AES-128 design. This clearly shows Verilator performs much better than VCS in
multi-core simulation as well.
80
Verilator
VCS
70
50
40
30
20
10
0
AES−128 RCA−128
RTL Designs
1500
Verilator
VCS
Single Core GL0 Simulation Time (minutes)
1000
500
0
AES−128 RCA−128
Gate−level Designs
76
1000
Verilator
900
VCS
Multi Core Simulation Time (minutes)
800
700
600
500
400
300
200
100
0
AES−128 RTL AES−128 GL0
Designs
Figure 4.7. Multi-core performance comparison of Verilator and VCS at RTL and
functional gate-level for AES-128
77
CHAPTER 5
Simulation of the Register transfer level (RTL) model is one of the first and manda-
tory steps of the design verification flow. Such a simulation needs to be repeated often
due to the changing nature of the design in its early development stages and after
consecutive bug fixing. Despite its relatively high level of abstraction RTL simulation
is a very time consuming process, often requiring nightly or week-long regression runs.
dividing the entire simulation run into independent simulation slices, each to be run
slice. This chapter paper describes the basic idea of the method and provides some
RTL simulation is used to verify the functionality of RTL design. As the design is
at an early stage in the design flow, RTL description may keep changing to accommo-
date more enhancements or as a result of bugs caught during RTL simulation. Hence,
RTL simulation is a must and it is done as exhaustively as possible using directed
and constrained random simulation. RTL regressions are run on nightly or weekly
78
basis to keep RTL in a bug-free state. Depending upon the size and complexity of
the design, RTL regression may take a few hours to several weeks to run. It should
be noted that RTL simulation is much faster than gate-level functional (zero-delay)
and gate-level timing simulations. Even then, designers want to simulate RTL faster,
leveraging multi-core machines. In this chapter, we discuss the idea of accelerating
RTL simulation and propose a few approaches that can potentially improve RTL
simulation.
5.1 Introduction
5.1.1 Issues with Co-Simulation
of a design model at a lower level of abstraction has been already used in industry
[28]. However, its application is limited to the selected portions of the design. For
example, instead of simulating an entire design at the gate-level, parts of the design
are simulated at the gate-level, while rest is simulated at RTL. This co-simulation
approach works faster than pure gate-level simulation, but slower than pure RTL
simulation. Also, this approach does not parallelize the entire gate-level or RTL
simulation. Such methods are applicable to processor designs, and to the designs
that rely on higher level models, such as Instruction Set Architecture (ISA). Some
designs, such as SoC, may not have such architectural models, which makes the
that run on multi-core machines. Unfortunately, these simulators have limited suc-
cess because of high cost, communication and synchronization over-head mentioned
79
earlier, inability to support Verilog PLI (Programming Language Interface) and new
SystemVerilog testbench features.
1. Time dependency: Before simulating the entire RTL design at a particular time
t, the design must be simulated at all times from 0 to t − 1.
pends upon the value from another component of the RTL design.
a design. Temporal parallel simulation (TPS) exploits time dependency while PDES
exploits spatial dependency in a design. In TPS, simulation time intervals are made
independent by pre-computing the initial state of each time interval. This allows TPS
inherent in PDES.
To provide a correct initial state of each time interval (slice) for parallel RTL
simulation, we follow a two-step approach [27][TCAD] proposed earlier for gate level
simulations.
1. Reference Simulation: Simulation that provides initial state of each time slice
in TPS. Normally, this simulation is much faster.
2. Target Simulation: Simulation of a time slice that uses initial state provided
by reference simulation. Normally, this simulation is slower compared to the
reference simulation.
80
The basic idea of TPS is illustrated in Figure 5.1. It shows fast reference simu-
lation to provide the initial state of each slice for target simulation run. MULTES
[27][TCAD] applied this idea to speed up gate-level timing simulation by using fast
RTL simulation as reference. The initial states were obtained from checkpoints saved
during reference simulation and then restored for gate-level target simulation. It was
speculated [27] that this idea could be used for RTL simulation as well, but the diffi-
culty was to find a suitable higher-level design model such as ESL (Electronic System
Level), that could be used as reference for RTL simulation. The difficulty comes
mostly from solving state matching problem between the ESL and RTL models mak-
ing this approach impractical. Instead, in this work we compute the initial states for
the RTL simulation slices, using a higher level model such as C/C++ or SystemC
simulation, ”on the fly” as they are needed by the RTL simulation. This approach
has additional advantage that it avoids saving and restoring the initial states, which
81
The number of target simulations that can be run in parallel is determined by the
number of CPU cores available. The theoretical performance of TPS, measured in
total simulation time Ttps can be expressed by Equation 5.1 where
∑
n
Ttps = (Tref (i) + Ttarget (i)) (5.1)
i=1
• Tref (i) denotes the time to run reference simulation to provide the initial state
for target simulation of the ith time slice.
• Ttarget (i) denotes the target simulation time for the ith time slice.
the standard ASIC and FPGA design flow where design is successively refined from
work, we use C/C++ as reference simulation to enable parallel RTL target simulation.
We assume SystemC, C/C++ or any higher level model of the design is already
available, as many designs are first simulated in C/C++ in the early design phase.
Furthermore, there are Open source tools, such as Verilator [41] that can convert
RTL description into equivalent C/C++ description. Once the C/C++ model for the
design is available, there is no need to translate the Verilog testbench into C/C++
testbench. A C/C++ model can be invoked directly from RTL via PLI, which is a
standard practice in the industry [28], as shown in Figure 5.2. Figure 5.2 shows how
testbench can invoke C/C++ model to obtain the initial state of any slice in time.
82
Figure 5.2. Temporal RTL simulation setup
Figure 5.3 [4] shows a circuit whose output f at a given time depends on the value
the value k of the flip-flop in the 2nd clock cycle. This value of k in turn determines
the new value of f in the 2nd clock cycle, which then determines the new value of
k for the 3rd clock cycle, and so on. Hence, to determine the value of f in the nth
clock cycle, the value of k needs to be known in the (n − 1)st clock cycle. Sequential
simulation over n clock cycles naturally resolves this problem.
Figure 5.4 [4] shows the circuit in Figure 5.3 unrolled twice. Note the absence of
the flip-flop. The value of j in the first clock cycle provides signal k for the second
cycle, etc. The two circuits are described differently at RTL but they produce identical
values of f in every clock cycle. Note that there is no clock in the unrolled circuit in
83
Figure 5.4, which makes the simulation faster. The verification engineer must create
a virtual clock in the testbench to make sure that input signals are applied at the
appropriate time.
Extending this idea further, the circuit can be unrolled for several time frames, F .
Unrolling the circuit offers some advantages in simulation, as it replaces the sequential
circuit by a combinational one, which can be simulated faster. Furthermore several
cycles of the original circuits can be simulated simultaneously. While the time needed
to simulate each set of F time frames will be longer than for a single frame, the number
of simulation cycles needed to simulate the design over some simulation time ts will be
84
reduced to ts/F . We experimented with this idea by observing the effect of unrolling
the circuit on the simulation speed. Table 5.1 compares the simulation performance
of the circuits shown in Figures 5.3 and 5.4 on a single-core machine. It shows that
circuit unrolled twice is 1.2× faster than the original circuit. Results of unrolling
over larger number of frames F will be presented in the next section, together with
analyzing the effect of size of the simulation slices on the simulation speedup.
#
of Iterative Unrolled 2x
clock circuit circuit
cycles T1 (sec) T2 (sec)
(Billions)
1 12 10
2 24 20
3 36 30
4 48 40
5 60 50
We will now combine the idea of unrolling the circuit over a fixed number of time
frames, F with the parallel simulation scheme described in Section 5.2 and observe
their effect on simulation speedup. We simulated the circuit in Figure 5.3 for an
85
5.4.2 Simulation of Small Custom Design Circuit
In the first set of experiments we used the example circuit in Figure 5.3. The
circuit was simulated on a two CPU cores, using the simulation configuration shown
in Figure 5.5. Core 1 simulates RTL for ”odd” slices: 0 − i, 2i − 3i, etc., where
i is a sufficiently large number of clock cycles, while core 2 performs simulation for
”even” slices: i − 2i, 3i − 4i, etc. The first slice starts with a known initial state and is
directly subjected to RTL simulation (for time TRT L ). At the same time, core 2 starts
simulating the second slice (i to 2i) starting at the initial state at time i. This initial
state is provided by fast C reference simulation (Tc ). To simulate next slice (2i - 3i)
at the first core, additional processing is needed to provide it with the required initial
state. It is composed of two components: i) fast ”testbench forwarding” (Tf ) to bring
the testbench to a state where it is ready to feed the design with correct stimulus; and
ii) the actual C simulation (Tc ). While the C simulation time Tc remains constant,
the testbench forwarding time Tf increases linearly with the number of time slices as
it must always execute the testbench from the beginning. This makes the number
of slices per core an important factor. Ideally, we want to keep the sum Tf + Tc
much smaller than TRT L to gain speedup over traditional RTL simulation. Figure 5.5
also shows comparators to make sure that reference value from C/C++ simulation
Tables 5.2, 5.3 and 5.4 show that, as the number of frames per simulation cycle
(unroll factor F ) increases, simulation speedup improves further. It approaches 2 for
the case when F =12 and when the number of slices is 4. Note that these tables show
the worst case time reported from the two cores.
Figure 5.6 summarizes these results in a plot for 1 billion clock cycles for a 2-core
machine. Specifically, it shows a family of speedup plots for unroll factors ranging
86
Machine1
0 2 4 6 8 10
Machine2
C RTL C RTL C
0 2 4 6 8 10
from 1 to 12, as a function of the total number of slices. Note that the plot for F =1
(single frame) the greatest speedup is for 2 slices (one per each core) and then drops
by switching between C and RTL and the lower slice granularity for this iterative
(single frame) case. At the same time, the speedup improves locally (around 4 slices)
for the cases when the frames are unrolled several times, offsetting this overhead.
Figure 5.7 shows the relationship between the speedup and the number of frames F
as a family of plots.
87
Table 5.3. RTL simulation speedup for circuit unrolled 2 times.
Figure 5.6. RTL simulation speedup as a function of number of slices for different
unroll factors.
88
Figure 5.7. RTL simulation speedup as a function of number of frames for different
slices.
In this experiment, we vary the number of cores to see their impact on simulation
number of cores, so there are as many slices as cores. For example, if number of
cores are 4, the simulation is divided into 4 slices that are run simultaneously. This
is shown in Figure 5.8. Clearly, the speedup is determined by core 4 which has the
slowest run time among all the cores because it spends most of the time in testbench
Table 5.5 shows the speedup in RTL simulation as a function of the number
of cores for the simulation configuration shown in Figure 5.8. Figure 5.9 show the
speedup plot for Table 5.5. It shows that speedup factor saturates around 10 cores.
Thus increasing cores beyond 12 and more is not useful for this design. Figure 5.10
shows speedup against the number of cores when the circuit is unrolled by a factor
of 4, 6 and 8 time frames.
89
Figure 5.8. Parallel RTL simulation across multiple CPU cores
# # of Traditional Parallel
of clock cycles RTL sim RTL sim Speedup
CPU (billions) time time T1/T2
Cores T1 (sec) T2 (sec)
90
Figure 5.9. RTL simulation speedup as a function of the number of cores
Figure 5.10. RTL simulation speedup as a function of the number of cores for
different unroll factors
91
5.5 Muti-core Architecture of Temporal RTL Simulation
We propose an architecture of temporal RTL simulation that exploits multi-core
architecture of the underlying hardware. The basic setup is shown in Figure 5.11.
The new architecture shows that Electronic System Level (ESL) simulation runs as
an independent thread on a CPU core. This thread simulates the design at ESL level,
checkpoints the state and spawns RTL simulation of a slice on a free CPU core. At
the end of each time slice simulation, ESL thread checks for horizontal state matching
(whether for slice i + 1 beginning state of ESL matches the state of RTL for slice i
at the end ). If there is a state matching between slice i and slice i+1, for every
time slice i, ESL is known to be accurately predicting the initial state of slice i+1 for
every time slice i. This mode of the simulation is called ”Prediction Mode”, where
ESL simulation correctly predicts the initial state of each time slice. If, on the other
hand, horizontal state matching fails for a slice i+1, the simulation result of slice i+1
is discarded and then slice i+1 is re-simulated using the state from previous slice
i rather than the ESL. This mode of the simulation is called the ”Actual Mode”.
The actual mode imposes re-simulation overhead but it affects simulation of only the
slice/s which experience state mismatch while not affecting the rest of the simulation.
92
Figure 5.11. Multi-core architecture of temporal RTL simulation
on four cores. Tref represents the time to provide the initial state for a time slice to
be simulated at RTL. Figure 5.13 shows simulation of the same design on two cores.
Note that the width of the RTL time slice in Figure 5.13 is twice the width of the
RTL time slice in Figure 5.12. It turns out that two-core configuration simulates the
design faster than the four-core configuration. This is because four-core configuration
amount of time as it takes the longest time Tref to provide it with its initial state. On
the other hand, the 2-core configuration does not have this issue. Table 5.6 compares
the simulation results. We used Cadence Incisive simulator 13.1 for RTL simulation
on a quad-core Intel CPU with 8GB RAM. From this experiment, we conclude that
simulating a design on large number of cores does not necessarily lead to speedup.
93
Figure 5.12. Temporal RTL simulation on four cores
# # of Traditional Parallel
of clock RTL sim RTL sim Speedup
CPU cycles time time T1/T2
Cores (Billions) T1 (sec) T2 (sec)
2 1 764 435 1.75
2 2 1492 988 1.51
4 1 764 570 1.34
4 2 1492 1280 1.16
94
Figure 5.14. AES-128 design in CBC mode
The 128-bit input vectors are: plain text (PT), key and initialization vector (IV).
The output vector is 128-bit cipher text (CT). As can be seen, the design is similar
in structure to the simple circuit shown in Figure 5.3. To accelerate cipher text
computation, we used C model of the design together with RTL to parallelize the
core machine, and the simulation run was partitioned into 5 slices (three on the first
core and two on the second) as this offered the best overall simulation performance.
Figure 5.15 shows this configuration. The results shown in Table 5.7 indicate that
the simulation performance was capped at about 1.7× speedup on the 2-core CPU.
95
Table 5.7. AES-128 speedup with parallel simulation
# # of # of Traditional Parallel
of time plain RTL sim RTL sim Speedup
CPU slices texts time time T1/T2
Cores (millions) T1 (sec) T2 (sec)
2 5 0.1 5 5 1.00
2 5 1 52 33 1.57
2 5 10 517 340 1.52
2 5 100 4200 2700 1.55
5.6 Conclusion
This chapter presented an approach of accelerating RTL simulation targeting
multi-core CPUs. It presented a new technique for accelerating RTL simulation based
on temporal partitioning of the simulation and using higher level model (C/C++)
to provide the initial states for the individual simulation slices. We showed that sim-
slices, number of CPU cores, and by unrolling the circuit by a number of time frames
per simulation cycle. To the best of our knowledge, this is the first attemp that has
considered RTL simulation acceleration using temporal partitioning with higher level
96
CHAPTER 6
6.1 Introduction
Traditional dynamic simulation with back-annotation in standard delay format
(SDF) cannot be reliably performed on large designs. The large size of SDF files
makes the event-driven timing simulation extremely slow as it has to process an ex-
pose a fast prediction-based gate-level timing simulation that combines static timing
analysis (STA) at the block level with dynamic timing simulation at the I/O inter-
faces. We demonstrate that the proposed timing simulation can be done earlier in
simulation suffers from very low performance because of its inherently sequential na-
ture and heavy event activities in gate-level simulation. As the design gets refined into
lower levels of abstraction, and as more debugging features are added into the design,
simulation time increases significantly. Figure 6.1 shows the simulation performance
of AES-128 design [32] at various levels of abstraction with debugging features en-
abled. As the level of abstraction goes down to gate or layout level and debugging
features are enabled, simulation performance drops down significantly. This is due
to a large number of events at the gate-level or layout level, timing checks and disk
access to dump simulation data.
97
Figure 6.1. Drop down in simulation performance with level of abstraction + de-
bugging enabled
timing simulation using static timing analysis (STA) as ”timing predictor” at the
block level [9]. We propose an automatic partitioning scheme that partitions the
gate-level netlist into blocks for SDF annotation and STA. We also propose a new
design/verification flow where timing simulation can be done early in the design cycle
98
entire design. However, for large designs, such SDF back-annotation will negatively
impact the performance of gate-level timing simulation.
Gate-level block1 is analyzed by STA tool which reports the maximum delay inside
the block. Only this value is back-annotated during simulation as dsta at the output
of block1. This type of timing annotation is termed as selective SDF annotation.
Note that STA can be performed on the gate-level block1 as part of the whole design
multiple blocks, the proposed STA based timing prediction approach can be used for
know the timing critical blocks in a design where selective SDF back-annotation can
Partitioning of gate-level netlist into blocks for SDF annotation and STA is a
challenging problem as verification engineer may not have insight to identify timing-
99
Figure 6.3. Hybrid Gate-level timing simulation with partial SDF back-annotation
critical blocks. Furthermore, partitioning schemes are often manually done. This may
cause a problem when dealing with huge gate-level netlists. Often gate-level netlist
is flattened and hierarchy is not preserved. We propose a partitioning scheme based
upon STA that is fully automated and works for flat or hierarchical gate-level netlist.
The main goal of STA is to calculate slowest (critical path) in the design. One
can choose to report not only the most timing critical path but the next most timing
critical path and so on. STA report then reports these timing critical path/s and the
associated module instances. See Figures 6.4 and 6.5 for most timing critical paths
in VGA and AES-128 designs [32]. Since these paths are time critical, one would
to make sure that their timing conforms to STA results. In brief, one can include all
the module instances that are in the timing critical path/s for SDF back-annotation.
We call this group of instances Block2, as shown in Figure 6.3. All the other module
instances can be considered not timing critical. These module instances shall be
simulated in function-al (zero-delay) mode. This group of instances is called Block1.
However, one needs to run STA on Block1 to find out their worst case delay dsta as
shown in Figure 6.3. All of this can be automated in a flow as shown in Figure 6.6.
Sample timing constraint file tf ile is shown in Figure 6.7 for AES-128 design [32].
100
Figure 6.4. Static Timing Analysis (STA) of VGA controller design
101
Figure 6.5. Static Timing Analysis (STA) of AES-128 controller design
102
Figure 6.6. Automated partitioning and simulation flow for hybrid gate-level timing
simulation
Figure 6.7. Sample timing constraint file (tfile) for AES-128 design
103
6.2.3 Integration with the existing ASIC/FPGA Design Flow
Figure 6.8 shows the flow for this approach. The key idea is to capture peripheral
timing of each block via static timing analysis and various estimates derived from
time budgeting. As majority of the design blocks are simulated in functional (zero-
delay) mode, except at the block periphery, this should result in a significant speedup
compared to the simulation with full SDF back-annotation.
104
6.2.4 Early Gate-level Timing Simulation
The concept of early gate-level timing simulation is shown in Figure 6.9, where
gate-level Block1 is replaced by equivalent RTL. Now Block1 is simulated in RTL
instead of its gate-level model. The key idea is to perform timing simulation using
estimated timing dest early in the design cycle when all blocks have not been synthe-
sized. The estimated timing can come from time budgeting or a tool like Synopsys DC
Explorer [23]. This is in contrast to the conventional approach, where gate-level sim-
ulation is performed later in the design flow, after synthesis or place and route step,
when all the detailed delay data is available. Major simulator vendors have already
embraced the idea of early timing simulation based on the estimated delays realizing
that performing gate-level timing simulations late in the design cycle is prohibitively
slow. Verification engineers get around this problem by performing gate-level timing
simulation of only time critical blocks with few test vectors. However, they are not
able to perform full chip timing simulation with large number of test vectors, which
often leaves certain timing bugs undetected. Synopsys has recently announced a new
product called DC Explorer [23] that is based on the same idea of early design explo-
ration. It can do early synthesis, timing and other estimates with enough accuracy
for designs to start the simulation process early in the design flow. Synopsys DC
Figure 6.9. Early timing simulation using RTL with estimate of peripheral timing
105
6.3 Experiments
6.3.1 Experimental Setup
controller and JPEG encoder designs . We used Cadence Incisive Unified Simulator
13.1 on quad-core Intel CPU with 8GB RAM. The designs were synthesized with
Synopsys Design Compiler using TSMC 65nm standard cell library. All these de-
signs except VGA controller are single clock designs. The following Table 6.1 shows
essential statistics for these designs.
Design Synthesized
Name Area in
NAND2
AES-128 18400
3-DES 96650
VGA 144189
JPEG 968788
6.3.2 Results
First, we show simulation results with the AES-128 design. We start with SDF an-
notation of majority of blocks (to accommodate many timing critical paths) and then
gradually decrease the number of blocks in SDF annotation to one (to accommodate
the worst case timing path) . The module hierarchy for AES-128 is shown in Figure
6.10. Table 6.2 shows the results. It shows that significant speedup over full SDF
annotated timing simulation can be attained.
The waveforms in Figure 6.11 illustrate the difference between full SDF annotation
and selective SDF annotation when only one block (aes sbox4) is in STA. It shows
that signal from selective SDF annotation is delayed more than the SDF-annotated
106
Figure 6.10. Instance hierarchy of AES-128 design
Table 6.2. Simulation speedup of AES-128 for variable number of blocks in SDF
annotation
107
signal due to STA delay, but contains no glitches (hence has fewer events to process
during simulation and hence faster simulation). Both signals match at the clock cycle
boundary. Similarly Figures 6.12 and 6.13 show the same effect when two (aes sbox4
and aes sbox5) and majority of the aes sboxes blocks are in STA.
In the next set of experiments, all designs were divided into two gate-level blocks,
Block1 and Block2, as shown in Figure 6.3. Block2 contains module instances from
the most timing critical path. Here, only one timing critical path is considered.
The approach has an additional advantage that it validates the result of STA which
is dependent upon manual constraints entry. If the simulation shown in Figure 6.9
exhibits timing failure, it will help debug STA constraints. Once the constraints are
108
Figure 6.13. Full SDF-Annotated Signal versus Selective SDF-Annotated Signal
when majority of the blocks are in STA
corrected, STA is run again to provide the new #dsta value. This STA-to-simulation
cycle is repeated until all timing failures are debugged and removed from the simula-
tion.
Table 6.3 shows the speedup obtained using our hybrid gate-level timing simula-
109
only to verify the proposed simulation approach. In practice, verification engineer
can skip this step to reduce the verification time.
While testbench can verify functional correctness of the two simulations, the pro-
posed verification scheme helps in verifying timing correctness of the two simulations.
In order for both simulations to be timing correct, the monitored signals from the
two simulations should match at the clock cycle boundary. Unfortunately, dumping,
as shown in Table 6.4 can drastically reduce simulation performance. Further, the
amount of dumping can cause the disk to quickly become full. Therefore, it is rec-
ommended that dumping should be done for a small time interval rather than for the
entire simulation. We used small simulation intervals to verify timing correctness of
the output signals of the designs. Cadence Comparescan tool was used to compare
the dumped signals. The tool reported the signals to be matching at the clock cycle
boundary. Table 6.4 shows comparison between full SDF gate-level timing simula-
tion and proposed hybrid gate-level timing simulation for all the flip-flops/registers
in VGA and AES-128 designs. The fact that the values of the registers match at
the clock cycle boundary during the entire simulation confirms the accuracy of our
approach.
110
Table 6.4. Accuracy of hybrid gate-level timing simulation at the register boundary
is performed early in the design cycle, using estimates from time budgeting and/or
STA. Tools like Synopsys DC Explorer [23] can provide timing estimates for running
gate-level timing simulation. As already mentioned, performing gate-level timing sim-
ulation late in the design cycle is prohibitively slow and may result in design changes
back in the RTL or may require ECO. Furthermore, the idea of performing long full
chip timing simulation in a short amount of time is much welcomed by the industry.
Figures 6.15 and 6.16 show the traditional and new flow for simulation, respectively.
The obvious advantage of the new flow is rapid gate-level timing simulation early in
the design cycle so that timing checks are validated and bugs are caught early on.
111
Figure 6.16. Proposed flow of early simulation in ASIC/FPGA design
This chapter proposed an approach to hybrid gate-level timing simulation [9] that
makes use of STA and selective SDF back-annotation to accelerate gate-level timing
simulation. In this approach, STA acts as timing predictor for blocks which are run
without SDF back-annotation. The approach also validates the result of STA which
is dependent upon manual constraints entry. The proposed approach can be applied
112
CHAPTER 7
7.1 Conclusion
In the previous Chapters, we described three techniques for accelerating HDL
simulation at three levels of abstraction namely
3. Gate-level timing.
3. Arithmetic circuits.
2. Matrix operations;
4. Filtering/DSP; and
113
5. Network dataflow operations
From the above categorization, it is clear that the designs chosen in our experi-
ments encompass almost all operations and design categories. Table 7.1 categorizes
the chosen designs into the above categories.
abstraction can bring huge improvements in verification time. Table 7.2 shows the
114
compared to Synopsys VCS [40], but were able to add multi-core simulation capa-
bility in Verilator by using OpenMP [7]. To the best of our knowledge, this way of
parallelization has not been explored before. We were able to increase performance of
both RTL and gate-level simulations using Verilator with OpenMP. It is worth men-
tioning that running cost-free simulation software i.e., Verilator with OpenMP on a
linux platform like Red Hat [5] and Centos [2] offers a huge financial advantage for
researchers and companies with limited budget. This work is a contribution towards
opensource software as this thesis work has benefitted from opensource simulation
in this thesis. We address all three levels, i.e., time parallel RTL, gate-level timing
The first challenge in timing closure in modern process technologies is the varia-
Today, due to increased variation the number of timing corners has grown. For ex-
ample, the variations between different layers (transistors, M1M2, higher metal, etc.)
are not correlated and combinations of fast/slow metal versus fast/slow transistors
need to be analyzed. This can be addressed using statistical static timing analysis
(SSTA) such as Synopsys Prime Time VX [6]
This work focused only on setup time violations, as we were dealing with pre-
layout verification. There can be hold time violations in any block within a chip -
115
regardless if it has critical (”long”) timing paths. Running simulations with a reduced
SDF file means that hold violations may not be detected. To fix potential hold time
violations, it is recommended that one starts with the proposed hybrid methodology
at the post-layout stage. If there are any hold violations, fix those hold violations.
In the next step, increase the number of blocks in SDF annotation gradually. If hold
violations exist, fix them and then add more blocks in SDF annotation until all hold
violations are fixed. In the worst case, it is possible that all blocks are SDF annotated
but the probability of this happening might be insignificant.
We showed that by using a reduced SDF file, the simulation times are significantly
time violations, at all timing corners. This is very much needed today.
ran Cadence CompareScan for some time to compare and verify values across all the
simulation is needed to say that the results are verified is not known. It is worth in-
vestigating how much effort is needed to perform the matching of full SDF-annotated
We demonstrated that as one increases the number of cores, the time parallel
simulation approach does not scale. The question arises, is it possible to change the
architecture or revamp the scheme to make it scalable with the number of CPU cores?
What are the potential barriers to scalability and how can they be overcome?
Another interesting idea would be to compare the horizontal state matching ap-
proaches between ESL-RTL and RTL-GL0 to find out the similarities and differences.
As a result, it may lead to restructure or redefine horizontal state matching.
116
7.3.3 Future Work in Accelerating Multi-core RTL or Functional Gate-
level Simulation
We explored both partitioning the design across multi-cores using VCS multi-core
simulator, as well as partitioning the design across number of test-vectors, which is
single program multiple data (SPMD) approach, using Verilator. It turns out that
SPMD approach is more scalable than design partitioning. Future work in this di-
rection could be design partitioning combined with SPMD approach using Verilator.
This has the potential to be the best performance-driven simulation approach if prop-
erly instrumented.
117
CHAPTER 8
8.1 Publications
2012).
118
6. M. Basith, T. Ahmad, A. Rossi, and M. Ciesielski, ”Algebraic Approach to
Arithmetic Design Verification,” Formal Methods in Computer Aided Design
(FMCAD 2011).
8.2 Support
This works has been supported by funding from the National Scicence Foundation
(NSF) award NO. CCF-1017530.
8.3 Acknowledgements
I would like to acknowledge Wilson Synder from Cavium Network for creating
a tool like Verilator [41]. I also want to thank Hristo Iliev from HPC team at the
for being available throughout to listen and discuss ideas related to my research and
opensource hardware.
excellent advisor. He is far beyond any of his colleagues. His energy, passion, work
ethics and presentation are all too good. He has always been open to new ideas,
meeting new people, expanding skills etc., which is why he is so good. I wish i could
become like him one day. Salut a Profesor Ciesielski.
119
BIBLIOGRAPHY
[8] Ahmad, Tariq B., and Ciesielski, Maciej. An approach to multi-core functional
gate-level simulation minimizing synchronization and communication overheads.
In Microprocessor Test and Verification Conference (MTVCON) (2013).
[9] Ahmad, Tariq B., and Ciesielski, Maciej. Fast sta prediction-based gate-level
timing simulation. In Design and Test Europe (DATE) (2014).
[10] Anderson, T., and Bhagat, R. Tackling functional verification for virtual com-
ponents. In ISD Magazine (2000).
[13] Bailey, Mary L., Jr., Jack V. Briner, and Chamberlain, Roger D. Parallel logic
simulation of vlsi systems. ACM Comput. Surv. 26, 3 (1994), 255–294.
[15] Chamberlain, Roger D. Parallel logic simulation of vlsi systems. In DAC (1995),
pp. 139–143.
120
[16] Chang, Kai-Hui, and Browy, Chris. Parallel logic simulation: Myth or reality?
IEEE Computer 45, 4 (2012), 67–73.
[19] D. Kim, M. Ciesielski, and Yang, S. ”multes: Multi-level temporal parallel event-
driven simulation. In IEEE Trans. on CAD of Integrated Circuits and Systems
(2013), pp. 845–857.
[21] Fujimoto, Richard. Parallel discrete event simulation. Commun. ACM 33, 10
(1990), 30–53.
[24] Jefferson, David R. Virtual time. ACM Trans. Program. Lang. Syst. 7, 3 (July
1985), 404–425.
[26] Kim, Dusung, Ciesielski, Maciej J., Shim, Kyuho, and Yang, Seiyang. Temporal
parallel simulation: A fast gate-level hdl simulation using higher level models.
In DATE (2011), pp. 1584–1589.
[27] Kim, Dusung, Ciesielski, Maciej J., and Yang, Seiyang. A new distributed
event-driven gate-level hdl simulation by accurate prediction. In DATE (2011),
pp. 547–550.
[28] Lam, William K. Hardware Design Verification: Simulation and Formal Method-
Based Approaches. Prentice Hall, 2005.
[29] Li, Lijun, and Tropper, Carl. A design-driven partitioning algorithm for dis-
tributed verilog simulation. In PADS (2007), pp. 211–218.
121
[31] Nicol, David M. Principles of conservative parallel simulation. In Proceedings of
the 28th conference on Winter simulation (Washington, DC, USA, 1996), WSC
’96, IEEE Computer Society, pp. 128–135.
[33] Rashinkar, P., and Singh, L. New soc verification techniques. In IP/SOC 2001
(2001).
[35] Tariq B. Ahmad, Namdo Kim, Byeong Min Apurva Kalia Maciej Ciesielski, and
Yang, Seiyang. Scalable parallel event-driven hdl simulation for multi-cores.
In Synthesis, Modeling, Analysis and Simulation Methods and Applications to
Circuit Design (SMACD) (2012), pp. 217–220.
[36] Tariq B. Ahmad, Dusung Kim, Maciej Ciesielski, and Yang, Seiyang. ”appli-
cation of parallel distributed event driven simulation for accelerating hardware
verification. In Advances in Distributed and Parallel Computing (ADPC) (2012).
[37] Thomas Rauber, Gudula Runger. Parallel Programming for Multicore and Clus-
ter Systems. Springer-Verlag, 2010.
[38] Tompkins, Joe, and Joshi, Prathamesh. Improving Functional Gate Level Sim-
ulation Performance A Case Study. Synopsys User Group Boston (2011).
[41] Wilson Snyder, Paul Wasson, and Galbi, Duane. Verilator. http://www.
veripool.org/wiki/verilator, 2007.
[42] Zhu, Yuhao, Wang, Bo D., and Deng, Yangdong. Massively parallel logic simu-
lation with gpus. ACM Trans. Design Autom. Electr. Syst. 16, 3 (2011), 29.
122