Systemverilog - Coding
Systemverilog - Coding
Q1. FIFO depth, given read and write rates for a burst of x writes
Q2. a=0; b=0; c=1; #1 a=c; #1 b =a; (Give waveforms)
Q3. a<=0; b<=0; c<=1; #1 a<=c; #1 b< =a; (Give waveforms)
Q4. a=0; b=0; c=1; a= #1 c; b=#1 a; (Give waveforms)
Q5. a<=0; b<=0; c<=1; a<= #1 c; b<=#1 a; (Give waveforms)
Q6. You have incoming bit stream. You can't store them. You get a new bit at everyclock edge, find modulo 5 of the
updated number everytime. Eg, if bitstream is 10111, you find modulo of 1, then 10, then 101 and so on..
FIbonacci series
Questions on C++, Perl, System Verilog.
Computer Architecture Concepts, Memory Consistency and Cache Coherency, cache configuration.
difference between non-blocking and blocking assignment
How to verify asynchronous fifo?
How to implement a greedy snake game? what data structures to implement the snake
In a certain protocol why the ready signal is inout instead of out?
About the refresh in DDR2
FSM
System Verilog, Verilog, C, Perl, (also questions about OOP)
Bit operation
Asked to write SystemVerilog constraints for a variety of random stimulus needs
What is verification about? What are the components of design verification? What is
coverage? coverages types?
setup and hold time
aptitude based questions(apples and oranges)
perl scripting and programming based questions
Write code for a UVC mimicing a memory . Reactive sequence in UVM
Explain how an out-of-order processor works? How do you implement register renaming? Difference
between an architectural and physical register file
Verilog code writing, simple hardware design question using muxes and counter that was approached from
different levels of abstraction.
Entirely computer architecture questions, including cache coherency protocols, cache organizations
What is the scope of a static variable? Given multiple scenarios(static variables across files, in
recursion, ect.)
Describe what a virtual function does?
What are some ways for error testing/handling in software?
Computer Architecture stuff: OOO, memory dependencies, Pipelining, Fetch stage, Branch Prediction
System Verilog: coverage and assertion writing
Digital Logic: Implement AND and OR using 2:1 mux
Asked to rate myself in C++, System Verilog
C program to sort array. Binary search vs Linear Search. Time complexity.
How to verify many design scenarios.
difference of Union and Struct (C++).
VIPT cache.
What is an isolation cell?
FIFO Depth, SV assertions, Multi-threading and OOP concepts
1
Random number generations, assertions, constraint
Bug scenarioshttp://hwinterview.com/index.php/2016/11/01/bug-scenarios/
2
What are the goals of a verification engineer?
Develop a testplan to define the what, where and how of testing methodology.
Design a resusable and scalable testbench environment to verify the module.
Work with the designer to ensure that the design meets all the specifications through coverage analysis.
Start debugging with the mindset that the testbench is incorrect. Once that is ruled out, then the designer
can be involved in the debug effort.
Automate the checking process.
Gain a good grasp of the design specifications. The verification engineer should not completely trust the
designer to determine if the design has been documented correctly.
Suggest changes once a reasonable understanding is obtained to re-design or re-evaluate if a particular
logic is constantly seeing issues. The verification engineer should not be afraid to push for changes keeping
verification schedules in mind.
What is a testplan?
A testplan is the probably the most crucial aspect of the design verification flow. It involves the definition of the following aspects
in general
Engineers involved in designing and verifying the module
Module features to be verified based on the design specification
Environment used for testing (unit/system/emulation)
Schedule
Description of how to go about the task of thoroughly verifying the module
Benefits of a good testplan
A good testplan sets the ground work for focusing on the important features to be verified.
It also provides a framework for evaluating the progress of verification through functional coverage.
Furthermore, it provides a good opportunity to hash out any misunderstandings on features and interfaces. Thus, a
review can be held by all the stakeholders involved in the module (design, block verification, system verification,
architecture) to clearly define the methodology for testing.
What should be in a testplan?
Testbench description A brief overview (maybe a diagram) of the testbench components used like
scoreboards/checkers and agents. Also, a description of the testbench files should be beneficial for anyone new to the
module to grasp the testbench intent.
Features The testplan should list the feature specifications and map them to a specific coverpoint. It is also crucial
to focus on the interfaces of the block as these are the usual spots to uncover bugs. Another aspect is to provide the
scenarios of how the end product will be used by the system.
How to test? The how of testing should cover the following
o High risk areas
o Scope of what should be covered in the future
o Assumptions
o Test fail criteria
1. The testbench must generate proper input stimulus to activate a design error.
2. The testbench must generate proper input stimulus to propagate all effects resulting from the design error to an output port.
3. The testbench must contain a monitor that can detect the design error that was first activated then propagated to a point for
detection.
3
Circular Buffer Implementation
// Port A
wire [9:0] addressA;
wire [15:0] dataInA;
wire writeEnableA;
reg [15:0] dataOutA;
// Port B
wire [9:0] addressB;
wire [15:0] dataInB;
wire writeEnableB;
reg [15:0] dataOutB;
if (writeEnableB) begin
myMemory[addressB] <= dataInB;
dataOutB <= dataInB;
end
else
dataOutB <= myMemory[addressB];
end
binary count values. Some advantages of using binary pointers over Gray code pointers:
The technique of sampling a multi-bit value into a holding register and using synchronized handshaking control signals to pass the multi-bit value into a new clock domain can be
used for passing ANY arbitrary multi-bit value across clock domains. This technique can be used to pass FIFO pointers or any multi-bit value.
Each synchronized Gray code pointer requires 2n flip-flops (2 per pointer bit). The sampled multi-bit register requires 2n+4 flip-flops (1 per holding register bit in each clock
domain, 2 flip-flops to synchronize a ready bit and 2 flip-flops to synchronize an acknowledge bit). There is no appreciable difference in the chance that either pointer style would
experience metastability.
The sampled multi-bit binary register allows arbitrary pointer changes. Gray code pointers can only increment and decrement.
The sampled multi-bit register technique permits arbitrary FIFO depths; whereas, a Gray code pointer requires power-of-2 FIFO depths. If a design required a FIFO depth of at
least 132 words, using a standard Gray code pointer would employ a FIFO depth of 256 words. Since most instantiated dual-port RAM blocks are power-of- 2 words deep, this may
not be an issue.
Using binary pointers makes it easy to calculate almost-empty and almost-full status bits using simple binary arithmetic between the pointer values.
One small disadvantage to using binary pointers over Gray code pointers is:
Sampling and holding a binary FIFO pointer and then handshaking it across a clock boundary can delay the capture of new samples by at least two clock edges from the receiving
clock domain and another two clock edges from the sending clock domain. This latency is generally not a problem, but it will typically add more pessimism to the assertion of full
and empty and might require additional FIFO depth to compensate for the added pessimism. Since most FIFOs are typically specified with excess depth, it is not likely that extra
4
ASSERTIONS
An assertion is a statement about a designs intended behavior
- If a property that is being checked for in a simulation does not behave the way we expect it to, the assertion fails.
- If a property that is forbidden from happening in a design happens during simulation, the assertion fails.
- It helps capturing the designers interpretation of the specification. - Describes the property of the design - Assertion doesnt help in designing any entity but
it checks for the behavior of the design.
assert property (@(posedge clk) $rose(req) |-> ##[1:3] $rose(ack)); In this example, when there is a positive edge on the Request (req) signal, then make sure that
between 1 and 3 clock cycles later, there is a positive edge on acknowledge (ack) signal. Here the designer knows that the acknowledge signal should go high within 1 to 3 cycles as
soon as the Request signal has gone high at the positive edge.
Immediate assertions uses the keyword assert (not assert property), and is placed in procedural code and executed as a procedural statement.
- Based on simulation event semantics. - Test expression is evaluated just like any other. Verilog expression with a procedural block. Those are not temporal in
nature and are evaluated immediately. - Have to be placed in a procedural block definition. - Used only with dynamic simulation
A sample immediate assertion is shown below:
always_comb
begin
a_ia: assert (a && b);
end
Concurrent assertions uses the keywords assert property, is placed outside of a procedural block and is executed once per sample cycle at the end of the
cycle. The sample cycle is typically a posedge clk and sampling takes place at the end of the clock cycle, just before everything changes on the next posedge clk.
- Based on Clock Cycles. - Test expression is evaluated at clock edges based on the sampled values of the variables involved. - Sampling of variables is done in
the observed region of the scheduler. - Can be placed in a procedural block, a module, an interface or a program definition. - Can be used with both static
and dynamic verification tool.
A sample of Concurrent Assertion: a_cc: assert property ( @ (posedge clk) not (a && b)) ; This example shows the result of concurrent assertion a_cc. All
successes are shown with an up arrow and all features are shown with a down arrow. The key concept in this example is that property being verified on every
positive edge of the clock irrespective of whether or not signal a and signal b changes.
Embedded concurrent assertions another form of concurrent assertions added to IEEE Std 18002009[7] and also uses the keywords assert property but
is placed inside of a clocked always process. Placing the assertion in a clocked always process allows the concurrent assertion to inherit the clockingsample
signal from the always process.
Design Engineers should create the lowlevel and simple assertions while Verification Engineers should create higherlevel and perhaps more complex
assertions.
Where should the Assertions be used? - Between modules, DUT and Testbench to check communication between the modules
and stimulus constraints. - It can also be used inside individual modules to verify the design, corner-cases and verify the
assumptions.
assertions should instead be put in a separate bindfile and NOT put the assertions in the RTL code.
Bindfiles:
How bindfiles work : In general, using bindfiles is actually doing indirect instantiation. The engineer will bind (indirectly instantiate) one module inside of another module
using the bind keyword.
To create a bindfile, declare a module that will encapsulate the assertion code (and other verification code if needed). The module needs access to all of the important
signals in the enclosing file so all of the ports and internal signals from the enclosing file are declared as inputs to the bindfile.
- the bind command includes the bind keyword followed by the DUT module name
bind fifo1
- describes how the bound module would be instantiated if placed directly in the module being bound to.
fifo1_asserts p1 (.*);
When creating bindfiles, it is a good idea to copy the DUT module to a DUT_asserts module, keep all existing input declarations, change all output declarations to input
declarations, and declare all internal signals as input declarations to the bindfile. The bindfile will sample the port and internal signals from the DUT.
It is not required to list all of the DUT signals in the asserts file, only those signals that will be checked by assertions; however, it is highly recommend to add ALL of the DUT signals
to the asserts file because it is common to add more assertions in the future that might require previously unused DUT signals.
5
The SystemVerilog language provides three important benefits over Verilog. 1. Explicit design intent SystemVerilog introduces several constructs
that allow you to explicitly state what type of logic should be generated. 2. Conciseness of expressions SystemVerilog includes commands that allow you to specify
design behavior more concisely than previously possible. 3. A high level of abstraction for design The SystemVerilog interface construct facilitates inter module
communication. These benefits of SystemVerilog enable you to rapidly develop your RTL code, easily maintain your code, and minimize the occurrence of situations
where the RTL code simulates differently than the synthesized netlist. SystemVerilog allows you to design at a high level of abstraction. This results in improved
code readability and portability. Advanced features such as interfaces, concise port naming, explicit hardware constructs, and special data types ease verification
challenges.
Basic Testbench Functionality The purpose of a Testbench is to determine the correctness of the design under test (DUT). The following steps
accomplish this. Generate stimulus Apply stimulus to the DUT Captures response Check for correctness Measure progress against overall verification goals
Classes System Verilog provides an object-oriented programming model. System Verilog classes support a single-inheritance model. There is no facility
that permits conformance of a class to multiple functional interfaces, such as the interface feature of Java. System Verilog classes can be type-parameterized,
providing the basic function of C++ templates. However, function templates and template specialization are not supported.
The polymorphism features are similar to those of C++: the programmer may specify write a virtual function to have a derived class gain control of the
function. Encapsulation and data hiding is accomplished using the local and protected keywords, which must be applied to any item that is to be hidden. By default,
all class properties are public. System Verilog class instances are created with the new keyword. A constructor denoted by function new can be defined. System
Verilog supports garbage collection, so there is no facility to explicitly destroy class instances.
Why always Blocks Not allowed in a Program Block? In System Verilog, you can put an initial blocks in a program, but not always blocks.
This is bit opposite to the verilog and we have the reasons below: - System Verilog programs are closer to program in C, with one entry point, than Verilogs many
small blocks of concurrently executing hardware. - In a design, always block might trigger on every positive edge of a clock from the start of simulation. In System
Verilog, a Testbench has the steps of initialization, stimulate and respond to the design and then wrap up the simulation. An always block that runs continuously
would not work in System Verilog.
The Interface It is the mechanism to connect Testbench to the DUT just named as bundle of wires (e.g. connecting two hardware unit blocks with the help of
physical wires). With the help of interface block: we can add new connections easily, no missed connections and port lists are compact. It also carries the directional
information in the form of Modport (will be explained in counter example) and clocking blocks.
The interface instance is to be used in program block, the data type for the signals should be logic. The reason is that signals within a program block are
almost always driven within a procedural block(initial). All signals driven within a procedural block must be of type reg, the synomn of which is logic. When a signal is
declared to be logic, it can also be driven by a continuous assignment statement. This added flexibility of logic is generally desirable. There is an exception to the
above recommendation. If the signal is bi-directional signal(inout), or has multiple drivers, then the data must be wire (or any other form of wire-types).
TIP: Use wire type in case of multiple drivers. Use logic type in case of a single driver.
An interface cannot contain module instances, but only instances of other interfaces.
The advantages to using an interface are as follows: - An interface is ideal for design reuse. When two blocks communicate with a specified
protocol using more than two signals, consider using an interface. - The interface takes the jumble of signals that you declare over and over in every module or
program and puts it in a central location, reducing the possibility of misconnecting signals. - To add a new signal, you just have to declare it once in the interface, not
in higher-level modules, once again reducing errors. - Modports allow a module to easily tap a subset of signals form an interface. We can also specify signal
direction for additional checking
6
Modport. This provides direction information for module interface ports and controls the use of tasks and functions within certain modules. The directions of
ports are those seen from the perspective module or program. - Modports do not contain vector sizes or data types (common error) only whether the connecting
module sees a signal as input, output, inout or ref port.
Module Top: This file carries the top-level image of your whole design showing all the modules connected to it and the ports being used for the design. The
interface and test programs are instantiated here in the harness files. Looking at into a top level harness files gives an detailed picture of any design, as to what are
the functional parameters, interfaces etc.
Descriptions of some of the intermediate blocks
Environment contains the instances of the entire verification component and Component connectivity is also done. Steps required for execution of each
component are done in this.
Coverage - Checks completeness of the Testbench. This can be improved by the usage of Assertions, which helps to check the coverage of the Testbench and
generate suitable reports of the test coverage. The concept of coverage can get more complex, when we deal with the concept of functional coverage, cover groups
and cover points. With the coverage points, we can generate coverage report of your design and know the strength of your verification.
Transactors Transactor does the high level operations like burst-operations into individual commands, sub-layer protocol in layered protocol like PciExpress
Transaction layer over PciExpress Data Link Layer, TCP/IP over Ethernet etc. It also handles the DUT configuration operations. This layer also provides necessary
information to coverage model about the stimulus generated. Stimulus generated in generator is high level like Packet is with good crc, length is 5 and da is 8h0.
This high level stimulus is converted into low-level data using packing. This low level data is just a array of bits or bytes. Creates test scenarios and tests for the
functionality and identifies the transaction through the interface.
Drivers - The drivers translate the operations produced by the generator into the actual inputs for the design under verification. Generators create inputs at a high
level of abstraction namely, as transactions like read write operation. The drivers convert this input into actual design inputs, as defined in the specification of the
designs interface. If the generator generates read operation, then read task is called, in that, the DUT input pin "read_write" is asserted.
Monitor Monitor reports the protocol violation and identifies all the transactions. Monitors are two types, Passive and active. Passive monitors do not drive any
signals. Active monitors can drive the DUT signals. Sometimes this is also referred as receiver. Monitor converts the state of the design and its outputs to a
transaction abstraction level so it can be stored in a 'score-boards' database to be checked later on. Monitor converts the pin level activities in to high level.
Checker: The monitor only monitors the interface protocol. It doesn't check the whether the data is same as expected data or not, as interface has nothing to do
with the data. Checker converts the low level data to high-level data and validated the data. This operation of converting low-level data to high-level data is called
Unpacking, which is reverse of packing operation. For example, if data is collected from all the 15 commands of the burst operation and then the data is converted in
to raw data, and all the sub fields information are extracted from the data and compared against the expected values. The comparison state is sent to scoreboard
The Generator, Agent, Driver, Monitor and Checker are all classes, modelled as Transactors. They are instantiated inside the Environment class. For
simplicity, the test is at the top of the hierarchy, as is the program that instantiates the Environment Class. The functional coverage definition can be put inside or
outside the Environment class.
Scoreboard: Scoreboard is used to store the expected output of the device under test. It implements the same functionality as DUT. It uses higher level of
constructs. Dynamic data types and dynamic memory allocations in SystemVerilog make us easy to write scoreboards. Scoreboard is used to keep track of how
many transactions were initiated; out of which how many are passed or failed.
Randomization: What to randomize, the first things you may think of are the data fields. These are the easiest to create just call $random. The problem is
that this approach has a very low payback in terms of bugs found: you only find data-path bugs, perhaps with bit-level mistakes. The test is still inherently directed.
The challenging bugs are in the control logic. As a result, you need to randomize all decision points in your DUT. Wherever control paths diverge, randomization
increases the probability that youll take a different path in each test case.
7
Difference between rand and randc? The variables in the class can be declared random using the keywords: rand and randc. Dynamic and
associative arrays can be declared using rand or randc keywords. Variables declared with rand keywords are standard random variables. Their values are uniformly
distributed over their range. Values declared with randc keyword are randomly distributed. They are cyclic in nature. They support only bit or enumerated data types.
The size is limited.
Semaphores A semaphore allows you to control access to a resource. Semaphores can be used a testbench when you have a resource, such as a bus, that
may have multiple requestors from inside the testbench but, as part of the physical design, can only have one driver. In System Verilog, a thread that requests a key
when one is not available always block.
Semaphores can be used in a testbench when you have a resource, such as a bus, that may have multiple requestors from inside the testbench but as
part of the physical design, can only has one driver. There are three basic operations for a semaphore. We create a semaphore with one or more keys using the new
method get one or more keys with get, and return one or more keys with put.
Mailboxes: A mailbox is a communication mechanism that allows messages to be exchanged between processes or threads. Data can be sent to a mailbox by
one process and retrieved by another. Mailbox is a built-in class that provides the following methods: - Create a mailbox: new() - Place a message in a mailbox:
put() - Try to place a message in a mailbox without blocking: try_put() - Retrieve a message from a mailbox: get() or peek() - Try to retrieve a message from a
mailbox without blocking: try_get() or try_peek() - Retrieve the number of messages in the mailbox: num().
Eg:
Generator using mailboxes
task generator(int n, mailbox mbx);
Transacion t;
repeat (...)
begin t = new();
.....
mbx.put(t);
end endtask
8
what is coverage? Simply put, coverage is a metric we use to measure verification progress and completeness. Coverage
metrics tells us what portion of the design has been activated during simulation (that is, the controllability quality of a
testbench). Or more importantly, coverage metrics identify portions of the design that were never activated during simulation,
which allows us to adjust our input stimulus to improve verification.
Coverage-driven verification Coverage-driven verification is a widely used methodology to tackle the growing
complexity of ASIC designs which add new features and improve performance with every product generation. It typically
involves the following steps
1. Development of a test plan incorporating the list of features to verify.
2. Creation of a smart environment with configurable parameters, random constrained stimulus, checkers and a
coverage model to track progress.
3. Addition of assertions to catch illegal scenarios.
4. Iteratively run simulations and analyze coverage metrics (code coverage and functional coverage).
Benefits
Coverage-driven approach provides measurable success parameters through coverage metrics. This is crucial especially with
the tough schedules to meet. In addition, using constrained random stimulus eliminates the time spent creating directed tests.
Coverage Classification
Two most common ways are to classify them by either their method of creation (such as, explicit versus implicit), or by their origin of source (such
as, specification versus implementation).
For instance, functional coverage is one example of an explicit coverage metric, which has been manually defined and then implemented by the
engineer. In contrast, line coverage and expression coverage are two examples of an implicit coverage metric since its definition and implementation is
automatically derived and extracted from the RTL representation.
Coverage Metrics
There are two primary forms of coverage metrics in production use in industry today and these are:
- Code Coverage Metrics (Implicit coverage)
- Functional Coverage/Assertion Coverage Metrics (Explicit coverage)
Code Coverage Metrics One of the advantages of code coverage is that it automatically describes the degree to which the source code of
a program has been activated during testing-thus, identifying structures in the source code that have not been activated during testing. One of the key benefits of
code coverage, unlike functional coverage, is that creating the structural coverage model is an automatic process. Hence, integrating code coverage into your
existing simulation flow is easy and does not require a change to either your current design or verification approach.
Limitations:
One limitation with code coverage metrics are that you might achieve 100% code coverage during your regression run, which means that your testbench
provided stimulus that activated all structures within your RTL source code, yet there are still bugs in your design. For example, the input stimulus might have
activated a line of code that contained a bug, yet the testbench did not generate the additional required stimulus that propagates the effects of the bug to some point
in the testbench where it could be detected.
Another limitation of code coverage is that it does not provide an indication on exactly what functionality defined in the specification was actually
tested. For example, you could run into a situation where you achieved 100% code coverage, and then assume you are done. Yet, there could be functionality
defined in specification that was never tested?or even functionality that had never been implemented! Code coverage metrics will not help you find these situations.
9
Types of Code Coverage Metrics
Toggle Coverage Toggle coverage is a code coverage metric used to measure the number of times each bit of a register or wire has toggled its value. Although
this is a relatively basic metric, many projects have a testing requirement that all ports and registers, at a minimum, must have experienced a zero-to-one and one-
to-zero transition.
In general, reviewing a toggle coverage analysis report can be overwhelming and of little value if not carefully focused. For example, toggle coverage is
often used for basic connectivity checks between IP blocks. In addition, it can be useful to know that many control structures, such as a one-hot select bus, have
been fully exercised.
Line Coverage Line coverage is a code coverage metric we use to identify which lines of our source code have been executed during simulation. A line coverage
metric report will have a count associated with each line of source code indicating the total number of times the line has executed. The line execution count value is
not only useful for identifying lines of source code that have never executed, but also useful when the engineer feels that a minimum line execution threshold is
required to achieve sufficient testing.
Line coverage analysis will often reveal that a rare condition required to activate a line of code has not occurred due to missing input stimulus.
Alternatively, line coverage analysis might reveal that the data and control flow of the source code prevented it either due to a bug in the code, or dead code that is
not currently needed under certain IP configurations. For unused or dead code, you might choose to exclude or filter this code during the coverage recording and
reporting steps, which allows you to focus only on the relevant code.
Statement Coverage Statement coverage is a code coverage metric we use to identify which statements within our source code have been executed during
simulation. In general, most engineers find that statement coverage analysis is more useful than line coverage since a statement often spans multiple lines of source
code-or multiple statements can occur on a single line of source code.
A metrics report used for statement coverage analysis will have a count associated with each line of source code indicating the total number of times the
statement has executed. This statement execution count value is not only useful for identifying lines of source code that have never executed, but also useful when
the engineer feels that a minimum statement execution threshold is required to achieve sufficient testing.
Block Coverage Block coverage is a variant on the statement coverage metric which identifies whether a block of code has been executed or not. A block is
defined as a set of statements between conditional statements or within a procedural definition, the key point being that if the block is reached, all the lines within the
block will be executed. This metric is used to avoid unscrupulous engineers from achieving a higher statement coverage by simply adding more statements to their
code.
Branch Coverage Branch coverage (also referred to as decision coverage) is a code coverage metric that reports whether Boolean expressions tested in control
structures (such as the if, case, while, repeat, forever, for and loop statements) evaluated to both true and false. The entire Boolean expression is considered one
true-or-false predicate regardless of whether it contains logical-and or logical-or operators.
Expression Coverage Expression coverage (sometimes referred to as condition coverage) is a code coverage metric used to determine if each condition
evaluated both to true and false. A condition is an Boolean operand that does not contain logical operators. Hence, expression coverage measures the Boolean
conditions independently of each other.
Focused Expression Coverage Focused Expression Coverage (FEC), which is also referred to as Modified Condition/Decision Coverage (MC/DC), is a code
coverage metric often used used by the DO-178B safety critical software certification standard, as well as the DO-254 formal airborne electronic hardware
certification standard. This metric is stronger than condition and decision coverage. The formal definition of MC/DC as defined by DO-178B is: Every point of entry
and exit in the program has been invoked at least once, every condition in a decision has taken all possible outcomes at least once, every decision in the program
has taken all possible outcomes at least once, and each condition in a decision has been shown to independently affect that decisions outcome. A condition is
shown to independently affect a decisions outcome by varying just that condition while holding fixed all other possible conditions. [3] It is worth noting that
completely closing Focused Expressing Coverage can be non-trivial.
Finite-State Machine Coverage Today's code coverage tools are able to identify finite state machines within the RTL source code. Hence, this makes it
possible to automatically extract FSM code coverage metrics to measure conditions. For example, the number of times each state of the state machine was entered,
the number of times the FSM transitioned from one state to each of its neighboring states, and even sequential arc coverage to identify state visitation transitions.
There are generally three main steps involved in a code coverage flow, which include:
Instrument the RTL code to gather coverage
Run simulation to capture and record coverage metrics
Report and analyze the coverage results
Part of the analysis step is to identify coverage holes, and determine if the coverage hole is due to one of three
conditions:
Missing input stimulus required to activate the uncovered code
A bug in the design (or testbench) that is preventing the input stimulus from activating the uncovered code
Unused code for certain IP configurations or expected unreachable code related during normal operating conditions
10
Functional Coverage Metrics
The objective of functional verification is to determine if the design requirements, as defined in our specification, are functioning as intended. The objective of
measuring functional coverage is to measure verification progress with respect to the functional requirements of the design. That is, functional coverage helps us
answer the question: Have all specified functional requirements been implemented, and then exercised during simulation?
Benefit:
one of the value propositions of constrained-random stimulus generation is that the simulation environment can automatically generate thousands of tests that would
have normally required a significant amount of manual effort to create as directed tests. However, one of the problems with constrained-random stimulus generation
is that you never know exactly what functionality has been tested without the tedious effort of examining waveforms after a simulation run. Hence, functional
coverage was invented as a measurement to help determine exactly what functionality a simulation regression tested without the need for visual inspection of
waveforms.
For example, functional coverage can be implemented with a mechanism that links to specific requirements defined in a specification. Then, after a
simulation run, it is possible to automatically measure which requirements were checked by a specific directed or constrained-random test as well as automatically
determine which requirements were never tested.
Limitations:
Since functional coverage is not an implicit coverage metric, it cannot be automatically extracted. Hence, this requires the user to manually create the coverage
model. From a high-level, there are two different steps involved in creating a functional coverage model that need to be considered:
1. Identify the functionality or design intent that you want to measure: addressed through verification planning
2. Implementing the machinery to measure the functionality or design intent: coding the machinery for each of the coverage items identified in the verification
planning step (for example, coding a set of SystemVerilog covergroups for each verification objective identified in the verification plan).
Scoreboard and Functional Coverage: The main goal of a verification environment is to reach 100% coverage of the defined functional coverage spec
in the verification plan. Based on functional coverage analysis, the random based tests are than constrained to focus on corner cases to get do complete functional
check. Coverage is a generic term for measuring progress to complete design verification. Simulations slowly paint the canvas of the design, as we try to cover all of
the legal combinations. The coverage tools gather information during a simulation and then post process it to produce a coverage report. You can use this report to
look for coverage holes and then modify existing tests or create new ones to fill the holes.
Types of Functional Coverage Metrics The functional behavior of any design, at least as observed from any interface within the
verification environment, consists of both data and temporal components. Hence, from a high-level, there are two main types of functional coverage measurement
we need to consider: Cover Groups' and Cover Properties.
Cover Group Covergroup is like a user-defined type that encapsulates and specifies the coverage. It can be defined in a package, module, program,
interface or class once defined multiple instances can be created using new Parameters to new () enable customization of different instances. In all cases, we must
explicitly instantiate it to start sampling. If the cover group is defined in a class, you do not make a separate name when we instance it. Cover group comprises of
cover points, options, formal arguments, and an optional trigger. A cover group encompasses one or more data points, all of which are sampled at the same time.
The two major parts of functional coverage are the sampled data values and the time when they are sampled. When new values are ready (such as when
a transaction has completed), your testbench triggers the cover group.
To calculate the coverage for a point, you first have to determine the total number of possible values, also known as the domain. There may be one value
per bin or multiple values. Coverage is the number of sampled values divided by the number of bins in the domain. A cover point that is a 3-bit variable has the
domain 0:7 and is normally divided into eight bins. If, during simulation, values belonging to seven bins are sampled, the report will show 7/8 or 87.5% coverage for
this point. All these points are combined to show the coverage for the entire group, and then all the groups are combined to give a coverage percentage for all the
simulation databases.
With respect to functional coverage, the sampling of state values within a design model or on an interface is probably the easiest to understand. We refer
to this form of functional coverage as cover group modeling. It consists of state values observed on buses, grouping of interface control signals, as well as register.
The point is that the values that are being measured occur at a single explicitly or implicitly sampled point in time. SystemVerilog covergroups are part of the
machinery we typically use to build the functional data coverage models, and the details are discussed in the block level design example and the discussion of the
corresponding example covergroup implementations.
11
Assertion Coverage
The term assertion coverage has many meanings in the industry today. For example, some people define assertion coverage as the ratio of number of assertions to
RTL lines of code. However, assertion density is a more accurate term that is often used for this metric. For our discussion, we use the term assertion coverage to
describe an implementation of coverage properties using assertions.
Cross Coverage
Cross Coverage is specified between the cover points or variables. Cross coverage is specified using the cross construct.
Expressions cannot be used directly in a cross; a coverage point must be explicitly defined first.
12
CONSTRAINTS
13
14
15
Clocking blocks have been introduced in SystemVerilog to address the problem of specifying the timing and synchronisation requirements of a
design in a testbench.
A clocking block is a set of signals synchronised on a particular clock. It basically separates the time related details from the structural, functional
and procedural elements of a testbench. It helps the designer develop testbenches in terms of transactions and cycles. Clocking blocks can only be
declared inside a module, interface or program.
The clocking construct is both the declaration and the instance of that declaration. Note that the signal directions in the clocking block within the testbench
are with respect to the testbench. So Q is an output of COUNTER, but a clocking input. Note also that widths are not declared in the clocking block, just the
directions.
The signals in the clocking block cb_counter are synchronised on the posedge of Clock, and by default all signals have a 4ns output (drive)
skew and a #1step input (sample) skew. The skew determines how many time units away from the clock event a signal is to be sampled or driven. Input
skews are implicitly negative (i.e. they always refer to a time before the clock), whereas output skews always refer to a time after the clock.
Clocking Block Drives
Clocking block outputs and inouts can be used to drive values onto their corresponding signals, at a certain clocking event and with the specified skew. An important
point to note is that a drive does not change the clock block input of an inout signal. This is because reading the input always yields the last sampled value, and not
the driven value.
Synchronous signal drives are processed as nonblocking assignments. If multiple synchronous drives are applied to the same clocking block output or inout at the
same simulation time, a run-time error is issued and the conflicting bits are set to X for 4-state ports or 0 for 2-state ports.
Here are some examples using the driving signals from the clocking block cb:
The interface signals will have the same direction as specified in the clocking block when viewed from the testbench side (e.g. modport TestR), and reversed when
viewed from the DUT (i.e. modport Ram). The signal directions in the clocking block within the testbench are with respect to the testbench, while a modport declaration
can describe either direction (i.e. the testbench or the design under test).
16
Assertions are primarily used to validate the behaviour of a design. ("Is it working correctly?") They may also be used to provide functional
coverage information for a design ("How good is the test?"). Assertions can be checked dynamically by simulation, or statically by a separate property checker
tool i.e. a formal verification tool that proves whether or not a design meets its specification. Such tools may require certain assumptions about the designs
behaviour to be specified.
In SystemVerilog there are two kinds of assertions: immediate (assert) and concurrent (assert property). Coverage statements (cover
property) are concurrent and have the same syntax as concurrent assertions, as do assume property statements. Another similar statement
expect is used in testbenches; it is a procedural statement that checks that some specified activity occurs. The three types of concurrent assertion
statement and the expect statement make use of sequences and properties that describe the designs temporal behaviour i.e. behaviour over time, as
defined by one or more clocks.
Immediate Assertions
Immediate assertions are procedural statements and are mainly used in simulation. An assertion is basically a statement that something must be true, similar
to the if statement. The difference is that an if statement does not assert that an expression is true, it simply checks that it is true, e.g.:
If the conditional expression of the immediate assert evaluates to X, Z or 0, then the assertion fails and the simulator writes an error message.
An immediate assertion may include a pass statement and/or a fail statement. In our example the pass statement is omitted, so no action is taken when
the assert expression is true. If the pass statement exists:
it is executed immediately after the evaluation of the assert expression. The statement associated with an else is called a fail statement and is executed
if the assertion fails:
Note that you can omit the pass statement and still have a fail statement:
The failure of an assertion has a severity associated with it. There are three severity system tasks that can be included in the fail statement to specify a
severity level: $fatal, $error (the default severity) and $warning. In addition, the system task $info indicates that the assertion failure carries no
specific severity.
Here are some examples:
The pass and fail statements can be any legal SystemVerilog procedural statement. They can be used, for example, to write out a message, set an error
flag, increment a count of errors, or signal a failure to another part of the testbench.
Concurrent Assertions
The behaviour of a design may be specified using statements similar to these:
"The Read and Write signals should never be asserted together."
"A Request should be followed by an Acknowledge occurring no more than two clocks after the Request is asserted."
Concurrent assertions are used to check behaviour such as this. These are statements that assert that specified properties must be true. For example,
asserts that the expression Read && Write is never true at any point during simulation.
Properties are built using sequences. For example,
17
where Req is a simple sequence (its just a boolean expression) and ##[1:2] Ack is a more complex sequence expression, meaning that Ackis true on
the next clock, or on the one following (or both). |-> is the implication operator, so this assertion checks that whenever Req is asserted, Ack must be
asserted on the next clock, or the following clock.
Concurrent assertions like these are checked throughout simulation. They usually appear outside any initial or always blocks in modules, interfaces and
programs. (Concurrent assertions may also be used as statements in initial or always blocks. A concurrent assertion in an initial block is only tested on the
first clock tick.)
The first assertion example above does not contain a clock. Therefore it is checked at every point in the simulation. The second assertion is only checked
when a rising clock edge has occurred; the values of Req and Ack are sampled on the rising edge of Clock.
Implication
The implication construct (|->) allows a user to monitor sequences based on satisfying some criteria, e.g. attach a precondition to a sequence and evaluate
the sequence only if the condition is successful. The left-hand side operand of the implication is called the antecedent sequence expression, while the right-
hand side is called the consequent sequence expression.
If there is no match of the antecedent sequence expression, implication succeeds vacuously by returning true. If there is a match, for each successful match
of the antecedent sequence expression, the consequent sequence expression is separately evaluated, beginning at the end point of the match.
There are two forms of implication: overlapped using operator |->, and non-overlapped using operator |=>.
For overlapped implication, if there is a match for the antecedent sequence expression, then the first element of the consequent sequence expression is
evaluated on the same clock tick.
s1 |-> s2;
In the example above, if the sequence s1 matches, then sequence s2 must also match. If sequence s1 does not match, then the result is true.
For non-overlapped implication, the first element of the consequent sequence expression is evaluated on the next clock tick.
s1 |=> s2;
define true 1
where `true is a boolean expression, used for visual clarity, that always evaluates to true.
property not_read_and_write;
Complex properties are often built using sequences. Sequences, too, may be declared separately:
sequence request
Req;
endsequence
sequence acknowledge
##[1:2] Ack;
endsequence
property handshake;
endproperty
18
Assertion Clocking
Concurrent assertions (assert property and cover property statements) use a generalised model of a clock and are only evaluated when a clock
tick occurs. (In fact the values of the variables in the property are sampled right at the end of the previous time step.) Everything in between clock ticks is
ignored. This model of execution corresponds to the way a RTL description of a design is interpreted after synthesis.
A clock tick is an atomic moment in time and a clock ticks only once at any simulation time. The clock can actually be a single signal, a gated clock (e.g. (clk
&& GatingSig)) or other more complex expression. When monitoring asynchronous signals, a simulation time step corresponds to a clock tick.
The clock for a property can be specified in several ways:
o Explicitly specified in a sequence:
sequence s;
endsequence
property p;
a |-> s;
endproperty
property p;
endproperty
property p;
a ##1 b;
endproperty
property p;
a ##1 b;
endproperty
endclocking
19
Handling Asynchronous Resets
In the following example, the disable iff clause allows an asynchronous reset to be specified.
property p1;
endproperty
The not negates the result of the sequence following it. So, this assertion means that if Reset becomes true at any time during the evaluation of the
sequence, then the attempt for p1 is a success. Otherwise, the sequence b ##1 c must never evaluate to true.
Sequences
A sequence is a list of boolean expressions in a linear order of increasing time. The sequence is true over time if the boolean expressions are true at the
specific clock ticks. The expressions used in sequences are interpreted in the same way as the condition of a procedural ifstatement.
Here are some simple examples of sequences. The ## operator delays execution by the specified number of clocking events, or clock cycles.
The * operator is used to specify a consecutive repetition of the left-hand side operand.
The $ operator can be used to extend a time window to a finite, but unbounded range.
This means a is followed by any number of clocks where c is false, and b is true between 1 and three times, the last time being the clock before c is true.
The [= or non-consecutive repetition operator is similar to goto repetition, but the expression (b in this example) need not be true in the clock cycle before c is
true.
20
Combining Sequences There are several operators that can be used with sequences:
The binary operator and is used when both operand expressions are expected to succeed, but the end times of the operand expressions can be different.
The end time of the end operation is the end time of the sequence that terminates last. A sequence succeeds (i.e. is true over time) if the boolean expressions
containing it are true at the specific clock ticks.
If s1 and s2 are sampled booleans and not sequences, the expression above succeeds if both s1 and s2 are evaluated to be true.
The binary operator intersect is used when both operand expressions are expected to succeed, and the end times of the operand expressions must be
the same.
The operator or is used when at least one of the two operand sequences is expected to match. The sequence matches whenever at least one of the operands
is evaluated to true.
The first_match operator matches only the first match of possibly multiple matches for an evaluation attempt of a sequence expression. This allows all
subsequent matches to be discarded from consideration. In this example:
sequence fms;
endsequence
whichever of the (s1 ##1 s2) and (s1 ##2 s2) matches first becomes the result of sequence fms.
The throughout construct is an abbreviation for writing:
i.e. Expression throughout SequenceExpr means that Expression must evaluate true at every clock tick during the evaluation
of SequenceExpr.
The within construct is an abbreviation for writing:
i.e. SequenceExpr1 within SequenceExpr2 means that SeqExpr1 must occur at least once entirely within SeqExpr2 (both start and end points
of SeqExpr1 must be between the start and the end point of SeqExpr2 ).
`define true 1
property p_pipe;
logic v;
@(posedge clk) (`true,v=DataIn) ##5 (DataOut === v);
endproperty
In this example, the variable v is assigned the value of DataIn unconditionally on each clock. Five clocks later, DataOut is expected to equal the assigned
value. Each invocation of the property (here there is one invocation on every clock) has its own copy of v. Notice the syntax: the assignment to v is separated
from a sequence expression by a comma, and the sequence expression and variable assignment are enclosed in parentheses.
Coverage Statements
In order to monitor sequences and other behavioural aspects of a design for functional coverage, cover property statements can be used. The syntax of
these is the same as that of assert property. The simulator keeps a count of the number of times the property in the cover property statement holds or
fails. This can be used to determine whether or not certain aspects of the designs functionality have been exercised.
21
CovLavel: cover property (s1);
...
endmodule
SystemVerilog also includes covergroup statements for specifying functional coverage. These are introduced in the Constrained-Random Verification
Tutorial.
assert property
(@(posedge clk) $rose(in) |=> detect);
asserts that if in changes from 0 to 1 between one rising clock and the next, detect must be 1 on the following clock.
This assertion,
assert property
(@(posedge clk) enable == 0 |=> $stable(data));
assert property
Binding
We have seen that assertions can be included directly in the source code of the modules in which they apply. They can even be embedded in procedural
code. Alternatively, verification code can be written in a separate program, for example, and that program can then be bound to a specific module or module
instance.
For example, suppose there is a module for which assertions are to be written:
module M (...);
// The design is modelled here
endmodule
The properties, sequences and assertions for the module can be written in a separate program:
program M_assertions(...);
// sequences, properties, assertions for M go here
endprogram
The syntax and meaning of M_assertions is the same as if the program were instanced in the module itself:
module M (...);
// The design is modelled here
M_assertions M_assertions_inst (...);
endmodule
22
Universal Verification Methodology (UVM)
What is UVM?
UVM refers to Universal Verification Methodology introduced by Accellera based on the Open Verification Methodology (OVM). It
is a methodology to perform functional verification through a supporting library of System Verilog code.
What are benefits of using UVM?
UVM offers a complete verification environment composed of reusable components and is part of a constrained
random, coverage-driven methodology. However, traditional HDL based testbenches might wiggle a few input pins and rely on
manual inspection for checking correct operation. Even if they are automated, they have to offer a quantifiable way to
determine verification progress. Based on the complexity of current designs, a complete random approach is not reasonable to
meet the tight schedules.
UVM leverages the object oriented capabilities of System Verilog such as classes, constraints and covergroups to ease the
difficulties in verifying a complex design.
UVM is primarily simulation based. However, it can also be used alongside assertion, emulation or hardware acceleration based
approaches. The other approaches typically use a Verilog, System Verilog or System C language at abstraction levels such as
behavioral, gate level or register transfer level.
Another benefit of TBV is that it allows the testbench to stream data to the DUT, which the transactor buffers
automatically. This further speeds up the execution of the testbench.
o With this methodology, it is possible to have multiple transactions active across multiple transactors
o Together, these transactors enable the emulator to process data continuously, which dramatically
increases overall performance to that of a pure ICE environment.
A point to note here is that in TBV, the back-end portion of the transactor and the DUT are located within the emulator. This
mandates that they both be written in synthesizable RTL.
Where is TBV used?
TBV can be used throughout the verification flow, from unit (block) level to SoC level. Common applications include:
Verification of large blocks, subsystems or entire SoCs
Driver development
Early hardware/software bring-up (this includes firmware, drivers, and OSs)
Full-chip power analysis and estimation
23
Arbiter verification
An arbiter is a commonly used design in circuits to control the access to a shared resource among multiple clients.
SOURCE: http://rtlery.com/sites/default/files/queueing_fifos_and_arbiter.png
Arbitration policies
Round Robin This policy is generally used to improve fairness. Fairness generally implies granting all clients a good
chance of running on the shared resource. A particular client will not be considered for arbitration if it has been
serviced and there are other clients having outstanding requests.
Priority This policy guarantees that the important clients run first when the latency or application requirements are
known.
First Come First Serve (FCFS) This is a variation of the priority policy where the priority is granted to the client
making the request first.
Scenarios to verify
Apart from functionally verifying the arbiter algorithm stand alone, the arbiter should be verified at application level via writing
assertions. Adding the assertions will also ensure that the application requirements are met in terms of fairness and performance.
24
What is REGISTER RENAMING?
Register renaming is a technique deployed in Out-Of-Order Processors (OOO). It eliminates the false data dependencies arising
from the reuse of architectural registers by successive instructions that do not have any real data dependencies between them.
Why use register renaming?
As mentioned earlier, it eliminates false (WAR and WAW) dependencies. A description of false dependencies is here.
How is register renaming implemented?
When possible, the compiler detects the distinct instructions and tries to assign them to a different register. However, there is a
finite number of register names that can be used in the assembly code. Many high performance CPUs have more physical
registers than that may be named directly in the instruction set, so they rename registers in hardware to achieve additional
parallelism.In all renaming schemes, the machine converts the architectural registers referenced in the instruction stream into
tags. Where the architectural registers might be specified by 3 to 5 bits, the tags are usually a 6 to 8 bit number. Because the
size of a register file generally grows as the square of the number of ports, the rename file is usually physically large and
consumes significant power.
Superscalar and VLIW processors In VLIW processors, the decision is made by the compiler to group together
instructions (words) into a Very Large Instruction Word. The onus is on the compiler to group and execute independent
instructions in parallel. Therefore, hardware implementation is simplified leading to lesser power consumption.
For Superscalar processors, the decision is made by the compiler. However, hardware implementation is complicated.
Superscalar processors have multiple functional units to execute the same types of instructions in parallel.
Example: 4 adders can execute 4 addition instructions in parallel.
Disadvantages of VLIW:
It requires compiler support and extended usage to make the best of the hardware support.
It requires the compiler to add branch prediction. Furthermore, it needs to add recovery code.
VLIW compilers can induce code bloat when there is a lot of dependence among the instructions. Hence, this can lead to
functional units execution NOPs.
25
PIPELINING and its Pros and Cons
What is pipelining?
In general, pipelining refers to a set of data processing elements connected in series and executed in parallel, where the output
of one element is the input of the next one.
26
Questions:
What are the primary types of Register Dependencies/Data Hazards in a Pipelined System?
What are the remedies for each of these hazards?
Which type of data hazard is commonly observed in an in-order pipeline?
Here, we will use the term Read for a Load and Write for a Store operation.
1. Read After Write (RAW) This is observed when a read is to be performed to an unwritten address/memory location. Hence, these
occur in an in-order pipeline. It is also known as True or Flow dependence.
pseudo-Assembly example:
1. R1 < 10
2. R2 < R1
Solution: In order execution of the code sequence prevents the occurrence of this data hazard. However, this typically results in stalling the
pipeline for few clocks until the write can be committed to memory. Therefore, Data Forwarding or Bypassing is an optimization for this hazard.
2. Write After Write (WAW) This is observed when you need to write to the same address in consecutive lines of code. It is also known
as Output Dependence.
pseudo-Assembly example:
1. R1 < 10
2. R2 < 11
Solution: Squashing earlier Write (using updated value) within a structure like a Write/Store buffer is a micro architectural optimization to
prevent multiple stores to the same address. Register Renaming, a very efficient technique employed in modern Out-of-Order systems can be
used to prevent this hazard.
3. Write After Read (WAR) This is observed when you need to write to a memory location after a read. Hence, you need to wait till the old
value has been read before performing another write to the same address. It is also known as Anti Dependence or False Dependence.
pseudo-Assembly example:
1. R2 < R1
2. R1 < 10
Solution: Similar to WAW hazards, Register Renaming can also be used.
Todays processors use the mechanisms dependent on load store queues to resolve ambiguous dependences and recover when
a dependence was violated. If you havent read about dependencies in out-of-order processors, check it out here first.
Solution
However, when the violated load reaches the retirement point, the processor flushes the pipeline. It restarts execution
from the load instruction. At this point, all previous stores have committed their values to the memory system. The
load instruction will now read the correct value from the memory system. Any dependent instructions will re-execute
using the correct value.
Instruction-Level Parallelism
What is Instruction-level parallelism (ILP)? A measure of how many of the instructions a processor can execute simultaneously.
What are the approaches to instruction level parallelism?
Hardware
o Also known as dynamic parallelism
o Processor decides which instructions to execute in parallel at run time
o The Pentium processor implements dynamic parallelism
Software
o Also known as static parallelism
o Compiler decides which instructions to execute in parallel at compile time
o The Itanium processor implements static parallelism
ILP Example: e = a + b //1
f = c + d //2
m = e * f //3
The result of instruction 3 cannot be calculated until results of instructions 1 and 2 are completed since it depends on them. On the contrary, instructions 1 and 2 do
not depend on any other operation and hence can be calculated parallelly.
28
Translation Lookaside Buffer
The CPU accesses the main memory to do a Page Table Walk (PTW) in case of a TLB miss.
In case of a TLB miss, the best case is that the desired virtual address translation entry is in the Page Table, but the virtual-to-
physical translation entry is just not in a TLB. In this case, all that needs to be done is to lookup the main memory page table to find
the requested translation and insert it into the TLB.
However, the worst case is that a TLB miss does not find the entry in the main memory Page Table eventually leads to a Page Table
fault, in which case the page does not exist in memory and needs to be first brought into memory by doing an IO read operation
from disk. Post that, the page table needs to be updated with a Page Table Entry (PTE) reflecting the new page that has just been
brought into memory. The faulting memory operation which originally lead to the TLB miss is then retried and it leads to a TLB miss
again but this time a Page Table Walk leads to the entry finally bring brought into the TLB eventually leading to a TLB hit.
To do a Page Table Walk, the CPU first reads the Page Table Base Register (PTBR) (CR3 register on x86 for instance) to find the
starting address for the Page Table and looks up the entry in the Page Table by looking up the Virtual Page Number and the Offset
from the virtual address.
Due to the latency involved in accessing a lower level of the memory hierarchy (DRAM or Disk), these operations are time consuming
so a well-functioning TLB is of prime importance.
These sequence of operations also prove that a TLB miss can be more expensive than an instruction or data cache miss, due to
requiring not just a load from main memory, but a page walk, requiring several loads.
Multiple TLBs
With hardware managed TLBs, the CPU walks the page tables.
In case of a Page Table Fault, the CPU raises a page fault exception, which the operating system must handle.
With a hardware-managed TLB, the format of the TLB entries is not visible to software, and can change from CPU to CPU without
causing loss of compatibility for the programs.
Key features: The following are key features of Tomasulos Algorithm: Reservation Stations, Common Data Bus, Distributed hazard detection and execution
control and Dynamic memory disambiguation.
Reservation Stations (RS)
Buffers for functional units that hold instructions stalled for RAW hazards and their operands to be available.
Source operands can be values or names of other reservation station entries or load queue entries (in case of a memory read) that will produce value.
o Both operands dont have to be available at the same time.
o When both operand values have been computed, an instruction can be dispatched to its functional unit.
RAW hazards eliminated by forwarding
o Source operand values that are computed after the registers are read are known by the functional unit or load queue that will produce
them.
o Results are immediately forwarded to functional units on the common data bus.
o Dont have to wait until for value to be written into the register file.
WAR and WAW hazards eliminated by using register renaming
o Name-dependent instructions refer to reservation station or load queue locations for their sources, not the registers (as above)
o The last writer to the register updates it
o More reservation stations than registers, so eliminates more name dependences than a compiler can & exploits more parallelism
Common Data Bus (CDB)
Connects functional units and load queue to reservations stations, registers and the store queue.
Ships results to all hardware that could want an updated value.
Eliminates RAW hazards: not have to wait until registers are written before consuming a value.
Distributed hazard detection and execution control
Each reservation station decides when to dispatch instructions to its function unit.
Each hardware data structure entry that needs a value from the common data bus grabs the value itself by snooping .
Reservation stations, store queue entries & registers have a tag saying where their data should come from.
When it matches the data producers tag on the bus, reservation stations, store queue entries & registers grab the data.
Dynamic memory disambiguation
The issue:
o Dont want loads to bypass stores to the same location.
The solution:
o Loads associatively checks addresses in the store queue.
o If an address match, grab the value.
Tomasulo execution stages
Tomasulo works in three stages: Issue, execute and write result assuming that the instruction has been fetched. With the addition of a Re-Order Buffer (ROB), an
additional fourth stage is added called commit.
Issues
o Issue if no hazard; stall if hazard.
o Read registers for source operands.
Put into reservation stations if values are in them.
If not, put tag of the producing functional unit or load queue.
(renaming the registers to eliminate WAR and WAW hazards)
Execute
o Detect RAW hazards.
o Snoop on CDB for missing operands.
o Dispatch instruction to a functional unit when both operand values are ready.
o Execute the operation.
o Calculate effective address and start memory operation (load/store).
Write Result
o Broadcast result and reservation station tag (ID) on the CDB.
o Reservation stations, registers and store queue entries obtain the
value through snooping.
Advantages of Tomasulos algorithm compared to Scoreboarding
Register renaming in hardware.
Reservation stations for all execution units.
Common Data Bus (CDB) on which computed values broadcast to all reservation stations that may need them.
These developments avoid unnecessary stalls that would occur with scoreboarding and thus allow for efficient parallel execution and better performance
than scoreboarding.
30
CACHES
Tradeoffs
There are multifaceted tradeoffs considered while designing caches.
1. Caches are based on SRAM Technology SRAMs are much faster than DRAMs. However their disadvantages include:
1. Lower density compared to DRAMs owing to the use of ~6 transistors per bit compared to 1 transistor per bit used in DRAM.
2. As a result of the lower density, the per bit storage cost is higher in SRAMs compared to DRAMs.
2. Large Caches provide higher hit rate. However, as the size of the cache increases the latency of the access circuits (comparators) increase drastically.
3. Multi-level caching helps improve hit rate further but can be slow if you dont find data in the first level you go to the next level (adds to latency).
4. Typical scenarios when caches do not serve performance improvement
1. On every program change, the stale data of the previous program has to be flushed. You have to go through the phase of cold misses every time a fresh
program is loaded for execution.
2. If a particular workload does not have locality such as a streaming application (say streaming a YouTube video online wherein you rarely re-read
previous data by seeking backwards). In this case, the benefits of caches drop low and the AMAT (Average Memory Access Time) increases drastically. This is
because cache content is rarely re-used and everything you bring in your cache eventually has to be evicted without being re-read.
What are the types of Cache Misses? (4 Cs) In modern High Performance Computer Architecture (HPCA) literature, cache optimizations play a very
important role. Caches are another layer of faster (SRAM) memory added to speed up Memory operations which are traditionally bottlenecked by Main
Memory (DRAM) also known as lower layer Memory.
What are Snooping and Directory Based Cache Coherency Protocols? Cache Coherency is of prime importance in Modern CPU Design. There
are two types of Cache Coherency Protocols. There is a trade off between them depending on the complexity and scalability of implementation.
31
Types of Caches based on Construction
Direct-mapped caches
Set-associative caches
A memory location can be cached in any of the n-ways (or slots) within the cache.
Typically, the least significant bits of the memory locations index are used the slot index for the cache memory, and to have two entries for each
index.
LRU is typically used as a replacement policy and is especially simple since only one bit needs to be stored for each pair.
Fully-associative caches
Advantages of prefetching
1. Reducing effective latency 2. Improving resource utilization 3. Higher confidence of prefetch usage (depending on the workload)
Generally, prefetchers understand and develop a pattern in the way the current workload uses data by applying dynamic learning policies. Therefore, the locality
and access footprint of the workload trains the prefetcher. Also, modern CPUs typically have multiple levels of prefetchers for each cache level in the Memory
Hierarchy. As an additional note, prefetching improves the latency for both instruction and data caches.
Interview Question
Design an L2 prefetcher for the below specification.
Inputs: Current PC, Valid, Hit Outputs: Prefetch Address, Valid, Solution: Firstly, if there is a Hit for the PC being sent in to the block,
we can increment and send out PC+4 (next line) and PC+8 (next to next line). Otherwise, send out just PC+4 (next line).
32
Frequency Divide by 2 logic design
Verilog RTL code for a divide by 2 logic
module clk_div (clk_in, enable,reset, clk_out);
// Port Declaration
input clk_in ;
input reset ;
input enable ;
output clk_out ;
//Port data type declaration-
wire clk_in ;
wire enable ;
//Internal Registers-
reg clk_out ;
//Code Starts Here
always @ (posedge clk_in)
if (reset) begin
clk_out <= 1b0;
end else if (enable) begin
clk_out <= ! clk_out ;
end
endmodule
33
What's the relationship between voltage and speed Higher the voltage more time to reach saturation or more time to switch on and off the transistor so
lower the speed. V (proportional) R (proportional) Length of wire.
Lets assume Length of wire is Distance, then longer the wire V should be higher to reach the other end of the wire. Which means higher the voltage-difference.
What will happen if the PMOS and NMOS of the CMOS inverter circuit are interchanged with respect to their positions?
Assume that the PMOS and NMOS positions are interchanged.
- pMOS is a switch which turns on when you give 0 in gate.
- nMOS is a switch which turns on when you give 1 in gate.
Since in an Nmos, the Drain gets the Higher voltage; in our case, Drain is connected to VDD and Source becomes the output node.
Apply a VDD i.e Logic 1 to the Gate. The Nmos turns ON and the ouput node charges towards VDD. But you need a Vgs >= Vth to keep the Nmos in ON state.
Currently Vg is at VDD and Vs charging towards VDD.
Now, when Vs approaches VDD - Vth , you have Vgs = VDD - (VDD - Vth) = Vth. Any extra voltage at Vs would turn the Nmos off and thus, you would never get a
Strong 1 ( i.e VDD) at the output. Thus Nmos passes a Weak 1 (VDD - Vth ).
You could apply the similar analysis to the Pmos and prove it passes a weak Zero. (i.e Vth)
PS: The circuit would actually not work like an Inverter......but a Buffer passing Weak 1's and Weak 0's.
Why/What is load capacitance in CMOS inverter? Load capacitance in a CMOS circuit is a combination of input capacitance of the following
circuit(s) and the capacitance of the interconnect. (For long interconnects things get more tricky as transmission line effects need to be taken into
consideration)
The effect of load capacitance is that it causes a transient current demand on the inverter output, which causes a number of secondary
effects, two of which are: The output has a limited current capability, so this limits the maximum rate of change of the signal, slowing down the
edges.
The transient output current is drawn from the power supply and hence causes spikes in the power supply (since the power supply and its
interconnect are non-ideal and have series impedance). This is the reason why decoupling capacitors need to be connected between the power rails
close to the output stage.
Why increasing transistor size reduces delay in operation of MOS?Delay in a gate can be simplified as the amount of time it takes to discharge the load
capacitance that the gate or fet is driving.
I= q/t = C*V/t
t=C*V/I
1) to the first order, delay (time) is inversely proportional to drive current. So, increasing the drive current will reduce the delay.
2) Increasing the MOS width will increase its drive current.
Therefore, increasing MOS width will increase its drive current which will reduce the discharge time of the load ( reduce delay).
If you want the delay through the gate to be small, you should make the gate bigger and that would reduce the fanout.
However, we have to keep in mind that there will be other gates that need to drive Cin. So, we cannot make the gate very big. You cannot size one gate in isolation but you should
consider the full chain of logic or gates. Typically, there will be an optimum sizing solution. In the case of a chain of inverters driving a large load capacitance, the optimal electrical
fanout is found to be between 3 and 4.
How does transistor size affects clock speed? The critical path, i.e. the slowest pathway in your chip. In layman's terms, you're only as strong as your
weakest link - the critical path is that weakest link. If you run your clock any faster than your critical path, you encounter "setup violations" along that path, which in
turn contaminates other paths, and the chip malfunctions.
This is where transistor size comes into play. While not directly affecting the clock, the size of your transistors affects path delays. Why? A bigger transistor takes
longer to charge up. This makes your pathways, and therefore your critical path, slower.
Transistor size goes down and Switching Frequency goes up. That is 0 to 1 or 1 to 0. So a high frequency transistor ( well a whole bunch of them) ensures a high
frequency CPU.
Transistors
34
Describe how a multi bit synchronizer / async fifo handles the variable delay of each bit!
The standard solution is to encode the pointer into grey-code, where for each incrementation of the pointer, only one bit changes (so the variable delay cannot
cause false empty/full glitches as the bits settle). Note that this assumes a power-of two depth FIFO, or the grey-code may flip multiple bits when the pointer hits the non-
power-of-2 top and has to be set to the bottom. Also not that in any standard asynchronous design, it is best not to have any combinational logic before a synchronizer
(always have a flop). The reason is that even if you think you know the design is glitch-proof, the synthesis tool could do an strange optimizatino and create a circuit that
will glitch when you're not expecting it (and that glitch can get sampled by the sychronizer and result in a false pulse on the other sie). This can be avoided by writing
structural RTL and using DC constraints to ensure your glitch-proof circuit ends up the in netlist (though why not just avoid these cases all together by adding the flop).
Also of note, the create an empty signal, the write-pointer has to be converted to grey-code, flopped (see notes above about combinational logic before
synchronizers) and then sent through synchronizers in the write-pointer clock domain. The output of the synchronizers then has to be grey-code decoded, so that it can
finally be compared to the read-pointer to determine empty. Note, generating the full signal is just the reverse. For power-of two FIFOs, just add one extra bit to pointers
tell the difference between full and empty (if the extra bit is 0, and the pointers are equal, the FIFO is empty, if the extra bit is 1, and the pointers are equal, the FIFO is full).
Enough about FIFOs and async-crossings.
A synchronous First In First Out (commonly referred to as FIFO) or a queue is an array of memory. Generally, it is used when the write and read side logic
operate at the same clock frequency.
Use Case
- to buffer data when the burst write operation is larger than the burst read operation or vice versa
- read operation is delayed with respect to the write
Interfaces
A FIFO typically has the following set of signals
- Clock and reset
- Write and Write data
- Read and Read data
- Read and Write enable
- Full and empty (outputs)
Scenarios to verify
FIFO is a commonly used logic in many designs. The major functional features which have to be verified are
- Single write and read operations as well as data correctness
- FIFO transitioning from empty to non-empty and vice versa
- The transition from non-empty to full and vice versa
- Burst read and burst write operations up to the maximum depth
- Empty to full and back to empty
Error conditions
- Write operation when full The client should wait for the full signal to go low before issuing more writes. Otherwise, the data in the FIFO could be
overwritten or dropped.
- Read operation when empty The client should wait for the empty signal to go low before issuing a read. Otherwise, the data read will be garbage.
Concept The concept is to detect a rising edge (signal transition from logic 0 to logic 1). This can actually be done in 2 ways.
1. Levelshifters Cells:- These cells are used when signals need to traverse between different voltage levels. Commonly different blocks have
different voltage modes depending on performance and power requirements.
1. Low to High levelshifter cells :- These cells connect between low voltage to high voltage domains.
2. High to Low levelshifter cells :- These cells connect between high voltage to low voltage domains.
2. Isolation Cells :- These cells are used to isolate the logic which is power collapsed and power on logic. Because power collapsed will output
Xs and this unknown digital logic values should not propagate into powered on digital logic. There are different types of isolation cells. Isolation
cells can be both of input/output types.
3. Clamp Low Isolation Cell:- When the clamp signal is asserted, then these cells clamp to a Digital 0.
4. Clamp High Isolation Cell:- When the clamp signal is asserted, these cells clamp to a Digital 1.
5. Clamp Keeper Isolation Cell :- These cells clamp to the value previous to the clamp signal assertion. They are like
sequential elements.
3. Retention Registers:- In a power collapsed state, the registers which are retain-able will hold the values. These cells will be optimized to
work on dual rails and a powered on rail will enable them to hold values. This is used for the system to recover from powered down state. The
state and configuration registers can be retained to hold values.
4. Power switches:- Power switches can be used to turn on/off power to a block. These cells cause the power collapse which enables us to save
power from leakage and switching activity. This type of architecture is called power gating. Power switches can be PMOS headswitches or NMOS
foot switches. They are typically of higher resistance to limit leakage power from voltage rails.
5. LDO (low dropout regulator):- LDOs are used to regulate voltage and they are quite stable in operation. They are used for voltage scaling.
6.Voltage rail shifters:- These can change the output voltage rail from different input rails. A certain rail can be a higher voltage rail for high
performance and other rail can of lower voltage for lower power. Since, changing the voltage of rails is not fast enough and settling time is large.
Voltage rail shifters can be used to quickly shift to a lower voltage rail for lower power or a higher voltage rail for higher performance.
7. Clock gating cells:- To save dynamic power, clock gating cells are used which are used to gate the clock to an idle block. There are self
gating clocks, they depend on logic which enables or disables clock. It is a kind of feedback loop. Clock power is a significant portion of total
power. Typically the concept is when d and q of FFs remain at constant values, then clock is shut down to save power. This is however at cell
level. There will be higher levels of power saving architectures.
8. PMIC:- Typically a power management IC is used to supply voltage rails to chip. The characteristics of these voltage rails are that they are
stable, low noise and work in operating margin. PMIC takes in power from a battery or a power source uses converters( such as DC-DC buck/boost,
pulse width modulators etc ) . It also performs voltage scaling and power source selection. In some cases, can be used to perform charging device
battery.
freq_write = 20 MHz, freq_read = 10 MHz, Burst = 80 bytes Same read and write clock frequency, Write burst = 80 bytes in 100 clocks, Read
Time taken to write 80 bytes (t1) = 80/20 = 4 us burst = 8 bytes in 10 clocks
Time taken to read 80 bytes (t2) = 80/10 = 8 us If there is no burst overlap and considering the worst case burst,
FIFO_depth = (t2 t1) * Smaller_freq = 4 * 10 = 40 No of bytes written in 80 clocks = 80
Using the formula, FIFO_Depth = 80 80*(10/20) = 40 No. of bytes read in 80 clocks = 8*8 = 64
Hence, FIFO_depth = 80 64 = 16
If there is a burst overlap, the maximum write burst can be 160 bytes across 200
clocks and considering worst case burst,
No of bytes written in 160 clocks = 160
No. of bytes read in 160 clocks = 8*16 = 128
Hence, FIFO_depth = 160 128 = 32
36
// generate 100 Hz pulse chain from 50 MHz // generate 100 Hz from 50 MHz
reg [18:0] count_reg = 0; reg [17:0] count_reg = 0;
reg out_100hz = 0; reg out_100hz = 0;
always @(posedge clk_50mhz or posedge rst_50mhz) begin always @(posedge clk_50mhz or posedge rst_50mhz) begin
if (rst_50mhz) begin if (rst_50mhz) begin
count_reg <= 0; count_reg <= 0;
out_100hz <= 0; out_100hz <= 0;
end else begin end else begin
out_100hz <= 0; if (count_reg < 249999) begin
if (count_reg < 499999) begin count_reg <= count_reg + 1;
count_reg <= count_reg + 1; end else begin
end else begin count_reg <= 0;
count_reg <= 0; out_100hz <= ~out_100hz;
out_100hz = 1; end
end end
end end
end
37
Summary of SystemVerilog Extensions to Verilog
SystemVerilog adds important new constructs to Verilog-2001, including:
New data types: byte, shortint, int, longint, bit, logic, string, chandle.
Typedef, struct, union, tagged union, enum
Dynamic and associative arrays; queues
Classes
Automatic/static specification on a per variable instance basis
Packages and support for Compilation Units
Extensions to Always blocks for modelling combinational, latched or clocked processes
Jump Statements (return, break and continue)
Extensions to fork-join, disable and wait to support dynamic processes.
Interfaces to encapsulate communication
Clocking blocks to support cycle-based methodologies
Program blocks for describing tests
Randomization and constraints for random and directed-random verification
Procedural and concurrent assertions and Coverage for verification
Enhancements to events and new Mailbox and Semaphore built-in classes for inter-process communication.
The Direct Programming Interface, which allows C functions to be called directly from SystemVerilog (and vice versa) without using the PLI.
Assertions and Coverage Application Programming Interfaces (APIs) and extensions to the Verilog Procedural Interface (VPI) details of these are outside
the scope of the SystemVerilog Golden Reference Guide
38
39