Seminar 3
Seminar 3
Seminar 3
ON
AKASHDEEP
GUIDED BY:
ECED, SVNIT
CERTIFICATE
This is to certify that candidate Mr. AKASHDEEP bearing Roll No: U12EC003 of B.TECH
IV, 7TH Semester has successfully and satisfactorily presented UG Seminar & submitted the
Report on the topic entitled “ASYNCHRONOUS TECHNOLOGIES FOR SYSTEM ON
CHIP DESIGN” for the partial fulfillment of the degree of Bachelor of Technology (B.Tech) in
Dec. 2015.
Head,
ECED, SVNIT.
This will introduce the main design principal methods and building blocks for asynchronous very
large scale integration system with an emphasis on communication and synchronization. At first,
System on chips (SoCs) will be globally asynchronous and locally synchronous (GALS). But the
complexity of the various asynchronous/synchronous interfaces required in a GALS will
eventually lead to totally asynchronous solution.
In this we will discuss four main areas: first, an overview of system on chip including
descriptions of major approaches and what are the basic challenges for system on chip design
those arises commonly and their solutions to overcome those challenges. We will also discuss
the motivating factors behind this development. Next, we will briefly summarize key design
methodologies, processes and flows. We will briefly also describe some of the next generation
advanced concepts that are emerging for SoCs.
Asynchronous circuits with the only delay assumption of isochronic forks are called quasi delay
insensitive (QDI).QDI is used as the basis for asynchronous logic. We will discuss asynchronous
handshake protocols for communication and the notion of validity/neutrality tests. We will also
discuss basic building blocks for sequencing, storage, function evaluation and buses.
Abstract ......................................................................................................................................... iv
2.8 EM .............................................................................................................................. 10
References:................................................................................................................................... 31
Acronyms ..................................................................................................................................... 32
A system on chip is a system on an IC that integrates software and hardware IP using more than
one design methodology. System on chip design includes embedded processor core and a
significant software component which leads to additional design challenges. In addition to IC
SoC consists of software and interconnection structure for integration.
System on chip design is significantly more complex. Chip designs have for the last 20 years
reused design elements. SoC design has involved the reuse of more complex elements at higher
levels of abstraction. Block-based design, which involves partitioning, designing and assembling.
SoCs using a hierarchical block-based approach, has used the Intellectual Property (IP) block as
the basic reusable element [4].
SoC: More of a System not a Chip. In addition to IC, SoC consists of software and
interconnection structure for integration. SoC may consists of all or some of the following:
Processor/CPUs (cores)
On-chip interconnection (busses, network, etc.)
Analog circuits
Accelerators or application specific hardware modules
ASICs Logics
Software – OS, Application, etc.
Firmware [2]
The large parameter variations across a chip will make it prohibitively expensive to control
delays in clocks and other global signals. Also, issues of modularity and energy consumption
plead in favor of asynchronous solutions at the system level. It is now generally agreed that the
sizable very large scale integration (VLSI) systems [systems-on-chip] of the nanoscale era will
not operate under the control of a single clock and will require asynchronous techniques.
In this we will introduce the main design principles, methods, and building blocks for
asynchronous VLSI systems, with an emphasis on communication and synchronization. Such
systems will be organized as distributed systems on a chip consisting of a large collection of
components communicating by message exchange. Therefore, it places a strong emphasis on
issues related to network and communication issues for which asynchronous techniques are
particularly well-suited.
We hope that after reading this, the designer of an SoC should be familiar enough with those
techniques that he or she would no longer hesitate to use them. Even those adepts of GALS who
are adamant not to let asynchrony penetrate further than the network part of their SoC must
realize that network architectures for SoCs are rapidly becoming so complex as to require the
mobilization of the complete armory of asynchronous techniques.[1]
Here we are defining some of the important words those will be used many times further. Those
are-
A digital circuit is asynchronous when no clock is used to implement Sequencing. Such circuits
are also called clock less. The various asynchronous approaches differ in their use of delay
assumptions to implement sequencing.
Asynchronous circuits with the only delay assumption of isochronic forks are called quasi-delay-
insensitive (QDI).We use QDI as the basis for asynchronous logic. All other forms of the
An asynchronous circuit in which all forks are assumed isochronic corresponds to what has been
called a speed independent circuit, which is a circuit in which the delays in the interconnects
(wires and forks) are negligible compared to the delays in the gates.
self-timed circuits are asynchronous circuits in which all forks that fit inside a chosen physical
area called equipotential region are isochronic.
A circuit is delay-insensitive (DI) when its correct operation is independent of any assumption on
delays in operators and wires except that the delays are finite and positive.
A SoC is a system on an IC that integrates software and hardware IP using more than one design
methodology. SoC design includes embedded processor cores and a significant software
component which leads to additional design challenges.
There are so many challenges in system on chip designing but some of the important challenges
are described in the following content-[3]
2.8 EM:
“In the future, more designs will be EM-limited.”
The “self-heating” thermal profile of high-switching activity, high-frequency
devices requires detailed modeling, to determine the (local) increase in metal temp.
EM reliability analysis will become increasingly more complex.
• The delay in the EUV lithography requires new techniques for decomposition and
patterning (e.g., TPL, sidewall-based).
Expect MUCH more engineering participation in layout strategies to be
multi-patterning compliant.
Additional restricted design rules
Complex coloring requirements and verification
SoCs are complex distributed systems in which a large number of parallel components
communicate with one another and synchronize their activities by message exchange.
Synchronous (clocked) logic brings a simple solution to the problem by partially ordering
transitions with respect to a succession of global events (clock signals) so as to order conflicting
read/write actions. In the absence of a global time reference, asynchronous logic has to deal with
concurrency in all its generality, and asynchronous logic synthesis relies on the methods and
notations of concurrent computing.
There exist many languages for distributed computing. The high-level language used in this is
called Communicating Hardware Processes (CHP). It is used widely in one form or other in the
design of asynchronous systems. We introduce only those constructs of the language needed for
describing the method and the examples, and that are common to most computational models
based on communication.
A system is composed of concurrent modules called processes. Processes do not share variables
but communicate only by send and receive actions on ports.
A send port of a Process-say, port R of process p1-is connected to a receive port of another
process- say, port L of process p2-to form a channel. A receive command on port L is denoted
L?y. It assigns to local variable y the value received on L. A send command on port R, denoted
R!x, assigns to port R the value of local variable x. The data item transferred during a
communication is called a message. The net effect of the combined send R!x and receive L?y is
the assignment y := x together with the synchronization of the send and receive actions.
The value of a variable is changed by an explicit assignment to the variable as in x := expr. For b
Boolean, b ↑and b ↓ stand for b: =true and b: = false, respectively.
CHP and HSE provide two composition operators: the sequential operator S1; S2 and the parallel
operator. Unrestricted use of parallel composition would cause read/write conflicts on shared
variables. CHP restricts the use of concurrency in two ways. The parallel bar ║, as in S1║S2,
denotes the parallel composition of processes. CHP also allows a limited form of concurrency
inside a process, denoted by the comma, as in S1, S2. The comma is restricted to program parts
that are noninterfering: if S1 writes x, then S2 neither reads x nor writes x.
The selection command is a generalization of the if statement. It has an arbitrary number (at least
one) of clauses, called ″guarded commands, ″ Bi →Si where Bi is a Boolean condition and Si is
a program part. The execution of the selection consists of: 1) evaluating all guards and 2)
executing the command Si with the true guard Bi. In this version of the selection, at most one
guard can be true at any time. There is also an arbitrated version where several guards can be
true. In that case, an arbitrary true guard is selected.
In both versions, when no guard is true, the execution is suspended: the execution of the
selection reduces to a wait for a guard to become true. Hence, waiting for a condition to be true
can be implemented with the selection [B → skip], where skip is the command that does nothing
but terminates. A shorthand notation for this selection is [B].
Slack matching is an optimization by which simple buffers are added to a system of distributed
processes to increase the throughput. A pipeline is a connected subgraph of the process graph
with one input port and one output port. The static slack of a pipeline is the maximal number of
messages the pipeline can hold. A pipeline consisting of chain of n simple buffers has a static
Each CHP process is refined into a partial order of signal transitions, i.e., transitions on Boolean
variables. The HSE notation is not different from CHP except that it allows only Boolean
variables, and send and receive communications have been replaced with their handshaking
expansion in terms of the Boolean variables modeling the communication wires. The modeling
of wires introduces a restricted form of shared variables between processes (the variables
implementing channels).
The input variables li and ri can only be read. The output variables lo and ro can be read and
written.
Stability and noninterference are the two properties of PRS that guarantee that the circuits are
operating correctly, i.e. without logic hazards. A hazard is the possibility of an incomplete
transition.
How do we guarantee the proper execution of production rule G → t? In other words, what can
go wrong and how do we avoid it? Two types of malfunction may take place: 1) G may cease to
hold before transition t has completed, as the result of a concurrent transition invalidating G, and
2) the complementary transition t0 of t is executed while the execution of t is in progress, leading
to an undefined state. We introduce two requirements, stability and noninterference that
eliminate the two sources of malfunction.
Any concurrent execution of a stable and noninterfering PRS is equivalent to the sequential
execution model in which, at each step of the computation, a PR with a true guard is selected and
executed. The selection of the PR should be weakly fair, i.e., any enabled PR is eventually
selected for execution.
The existence of a sequential execution model for QDI computations greatly simplifies reasoning
about, and simulating, those computations. Properties similar to stability are used in other
theories of asynchronous computations, in particular, semi modularity and persistency. At the
logical level, the execution of transition x ↓ when the guard holds invalidates the guard. (Such
production rules are therefore called self-invalidating.) We exclude self-invalidating production
rules, since, in most implementations, they would violate the stability condition.
The fork (x, x1, x2) is isochronic: a transition on x1 causes a transition on y only when c is true,
and a transition on x2 causes a transition on z only when c is false. Hence, certain transitions on
x1 and on x2 are not acknowledged, and therefore a timing assumption must be used to
guarantee the proper completion of those unacknowledged transitions.
Unfortunately, the class of computations in which all transitions are acknowledged is very
limited. Consider the example of Fig. 6 Signal x is forked to x1, an input of gate Gy with output
y, and to x2, an input of gate Gz with output z. A transition x ↑ when c holds is followed by a
transition y ↑, but not by a transition z ↑, i.e. transition x1 ↑ is acknowledged but transition x2 ↑
is not, and vice versa when ̚c holds. Hence, in either case, a transition on one output of the fork is
not acknowledged. In order to guarantee that the unacknowledged transition completes without
violating the specified order, a timing assumption called the isochronicity assumption has to be
introduced, and the forks that require that assumption are called isochronic forks.
Let us first implement a ″bare″ communication between processes p1 and p2: no data is
transmitted. (Bare communications are used as a synchronization point between two processes.)
In that case, channel (R, L) can be implemented with two wires: wire (ro, li) and wire (lo, ri).
(The wires that implement a channel are also called rails.) It is shown in fig4.1.
The simplest handshake protocol implementing the slack-zero communication between R and L
is the so-called two-phase handshake protocol, also called non return to zero (NRZ). The
protocol is defined by the following handshake sequence Ru for R and Lu for L:
Ru : ro ↑; [ri]
Lu : [li]; lo ↑:
Given the behavior of the two wires (ro,li) and (lo,ri) , the only possible interleaving of the
elementary transitions of Ru and Lu is ro ↑; li ↑; lo ↑; ri ↑.
This interleaving is a valid implementation of a slack zero execution of R and L, since there is no
state in the system where one handshake has terminated and the other has not started. But now all
handshake variables are true, and therefore the next handshake protocol for R and L has to be
Rd : ro ↓; [̚ri]
Ld : [̚li]; lo ↓
The use of the two different protocols is possible if it can be statically determined (i.e., by
inspection of the CHP code) which are the even (up-going) and odd (down-going) phases of the
communication sequence on each channel. But if, for instance, the CHP program contains a
selection command, it may be impossible to determine whether a given communication is an
even or odd one.
A straightforward solution is to always reset all variables to their initial value (zero). Such a
protocol is called four-phase or return-to-zero (RZ). R is implemented as Ru; Rd and L as Lu; Ld
as follows:
R : ro ↑; [ri]; ro ↓; [̚ri]
L : [li]; lo ↑; [̚li]; lo ↓ :
In this case, the only possible interleaving of transitions for a concurrent execution of R and L is
ro ↑; li ↑; lo ↑; ri ↑; ro ↓; li ↓; lo ↓; ri ↓.
Again, it can be shown that this interleaving implements a slack-zero communication between R
and L. It can even be argued that this implementation is in fact the sequencing of two slack-zero
communications: the first one between Ru and Lu, the second one between Rd and Ld. This
observation will be used later to optimize the protocols by a transformation called reshuffling.
Let us now deal with the case when the communication also entails transmitting data, for
instance, by sending on R(R!x) and receiving on L(L?y). A solution immediately comes to mind:
let us add a collection of data wires next to the handshake wires. The data wire (rd; ld) is
indicated by a double arrow on Fig. 4.2. The protocols are as follows:
This protocol relies on the timing assumption that the order between rd := x and ro ↑ in the
sender is maintained in the receiver: when the receiver has observed li to be true, it can assume
that ld has been set to the right value, which amounts to assuming that the delay on wire (ro; li) is
always safely longer than the delay on wire (rd; ld). Such a protocol is used and is called
bundled-data. The efficiency of bundle-data versus DI codes is a hotly debated issue.
In the absence of timing assumptions, the protocol cannot rely on a single wire to indicate when
the data wires have been assigned a valid value by the sender. The validity of the data has to be
encoded with the data itself. A DI data code is one in which the validity and neutrality of the data
are encoded within the data. Furthermore, the code is chosen such that when the data changes
from neutral to valid, no intermediate value is valid; when the data changes from valid to neutral,
no intermediate value is neutral. Such codes are also called separable. There are many DI codes
but two are almost exclusively used on chip-the dual-rail and 1-of-N codes.
In a dual-rail code, two wires, bit.0 and bit.1, are used for each bit of the binary representation of
the data.
In a 1-of-N code, one wire is used for each value of the data. Hence, the same two-bit data word
is now encoded as follows:
For a Boolean data-word, dual-rail and 1-of-N are obviously identical. For a 2-bit data word,
both dual-rail and 1-of-4 codes require four wires. For an N-bit data word, dual-rail requires 2 *
N wires. If the bits of the original word are paired and each pair is 1-of-4 encoded, this coding
also requires 2 * N wires. An assignment of a valid value to a dual-rail-coded word requires 2 *N
transitions, but requires only N transitions in the case of a 1-of-4 code.
The 1-of-N code, also called one-hot, is a special case of a larger class of codes called k-out-of-
N. Instead of using just one true bit out of N code bits, as is done in the 1-of-N, we may use k, 0
G k G N, true bits to represent a valid code value. Hence, the maximal number of valid values for
a given N is obtained by choosing k as N=2. Sperner has proved that this code is not only the
The choice of a DI code in the design of a system on a chip is dictated by a number of practical
requirements. First, the tests for validity and neutrality must be simple. The neutrality test is
simple: as in all codes, the unique neutral value is the set of all zeroes or the set of all ones. But
the validity test may vary greatly with the code. Second, the coding and decoding of a data word
must be simple. Third, the overhead in terms of the number of bits used for a code word
compared to the number of bits used for a data word should be kept reasonably small.
Finally, the code should be easy to split a coded word is often split into portions that are
distributed among a number of processes-for example, a processor instruction may be
decomposed into an opcode, and several register fields. It is very convenient if the portions of a
code word are themselves a valid code word. This is the case for the dual rail code for all
partitions and for the 1-of-4 code for partitions down to a quarter-byte.
The combination of four-phase handshake protocol and DI code for the data gives the following
general implementation for communication on a channel. In this generic description, we use
global names for both the sender and receiver variables. A collection of data wires called data
encodes the message being sent. A single acknowledge wire ack is used by the receiver to notify
the sender that the message has been received. This wire is called the enable wire when it is
initialized high.
The three basic building blocks are: 1) a circuit that sequences two bare communication actions-
the sequencing of any two arbitrary actions can be reduced to the sequencing of two bare
communications; 2) a circuit that reads and writes a single-bit register; and 3) a circuit that
computes a Boolean function of a small number of bits.
5.1 Sequencer:
The basic sequencing building block is the sequencer process, also called left–right buffer p1 :
*[L; R] which repeatedly does a bare communication on its left port L followed by a bare
communication on its right port R. The two ports are connected to an environment which
imposes no restriction on the two communications. The simplest implementation is when both
ports are active. (The reason is that a handshake on a passive port is initiated by the environment
and therefore requires extra effort to be synchronized.)
Now, all the states that need to be distinguished are uniquely determined and we can generate a
PR set that implements the HSE. This leads to the two solutions shown in Fig. 5.1 and 5.2. In the
first solution, the state variable x is implemented with a C-element, in the second one with cross-
coupled nor-gates.
All other forms of the left–right buffer are derived from the active–active buffer by changing an
active port into a passive one. The conversion is done by a simple C-element. The passive–active
buffer is shown on Fig.5.3
We have already mentioned that the down-going phase of a four-phase handshake is solely for
the purpose of resetting all variables to their initial (neutral state) values, usually false. The
designer therefore has some leeway in the sequencing of the down-going actions of a
communication with respect to other actions of an HSE. The transformation that moves a part of
a handshake sequence in an HSE is called reshuffling. It is an important transformation in
asynchronous system synthesis as many alternative implementations of the same specification
can be understood as being different reshufflings of the same initial HSE.
Its interest is that it leads to a very simple implementation: a simple C-element with the output
replicated to be both lo and ro, as shown in Fig. 5.4
Another (less drastic) reshuffling of the original HSE is shown in the fig 5.5 which admits the
two-C-element implementation.
But reshuffling may also reduce the slack of a pipeline stage when it is applied to an input port
and an output port, for instance, L and R in the simple buffer. Hence, reshuffling a buffer HSE is
usually a tradeoff between reducing the circuit complexity on the one hand, and reducing the
slack on the other hand, thereby reducing the throughput.
Next, we implement a register process that provides read and write access to a single Boolean
variable, x. The environment can write a new value into x through port P, and read the current
value of x through port Q. Read and write requests from the environment are mutually exclusive.
As shown in Fig.16, input port P is implemented with two input wires, p:1 for receiving the value
true, and p:0 for receiving the value false; and one acknowledge wire, po. Output port Q is
implemented with two output wires, q:1 for sending the value true, and q:0 for sending the value
false; and one request wire, qi. Variable x is also dual-rail encoded as the pair of variables xt; xf.
An n-bit register R is built as the parallel composition of n one-bit registers ri. Each register ri
produces a single write-acknowledge signal wack. All the acknowledge signals are combined by
an n-input C-element to produce a single write-acknowledge for R.
The completion tree puts a delay proportional to logn elementary transitions on the critical cycle.
Combined with the write-acknowledge circuit itself, the completion tree constitutes the
completion detection circuit, which is the main source of inefficiency in QDI design. Numerous
efficient implementations of completion detection have been proposed. The read part of the n-bit
register is straightforward: the read-request signal is forked to all bits of the register.
However, the increasing variability of modern technology requires increasing delay margins for
safety. It is the authors’ experience that after accounting for all margins, the total delay of
bundled data is usually longer than the completion-tree delay-and bundled data gives up the
robustness of QDI [1].
In order to pipeline successive computations of the function, the stage must have slack between
input ports and output ports. In this section, we present two different approaches to the design of
asynchronous pipelines.
The second approach is aimed at fine-grain high throughput pipelines. The data path is
decomposed into small portions in order to reduce the cost of completion detection, and for each
portion, control and data path are integrated in a single component, usually a precharge half
buffer. The implementation of a pipeline into a collection of fine-grain buffers is based on data-
driven decomposition [6].
In its simplest form, a pipeline stage receives a value x on port L and sends the result of a
computation, f (x), on port R. The design of a pipeline stage combines all three basic operations:
sequencing between L and R, storage of parameters, and function evaluation. A simple and
systematic approach consists of separating the three functions.
A control part implements the sequencing between the bare ports of the process and provides a
slack of one in the pipeline stage.
Simplicity and generality are the strengths of the previous approach to pipeline design; it allows
quick circuit design and synthesis. However, the approach puts high lower bounds on the cycle
time, forward latency, and energy per cycle.
First, the inputs on L and the outputs on R are not interleaved in the control, putting all eight
synchronizing transitions in sequence. Second, the completion-tree delay, which is proportional
to the logarithm of the number of bits in the data path, is included twice in the handshake cycle
between two adjacent pipeline stages.
The handshake sequence of L and the handshake sequence of Rare reshuffled with
respect to each other so as to overlap some of the transitions, and eliminate the need for
the explicit registers for input data, and
The data path is decomposed into independent slices so as to reduce the size of the
completion trees, and improve the cycle time.
The purpose of this paper was to expose the SoC architect to a comprehensive set of standard
asynchronous techniques and building blocks for SoC interconnects and on chip communication.
Although the field of asynchronous VLSI is still in development, the techniques and solutions
presented here have been extensively studied, scrutinized, and tested in the field-several
microprocessors and communication networks have been successfully designed and fabricated.
The techniques are here to stay. The basic building blocks for sequencing, storage, and function
evaluation are universal and should be thoroughly understood. At the pipeline level, we have
presented two different approaches: one with a strict separation of control and data path, and an
integrated one for high throughput. Different versions of both approaches are used.
At the system level, issues of slack, choices of handshakes and reshuffling affect the system
performance in a profound way. We have tried to make the designer aware of their importance.
At the more fundamental level, issues of stability, isochronic forks, validity, and neutrality tests,
state encoding must be understood in order for the designer to avoid the recurrence of hazard
malfunctions that have plagued early attempts at asynchrony.
While we realize very well that the engineer of an SoC has the freedom and duty to make all
timing assumptions necessary to get the job done correctly, we also believe that, from a didactic
point of view, starting with the design style from which all others can be derived is the most
effective way of teaching this still vastly misunderstood but beautiful VLSI design method.
IP Intellectual property
DI Delay insensitive