Distributed Simulation of Asynchronous Hardware: The Program Driven Synchronisation Protocol
Distributed Simulation of Asynchronous Hardware: The Program Driven Synchronisation Protocol
Distributed Simulation of Asynchronous Hardware: The Program Driven Synchronisation Protocol
Key Words: asynchronous hardware, CSP, occam, modelling, distributed simulation, synchroni-
sation
1. INTRODUCTION
A digital system is typically designed as a collection of subsystems, each performing a
different computation and communicating with its peers to exchange information. Before
a communication transaction takes place, the subsystems involved need to synchronise,
namely to wait for a common control state to be reached, which guarantees the validity of
data exchanged.
In synchronous systems, the synchronisation of communicating subsystems is achieved
by means of a global clock whose transitions define the points in time when communication
transactions can take place. The operation of a synchronous system proceeds in lockstep,
with the different subsystems being activated to perform their computations in a strict,
predefined order [42]. Synchronous VLSI design however is approaching a critical point,
1
2 GEORGIOS K. THEODOROPOULOS
with clock distribution becoming an increasingly costly and complicated issue and power
consumption rapidly emerging as a major concern.
Another digital design philosophy allows subsystems to communicate only when it
is necessary to exchange information. The operation of the system does not proceed in
lockstep, but rather is asynchronous; each sub-system operates at its own rate synchronising
with its peers only when it needs to exchange information. This synchronisation is not
achieved by means of a global clock but rather, by the communication protocol employed.
This protocol is typically in the form of local request and acknowledge signals which
provide information regarding the validity of data signals.
Although asynchronous design techniques have been explored since, at least, the mid
1950s [44, 14, 17, 34], they have not hitherto been established as a major philosophy in
digital design. This failure was mainly related to the difficulty to enforce specific orderings
of operations and to deal with circuit hazards and dynamic states in an asynchronous,
non-deterministic environment [30]. However, recently, there has been a resurgence of
interest in asynchronous design techniques, due to the significant potential benefits that the
elimination of global synchronisation may offer to issues such as clock distribution, power
consumption, performance and modularity [26].
Various asynchronous digital design techniques have been developed, which are typically
categorised by the timing model (namely, the assumptions made regarding the circuit and
signal delay), the signalling protocol (namely, the sequence of events which must take
place in a communication transaction between two elements), and the technique they
employ for the transfer of data between two elements (namely, encoding the value of
each bit transmitted during a communication transaction). In his influential 1988 Turing
award lecture, Ivan Sutherland introduced Micropipelines, a new conceptual framework
for designing asynchronous systems [51]. In depth surveys of existing asynchronous
methodologies may be found in [6, 30, 5]. Additionally, the Asynchronous Online Logic
Home Page maintained by the AMULET group at the University of Manchester provides
continuous, up to date information regarding asynchronous systems research [3].
A number of asynchronous architectures have been developed including one at CalTech
[40], NSR [8] and Fred [47] at the University of Utah, STRiP at Stanford University [19],
Sun’s Counterflow pipeline processor [50], FAM [12] and TITAC [45] at Tokyo University
and Institute of Technology respectively, Hades at the University of Hertfordshire [22]
and Sharp’s Data-Driven Media Processor [49]. The AMULET group at the University of
Manchester have developed a series of asynchronous implementations of the ARM RISC
processor using Sutherland’s Micropipelines. The AMULET1 [60] microprocessor has
been the first asynchronous implementation of a commercial instruction set; AMULET2e
[27] has sought to improve performance via an improved design while AMULET3i [28]
aims at embedded applications.
The quest for the exploitation of the potential advantages offered by asynchronous logic
has revealed a need for modelling and simulation techniques, which would be appropriate
for the asynchronous design style. Thus, the recent interest in asynchronous design has
fuelled an intense research activity aiming to develop techniques appropriate for modelling
and simulating asynchronous systems. I-Nets [43], Petri Nets (e.g. the Petrify tool [16]),
Signal Transition Graphs (STGs) [13] (e.g. the Versify [57] and SIS [48] tools), State
Transition Diagrams (e.g. the MEAT tools [18]), ST-V (the Self-Time Verilog developed
by Cogency Technology Inc. [15]) and CCS (by Birtwistle et. al. at the University of Leeds
DISTRIBUTED SIMULATION OF ASYNCHRONOUS HARDWARE 3
Request
Request
Acknowledge
Acknowledge
Sender Receiver Data
[39]) are some of these tools and formalisms that have been employed in asynchronous
logic design.
Communication Sequential Processes (CSP)[32], in particular, the concurrent process
algebra developed by Tony Hoare for the specification of parallel systems, has been ex-
tensively advocated as a suitable means for describing asynchronous behaviour. Several
asynchronous modelling approaches and systems have been developed which use CSP-
based notations, including Martin’s [41, 40, 9], Hulgaard’s [33] and Brunvand’s [7] work,
trace theory [21] [20], Delay-Insensitive algebra [37], Tangram[56], SHILPA [29] and
LARD [23].
Contributing to this effort, and motivated by the increasing debate regarding the potential
use of CSP for this purpose, this paper addresses timing and synchronisation issues arising
from the parallel semantics of CSP when the latter is employed for the simulation of large
and complex asynchronous systems. These issues have been largely overlooked by the
advocates of CSP for this purpose and have been addressed for the first time by the research
presented in this paper. The research forms part of a larger project, which has investigated
the suitability of occam [35], the executable counterpart of CSP, for the modelling and
simulation of asynchronous architectures, developing a generic modelling methodology
for this purpose [53, 55]. The investigation targeted asynchronous systems that are based
on Sutherland’s Micropipelines using the AMULET1 microprocessor as a testbed, however
the results may also be applied to any asynchronous design methodology.
The rest of the paper is organised as follows: Section 2 provides a brief overview of
Sutherland’s Micropipelines. Section 3 provides a discussion of the role of simulation
in asynchronous VLSI design, placing emphasis on the use of CSP and occam for this
purpose. Section 4 outlines a generic framework which has been developed to support the
rapid modelling of asynchronous micropipelined systems using CSP/occam, while section
5 discusses the application of this framework for the development the occarm model of the
AMULET1 microprocessor. Section 6 discusses timing and synchronisation problems in
CSP/occam models of asynchronous systems and section 7 argues for the need for a new
synchronisation approach to address these problems. Section 8 introduces such a approach
and section 9 describes the application of this approach in the occarm model and presents
performance results. Finally, section 10 epitomises the conclusions drawn.
2. MICROPIPELINES
Using Micropipelines, the asynchronous architecture is designed as a set of simple,
data processing elastic pipelines, whose stages operate asynchronously and exchange data
4 GEORGIOS K. THEODOROPOULOS
SELECT steers
Muller-C elements XOR provides events according
provide the AND the OR function to its boolean
functions of events. for events. input.
C SELECT
True False
r1 r1 g1
d1 d1
ARBITER
CALL
TOGGLE r
d2 d2
r2 r2 g2
TOGGLE steers
events to its out- CALL allows two independ- ARBITER
puts alternately. ent clients R1 and R2 to share performs the
a procedure R. When the pro- mutual exclusion
cedure is done a matching function
done event is returned on
either D1 or D2.
C Pd
Register
Dout
Din
Cd P
via a two-phase bundled data handshake synchronisation protocol (figure 1). Two-phase
signalling recognises and responds to transitions of the voltage on a wire, regardless of
whether the transition is rising or falling; a transition is referred to as an event.
Ivan Sutherland also proposed a set of event control blocks for the design of control
circuits in micropipelined systems as well as event controlled storage elements to be used
in such systems. The event control blocks include the Muller-C, Select, Call, Toggle, Xor
and the Arbiter (figure 2).
An event controlled storage element is the Capture-Pass latch, depicted in figure 2g.
The latch is controlled by two control signals, namely Capture (C) and Pass (P). Initially
the latch is in its transparent state, where the input is connected through to the output (i.e.
Din=Dout). When an event is issued on the Capture wire the input-output connection is
interrupted, the data is “latched", and an event is issued on the Cd signal (Capture done) to
indicate the change of state in the latch (i.e. from transparent to opaque); the latched data
does not change with subsequent data input changes. When an event arrives on the Pass
wire, the input is connected back through to the output, thus making the latch transparent
again; this change is indicated by an event on the Pd (Pass done) signal. The Capture-Pass
may repeat, with events arriving alternately on the C and P wires respectively.
DISTRIBUTED SIMULATION OF ASYNCHRONOUS HARDWARE 5
Rin A1 R2 A3
DELAY DELAY Rout
C
C Pd Cd P C Pd Cd P
Register
Register
Register
Register
Dout
Dout
Dout
Dout
Din
Din
Din
Din
Din
Dout
Cd P C Pd Cd P C Pd
C
Ain DELAY DELAY Aout
R1 A2 R3 A4
(a)
Rin A1 R2 A3
DELAY DELAY Rout
C
C
C Pd Cd P C Pd Cd P
Register
Register
Register
Register
LOGIC
LOGIC
LOGIC
Dout
Dout
Dout
Dout
Din
Din
Din
Din
Din
Dout
Cd P C Pd Cd P C Pd
C
C
Ain DELAY DELAY Aout
R1 A2 R3 A4
(b)
request and acknowledge signals moving in opposite directions. CSP overcomes this
problem:
Control Logic
Register
Register
Dout
Dout
Din
Din
DPE
RDin
RDin RDout
C Pd
Register
for the control/data processing logic; the control logic and the DPE may be modelled as
one process, with the DPE being a procedure called by the control process.
Figure 5 illustrates the register occam model. The model makes use of two PAR
statements, one to model the Muller-C element and one to model the fork on the Ain/Rout
wire. Two channels are used in each direction, one for the data/request bundle and one for
the acknowledgement signal. The latter is required to keep the register processes tightly
coupled and synchronised, as in a different case, the control process would act as a buffer
introducing an extra pipeline stage in the model that does not exist in the physical system
[53, 55]. In the case of pipelines without processing, the acknowledgement channel is not
required.
A multi-stage Micropipeline may be modelled by means of a parallel replication of the
register process.
The control logic is inherently concurrent; different parts of the circuit operate concur-
rently while, within each part, events take place in a deterministic sequential order, i.e. the
control logic implements a partial ordering of events. The simulation model should have the
same degree of concurrency as the physical circuit. The control logic may be implemented
DISTRIBUTED SIMULATION OF ASYNCHRONOUS HARDWARE 9
Initialise Interrupts
FIQ IRQ
Read Data
Dabt0 Dabt1
AMULET1 Memory
Control Control
Address Address
as a network of communicating processes, with the occam PAR (parallel execution) and
SEQ (sequential execution) commands being used within each process to implement the
partial ordering of events of the circuit. The number of these processes depends on the
degree of modularity and fidelity required in the simulation model.
Dout
Byte Rep.
Data Interface
mem ctrl DestCtrl
FIFO
MemCP
DataIn
Byte Align
IPipe
rdgen Exception
Primary X pipe
Pipe
Decode
LSMp Primary Decode
mux
Immediate
Pipe
Dec Dec ImmExt
NGen Registers
3 2
Reg
Control
Register
Bank
I[31:28], PcPar
multiply
Cond
Ctrl
2 psrC
Execution shift pass
Cout
Unit
CPSR’
ALU
Ctrl
3 mux
CPSR
mux
Wbus
incrementer
Wbus
Ctrl.
A Pipe
PC H. L.
mux
PC Pipe
arb. mux
AddC
mux
Address Interface
Address
in the figures) and acknowledgement (dotted arrows) channels. To illustrate the modelling
of control logic, figures 11 and 12 depict the modelling of one of the control modules of
the Adress Interface (AddC).
For a detailed description of the structure and operation of occarm the reader is referred
to [53].
DISTRIBUTED SIMULATION OF ASYNCHRONOUS HARDWARE 11
Decode1
Decode3
Decode2
RegBank
Wire
(single channel) from memory
Bundle Dab0
(two channels: Req/Data + Ack) Dabt1
APipe
ALx
buffer
OPa/OPb
Decode2 Decode3
CP
PCcol ALU
NGen result
buffer
InC2 buffer
InC3
ALUgo
Imm AIabt
RdData
Decode1
Dout buffer
R15
Instr LsmTrm
addr+data+ctrl
to memory
Dabt0
DatInt AddInt from memory
data addr+ctrl
Dabt1
from memory
Addr
Din
I,R15
buffer
Wlx
WrtCtrl
RegBank
WrtData
intense. Indeed, if the simulated time is not the synchronisation agent in the simulator,
different event orderings may only be achieved by using occam real time Timers to change
the relative scheduling of the occam processes [35]. The major drawback of this approach
is that small delays cannot guarantee the intended effects and behaviour in the model, as
these delays are only approximate. Furthermore, real time delays have a direct effect on the
performance of the simulator. Thus, large delays that would guarantee the planned process
scheduling, would also have a severe effect on the performance of the simulator.
It has been shown [38] that a distributed system consisting of asynchronous concurrent
processes will not violate the causality principle if each merge process consumes and
processes event messages in non-decreasing timestamp order (the local causality constraint;
the processing of an out of order event is referred to as a preemption).
In a Micropipelined architecture, micropipelines may be merged in one of the following
ways:
Synchronous merge. A functional module has to wait for all input data to become
available before it starts its operation. This is the case when a Muller-C element is used for
the corresponding request events. In the simulation model, the occam process has to wait
for all input channels to “fire” and therefore no preemptions occur as the process is in the
position to select the message with the smallest timestamp.
DISTRIBUTED SIMULATION OF ASYNCHRONOUS HARDWARE 13
address+control+data instructions/data
to memory from memory
from execution unit
(Decode2) PCch MemCP MRReg
address+control
APipe from address register
interface
register
register
register
register
register
LSMreg PC Holding Latches
MemCtrl
register
register
register
register
register
Dout IPipe DestCtrl
PC Pipe
MAReg register
register
register
register
register
register
PC0
register
register
AddC Incrementer register
register PC1
LsmTrm DataOut DInCtrl
from primary decode DataIn
(Dec1CtrlB) register
Wlx Wch register
(b) DatInt
to primary decode
to memory
to/from write control
(Dec1CtrlA) data instructions data
from execution unit to primary decode to write control
(a) AddInt (Decode2)
IReg ReadDone
I
register
register
ReadCtrl WrtCtrl
PC (W decode)
(A,B decode)
from execution unit to register bank to execution unit from primary LockDone
PCcol to XPipe RBch PCch (Decode2) decode
ALUgo
XP RGch from execution unit read addresses
from PCPipe
PC B A AFifo MFifo
R15
R15 activate lock register
Dec1CtrlA I Dec1CtrlB RdGa RdGen register
I
LSMPr register
ReadLock register
from IPipe register
ImmPipe register
unlock
register
LsmTrm InC2 InC3
register
read addresses
RegA RegB
PC B A
register register ImmExt
write addresses from write
control
write data
to address interface (Decode2) (Decode3) register
to execution unit Read Write
to execution unit
to execution unit (Decode2)
(c) Decode1 (Decode2)
... read enable ... write data
A
register
Mux out.bus
...
OReg
cell cell ... cell
from primary decode
(Dec1CtrlB)
(d) RegBank
from previous execution stage from primary decode from primary decode
register
Dec3 (Decode2) (Dec1CtrlB) (Dec1CtrlB)
Dec2 NGen
Dec3Ctrl RSh from ImmPipe
register
register
register
Dec2Ctrl
CPreg OPreg
register register register
NGenCtrl
register
register
Dabt1
Ctrl3 from memory
from register A Ctrl2
Dabt0 bank B
ALx
RReg AP Dout
register to data interface
buffer buffer buffer buffer
to APipe
CP OPa OPb
to next execution stage
to write control ALx AIabt PCcol ALUgo
(Decode3)
(e) Decode3 (f) Decode2
Data dependent merge. The functionality of the system dictates the order in which
messages from different source processes should be consumed and processed. This situation
14 GEORGIOS K. THEODOROPOULOS
P2.5
P2.1
P1
P2.3
P2.4
P2.2
P2.6
SEQ
ALT --ARBITER
PCr?
P1 --PC loop
Wr?
P2.1 --IF
FALSE
SEQ
P2.2 --data transfer
TRUE
SEQ
P2.3 --IF
TRUE
SEQ
WHILE Ntrm=FALSE -- LSM loop
SEQ
P2.4
P2.5
FALSE
SEQ
P2.6 --data transfer: Apipe
:
is implemented in hardware using a combination of a Select and a Call or Xor. The process
in this case behaves as a single input module, hence causality is not violated.
DISTRIBUTED SIMULATION OF ASYNCHRONOUS HARDWARE 15
Ain1
r1 d1 d1
RDin1 g1 r1
RDout
ARBITER
CALL
r process data
d
g2 r2
Aout
RDin2
r2 d2 d2
(a)
RDin2 ? in.data
SEQ
--process in.data in a similar as above
:
(b)
Arbitrated merge. The order of arrival defines the order of consumption. If events
from two micropipelines arrive at the same time, an arbitrary choice is made.
500
450
500
4
400
3 350
2 300
250
1
200
00 20000 40000 60000 80000 100000 120000 150
100
Simulated Time (ns) 50
00 20000 40000 60000 80000 100000 120000
b) AddInt Simulated Time (ns)
4 200
3 150
2 100
1 50
00 20000 40000 60000 80000 100000 120000 00 20000 40000 60000 80000 100000 120000
Simulated Time (ns) Simulated Time (ns)
c) WrtCtrl
of the AMULET1 pipelines, while figure 14 illustrates the distribution of preemptions over
time, namely the number of times that a preemption was detected (the preemption count)
and the corresponding accumulated preemption magnitude for each of the three arbiters of
AMULET1.
the computation and communication semantics of the latter, an exploitation which implies
a data-driven operation of the occam processes in the model. Furthermore, in the case
of asynchronous hardware models, preemption problems are restricted only in the arbiter
processes of the model and therefore, the deployment of existing generic protocols would
impose an unnecessary overhead and complexity in the model.
The Program Driven Synchronisation Protocol introduced in this paper, aims at enforcing
temporal coherency in CSP/occam asynchronous hardware models while adhering to the
data-driven operation of the processes in the model.
The Instruction Lookahead Set of any particular link in the system is directly defined by
the architecture’s specification and thus, may become available to the arbiter processes of
the simulation model in advance. Based on the ILS of their input links, arbiter processes
may directly make decisions regarding the potential arrival of messages, provided of course
that they are also informed of the instructions being executed in the system.
Rule 8.1. An arbiter process is allowed to block and wait for an event on its input
link during the execution of an instruction I if and only if I 2 ILS .
The above rule ensures that arbiter processes block only for instructions that are likely to
generate the corresponding events. However, depending on the status of the system, during
the instruction’s execution such an event might not occur; in this case Null messages are
required otherwise the arbiter process will become blocked and the simulation model will
deadlock. The following rule is concerned with the production of Null messages:
Rule 8.2. A Null message will be sent to link of the arbiter process if and only
if expects an event on based on the ILS , and for the current state of the system the
event will not be generated.
The two rules above, specify the behaviour of arbiter processes and their peers, when
their interaction depends on the executed instructions. However, not all events in an
asynchronous system occur in an instruction dependent fashion. Indeed, certain parts
of the system may operate autonomously, irrespective of which instructions are being
executed; the PC loop in the AMULET1 processor is an example of such an autonomous
unit. In this case it is the state of the simulated system that dictates the behaviour of the
arbiter process:
Rule 8.3. An arbiter process is allowed to block and wait for an event on its input
link which fires in an instruction independent way, if and only if the state of the system
guarantees that a message will be issued on .
PROC PDSP_Arbiter()
PROC Select(msg1, msg2)
SEQ
IF
msg2_pending=TRUE
SEQ
IF
timestamp(msg1)<timestamp(msg2)
SEQ
process(msg1)
msg1_pending:=FALSE
timestamp(msg1)>timestamp(msg2)
SEQ
process(msg2)
msg2_pending:=FALSE
timestamp(msg1)=timestamp(msg2)
SEQ
make_random_selection(msg1,msg2)
msg2_pending=FALSE
SEQ
IF
msg2_expected=TRUE
SEQ
In2?msg2
msg2_pending:=TRUE
msg2_expected=FALSE
SEQ
process(msg1)
msg1_pending:=FALSE
:
WHILE TRUE
SEQ
IF
msg1_pending=TRUE
SEQ
Select(msg1,msg2)
msg2_pending=TRUE
SEQ
Select(msg2,msg1)
TRUE
SEQ
ALT
In1?msg1
msg1_pending:=TRUE
In2?msg2
msg2_pending:=TRUE
:
However, it does not ensure that the concurrency of the simulated system is sufficiently
exploited to increase the potential of the simulation model for high performance.
Indeed, as soon as it predicts that a message is expected on one of its input links In2, the
arbiter process will stop accepting any messages arriving on its other input link In1 until
the expected message on In1 arrives. As a consequence, all the processes that are part of
20 GEORGIOS K. THEODOROPOULOS
the path that leads to In1 will block and wait, and the pipelines at the output side of the
arbiter process will starve; during this time, large parts of the simulator will remain idle.
A solution to this problem is to provide arbiter processes with some indication as to
when in the simulated future a message they expect will actually arrive. This information
would enable them to consume a number of events occurring on In1 link before they block
on In2 increasing thus the concurrency of the simulation model.
This information can be obtained by taking into account the propagation delays in the
architecture being simulated. An event generated by an instruction will propagate through
a number of pipeline stages before it reaches an arbiter. The path followed by the event
is completely defined by its parent instruction; the latency of the path however at any
particular time, depends on the number of elements in the micropipelines involved and
thus, it is non-deterministic.
Consequently, it is not feasible to know in advance the exact time required for an event
to propagate through a given micropipeline. However, there is a lower bound to this time,
namely the latency of the micropipeline when, at the moment of the event’s entry, it is
empty. Based on this observation, a distance-based[4] form of lookahead may be defined,
namely the Minimum Latency Lookahead:
Definition 8.2. The Minimum Latency Lookahead of a link MLL , is defined
as the total propagation delay of the path leading to , when the pipelines of the path are
empty:
MLL =
P di ; di = Propagation Delay of the i-th Pipeline Stage in the path.
The Instruction Lookahead Set of a link informs the corresponding arbiter process
whether a message should be expected on that link; the Minimum Latency Lookahead
reveals when in the simulated future the expected message may arrive. Based on the
Minimum Latency Lookahead, the following rule may be specified:
DISTRIBUTED SIMULATION OF ASYNCHRONOUS HARDWARE 21
Rule 8.4. An arbiter process will not process a message 1 received on its input
link 1 but instead it will block and wait for a message 2 expected on its other input link
2 , if and only if the timestamp of 2 as predicted by the MLL2 is less than or equal to
the timestamp of 1 .
Rule 4, combined with the ALT statement in the main loop of the PDSP arbiter process,
will enable arbiter processes to process messages with appropriate timestamps as soon as
they arrive.
Figure 16 depicts the Select process when the MLL is taken into account.
Memory Addresses
from PC Pipe
PCch
MAReg MRReg
register
register
AddC MemCtrl Memory to IPipe
Wch
from WrtCtrl
Instructions
Values forwarded Acknowledgements and
to MAReg piggybacked Instructions
PCk ??
PCk+1 Ik
PCk+2 Ik+1
PCk+3 Ik+2
AddC MAReg
PC Pipe
PC loop
PCch R15
...
(PC)
LsmTrm
Datapath
AddC
(W) Wch
the rest of the processor. Thus, on the PCch channel there will be a continuous, instruction
independent flow of messages.
The role of the PC Pipe is to provide the processor’s datapath with the R15 values
required for the execution of instructions as depicted in figure 18. If, for any reason, the
datapath stalls, instructions will start to backlog and as a result, the PC Pipe will become
full and will remain so for as long as the datapath stalls. During this period no further PC
values will be allowed to enter the PC Pipe and thus no Acknowledgment signal will be
issued on the PCch channel. The datapath may stall in the following cases, as illustrated in
figure 19:
If the datapath fills up; this will occur as a result of Decode3 and WrtCtrl processes
waiting for the aborts and Wlx signals respectively.
If an ILSW
h instruction is followed by register read operations which refer to locked
registers.
If an ILSW
h instruction is followed by instructions which activate the ALUgo signal.
During the execution of load/store multiple instructions.
In order to avoid deadlock situations, it is essential that AddC be able to decide whether
it should wait for yet another message from PCch or whether the PC Pipe has become full
and thus no more messages will be sent on the PCch link (PDSP Rule 3). In order to do
that, AddC needs to possess information regarding the possible invalidation of instructions
that have entered the system. This information is provided by both Decode1 and Decode3
by means of extra messages sent via dedicated, buffered links that have been introduced in
the occarm model.
Decode3 informs AddC of the possible changes in the value of the Current Processor
Status that could result to the rejection of instructions, as illustrated in figure 20a.
The messages issued by Decode1 (figure 20b) aim to inform AddC of the exact time of
arrival of the PCcol signal from Decode3, whereupon instructions may start being rejected
in Decode1.
ALUgo
...
...
Decode1 Decode3
from memory
R15
Dabt1
from PC Pipe
RegBank
LsmTrm RReg register
WReg
Wch
register
AddC WrtCtrl
Wlx
Messages arriving on Wch channel are primarily sent to AddC from the datapath (through
WrtCtrl) as a result of the execution of ILSW
h instructions and carry either branch target
or data transfer addresses. For data transfer operations whose destination register is R15, a
second message will be sent over Wch, namely the new value of the Program Counter from
memory. According to the PDSP algorithm (PDSP Rules 1 and 4), when AddC detects
an ILSW
h instruction it blocks until it receives the corresponding messages on Wch. If
however the ILSW
h instruction will not be executed or the memory fails to respond (i.e.
an abort occurs), the expected messages will never be issued, thus leaving AddC blocked
and causing the simulator to deadlock. There are two reasons why an instruction may
not be executed, namely if its colour does not match that of the processor or if it fails its
condition codes.
All the instructions whose execution may change the operating colour of the processor
- either by explicitly writing a new value to the PC (i.e. the branch target address) or by
causing an abort - belong to the Instruction Lookahead Set of the Wch channel. Thus, an
ILSW
h instruction will suffer a colour mismatch only if it follows another ILSW
h which
has changed the processor’s colour. For instructions that explicitly change the PC, the new
colour is provided to the AddC with the branch target address, making the decision as to
whether an ILSW
h instruction will be discarded straightforward. If however the colour
changes due to an abort, AddC has no direct knowledge regarding this change; Decode3
will receive the abort signal from memory and will change the PCcol rejecting subsequent
instructions. In this case a Null message must be sent by Decode3 to inform AddC of the
occurrence of an abort and the colour change (PDSP Rule 2).
As described in the previous section, AddC is provided by Decode3 with all the CPS-
related information required to predict the fate of subsequent instructions regarding their
condition codes; this is performed via the dedicated link and lasts until a result is produced
by Decode3 and forwarded to RReg as illustrated in figure 20a. Thus, for subsequent
ILSW
h instructions that fail their condition codes, Null messages are required to be sent
DISTRIBUTED SIMULATION OF ASYNCHRONOUS HARDWARE 25
0: invalid
a) Instruction Valid bit c)
1: valid
I 1
0: invalid
Instruction Valid bit
I ∈ S1 1
1: valid
I ∈ S2 I ∈ ILS2 Wch 1
Instruction flow 1 Stop issuing CPS
I ∈ S1 0 I ∈ ILS2 Wch 0
I 0 I 0
I ∈ S1 1
Instruction Colour
b) I ∈ ILS2 Wch 0
I 0
I ∈ ILS2 Wch 0
Instruction flow I 1
Decode3
I ∈ ILS1 Wch ?
I 1
Decode1
1 Stop issuing messages
I
I 1
Message to AddC S1 = {TST, TEQ, CMP, CMN, MSR, MRS}
I 1 S2 = {I : The execution of I produces a result to RReg}
PCcol 0 ILS1Wch = {I : I does not activate aborts and
I 1
if invalidated in Ctrl3, I still produces a message to AddC}
I 1 ILS2Wch = {I : No message is sent to AddC if I is ivalidated in Dec3}
I ∈ ILS 1
Decode3 Wch
I 1
by Decode3 to inform AddC of this event. Once a Null message is sent to AddC, no more
messages of this kind will be issued for ILSW
h instructions subsequently invalidated
in Decode3, until a valid instruction is executed. This pattern will be followed until the
next ILSW
h to produce a result is encountered, whereupon the production of CPS-related
messages to AddC will commence. This is illustrated in figure 20c where a complete
picture of the interaction between Decode3 and AddC is provided.
Request.out
monitor[]
Data.in I/O Process
6 3 1 2 5 4 F F
T0 T1 T2 T3 T4 T5 T6 T7
Monitoring
Memory F
Process Processes
monitor[] T8
...
(a) (b)
chitectures1 . Like all synthetic benchmarks, Dhrystone tries to match the average behaviour
(i.e. the average frequency of operations and operands) of a large set of real programs; thus,
the results obtained may be considered representative of the average behaviour of occarm
and PDSP too.
TABLE 1
Performance of occarm - Benchmark: Dhrystone (1 loop)
Arbiter Model Elapsed Time (minutes) Speedup
Occarm (single) Occarm (multi)
Table 1 presents the performance results achieved for the different occarm configurations
and arbiter models used. The results represent the mean values derived from ten runs of
the simulators in question.
Clearly, the best performance results are achieved when no attempt is made to enforce
temporal precision in the model (the poor speedup achieved by the distribution of occarm
onto the multiple processors of the T-Rack is attributed to the characteristics of both, the
simulated architecture and the host machine and has been analysed in detail in [53, 55]).
1 A discussion on the different benchmarks used in connection with computer architecture research may be
found in e.g. [31] pp. 45-48.
DISTRIBUTED SIMULATION OF ASYNCHRONOUS HARDWARE 27
The deployment of PDSP and the incorporation of preemption-free arbiter models in the
occarm model, with no attempt to exploit MLL, has resulted to a 20% and a 16.39% decrease
of the performance of the single and multiple transputer configurations of the simulator
respectively. The difference in performance reductions between the two configurations is
attributed to fact that the distribution of processes onto multiple transputers reduces the
context switching overhead and the waiting times of blocked processes.
When the Minimum Latency Lookahead is exploited, smaller performance reductions of
15.27% and 12.8% respectively for the single and multiple transputer configurations of the
simulator are observed. In this case the impact of the distribution of processes onto multiple
transputers on the simulation performance is smaller, since in this case, the exploitation of
the MLL enables processes to be scheduled for longer periods alleviating thus some of the
context switching.
10. EPILOGUE
Asynchronous logic is being viewed as an increasingly viable alternative digital design
approach which promises to liberate VLSI systems from clock skew problems, offer the
potential for low power and high performance and encourage a modular design philosophy
which makes incremental technological migration a much easier task. The advent of
easily available custom ASIC technology in the form of VLSI or FPGAs has greatly
facilitated the implementation of asynchronous circuits. However, asynchronous logic
has to overcome several obstacles before it is established as a mainstream digital design
style. One such obstacle is the lack of modelling languages and simulation techniques
suitable for asynchronous design. Fundamentally, conventional, sequential, synchronous
hardware description languages are not suitable for describing concurrent non-deterministic
asynchronous behaviour.
The concurrent, asynchronous, process-based model of computation of CSP and occam,
with the support for non-deterministic behaviour, and the point-to-point, synchronous
and unbuffered inter-process communication are particularly suitable for describing the
concurrent, non-deterministic behaviour of asynchronous hardware systems and provide a
natural and convenient means for the rapid construction of asynchronous hardware models.
However, the exploitation of the relationship between CSP/occam and asynchronous
hardware systems, trades temporal accuracy for ease of modelling. Although the topological
characteristics of the modelled hardware system map naturally onto the model, the temporal
characteristics do not, a situation that may lead to causality errors in the model.
Addressing this problem, this paper has introduced the Program Driven Synchronization
Protocol (PDSP), a generic theoretical framework for the development of accurate arbiter
processes which eliminate preemptions, while preserving the data-driven philosophy of the
CSP/occam architectural models.
PDSP is based on the conservative, deadlock avoidance approach. It exploits the char-
acteristics of the simulated architectures to enable arbiter processes predict their simulated
future based on the instructions executed by the model; the term “Instruction Lookahead"
has been introduced to refer to this concept. The exploitation of Instruction Lookahead
ensures that Null messages are issued only when necessary, and are directed only to arbiter
processes; this departs from conventional conservative techniques where typically all pro-
cesses are required to handle and generate a regular flow of Null messages throughout the
model.
28 GEORGIOS K. THEODOROPOULOS
The success of PDSP depends on the ability to exploit the “Instruction Lookahead"
properties of the simulated architecture. The application of the PDSP to the occarm
simulation model has demonstrated that such an exploitation is feasible, even for systems
of the AMULET1’s complexity. However, it requires extensive knowledge of the operation
of the simulated architecture. This may be considered the major drawback of PDSP, though
this is a problem typical of the conservative framework and not specific to PDSP.
APPENDIX A
TABLE 1
AMULET1 Pipeline Occupancy (Dhrystone (1 loop))
Pipe Size Occupancy (%)
1 Item 2 Items 3 Items
Single Multi Asim Single Multi Asim Single Multi Asim
Ipipe 5 25.35 21.66 14.78 35.53 33.04 26.07 27.13 30.71 37.37
PCPipe 2 37.72 34.47 31.10 47.62 53.31 60.68
Xpipe 3 37.13 36.88 40.30 9.24 9.43 10.59 0.55 0.92 1.20
Dec1.RegB 1 74.60 77.04 52.43
Dec1.RegA 1 7.99 7.85 5.04
Dec2.OP 1 89.62 90.44 69.06
Dec2.Rsh 1 0 0 0
NGen 2 10.05 10.36 3.53 3.95 3.87 0
RB.OReg 1 50.91 52.27 40.75
RB.Wreg 1 78.40 79.62 63.35
RB.Ireg 1 47.20 50.10 61.80
RB.Afifo 3 37.24 37.48 36.17 1.09 1.10 2.56 0 0 0
RB.Mfifo 4 22.63 22.45 22.90 3.04 3.27 2.91 0.69 0.65 0.77
Dec3 3 27.70 25.65 NA 52.26 52.36 NA 12.95 15.36 NA
CPreg 1 57.67 57.83 65.62
OPreg 1 56.21 56.42 55.41
MemCP 5 56.75 57.55 60.21 7.88 7.76 2.87 0 0 0
MRReg 1 32.64 33.30 22.39
MAReg 1 57.29 57.49 42.17
Dout 3 10.31 10.48 13.09 0.43 0.44 0.09 0 0 0
PCHLat 2 95.65 95.67 90.11 0.60 0.77 0.76
LSMreg 1 6.79 6.70 4.42
APipe 2 5.17 5.30 7.65 0.17 0.16 0.20
WReg 1 14.31 15.23 18.71
DataIn 1 8.23 7.64 5.94
RReg 1 29.25 29.14 23.76
ImmPipe 2 40.17 41.32 27.63 13.15 14.30 4.13
TABLE 2
AMULET1 Pipeline Stalls (Dhrystone (1 loop))
Total Number Stall Period (ns)
Pipe of Messages Average Min
Single Multi Asim Single Multi Asim Single Multi Asim Single
REFERENCES
1. F. A. Almeida, P. H. Welch, A Parallel Emulator for a Multiprocessor Dataflow Machine, in“Proceedings of
the World Transputer Congress 1994", Como, 1994, pp. 259-272.
2. J. M. Alonso, et al., Conservative Parallel Discrete Event Simulation in a Transputer-Based Multicomputer,
in “Proceedings of the World Transputer Congress 1993", Aachen, 1993, pp. 636-650.
3. The AMULET Group, World Wide Web Home Page, URL: http://www.cs.man.ac.uk/amulet/index.html
4. Ayani, R., A Parallel Simulation Scheme based on the Distance Between Objects, in “Proceedings of the 1989
SCS Multiconference on Distributed Simulation", SCS Simulation Series, 1989, pp. 113-118.
5. G. Birtwistle, A. Davis, eds., “Asynchronous Digital Circuit Design", Springer Verlang, 1995.
6. J. A. Brozowski, C-J. H. Seger, “Asynchronous Circuits", Springer Verlang, 1995.
7. E. Brunvand, M. Starkey, An Integrated Environment for the Design and Simulation of Self Timed Systems,
in “Proceedings of VLSI 1991" 1991, pp. 4a.2.1-4a.3.1.
8. E. Brunvand, The NSR Processor, in “Proceedings of the 26th Annual Hawaii International Conference on
System Sciences", Maui, Hawaii, 1993, pp. 428-435.
9. S. M. Burns, A. J. Martin, Synthesis of Self-Timed Circuits by Program Transformations, “Technical Report
5253:TR:87", Computer Science Department, Caltech, 1987.
10. Capon, P.C., Gurd, J. R., Knowles, A. E., ParSiFal: A Parallel Simulation Facility, “IEE Colloquium: The
Transputer: Applications and Case Studies", IEE Digest, 1986/91, 23rd May 1986.
11. K. M. Chandy, J. Misra, Distributed Simulation: A Case Study in the Design and Verification of Distributed
Programs, IEEE Transactions on Software Engineering, 5, 5, September 1979, pp. 440-452.
12. K. R. Cho, K. Okura, K. Asada, Design of a 32-bit Fully Asynchronous Microprocessor (FAM), in “Proceedings
of the 35th Midwest Symposium on Circuits and Systems", Washington D.C., 1992, pp. 1500-1503.
13. T. A. Chu, “Synthesis of Self-timed VLSI Circuits from Graph-Theoretic Specifications", Ph.D Thesis
(MIT/LCS/TR-393), M.I.T., June 1987.
14. W. A. Clark, C. E. Molnar, Macromodular Computer Systems, R. W. Stacy, B.D, Waxman, eds., “Biomedical
Research", Academic Press, 1974, Chapter 3.
15. Cogency Technology Inc., World Wide Web Home Page, URL: http://www.cogency.com
16. J. Cortadella, et. al., Petrify: A Tool for Manipulating Concurrent Specifications and Synthesis of Asyn-
chronous Controllers, IEICE Transactions on Information and Systems, E80-D(3), March 1997, pp. 315-325.
17. A. Davis, The Architecture and System Method for DDM1: A recursively structures data-driven Machine, in
“Proceedings of the 5th Annual Symposium on Computer Architecture", Palo Alto, CA, 1978, pp. 210-215.
18. A. Davis, S. M. Nowick, Synthesizing Asynchronous Circuits: Practice and Experience, in [5], pp. 104-150.
19. M. E. Dean, STRiP: A Self-Timed RISC Processor, “Technical Report CSL-TR-92-543", Computer Systems
Laboratory, Stanford University, July 1992.
20. D. L. Dill, Trace Theory for Automatic Hierarchical Verification of Speed-Independent Circuits, ACM Distin-
guished Dissertations, MIT Press, 1989.
21. J. C. Ebergen, A Formal Approach to Designing Delay-Insensitive Circuits, Distributed Computing, 5 (3),
1991, pp. 107-119.
22. C. J. Elston, et al., Hades - Towards the Design of an Asynchronous Superscalar Processor, in “Proceedings
of the 2nd Working Conference on Asynchronous Design Methodologies", London, 1995, pp. 200-209.
23. P.B. Endecott, S.B. Furber, Modelling and Simulation of Asynchronous Systems using the LARD Hardware
Description Language, in “Proceedings of the 12th European Simulation Multiconference", Manchester, 1998,
Society for Computer Simulation International, pp. 39-43.
24. A. Ferscha, S. K.Tripathi, Parallel and Distributed Simulation if Discrete Event Systems, “Technical Report
CS.TR.3336", University of Maryland, August 1994.
25. R. Fujimoto, Parallel Discrete Event Simulation, Communications of the ACM,33(10), 1990, pp. 31-53.
26. S. B. Furber, Computing Without Clocks, In [5], pp. 211-262.
27. S. B. Furber, et. al., AMULET2e: An Asynchronous Embedded Controller, in “Proceedings of Async ’97
Conference", IEEE Computer Society Press, 1997, pp. 290-299.
28. J. D. Garside, et. al., AMULET3 Revealed, in “Proceedings of Async’99 Conference", IEEE Computer Society
Press, 1997, pp. 51-59.
29. G. Gopalakrishnan, V. Akella, Specification, Simulation, and Synthesis of Self-Timed Circuits, in “Proceedings
of the 26th Hawaii International Conference on System Sciences", 1993, pp. 399-408.
32 GEORGIOS K. THEODOROPOULOS
56. C. H. Van Berkel, J. Kessels, M. Roncken, R. Saeijs, F. Schalij, The VLSI-Programming Language Tangram
and its Translation into Handshake Circuits, in “Proceedings of EDAC", 1991, pp. 384-389.
57. “VERSIFY Release 2.0", Department d’Arquitectura de Computadors, Universitat PolitŁcnica de Catalunya,
Barcelona, Spain, November 1998, World Wide Web Home Page, URL: http://www.ac.upc.es/vlsi/versify/
58. R. P. Weicker, Dhrystone, A Synthetic Systems Programming Benchmark, Communications of the ACM,
27(10), 1984, pp. 1013-1030.
59. T. Werner, A. Venkatesh, Asynchronous Processor Survey, IEEE Computer, 30(11), 1997, pp. 67-76.
60. J.V. Woods, P. Day, S.B. Furber, J.D. Garside, N.C. Paver, and S. Temple, AMULET1: An Asynchronous
ARM Microprocessor, IEEE Transactions on Computers, 46 (4), 1997 pp.385-398.