Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Distributed Simulation of Asynchronous Hardware: The Program Driven Synchronisation Protocol

Download as ps, pdf, or txt
Download as ps, pdf, or txt
You are on page 1of 33

Distributed Simulation of Asynchronous Hardware:

The Program Driven Synchronisation Protocol


Georgios K. Theodoropoulos
School of Computer Science
The University of Birmingham
Birmingham B15 2TT
United Kingdom
E-mail: gkt@cs.bham.ac.uk

Synchronous VLSI design is approaching a critical point, with clock distribu-


tion becoming an increasingly costly and complicated issue and power consumption
rapidly emerging as a major concern. Hence, recently, there has been a resurgence of
interest in asynchronous digital design techniques which promise to liberate digital
design from the inherent problems of synchronous systems. This activity has re-
vealed a need for modelling and simulation techniques suitable for the asynchronous
design style. The concurrent process algebra Communication Sequential Processes
(CSP) and its executable counterpart, occam, are increasingly advocated as particu-
larly suitable for this purpose. However, the parallel distributed semantics of CSP
and occam introduce synchronisation problems in the model. This paper presents
the Program Driven Synchronisation Protocol, which seeks to address causality and
synchronisation problems and enforce temporal coherency in distributed CSP/occam
models of asynchronous hardware systems.

Key Words: asynchronous hardware, CSP, occam, modelling, distributed simulation, synchroni-
sation

1. INTRODUCTION
A digital system is typically designed as a collection of subsystems, each performing a
different computation and communicating with its peers to exchange information. Before
a communication transaction takes place, the subsystems involved need to synchronise,
namely to wait for a common control state to be reached, which guarantees the validity of
data exchanged.
In synchronous systems, the synchronisation of communicating subsystems is achieved
by means of a global clock whose transitions define the points in time when communication
transactions can take place. The operation of a synchronous system proceeds in lockstep,
with the different subsystems being activated to perform their computations in a strict,
predefined order [42]. Synchronous VLSI design however is approaching a critical point,
1
2 GEORGIOS K. THEODOROPOULOS

with clock distribution becoming an increasingly costly and complicated issue and power
consumption rapidly emerging as a major concern.
Another digital design philosophy allows subsystems to communicate only when it
is necessary to exchange information. The operation of the system does not proceed in
lockstep, but rather is asynchronous; each sub-system operates at its own rate synchronising
with its peers only when it needs to exchange information. This synchronisation is not
achieved by means of a global clock but rather, by the communication protocol employed.
This protocol is typically in the form of local request and acknowledge signals which
provide information regarding the validity of data signals.
Although asynchronous design techniques have been explored since, at least, the mid
1950s [44, 14, 17, 34], they have not hitherto been established as a major philosophy in
digital design. This failure was mainly related to the difficulty to enforce specific orderings
of operations and to deal with circuit hazards and dynamic states in an asynchronous,
non-deterministic environment [30]. However, recently, there has been a resurgence of
interest in asynchronous design techniques, due to the significant potential benefits that the
elimination of global synchronisation may offer to issues such as clock distribution, power
consumption, performance and modularity [26].
Various asynchronous digital design techniques have been developed, which are typically
categorised by the timing model (namely, the assumptions made regarding the circuit and
signal delay), the signalling protocol (namely, the sequence of events which must take
place in a communication transaction between two elements), and the technique they
employ for the transfer of data between two elements (namely, encoding the value of
each bit transmitted during a communication transaction). In his influential 1988 Turing
award lecture, Ivan Sutherland introduced Micropipelines, a new conceptual framework
for designing asynchronous systems [51]. In depth surveys of existing asynchronous
methodologies may be found in [6, 30, 5]. Additionally, the Asynchronous Online Logic
Home Page maintained by the AMULET group at the University of Manchester provides
continuous, up to date information regarding asynchronous systems research [3].
A number of asynchronous architectures have been developed including one at CalTech
[40], NSR [8] and Fred [47] at the University of Utah, STRiP at Stanford University [19],
Sun’s Counterflow pipeline processor [50], FAM [12] and TITAC [45] at Tokyo University
and Institute of Technology respectively, Hades at the University of Hertfordshire [22]
and Sharp’s Data-Driven Media Processor [49]. The AMULET group at the University of
Manchester have developed a series of asynchronous implementations of the ARM RISC
processor using Sutherland’s Micropipelines. The AMULET1 [60] microprocessor has
been the first asynchronous implementation of a commercial instruction set; AMULET2e
[27] has sought to improve performance via an improved design while AMULET3i [28]
aims at embedded applications.
The quest for the exploitation of the potential advantages offered by asynchronous logic
has revealed a need for modelling and simulation techniques, which would be appropriate
for the asynchronous design style. Thus, the recent interest in asynchronous design has
fuelled an intense research activity aiming to develop techniques appropriate for modelling
and simulating asynchronous systems. I-Nets [43], Petri Nets (e.g. the Petrify tool [16]),
Signal Transition Graphs (STGs) [13] (e.g. the Versify [57] and SIS [48] tools), State
Transition Diagrams (e.g. the MEAT tools [18]), ST-V (the Self-Time Verilog developed
by Cogency Technology Inc. [15]) and CCS (by Birtwistle et. al. at the University of Leeds
DISTRIBUTED SIMULATION OF ASYNCHRONOUS HARDWARE 3

Request
Request
Acknowledge

Acknowledge
Sender Receiver Data

Data Sender’s Action Receiver’s Action

FIG. 1. The Two-Phase Bundled Data Interface Protocol

[39]) are some of these tools and formalisms that have been employed in asynchronous
logic design.
Communication Sequential Processes (CSP)[32], in particular, the concurrent process
algebra developed by Tony Hoare for the specification of parallel systems, has been ex-
tensively advocated as a suitable means for describing asynchronous behaviour. Several
asynchronous modelling approaches and systems have been developed which use CSP-
based notations, including Martin’s [41, 40, 9], Hulgaard’s [33] and Brunvand’s [7] work,
trace theory [21] [20], Delay-Insensitive algebra [37], Tangram[56], SHILPA [29] and
LARD [23].
Contributing to this effort, and motivated by the increasing debate regarding the potential
use of CSP for this purpose, this paper addresses timing and synchronisation issues arising
from the parallel semantics of CSP when the latter is employed for the simulation of large
and complex asynchronous systems. These issues have been largely overlooked by the
advocates of CSP for this purpose and have been addressed for the first time by the research
presented in this paper. The research forms part of a larger project, which has investigated
the suitability of occam [35], the executable counterpart of CSP, for the modelling and
simulation of asynchronous architectures, developing a generic modelling methodology
for this purpose [53, 55]. The investigation targeted asynchronous systems that are based
on Sutherland’s Micropipelines using the AMULET1 microprocessor as a testbed, however
the results may also be applied to any asynchronous design methodology.
The rest of the paper is organised as follows: Section 2 provides a brief overview of
Sutherland’s Micropipelines. Section 3 provides a discussion of the role of simulation
in asynchronous VLSI design, placing emphasis on the use of CSP and occam for this
purpose. Section 4 outlines a generic framework which has been developed to support the
rapid modelling of asynchronous micropipelined systems using CSP/occam, while section
5 discusses the application of this framework for the development the occarm model of the
AMULET1 microprocessor. Section 6 discusses timing and synchronisation problems in
CSP/occam models of asynchronous systems and section 7 argues for the need for a new
synchronisation approach to address these problems. Section 8 introduces such a approach
and section 9 describes the application of this approach in the occarm model and presents
performance results. Finally, section 10 epitomises the conclusions drawn.

2. MICROPIPELINES
Using Micropipelines, the asynchronous architecture is designed as a set of simple,
data processing elastic pipelines, whose stages operate asynchronously and exchange data
4 GEORGIOS K. THEODOROPOULOS

SELECT steers
Muller-C elements XOR provides events according
provide the AND the OR function to its boolean
functions of events. for events. input.

C SELECT
True False

a) Muller-C b) Xor c) Select

r1 r1 g1
d1 d1

ARBITER
CALL
TOGGLE r

d2 d2
r2 r2 g2
TOGGLE steers
events to its out- CALL allows two independ- ARBITER
puts alternately. ent clients R1 and R2 to share performs the
a procedure R. When the pro- mutual exclusion
cedure is done a matching function
done event is returned on
either D1 or D2.

d) Toggle e) Call f) Arbiter

C Pd
Register
Dout
Din

Cd P

g) The Capture-Pass Storage Element

FIG. 2. Event Processing Blocks and the Capture-Pass Storage Element

via a two-phase bundled data handshake synchronisation protocol (figure 1). Two-phase
signalling recognises and responds to transitions of the voltage on a wire, regardless of
whether the transition is rising or falling; a transition is referred to as an event.
Ivan Sutherland also proposed a set of event control blocks for the design of control
circuits in micropipelined systems as well as event controlled storage elements to be used
in such systems. The event control blocks include the Muller-C, Select, Call, Toggle, Xor
and the Arbiter (figure 2).
An event controlled storage element is the Capture-Pass latch, depicted in figure 2g.
The latch is controlled by two control signals, namely Capture (C) and Pass (P). Initially
the latch is in its transparent state, where the input is connected through to the output (i.e.
Din=Dout). When an event is issued on the Capture wire the input-output connection is
interrupted, the data is “latched", and an event is issued on the Cd signal (Capture done) to
indicate the change of state in the latch (i.e. from transparent to opaque); the latched data
does not change with subsequent data input changes. When an event arrives on the Pass
wire, the input is connected back through to the output, thus making the latch transparent
again; this change is indicated by an event on the Pd (Pass done) signal. The Capture-Pass
may repeat, with events arriving alternately on the C and P wires respectively.
DISTRIBUTED SIMULATION OF ASYNCHRONOUS HARDWARE 5

Rin A1 R2 A3
DELAY DELAY Rout

C
C Pd Cd P C Pd Cd P

Register

Register

Register

Register
Dout

Dout
Dout

Dout
Din

Din
Din

Din

Din
Dout

Cd P C Pd Cd P C Pd

C
Ain DELAY DELAY Aout
R1 A2 R3 A4

(a)

Rin A1 R2 A3
DELAY DELAY Rout
C

C
C Pd Cd P C Pd Cd P
Register

Register

Register

Register
LOGIC

LOGIC

LOGIC
Dout

Dout
Dout

Dout
Din

Din
Din
Din

Din

Dout

Cd P C Pd Cd P C Pd
C

C
Ain DELAY DELAY Aout
R1 A2 R3 A4

(b)

FIG. 3. Micropipelines Without and With Processing Elements

The simplest micropipeline is a series of Capture-Pass registers connected together to


form a FIFO structure as depicted in figure 3(a). A micropipeline may perform processing
on the data, by interposing the necessary logic between adjacent register stages (figure
3(b)). A delay unit is typically used to slow down the request event and give the data
enough time to arrive at the register before the request, thus guaranteeing the validity of
data captured by the register (the bundled data constraint).

3. THE ROLE OF SIMULATION IN ASYNCHRONOUS DESIGN


Simulation modelling languages and tools for synchronous logic design have under-
pinned the development of ever more complex synchronous VLSI circuits. In the case
of asynchronous systems, the role of simulation is even more crucial as their concur-
rent, non-deterministic behaviour makes any attempt to reason about their correctness and
performance a very complicated task.
The concurrent behaviour of asynchronous systems renders them susceptible to dead-
locks, a problem typical in parallel systems but not an issue in systems synchronised by a
global clock and operating in lock step.
6 GEORGIOS K. THEODOROPOULOS

Furthermore, performance evaluation is a much more complex task in asynchronous


systems than in their synchronous counterparts. In the latter, benchmark execution times
are easy to interpret based on the number of clock cycles and the existence of a critical path.
Delays in the critical path can determine the clock period while non-critical path delays
have no effect on the performance of the system. In contrast, the temporal behaviour in
asynchronous systems is much more difficult to understand and interpret as delay inter-
dependencies are much more complex. Delays in one module may often be masked by
occasional longer delays in another module, while the accumulation of delays through a
chain-reaction in a non-deterministic concurrent environment may have a chaotic effect on
system performance.
This complexity renders simulation an essential tool in the endeavour to gain an insight
and understanding of the behaviour of asynchronous systems.
The simulation of digital systems in general, and computer architectures in particular, has
long been categorized among the highly computation intensive applications. The same is
true for the simulation of asynchronous microprocessors. For the testing and evaluation of
the AMULET1 design, for instance, more than 4 million instruction cycles were simulated
[46], a number which corresponds to many hours of simulation. Hence, a parallel approach
to simulation, could contribute significantly in reducing the duration and cost of the design
cycle.
Asynchronous hardware systems are an excellent candidate for distributed simulation.
The concurrent operation of the different subsystems of an asynchronous system, the in-
herent parallelism within each subsystem and the lack of any global synchronisation, are
characteristics which support the concurrent execution of events in a simulation model.
In his flashback simulation approach [52], Sutherland attempts to exploit these character-
istics of asynchronous systems and allow “out-of-order" processing of events to increase
simulation speed; however, his simulation retains its sequential nature, and is intended for
execution on conventional von Neumann computers.

3.1. CSP and occam


Although synchronous languages and tools can and indeed have been used for asyn-
chronous hardware too (e.g. both VHDL and Verilog have been used for the modelling of
the AMULET microprocessors), their application in that context is proving awkward and
inefficient.
Assuming a correct implementation of the communication protocol, at the Register
Transfer or higher levels, an asynchronous system may be viewed as a network of concurrent
modules communicating via synchronous, unbuffered communication. The modules are
data-driven; each module will start computation as soon as data is available on its input
wires, and will signal when the result has been computed. At these levels of abstraction, it
would thus be desirable to be able to model the structure and behaviour of the system and
consider communication between any two modules as an abstract atomic action, without
explicitly describing the protocol involved.
Using a conventional synchronous hardware description language such as VHDL how-
ever, where the communication between functional blocks is modelled using signals, one
would have to explicitly describe the behaviour of both the data, and the request and ac-
knowledge signals utilised by the asynchronous protocol. This may prove difficult and
inefficient, e.g. when trying to model concurrent communications involving interleaved
DISTRIBUTED SIMULATION OF ASYNCHRONOUS HARDWARE 7

request and acknowledge signals moving in opposite directions. CSP overcomes this
problem:

 CSP supports a concurrent, process based, asynchronous, non-deterministic model of


computation which exactly matches the structure and behaviour of hardware build using
asynchronous logic.
 In CSP, the communication between different modules is by means of a point-to-
point, synchronous and unbuffered channel. This behaviour directly reflects the interaction
between subsystems in asynchronous hardware, where a sender and a receiver rendezvous
before they physically exchange data via wires, which are memoryless media. Thus
the asynchronous communication protocol is modelled implicitly in the semantics of the
channel.

The advantages of using occam as an asynchronous hardware description language have


been presented and analysed in detail elsewhere [53, 54, 55]:

 Occam is primarily a general purpose programming language. Thus, a specification de-


veloped using occam is automatically an executable simulation model of the asynchronous
system. No extra simulation engine is required.
 Occam forms a practical realisation of CSP, and, consequently, maintains the strong re-
lationship with regard to communication and computation between CSP and asynchronous
systems.
 Occam allows explicit description of parallel as well as sequential computation. This
explicit control of concurrency which extends down to the command level, along with its
simple but powerful syntax and “send” and “receive” commands, makes occam ideal for
describing digital systems (indeed, occam has been employed for modelling digital systems
at various levels, e.g. [1]).
 Occam is a parallel programming language and thus may be used to perform distributed
simulation in order to achieve high simulation performance (occam has indeed been used
as a programming language for building distributed simulations in different applications,
e.g. [2])

4. THE MODELLING FRAMEWORK


Within the proposed modelling framework, the asynchronous system is modelled at the
Register Transfer as a network of concurrent occam processes, topologically identical to
the asynchronous system, with each occam process corresponding to a different functional
module of the system, and communicating with its peers via timestamped messages. This
approach is analogous to the “Logical Process Paradigm", typically employed in distributed
simulation modelling [25, 24].
At the Register Transfer Level, a Micropipeline with processing may be generally viewed
as depicted in figure 4. The sending register outputs its contents, consisting of data and
control bits, onto the data bus and produces a request event (request wires are indicated in
the figure by solid lines, while acknowledge wires are denoted by dotted lines). The control
bits, are used by the control logic to direct the request event to its correct destination,
activating if necessary the data path elements (DPEs, e.g. ALUs, multipliers, shifters etc.)
of the circuit. Data passes through the DPEs and propagates to the next stage. This general
Micropipeline may be modelled by three occam processes, two for the registers and one
8 GEORGIOS K. THEODOROPOULOS

Control Logic

Register

Register
Dout

Dout
Din

Din
DPE

FIG. 4. Micropipeline With Processing: A High Level View

RDin

RDin RDout

Rin Aout Register


Ain Aout
C

C Pd
Register

PROC Register(CHAN OF BUNDLE RDin,RDout,


Dout
Din

CHAN OF ACK Ain, Aout)


Din Dout SEQ
RDin ? Data
Cd P
WHILE TRUE
SEQ
PAR -- fork
RDout ! Data
Ain Rout Ain ! any
DELAY PAR -- Muller-C
RDin? ? Data
Aout ? any
:
fork RDout

FIG. 5. Micropipeline With Processing: The Register Model

for the control/data processing logic; the control logic and the DPE may be modelled as
one process, with the DPE being a procedure called by the control process.
Figure 5 illustrates the register occam model. The model makes use of two PAR
statements, one to model the Muller-C element and one to model the fork on the Ain/Rout
wire. Two channels are used in each direction, one for the data/request bundle and one for
the acknowledgement signal. The latter is required to keep the register processes tightly
coupled and synchronised, as in a different case, the control process would act as a buffer
introducing an extra pipeline stage in the model that does not exist in the physical system
[53, 55]. In the case of pipelines without processing, the acknowledgement channel is not
required.
A multi-stage Micropipeline may be modelled by means of a parallel replication of the
register process.
The control logic is inherently concurrent; different parts of the circuit operate concur-
rently while, within each part, events take place in a deterministic sequential order, i.e. the
control logic implements a partial ordering of events. The simulation model should have the
same degree of concurrency as the physical circuit. The control logic may be implemented
DISTRIBUTED SIMULATION OF ASYNCHRONOUS HARDWARE 9

Initialise Interrupts
FIQ IRQ

Read Data

Dabt0 Dabt1

AMULET1 Memory
Control Control

Write Data Write Data


MMU

Address Address

FIG. 6. The AMULET1 Interface

as a network of communicating processes, with the occam PAR (parallel execution) and
SEQ (sequential execution) commands being used within each process to implement the
partial ordering of events of the circuit. The number of these processes depends on the
degree of modularity and fidelity required in the simulation model.

5. MODELLING AMULET1: THE OCCARM MODEL


Deploying the modelling framework outlined above has been used to occarm, an occam
simulation model of the AMULET1 microprocessor has been developed (the name of the
model is derived from the combination the words occam and ARM). AMULET1 is the
first in the series of AMULET microprocessors and the world’s first implementation of a
commercial asynchronous processor.
Figures 6, 7 and 8 illustrate respectively the interface, the internal organisation and the
physical layout of the 1.2 micron implementation of AMULET1. The processor comprises
five major units, namely the address interface, the data interface, the execution unit, the
register bank and the primary decode. The execution unit consists of two major stages,
namely Decode2 (Dec2-Ctr12), which controls the operation of the shifter and multiplier
units of the processor, and Decode3 (Dec3-Ctr13), which controls the ALU. Detailed
descriptions of the AMULET1 microprocessor are provided in [60, 46].
Occarm describes AMULET1 at the Register Transfer Level. It executes ARM6 machine
code produced by a standard ARM compiler. Instructions enter the simulator as 32-bit
integer numbers in hexadecimal format. Instruction decoding is performed by means of
PLA models, which are implemented as two-dimensional arrays of Boolean values.
Occarm has been implemented as a hierarchy of occam processes, with each process
modelling a different functional module of AMULET1. Its top-level process structure
graph is depicted in figure 9, while figure 10 illustrates its internal structure (the reader is
invited to relate these two figures with figures 7 and 8). AddInt and DatInt processes model
AMULET1’s address and data interface units respectively. The datapath is modelled by
four processes, namely Decode1, Decode2, Decode3 and RegBank. Decode1 describes
the primary decode unit while Decode2 and Decode3 model the two major components of
the execution unit of the processor. RegBank process incorporates the functionality of the
register bank. WrtCtrl models the operation of AMULET1’s write bus arbitration logic.
All the registers of AMULET1, have been modelled using the generic register model,
with interprocess communication being performed using pairs of request/data (solid arrows
10 GEORGIOS K. THEODOROPOULOS

Control Data Out Data In

Dout
Byte Rep.
Data Interface
mem ctrl DestCtrl
FIFO
MemCP

DataIn
Byte Align
IPipe

rdgen Exception
Primary X pipe
Pipe
Decode
LSMp Primary Decode
mux

Immediate
Pipe
Dec Dec ImmExt
NGen Registers
3 2
Reg
Control
Register
Bank

I[31:28], PcPar

multiply

Cond
Ctrl
2 psrC
Execution shift pass
Cout
Unit
CPSR’

ALU
Ctrl
3 mux

CPSR

mux

Wbus

incrementer
Wbus
Ctrl.

A Pipe
PC H. L.

mux
PC Pipe

arb. mux
AddC

mux
Address Interface

Address

FIG. 7. The AMULET1 Internal Organization

in the figures) and acknowledgement (dotted arrows) channels. To illustrate the modelling
of control logic, figures 11 and 12 depict the modelling of one of the control modules of
the Adress Interface (AddC).
For a detailed description of the structure and operation of occarm the reader is referred
to [53].
DISTRIBUTED SIMULATION OF ASYNCHRONOUS HARDWARE 11

Decode1

Decode3
Decode2

RegBank

AddInt Execute Pipe DatInt

FIG. 8. The AMULET1 Processor Physical Layout

6. TIMING AND SYNCHRONISATION


The distributed semantics of CSP/occam introduces the problem, typical in distributed
simulations, of ensuring that the simulation respects the partial order over events dictated
by the causality principle in the simulated architecture [25, 24]. Indeed, typically, the
order in which the occam channels fire in the model will not adhere strictly to the order in
which the corresponding events take place in the simulated system. In the modelling and
simulation of computer systems, simulated time plays a dual role: as a synchronising agent,
to drive the simulation engine, and as a quantifier, to provide the means for the performance
evaluation of the architecture.
Since at the Register Transfer level, the correct operation of the asynchronous hard-
ware system does not depend on a global clock, simulated time is not required for the
synchronisation of the CSP/occam processes. Processes can be entirely data-driven and
self-scheduling, and be synchronised by the protocol employed in the communication se-
mantics of occam, in the same way that the communication protocol employed in the
asynchronous system synchronises the different functional modules. Thus, accurate time
modelling is not required for synchronisation.
Simulated time however is still needed as a quantifier, to provide for the performance
evaluation of the simulated asynchronous system, and errors in simulated time would
introduce inaccuracies in the evaluation of the simulated architecture. Thus, accurate
simulated time modelling is required if the simulation is to be used for a more extensive and
elaborate evaluation of the performance characteristics of the asynchronous architecture.
The requirement to test the architecture for potential deadlocks by modifying the delays
in the system to achieve different event orderings [26] makes this necessity even more
12 GEORGIOS K. THEODOROPOULOS

Wire
(single channel) from memory
Bundle Dab0
(two channels: Req/Data + Ack) Dabt1
APipe

ALx
buffer
OPa/OPb
Decode2 Decode3
CP

PCcol ALU
NGen result
buffer
InC2 buffer
InC3
ALUgo
Imm AIabt
RdData
Decode1
Dout buffer

R15

Instr LsmTrm
addr+data+ctrl
to memory
Dabt0
DatInt AddInt from memory
data addr+ctrl
Dabt1
from memory
Addr
Din
I,R15
buffer

Wlx
WrtCtrl
RegBank

WrtData

FIG. 9. Occarm Top Level Process Graph

intense. Indeed, if the simulated time is not the synchronisation agent in the simulator,
different event orderings may only be achieved by using occam real time Timers to change
the relative scheduling of the occam processes [35]. The major drawback of this approach
is that small delays cannot guarantee the intended effects and behaviour in the model, as
these delays are only approximate. Furthermore, real time delays have a direct effect on the
performance of the simulator. Thus, large delays that would guarantee the planned process
scheduling, would also have a severe effect on the performance of the simulator.
It has been shown [38] that a distributed system consisting of asynchronous concurrent
processes will not violate the causality principle if each merge process consumes and
processes event messages in non-decreasing timestamp order (the local causality constraint;
the processing of an out of order event is referred to as a preemption).
In a Micropipelined architecture, micropipelines may be merged in one of the following
ways:

 Synchronous merge. A functional module has to wait for all input data to become
available before it starts its operation. This is the case when a Muller-C element is used for
the corresponding request events. In the simulation model, the occam process has to wait
for all input channels to “fire” and therefore no preemptions occur as the process is in the
position to select the message with the smallest timestamp.
DISTRIBUTED SIMULATION OF ASYNCHRONOUS HARDWARE 13

address+control+data instructions/data
to memory from memory
from execution unit
(Decode2) PCch MemCP MRReg
address+control
APipe from address register
interface

register

register

register

register

register
LSMreg PC Holding Latches
MemCtrl

register

register

register
register

register
Dout IPipe DestCtrl
PC Pipe
MAReg register

register

register

register

register

register
PC0

register
register
AddC Incrementer register

register PC1
LsmTrm DataOut DInCtrl
from primary decode DataIn
(Dec1CtrlB) register
Wlx Wch register
(b) DatInt
to primary decode
to memory
to/from write control
(Dec1CtrlA) data instructions data
from execution unit to primary decode to write control
(a) AddInt (Decode2)

IReg ReadDone
I

register

register
ReadCtrl WrtCtrl
PC (W decode)
(A,B decode)
from execution unit to register bank to execution unit from primary LockDone
PCcol to XPipe RBch PCch (Decode2) decode
ALUgo
XP RGch from execution unit read addresses
from PCPipe
PC B A AFifo MFifo
R15
R15 activate lock register
Dec1CtrlA I Dec1CtrlB RdGa RdGen register
I
LSMPr register
ReadLock register
from IPipe register
ImmPipe register
unlock
register
LsmTrm InC2 InC3
register
read addresses
RegA RegB
PC B A
register register ImmExt
write addresses from write
control
write data
to address interface (Decode2) (Decode3) register
to execution unit Read Write
to execution unit
to execution unit (Decode2)
(c) Decode1 (Decode2)
... read enable ... write data
A
register

Mux out.bus
...

OReg
cell cell ... cell
from primary decode
(Dec1CtrlB)
(d) RegBank
from previous execution stage from primary decode from primary decode
register
Dec3 (Decode2) (Dec1CtrlB) (Dec1CtrlB)
Dec2 NGen
Dec3Ctrl RSh from ImmPipe
register
register

register
Dec2Ctrl

CPreg OPreg
register register register
NGenCtrl

register
register

Dabt1
Ctrl3 from memory
from register A Ctrl2
Dabt0 bank B
ALx
RReg AP Dout
register to data interface
buffer buffer buffer buffer
to APipe
CP OPa OPb
to next execution stage
to write control ALx AIabt PCcol ALUgo
(Decode3)
(e) Decode3 (f) Decode2

FIG. 10. Occarm Internal Structure

 Data dependent merge. The functionality of the system dictates the order in which
messages from different source processes should be consumed and processed. This situation
14 GEORGIOS K. THEODOROPOULOS

P2.5

P2.1

P1
P2.3

P2.4

P2.2
P2.6

FIG. 11. AddC: Control Circuit

SEQ
ALT --ARBITER
PCr?
P1 --PC loop
Wr?
P2.1 --IF
FALSE
SEQ
P2.2 --data transfer
TRUE
SEQ
P2.3 --IF
TRUE
SEQ
WHILE Ntrm=FALSE -- LSM loop
SEQ
P2.4
P2.5
FALSE
SEQ
P2.6 --data transfer: Apipe
:

FIG. 12. AddC: Occam Process - Arbiter modelled as an ALT

is implemented in hardware using a combination of a Select and a Call or Xor. The process
in this case behaves as a single input module, hence causality is not violated.
DISTRIBUTED SIMULATION OF ASYNCHRONOUS HARDWARE 15

Ain1

r1 d1 d1
RDin1 g1 r1
RDout

ARBITER

CALL
r process data
d

g2 r2
Aout
RDin2
r2 d2 d2

Ain2 Arbitrated.Merge Process

(a)

PROC Arbitrated.Merge(CHAN OF BUNDLE RDin1,RDin2,RDout,


CHAN OF ACK Ain1,Ain2,Aout)
-- data.delay : time taken to process the data
-- ack.delay : time taken for the acknowledge signal to propagate
SEQ
clock:=0
WHILE TRUE
SEQ
ALT --arbiter
RDin1 ? in.data
SEQ
-- process data
out.data(timestamp):= max(in.data(timestamp),clock) + data.delay
RDout ! out.data
Ackout ? ack
ack(timestamp):=ack(timestamp) + ack.delay
clock:=ack(timestamp)
Ain1 ! ack

RDin2 ? in.data
SEQ
--process in.data in a similar as above
:

(b)

FIG. 13. Modelling Arbiters as occam ALT Constructs

 Arbitrated merge. The order of arrival defines the order of consumption. If events
from two micropipelines arrive at the same time, an arbitrary choice is made.

A straightforward way to model the functionality and nondeterministic behaviour of


arbiters within the data-driven operation of the occam processes, is the occam ALT construct,
which provides for the nondeterministic choice of messages from different channels (see
figures 11 and 12). However, the order in which the occam channels in the ALT construct
fire, will typically be different than the order in which the corresponding events would
occur in the arbiter, in which case preemptions occur.
To evaluate and assess the significance of these preemptions and their impact on the
simulation when arbiters are modelled as ALT constructs (see figure 13) and no attempt
is made to enforce temporal coherency and precision in the model, a quantitative analysis
of the timing error introduced in the simulation has been performed (meta-)executing the
Dhrystone benchmark on occarm (see also section 9.2). The values measured were those
typically used for the performance evaluation of asynchronous architectures, such as the oc-
cupancy and idle and stall periods of the micropipelines and the performance of the system.
These values were compared with those obtained from a conventional, sequential discrete
event simulator of AMULET1 written in Asim, ARM’s in-house simulation language. The
results of this analysis have been reported in detail in [53, 55]. In summary, it has been
found that the error introduced in the model, for various measured values, ranges from 10%
to 25%, an error which is considered significant and unacceptable, even at this high level
of simulation. To indicate, appendix A presents the measured occupancy and stall periods
16 GEORGIOS K. THEODOROPOULOS

500
450

Accumulated Error (ns)


Preemption Count
4 400
3 350
300
2 250
1 200
00 150
20000 40000 60000 80000 100000 120000 100
50
Simulated Time (ns)
00 20000 40000 60000 80000 100000 120000
a) Decode1 Simulated Time (ns)

500

Accumulated Error (ns)


450
Preemption Count

4
400
3 350
2 300
250
1
200
00 20000 40000 60000 80000 100000 120000 150
100
Simulated Time (ns) 50
00 20000 40000 60000 80000 100000 120000
b) AddInt Simulated Time (ns)

Accumulated Error (ns)


250
Preemption Count

4 200
3 150
2 100
1 50
00 20000 40000 60000 80000 100000 120000 00 20000 40000 60000 80000 100000 120000
Simulated Time (ns) Simulated Time (ns)
c) WrtCtrl

FIG. 14. Occarm: Preemption Count and Magnitude

of the AMULET1 pipelines, while figure 14 illustrates the distribution of preemptions over
time, namely the number of times that a preemption was detected (the preemption count)
and the corresponding accumulated preemption magnitude for each of the three arbiters of
AMULET1.

7. THE NEED FOR A NEW SYNCHRONISATION TECHNIQUE


A number of synchronisation techniques, have been developed to ensure that the lo-
cal causality constraint is not violated. These are traditionally classified into two broad
categories, namely conservative and optimistic [25, 24].
Conservative techniques [11] require that a merge process blocks until there is a message
on each of its input links, and then it selects and processes the message with the smallest
timestamp. However, deadlocks may occur if the expected message will not be issued. Two
main approaches have been devised to deal with that problem, namely deadlock detection
and correction, and deadlock avoidance, whereby timestamped Null messages are sent
over the links of the model to enable merge processes decide if it is safe to process pending
messages.
Optimistic approaches, such as Time Warp [36], detect and recover from causality errors
rather that strictly avoid them. Typically, upon detecting a preemption, the processes in the
model “roll back" in simulated time undoing their illegal actions.
However, these techniques generally assume a different simulation philosophy, whereby
the model is typically executed by a set of simulation engines, which implement the
synchronisation protocol. This philosophy is incompatible with the rationale and the very
principle behind using CSP/occam as a specification language which is the exploitation of
DISTRIBUTED SIMULATION OF ASYNCHRONOUS HARDWARE 17

the computation and communication semantics of the latter, an exploitation which implies
a data-driven operation of the occam processes in the model. Furthermore, in the case
of asynchronous hardware models, preemption problems are restricted only in the arbiter
processes of the model and therefore, the deployment of existing generic protocols would
impose an unnecessary overhead and complexity in the model.
The Program Driven Synchronisation Protocol introduced in this paper, aims at enforcing
temporal coherency in CSP/occam asynchronous hardware models while adhering to the
data-driven operation of the processes in the model.

8. THE PROGRAM DRIVEN SYNCHRONISATION PROTOCOL


The Program Driven Synchronisation Protocol (PDSP) seeks to exploit the characteristics
of the simulated system in order to enable the development of accurate arbiter models
involving the minimum number of processes required for this purpose. The processes of
the model remain entirely data driven. PDSP follows a conservative, deadlock avoidance
philosophy.

8.1. The Basis


Von Neumann computer architectures, synchronous or asynchronous, are deterministic
systems: they accept as input instructions which they execute sequentially in a specific and
predefined order.
Each instruction defines the steps that are required for its execution as well as the
behaviour of each functional module of the architecture. Consequently, the type and
sequence of events that occur in the system at any time are determined by the executing
instructions.
This ability to predict events in the architecture based on the information provided by the
program under execution, forms the basis of the Program Driven Synchronisation Protocol;
by looking at the instructions being executed the arbiter processes of the simulation model
can decide whether an event is expected on a particular input link and thus whether their
blocking upon this link would result to a deadlock.
The key concept in the Program Driven philosophy is the “Instruction Lookahead Set”
which is defined as follows:
Definition 8.1. The Instruction Lookahead Set of a link  is the set of instructions

whose execution will potentially result to an event occurring on :


ILS = fInstruction I: I generates an event on link g. An instruction I is referred to as
an ILS instruction if and only if I 2 ILS .

The Instruction Lookahead Set of any particular link in the system is directly defined by
the architecture’s specification and thus, may become available to the arbiter processes of
the simulation model in advance. Based on the ILS of their input links, arbiter processes
may directly make decisions regarding the potential arrival of messages, provided of course
that they are also informed of the instructions being executed in the system.

8.2. The Rules


Based on the Instruction Lookahead Set defined above, the behaviour of arbiter processes
with regard to message consumption may be specified as follows:
18 GEORGIOS K. THEODOROPOULOS

Rule 8.1. An arbiter process  is allowed to block and wait for an event on its input
link  during the execution of an instruction I if and only if I 2 ILS .

The above rule ensures that arbiter processes block only for instructions that are likely to
generate the corresponding events. However, depending on the status of the system, during
the instruction’s execution such an event might not occur; in this case Null messages are
required otherwise the arbiter process will become blocked and the simulation model will
deadlock. The following rule is concerned with the production of Null messages:
Rule 8.2. A Null message will be sent to link  of the arbiter process  if and only
if  expects an event on  based on the ILS , and for the current state of the system the
event will not be generated.

The two rules above, specify the behaviour of arbiter processes and their peers, when
their interaction depends on the executed instructions. However, not all events in an
asynchronous system occur in an instruction dependent fashion. Indeed, certain parts
of the system may operate autonomously, irrespective of which instructions are being
executed; the PC loop in the AMULET1 processor is an example of such an autonomous
unit. In this case it is the state of the simulated system that dictates the behaviour of the
arbiter process:
Rule 8.3. An arbiter process  is allowed to block and wait for an event on its input

link  which fires in an instruction independent way, if and only if the state of the system
guarantees that a message will be issued on .

8.3. The PDSP Arbiter Process


The basic functionality of an arbiter process with regard to the Program Driven Synchro-
nisation Protocol is depicted in figure 15.
Upon receiving a message on one of its links (e.g. msg1 message on In1) the arbiter
invokes the Select process to determine whether the processing of this message would cause
a preemption.
If there is a pending message msg2 already received from the other input, then the
message with the minimum timestamp is selected to be processed and forwarded to the
arbiter process’ output; if both timestamps have the same value, the selection is made in a
random fashion to emulate the behaviour of the corresponding hardware arbiter.
If however, no pending message exists, but a positive prediction (based on the Instruction
Lookahead) is made regarding its potential arrival, the arbiter process blocks and waits until
this second message arrives. The arrival of this message provides the arbiter process with
the information required to proceed its operation and enable Select to make a decision,
namely the next timestamp on its other input link.

8.4. Improving PDSP Performance


The basic algorithm described in figure 15, enables arbiter processes to receive and pro-
cess messages arriving on their input links in increasing timestamp order, always selecting
the message with the smallest timestamp, thus guaranteeing the accurate, preemption-free
operation of the simulation model.
DISTRIBUTED SIMULATION OF ASYNCHRONOUS HARDWARE 19

PROC PDSP_Arbiter()
PROC Select(msg1, msg2)
SEQ
IF
msg2_pending=TRUE
SEQ
IF
timestamp(msg1)<timestamp(msg2)
SEQ
process(msg1)
msg1_pending:=FALSE
timestamp(msg1)>timestamp(msg2)
SEQ
process(msg2)
msg2_pending:=FALSE
timestamp(msg1)=timestamp(msg2)
SEQ
make_random_selection(msg1,msg2)
msg2_pending=FALSE
SEQ
IF
msg2_expected=TRUE
SEQ
In2?msg2
msg2_pending:=TRUE
msg2_expected=FALSE
SEQ
process(msg1)
msg1_pending:=FALSE
:
WHILE TRUE
SEQ
IF
msg1_pending=TRUE
SEQ
Select(msg1,msg2)
msg2_pending=TRUE
SEQ
Select(msg2,msg1)
TRUE
SEQ
ALT
In1?msg1
msg1_pending:=TRUE
In2?msg2
msg2_pending:=TRUE
:

FIG. 15. The PDSP Arbiter Process

However, it does not ensure that the concurrency of the simulated system is sufficiently
exploited to increase the potential of the simulation model for high performance.
Indeed, as soon as it predicts that a message is expected on one of its input links In2, the
arbiter process will stop accepting any messages arriving on its other input link In1 until
the expected message on In1 arrives. As a consequence, all the processes that are part of
20 GEORGIOS K. THEODOROPOULOS

PROC Select(msg1, msg2)


SEQ
...
msg2_expected=TRUE
SEQ
IF
timestamp(msg1)< MLL_timestamp(msg2)
SEQ
process(msg1)
msg1_pending=FALSE
timestamp(msg1) >= MLL_timestamp(msg2)
SEQ
In2?msg2
msg2_pending=TRUE
msg2_expected=FALSE
SEQ
process(msg1)
msg1_pending=FALSE
...
:

FIG. 16. PDSP: Taking MLL into Account

the path that leads to In1 will block and wait, and the pipelines at the output side of the
arbiter process will starve; during this time, large parts of the simulator will remain idle.
A solution to this problem is to provide arbiter processes with some indication as to
when in the simulated future a message they expect will actually arrive. This information
would enable them to consume a number of events occurring on In1 link before they block
on In2 increasing thus the concurrency of the simulation model.
This information can be obtained by taking into account the propagation delays in the
architecture being simulated. An event generated by an instruction will propagate through
a number of pipeline stages before it reaches an arbiter. The path followed by the event
is completely defined by its parent instruction; the latency of the path however at any
particular time, depends on the number of elements in the micropipelines involved and
thus, it is non-deterministic.
Consequently, it is not feasible to know in advance the exact time required for an event
to propagate through a given micropipeline. However, there is a lower bound to this time,
namely the latency of the micropipeline when, at the moment of the event’s entry, it is
empty. Based on this observation, a distance-based[4] form of lookahead may be defined,
namely the Minimum Latency Lookahead:
Definition 8.2. The Minimum Latency Lookahead of a link  MLL , is defined
as the total propagation delay of the path leading to , when the pipelines of the path are
empty:
MLL =
P di ; di = Propagation Delay of the i-th Pipeline Stage in the path.

The Instruction Lookahead Set of a link informs the corresponding arbiter process
whether a message should be expected on that link; the Minimum Latency Lookahead
reveals when in the simulated future the expected message may arrive. Based on the
Minimum Latency Lookahead, the following rule may be specified:
DISTRIBUTED SIMULATION OF ASYNCHRONOUS HARDWARE 21

Rule 8.4. An arbiter process  will not process a message 1 received on its input
link 1 but instead it will block and wait for a message 2 expected on its other input link
2 , if and only if the timestamp of 2 as predicted by the MLL2 is less than or equal to
the timestamp of 1 .

Rule 4, combined with the ALT statement in the main loop of the PDSP arbiter process,
will enable arbiter processes to process messages with appropriate timestamps as soon as
they arrive.
Figure 16 depicts the Select process when the MLL is taken into account.

9. APPLYING PDSP TO THE OCCAM SIMULATION MODEL OF AMULET1


In order to validate PDSP and demonstrate its applicability, the protocol has been em-
ployed in the occarm simulation model of AMULET1 microprocessor. AMULET1 makes
use of three arbiters, in the address interface unit (AddInt), in the first execution stage
(Decode1) and the write bus arbitration logic (WrtCtrl) respectively. The remaining of the
paper discusses the application of PDSP to develop an accurate, preemption-free model of
the address interface. For a detailed description of the deployment of PDSP in all arbiter
processes of the occarm model, the reader is referred to [53].

9.1. PDSP: The Address Interface Arbiter


The address interface (figure 10a) is responsible for providing all the address information
to memory. It operates as an autonomous unit, issuing sequential instruction addresses to
maintain a steady flow of prefetched instructions to the processor. Instructions arriving
from memory through the data interface, rendezvous with their associated R15 value (PC)
extracted from the address interface (see figures 9 and 10) whereupon they enter the datapath
for execution.
There are two cases wherein an instruction which has entered the processor will be
invalidated and rejected and as a result, its execution will not eventually take place:
 The instruction fails its condition codes in Decode3. In ARM architecture, all in-
structions are conditionally executed. Their execution depends on the outcome of the
comparison between the Current Processor Status (CPS) arithmetic flags and the condition
field of the instruction word. In AMULET1, the test of the condition flags is performed in
Decode3.
 The colour of the instruction does not match that of the processor. AMULET1
maintains a “colour" bit (at Decode3) which changes each time the instruction flow changes
(e.g. due to a branch or an exception). Instructions are also “coloured" and if their colour
does not match that of the processor at any particular moment they are discarded. To
make this mechanism more efficient, the new colour is also sent to Decode1 (via the PCcol
signal) to enable the rejection of invalid instructions before they enter the datapath. The
invalidation and rejection of instructions may occur either in Decode1 or in Decode3; the
choice depends on the exact point in time that the PCcol signal from Decode3 is detected
by Decode1, and thus it is nondeterministic.
Address Interface employs arbitration (AddC process, see figures 10 and 11) to allow
addresses arriving from the datapath on the Wch channel to break the PC loop (consisting
of AddC, MAReg, Incrementer, PC Holding Latches and PC0 register of PC Pipe) and gain
access of the MAReg.
22 GEORGIOS K. THEODOROPOULOS

Memory Addresses
from PC Pipe

PCch

MAReg MRReg

register

register
AddC MemCtrl Memory to IPipe

Wch

from WrtCtrl
Instructions
Values forwarded Acknowledgements and
to MAReg piggybacked Instructions

PCk ??

PCk+1 Ik

PCk+2 Ik+1

PCk+3 Ik+2

AddC MAReg

FIG. 17. Dynamic Provision Instruction Lookahead Knowledge

9.1.1. Providing Instruction Lookahead Information


An address produced by AddC is sent to Memory following the path from AddC, through
MAReg and DataInt. If it is an instruction address, the instruction message from memory
enters the processor following the path from Memory through DataInt to Decode1. AddC
is not in the path followed by the instruction and therefore has no direct information as
to which instructions have entered the system; in order to apply PDSP a mechanism is
required to provide AddC with this information and enable it to make decisions regarding
the potential arrival of messages on its input links.
There are two approaches to provide instruction lookahead information to the AddC
process, namely static and dynamic. In the former, a copy of the benchmark assembly code
is incorporated in AddC in advance, before the simulation starts. Control flow changes can
be easily followed as all branch target addresses are sent to memory via AddC itself.
A neat and efficient mechanism to dynamically provide AddC with instruction lookahead
information, is to take advantage of the contra flow of the Acknowledgement messages, as
illustrated in figure 17. An address message produced by AddC will propagate to MARreg
and to Memory and from there to DataInt; DatInt will generate an Acknowledgement
message which will follow the opposite direction back to AddC. This Acknowledgement
message can be used to carry the corresponding instruction to AddC; no communication
overhead is generated as the Acknowledgement messages would be sent anyway.

9.1.2. The PCch Link


The PCch channel carries the Acknowledgement signal from the PC Pipe, which is issued
each time the current circulating PC in the PC loop is latched by the first register of the PC
Pipe. The operation of the PC loop is autonomous and independent from the operation of
DISTRIBUTED SIMULATION OF ASYNCHRONOUS HARDWARE 23

PC Pipe
PC loop
PCch R15
...

(PC)
LsmTrm

Datapath

AddC

(W) Wch

FIG. 18. The Address Interface - Datapath Loop

the rest of the processor. Thus, on the PCch channel there will be a continuous, instruction
independent flow of messages.
The role of the PC Pipe is to provide the processor’s datapath with the R15 values
required for the execution of instructions as depicted in figure 18. If, for any reason, the
datapath stalls, instructions will start to backlog and as a result, the PC Pipe will become
full and will remain so for as long as the datapath stalls. During this period no further PC
values will be allowed to enter the PC Pipe and thus no Acknowledgment signal will be
issued on the PCch channel. The datapath may stall in the following cases, as illustrated in
figure 19:
 If the datapath fills up; this will occur as a result of Decode3 and WrtCtrl processes
waiting for the aborts and Wlx signals respectively.
 If an ILSW h instruction is followed by register read operations which refer to locked
registers.
 If an ILSW h instruction is followed by instructions which activate the ALUgo signal.
 During the execution of load/store multiple instructions.
In order to avoid deadlock situations, it is essential that AddC be able to decide whether
it should wait for yet another message from PCch or whether the PC Pipe has become full
and thus no more messages will be sent on the PCch link (PDSP Rule 3). In order to do
that, AddC needs to possess information regarding the possible invalidation of instructions
that have entered the system. This information is provided by both Decode1 and Decode3
by means of extra messages sent via dedicated, buffered links that have been introduced in
the occarm model.
Decode3 informs AddC of the possible changes in the value of the Current Processor
Status that could result to the rejection of instructions, as illustrated in figure 20a.
The messages issued by Decode1 (figure 20b) aim to inform AddC of the exact time of
arrival of the PCcol signal from Decode3, whereupon instructions may start being rejected
in Decode1.

9.1.3. The Wch Link


The Instruction Lookahead Set of the Wch channel is:
ILSW h =fB, BL, SWI, LDR, STR, LDM, STM, Data Processing with PC as Dest. Reg.g
24 GEORGIOS K. THEODOROPOULOS

ALUgo

Block due to Aborts


Block due to ALUgo
or load/store multiple

from IPipe datapath Dabt0


I Block due to a stages
register read ... Aborts

...

...
Decode1 Decode3
from memory
R15
Dabt1
from PC Pipe

RegBank
LsmTrm RReg register

Block due to Wlx

WReg

Wch

register
AddC WrtCtrl

Wlx

FIG. 19. Stalling of the Datapath

Messages arriving on Wch channel are primarily sent to AddC from the datapath (through
WrtCtrl) as a result of the execution of ILSW h instructions and carry either branch target
or data transfer addresses. For data transfer operations whose destination register is R15, a
second message will be sent over Wch, namely the new value of the Program Counter from
memory. According to the PDSP algorithm (PDSP Rules 1 and 4), when AddC detects
an ILSW h instruction it blocks until it receives the corresponding messages on Wch. If
however the ILSW h instruction will not be executed or the memory fails to respond (i.e.
an abort occurs), the expected messages will never be issued, thus leaving AddC blocked
and causing the simulator to deadlock. There are two reasons why an instruction may
not be executed, namely if its colour does not match that of the processor or if it fails its
condition codes.
All the instructions whose execution may change the operating colour of the processor
- either by explicitly writing a new value to the PC (i.e. the branch target address) or by
causing an abort - belong to the Instruction Lookahead Set of the Wch channel. Thus, an
ILSW h instruction will suffer a colour mismatch only if it follows another ILSW h which
has changed the processor’s colour. For instructions that explicitly change the PC, the new
colour is provided to the AddC with the branch target address, making the decision as to
whether an ILSW h instruction will be discarded straightforward. If however the colour
changes due to an abort, AddC has no direct knowledge regarding this change; Decode3
will receive the abort signal from memory and will change the PCcol rejecting subsequent
instructions. In this case a Null message must be sent by Decode3 to inform AddC of the
occurrence of an abort and the colour change (PDSP Rule 2).
As described in the previous section, AddC is provided by Decode3 with all the CPS-
related information required to predict the fate of subsequent instructions regarding their
condition codes; this is performed via the dedicated link and lasts until a result is produced
by Decode3 and forwarded to RReg as illustrated in figure 20a. Thus, for subsequent
ILSW h instructions that fail their condition codes, Null messages are required to be sent
DISTRIBUTED SIMULATION OF ASYNCHRONOUS HARDWARE 25

0: invalid
a) Instruction Valid bit c)
1: valid
I 1
0: invalid
Instruction Valid bit
I ∈ S1 1
1: valid
I ∈ S2 I ∈ ILS2 Wch 1
Instruction flow 1 Stop issuing CPS

I ∈ S1 0 I ∈ ILS2 Wch 0

I ∈ S1 1 I ∈ ILS2 Wch 0 NULL to AddC


CPS to AddC
I ∈ S1 1 I 1
Instruction flow
I ∈ S1 1 I ∈ ILS2 Wch 0

I 0 I 0

I ∈ ILS2 Wch 0 NULL to AddC


I 0 Decode3

I ∈ ILS Wch 1 I ∈ S2 1 Stop issuing CPS

S1={TST, TEQ, CMP, CMN, MSR, MRS} I ∈ S1 1


S2={I : The execution of I produces a result to RReg} I ∈ S1 1 CPS to AddC

I ∈ S1 1
Instruction Colour
b) I ∈ ILS2 Wch 0
I 0
I ∈ ILS2 Wch 0
Instruction flow I 1
Decode3
I ∈ ILS1 Wch ?
I 1
Decode1
1 Stop issuing messages
I

I 1
Message to AddC S1 = {TST, TEQ, CMP, CMN, MSR, MRS}
I 1 S2 = {I : The execution of I produces a result to RReg}
PCcol 0 ILS1Wch = {I : I does not activate aborts and
I 1
if invalidated in Ctrl3, I still produces a message to AddC}
I 1 ILS2Wch = {I : No message is sent to AddC if I is ivalidated in Dec3}
I ∈ ILS 1
Decode3 Wch

I 1

FIG. 20. AddC: The Instruction Dependent Generation of PDSP messages

by Decode3 to inform AddC of this event. Once a Null message is sent to AddC, no more
messages of this kind will be issued for ILSW h instructions subsequently invalidated
in Decode3, until a valid instruction is executed. This pattern will be followed until the
next ILSW h to produce a result is encountered, whereupon the production of CPS-related
messages to AddC will commence. This is illustrated in figure 20c where a complete
picture of the interaction between Decode3 and AddC is provided.

9.2. Validation of PDSP


The computer, which hosts the occarm simulation model is the T-Rack, a reconfigurable,
64 transputer machine that was built at Manchester University in the context of the UK
Alvey ParSiFal project to support the simulation of computer architectures [10]. Occarm has
been configured to execute on both, single (a 20MHz T414 transputer) and multiprocessor
T-Rack configurations, as depicted in figures 21(a) and 21(b) respectively. In the latter, 6
of the 64 transputers of the T-Rack execute occarm, 1 executes an extra I/O process and the
remaining 57 forward monitoring information.
For the validation of PDSP, the Dhrystone[58] benchmark has been used. This is a
synthetic benchmark which has traditionally been used for the evaluation of computer ar-
26 GEORGIOS K. THEODOROPOULOS

Occarm Process Node

Host (Sun 3/160) Decode1 + Decode2 1


Trace Files Memory File Results File AddInt 2
ê@êd Decode3+DatInt 3
... åÿÎãÑä
äPãûÿ
RegBank 4
WrtCtrl 5
Memory 6
I/O.Process -1

Request.out
monitor[]
Data.in I/O Process
6 3 1 2 5 4 F F
T0 T1 T2 T3 T4 T5 T6 T7

Monitoring
Memory F
Process Processes
monitor[] T8

...

Memory Requests out / Data In Monitoring Information


-1 F
Occarm T-1 T63
from.memory[]
Model Necklace
monitor[]
to.memory[]
Switched Links to/from Host
T-1 Monitoring Path
F: Forward Process

(a) (b)

FIG. 21. The Single and Multi-Transputer Configuration of Occarm

chitectures1 . Like all synthetic benchmarks, Dhrystone tries to match the average behaviour
(i.e. the average frequency of operations and operands) of a large set of real programs; thus,
the results obtained may be considered representative of the average behaviour of occarm
and PDSP too.

TABLE 1
Performance of occarm - Benchmark: Dhrystone (1 loop)
Arbiter Model Elapsed Time (minutes) Speedup
Occarm (single) Occarm (multi)

ALT 1.72 1.02 1.69


PDSP 2.15 1.22 1.76
PDSP-MLL 2.03 1.17 1.73

Table 1 presents the performance results achieved for the different occarm configurations
and arbiter models used. The results represent the mean values derived from ten runs of
the simulators in question.
Clearly, the best performance results are achieved when no attempt is made to enforce
temporal precision in the model (the poor speedup achieved by the distribution of occarm
onto the multiple processors of the T-Rack is attributed to the characteristics of both, the
simulated architecture and the host machine and has been analysed in detail in [53, 55]).

1 A discussion on the different benchmarks used in connection with computer architecture research may be
found in e.g. [31] pp. 45-48.
DISTRIBUTED SIMULATION OF ASYNCHRONOUS HARDWARE 27

The deployment of PDSP and the incorporation of preemption-free arbiter models in the
occarm model, with no attempt to exploit MLL, has resulted to a 20% and a 16.39% decrease
of the performance of the single and multiple transputer configurations of the simulator
respectively. The difference in performance reductions between the two configurations is
attributed to fact that the distribution of processes onto multiple transputers reduces the
context switching overhead and the waiting times of blocked processes.
When the Minimum Latency Lookahead is exploited, smaller performance reductions of
15.27% and 12.8% respectively for the single and multiple transputer configurations of the
simulator are observed. In this case the impact of the distribution of processes onto multiple
transputers on the simulation performance is smaller, since in this case, the exploitation of
the MLL enables processes to be scheduled for longer periods alleviating thus some of the
context switching.

10. EPILOGUE
Asynchronous logic is being viewed as an increasingly viable alternative digital design
approach which promises to liberate VLSI systems from clock skew problems, offer the
potential for low power and high performance and encourage a modular design philosophy
which makes incremental technological migration a much easier task. The advent of
easily available custom ASIC technology in the form of VLSI or FPGAs has greatly
facilitated the implementation of asynchronous circuits. However, asynchronous logic
has to overcome several obstacles before it is established as a mainstream digital design
style. One such obstacle is the lack of modelling languages and simulation techniques
suitable for asynchronous design. Fundamentally, conventional, sequential, synchronous
hardware description languages are not suitable for describing concurrent non-deterministic
asynchronous behaviour.
The concurrent, asynchronous, process-based model of computation of CSP and occam,
with the support for non-deterministic behaviour, and the point-to-point, synchronous
and unbuffered inter-process communication are particularly suitable for describing the
concurrent, non-deterministic behaviour of asynchronous hardware systems and provide a
natural and convenient means for the rapid construction of asynchronous hardware models.
However, the exploitation of the relationship between CSP/occam and asynchronous
hardware systems, trades temporal accuracy for ease of modelling. Although the topological
characteristics of the modelled hardware system map naturally onto the model, the temporal
characteristics do not, a situation that may lead to causality errors in the model.
Addressing this problem, this paper has introduced the Program Driven Synchronization
Protocol (PDSP), a generic theoretical framework for the development of accurate arbiter
processes which eliminate preemptions, while preserving the data-driven philosophy of the
CSP/occam architectural models.
PDSP is based on the conservative, deadlock avoidance approach. It exploits the char-
acteristics of the simulated architectures to enable arbiter processes predict their simulated
future based on the instructions executed by the model; the term “Instruction Lookahead"
has been introduced to refer to this concept. The exploitation of Instruction Lookahead
ensures that Null messages are issued only when necessary, and are directed only to arbiter
processes; this departs from conventional conservative techniques where typically all pro-
cesses are required to handle and generate a regular flow of Null messages throughout the
model.
28 GEORGIOS K. THEODOROPOULOS

The success of PDSP depends on the ability to exploit the “Instruction Lookahead"
properties of the simulated architecture. The application of the PDSP to the occarm
simulation model has demonstrated that such an exploitation is feasible, even for systems
of the AMULET1’s complexity. However, it requires extensive knowledge of the operation
of the simulated architecture. This may be considered the major drawback of PDSP, though
this is a problem typical of the conservative framework and not specific to PDSP.

APPENDIX A
TABLE 1
AMULET1 Pipeline Occupancy (Dhrystone (1 loop))
Pipe Size Occupancy (%)
1 Item 2 Items 3 Items
Single Multi Asim Single Multi Asim Single Multi Asim

Ipipe 5 25.35 21.66 14.78 35.53 33.04 26.07 27.13 30.71 37.37
PCPipe 2 37.72 34.47 31.10 47.62 53.31 60.68
Xpipe 3 37.13 36.88 40.30 9.24 9.43 10.59 0.55 0.92 1.20
Dec1.RegB 1 74.60 77.04 52.43
Dec1.RegA 1 7.99 7.85 5.04
Dec2.OP 1 89.62 90.44 69.06
Dec2.Rsh 1 0 0 0
NGen 2 10.05 10.36 3.53 3.95 3.87 0
RB.OReg 1 50.91 52.27 40.75
RB.Wreg 1 78.40 79.62 63.35
RB.Ireg 1 47.20 50.10 61.80
RB.Afifo 3 37.24 37.48 36.17 1.09 1.10 2.56 0 0 0
RB.Mfifo 4 22.63 22.45 22.90 3.04 3.27 2.91 0.69 0.65 0.77
Dec3 3 27.70 25.65 NA 52.26 52.36 NA 12.95 15.36 NA
CPreg 1 57.67 57.83 65.62
OPreg 1 56.21 56.42 55.41
MemCP 5 56.75 57.55 60.21 7.88 7.76 2.87 0 0 0
MRReg 1 32.64 33.30 22.39
MAReg 1 57.29 57.49 42.17
Dout 3 10.31 10.48 13.09 0.43 0.44 0.09 0 0 0
PCHLat 2 95.65 95.67 90.11 0.60 0.77 0.76
LSMreg 1 6.79 6.70 4.42
APipe 2 5.17 5.30 7.65 0.17 0.16 0.20
WReg 1 14.31 15.23 18.71
DataIn 1 8.23 7.64 5.94
RReg 1 29.25 29.14 23.76
ImmPipe 2 40.17 41.32 27.63 13.15 14.30 4.13
TABLE 2
AMULET1 Pipeline Stalls (Dhrystone (1 loop))
Total Number Stall Period (ns)
Pipe of Messages Average Min
Single Multi Asim Single Multi Asim Single Multi Asim Single

IPipe 992 1018 1079 12 12 12 12 12 12 12


PCPipe 992 1018 1075 81.24 83.44 68.85 12 12 12 700
XPipe 251 254 244 14.35 12.08 12.88 12 12 12 80
Dec1.RegB 973 985 954 29.47 32.16 23.48 12 12 12 214
Dec1.RegA 147 147 144 36.10 35.73 46.21 12 12 12 69
Dec2.OP 973 985 953 58.35 61.12 67.33 12 12 12 322
Dec2.Rsh 0 0 0 0 0 0 0 0 0 0
NGen 117 117 113 12 12 12 12 12 12 12
RB.OReg 973 985 953 43.51 44.16 25.53 12 12 12 134
RB.Wreg 1034 1046 1012 12 12 12.70 12 12 12 12
RB.Ireg 1034 1046 1012 16.03 17.03 50.89 12 12 12 170
RB.Afifo 458 465 454 12 12 12 12 12 12 12
RB.Mfifo 178 181 172 12 12 12 12 12 12 12
Dec3880 892 860 12.45 13.57 39.05 12 12 12 65 118
CPreg 879 891 859 32.40 15.10 40.62 12 12 12 150
OPreg 879 891 859 14.35 14.86 27.44 12 12 12 75
MemCP 1156 1182 1258 12 12 12 12 12 12 12
MRReg 1156 1182 1240 16.42 16.45 12 12 12 12 45
MAReg 1343 1372 1421 29.56 29.03 42.51 12 12 21 128
Dout 166 166 163 12 12 12 12 12 12 12
PCHLat 991 1017 1079 12 12 12 12 12 12 12
LSMreg 111 111 109 13.02 13.7 21.66 12 12 12 14
APipe 73 73 73 12 12 12 12 12 12 12
WReg 342 345 331 12 12 12 12 12 12 12
DataIn 186 189 179 12 12 15.76 12 12 14 12
RReg 716 725 700 12 12 12.53 12 12 12 12
ImmPipe 499 507 482 12.26 12.53 12 12 12 12 29
DISTRIBUTED SIMULATION OF ASYNCHRONOUS HARDWARE 31

REFERENCES
1. F. A. Almeida, P. H. Welch, A Parallel Emulator for a Multiprocessor Dataflow Machine, in“Proceedings of
the World Transputer Congress 1994", Como, 1994, pp. 259-272.
2. J. M. Alonso, et al., Conservative Parallel Discrete Event Simulation in a Transputer-Based Multicomputer,
in “Proceedings of the World Transputer Congress 1993", Aachen, 1993, pp. 636-650.
3. The AMULET Group, World Wide Web Home Page, URL: http://www.cs.man.ac.uk/amulet/index.html
4. Ayani, R., A Parallel Simulation Scheme based on the Distance Between Objects, in “Proceedings of the 1989
SCS Multiconference on Distributed Simulation", SCS Simulation Series, 1989, pp. 113-118.
5. G. Birtwistle, A. Davis, eds., “Asynchronous Digital Circuit Design", Springer Verlang, 1995.
6. J. A. Brozowski, C-J. H. Seger, “Asynchronous Circuits", Springer Verlang, 1995.
7. E. Brunvand, M. Starkey, An Integrated Environment for the Design and Simulation of Self Timed Systems,
in “Proceedings of VLSI 1991" 1991, pp. 4a.2.1-4a.3.1.
8. E. Brunvand, The NSR Processor, in “Proceedings of the 26th Annual Hawaii International Conference on
System Sciences", Maui, Hawaii, 1993, pp. 428-435.
9. S. M. Burns, A. J. Martin, Synthesis of Self-Timed Circuits by Program Transformations, “Technical Report
5253:TR:87", Computer Science Department, Caltech, 1987.
10. Capon, P.C., Gurd, J. R., Knowles, A. E., ParSiFal: A Parallel Simulation Facility, “IEE Colloquium: The
Transputer: Applications and Case Studies", IEE Digest, 1986/91, 23rd May 1986.
11. K. M. Chandy, J. Misra, Distributed Simulation: A Case Study in the Design and Verification of Distributed
Programs, IEEE Transactions on Software Engineering, 5, 5, September 1979, pp. 440-452.
12. K. R. Cho, K. Okura, K. Asada, Design of a 32-bit Fully Asynchronous Microprocessor (FAM), in “Proceedings
of the 35th Midwest Symposium on Circuits and Systems", Washington D.C., 1992, pp. 1500-1503.
13. T. A. Chu, “Synthesis of Self-timed VLSI Circuits from Graph-Theoretic Specifications", Ph.D Thesis
(MIT/LCS/TR-393), M.I.T., June 1987.
14. W. A. Clark, C. E. Molnar, Macromodular Computer Systems, R. W. Stacy, B.D, Waxman, eds., “Biomedical
Research", Academic Press, 1974, Chapter 3.
15. Cogency Technology Inc., World Wide Web Home Page, URL: http://www.cogency.com
16. J. Cortadella, et. al., Petrify: A Tool for Manipulating Concurrent Specifications and Synthesis of Asyn-
chronous Controllers, IEICE Transactions on Information and Systems, E80-D(3), March 1997, pp. 315-325.
17. A. Davis, The Architecture and System Method for DDM1: A recursively structures data-driven Machine, in
“Proceedings of the 5th Annual Symposium on Computer Architecture", Palo Alto, CA, 1978, pp. 210-215.
18. A. Davis, S. M. Nowick, Synthesizing Asynchronous Circuits: Practice and Experience, in [5], pp. 104-150.
19. M. E. Dean, STRiP: A Self-Timed RISC Processor, “Technical Report CSL-TR-92-543", Computer Systems
Laboratory, Stanford University, July 1992.
20. D. L. Dill, Trace Theory for Automatic Hierarchical Verification of Speed-Independent Circuits, ACM Distin-
guished Dissertations, MIT Press, 1989.
21. J. C. Ebergen, A Formal Approach to Designing Delay-Insensitive Circuits, Distributed Computing, 5 (3),
1991, pp. 107-119.
22. C. J. Elston, et al., Hades - Towards the Design of an Asynchronous Superscalar Processor, in “Proceedings
of the 2nd Working Conference on Asynchronous Design Methodologies", London, 1995, pp. 200-209.
23. P.B. Endecott, S.B. Furber, Modelling and Simulation of Asynchronous Systems using the LARD Hardware
Description Language, in “Proceedings of the 12th European Simulation Multiconference", Manchester, 1998,
Society for Computer Simulation International, pp. 39-43.
24. A. Ferscha, S. K.Tripathi, Parallel and Distributed Simulation if Discrete Event Systems, “Technical Report
CS.TR.3336", University of Maryland, August 1994.
25. R. Fujimoto, Parallel Discrete Event Simulation, Communications of the ACM,33(10), 1990, pp. 31-53.
26. S. B. Furber, Computing Without Clocks, In [5], pp. 211-262.
27. S. B. Furber, et. al., AMULET2e: An Asynchronous Embedded Controller, in “Proceedings of Async ’97
Conference", IEEE Computer Society Press, 1997, pp. 290-299.
28. J. D. Garside, et. al., AMULET3 Revealed, in “Proceedings of Async’99 Conference", IEEE Computer Society
Press, 1997, pp. 51-59.
29. G. Gopalakrishnan, V. Akella, Specification, Simulation, and Synthesis of Self-Timed Circuits, in “Proceedings
of the 26th Hawaii International Conference on System Sciences", 1993, pp. 399-408.
32 GEORGIOS K. THEODOROPOULOS

30. S. Hauck, Asynchronous Design Methodologies: An Overview, “Technical Report UW-CSE-93-05-07",


University of Washington, April 1993.
31. Hennessy, J. L., Patterson, D. A., “Computer Architecture A Quantitative Approach", Morgan Kaufmann
Publishers Inc., 1990.
32. C.A.R. Hoare, “Communicating Sequential Processes", Prentice Hall International, 1985.
33. H., Hulgaard, S. M. Burns, Bounded Delay Timing Analysis of a Class of CSP Programs with Choice, in
“Proceedings of the International Symposium on Advanced Research in Asynchronous Circuits and Systems",
1994.
34. R. N. Ibbett, P. C. Capon, The Development of the MU-5 Computer System, Communications of the ACM,
21(1), 1978, pp. 13-24.
35. “Occam 2 Reference Manual", Inmos, Prentice Hall International, 1988.
36. D. Jefferson, H. Sowizral, Virtual Time, ACM Transactions on Programming Languages and Systems, 3(7),
July 1985, pp. 404-425.
37. M. B. Josephs, J. T. Udding, Delay-Insensitive Circuits: An Algebraic Approach to their Design, “Lecture
Notes in Computer Science", Vol. 458, 1990, pp. 342-366.
38. L. Lamport, Time, Clocks and the Ordering of Events in Distributed Systems Communications of the ACM,
21(7),1978, pp. 558-565.
39. Y. Liu, J. Aldwinckle, K. Stevens, and G. Birtwistle, Designing Parallel Specifications in CCS, in “Proceedings
of the Canadian Conference on Electrical and Computer Engineering", Vancouver, 1993.
40. A. J. Martin, et al., Design of an Asynchronous Microprocessor, in “Proceedings of the Decennial Caltech
Conference on VLSI: Advanced Research in VLSI", 1989, pp. 351-373.
41. A. J. Martin, Synthesis of Asynchronous VLSI Circuits, J.Staunstrup, editor, Formal Methods for VLSI Design
(North Holland, 1990).
42. C. A. Mead, L. A. Conway, Introduction to VLSI Systems (Addison Wesley, 1980).
43. C. E. Molnar, T-P. Fang, Synthesis of Reliable Speed-Independent Circuit Modules: I. General Method for
Specification of Module-Environment Interaction and Derivation of a Circuit Realisation, “Technical Report
297", Computer Systems Laboratory, Institute for Biomedical Computing, Washington University, St. Louis,
1983.
44. D. E. Muller, W. S. Bartky, A Theory of Asynchronous Circuits, “Digital Computer laboratory 75", University
of Illinois, November 1956.
45. T. Nanya, et al., TITAC: Design of a Quasi-delay-Insensitive Microprocessor, IEEE Design and Test of
Computers, 11(2), 1994, pp. 50-63.
46. N. C. Paver, “The Design and Implementation of an Asynchronous Microprocessor", Ph.D Thesis, Department
of Computer Science, University of Manchester, 1994.
47. W. F. Richardson, E. Brunvand, Fred: An Architecture for a Self-Timed Decoupled
Computer, “Technical Report UUCS-95-008", University of Utah, May 1995. Available at:
ftp://ftp.cs.utah.edu/techreports/1995/UUCS-95-008.ps.Z
48. E. M. Sentovich, et. al., SIS: A System for Sequential Circuit Synthesis, “Technical Report UCB/ERL
M92/41", U.C. Berkeley, May 1992.
49. “Sharp’s Data-Driven Media Processor", World Wide Web Home Page, URL:
http://www.sharpsdi.com/DDMPhtmlpages/DDMPmain.html
50. R. F. Sproull, I. E. Sutherland, C. E. Molnar, The Counterflow Pipeline Processor Architecture, IEEE Design
and Test of Computers, 11(3), 1994, pp. 48-59.
51. I. E. Sutherland Micropipelines, Communications of the ACM, 32 (1), 1989, pp. 720-738.
52. I. E. Sutherland, Flashback Simulation, “Research Report SunLab 93:0285", Sun Microsystems Laboratories,
Inc., August 1993.
53. G. Theodoropoulos, “Strategies for the Modelling and Simulation of Asynchronous Computer Architec-
tures", Ph.D Thesis, Department of Computer Science, University of Manchester, 1995. Available at:
ftp://ftp.cs.man.ac.uk/pub/amulet/theses/theo95-phd.ps.Z
54. G. Theodoropoulos, J. V., Woods, Occam: An Asynchronous Hardware Description Language?", in “Pro-
ceedings of the 23rd IEEE Euromicro Conference on New Frontiers of Information Technology", Budapest,
Hungary, September 1997.
55. G. Theodoropoulos, Modelling and Distributed Simulation of Asynchronous Hardware, Simulation Practice
and Theory Journal, July 2000, pp. 741-767.
DISTRIBUTED SIMULATION OF ASYNCHRONOUS HARDWARE 33

56. C. H. Van Berkel, J. Kessels, M. Roncken, R. Saeijs, F. Schalij, The VLSI-Programming Language Tangram
and its Translation into Handshake Circuits, in “Proceedings of EDAC", 1991, pp. 384-389.
57. “VERSIFY Release 2.0", Department d’Arquitectura de Computadors, Universitat PolitŁcnica de Catalunya,
Barcelona, Spain, November 1998, World Wide Web Home Page, URL: http://www.ac.upc.es/vlsi/versify/
58. R. P. Weicker, Dhrystone, A Synthetic Systems Programming Benchmark, Communications of the ACM,
27(10), 1984, pp. 1013-1030.
59. T. Werner, A. Venkatesh, Asynchronous Processor Survey, IEEE Computer, 30(11), 1997, pp. 67-76.
60. J.V. Woods, P. Day, S.B. Furber, J.D. Garside, N.C. Paver, and S. Temple, AMULET1: An Asynchronous
ARM Microprocessor, IEEE Transactions on Computers, 46 (4), 1997 pp.385-398.

You might also like