Computer Architecture and Organization
Computer Architecture and Organization
ORGANIZATION
To My Father
Deryck East
1921–1989
IAN EAST
University of Buckingham
PITMAN PUBLISHING
128 Long Acre, London WC2E 9AN
A Division of Longman Group UK Limited
© I. East 1990
First published in Great Britain 1990
This edition published in the Taylor & Francis e-Library, 2005.
“To purchase your own copy of this or any of Taylor & Francis or Routledge’s collection of
thousands of eBooks please go to www.eBookstore.tandf.co.uk.”
British Library Cataloguing in Publication Data
East, Ian
Computer architecture and organization.
I. Title
004
Preface
2.2.3 Objects 34
2.3 Structured programming 37
2.3.1 Primitives 37
2.3.2 Constructs 39
2.3.3 Partitions 42
2.4 Standard programming languages 44
2.4.1 Modula 2 44
2.4.2 Occam 46
3 Machine language 51
3.1 Nature 51
3.1.1 Translation from programming language 51
3.1.2 Structure 52
3.1.3 Interpretation 53
3.1.4 Instructions 55
3.1.5 Operands 58
3.2 Simple architectures 62
3.2.1 Sequential processing 62
3.2.2 Parallel processing 63
3.2.3 Modular software 67
3.3 Instruction set complexity 69
3.3.1 Reduced instruction set computer (RISC) 69
3.3.2 Complex instruction set computer (CISC) 72
3.3.3 Comparison of RISC and CISC 73
Appendix
A ASCII codes 347
B Solutions to exercises 349
Bibliography 389
Index 392
Preface
In particular I felt that the architecture of the Transputer could not be ignored
because of the degree of its innovation and the fact that it renders parallel
computing accessible and affordable, even to the individual.
As a result of the absence of a suitable text I felt myself duty bound to attempt
the production of one. At the highest level my intention was to produce, not a
“bible” on the subject, but a course text, containing sufficient material to provide
a solid introduction but not so much as to overwhelm. It is intended as support for
self-contained introductory courses in computer architecture, computer
organization and digital systems. A detailed table of contents and substantial
xii
• Switch
• Element (Gate, flip-flop)
• Component (Register, ALU, counter, shift-register etc.)
• Processor
The distinction between mainframe, mini and micro computers has been
considered irrelevant to an introductory course or text. VLSI technology is in any
case blurring the boundaries. I have intentionally omitted any treatment of highly
complex designs in order to concentrate on fundamental ideas and their
exploitation. More can be learned from simpler examples. The Transputer and
Berkeley RISC have shown how simplification can defeat increased complexity.
Scientists simplify, engineers complicate!
The programming model assumed as a default is that of the procedure since
procedural languages are currently most commonly taught first to students of
computer science. Buckingham has opted for Modula-2. In addition the process
+message passing model is used when discussing concurrency support.
“Architecture” is taken to mean those features of the machine which a
programmer needs to know, such as programmer’s architecture, instruction set
and addressing modes. “Organization” is taken to mean all features which give
rise to the characteristic behaviour and performance of the machine and/or the
way in which its components are connected together. The distinction between the
two definitions is not adhered to with absolute rigidity.
The new ANSI/IEEE standard logic symbols are not employed since it is felt
that the traditional ones are easier to learn, make simple systems more clear and
are still in widespread use. The new symbols are clearly intended to simplify the
xiii
The approach taken in this text differs from the recommendations. Many more
additional topics are treated than are left out. Overlap at the module level may not
mean exact correspondence to topics within. Treatment of alternatives, of
equivalent merit, within a module is considered sufficient. For example in
Subject area eight/module three: “Computer architecture survey” a different
collection of machines will be found to that in Chapter 8.
It was never the intention to match the book to part of any preconceived
curriculum. Course syllabus should be a matter for the course tutor. I hope very
much that the text proves useful to others who share the experience of that
responsibility.
1.1
Systems
1.1.1
Definition and characterization
State
State is a concept fundamental to physics. It is the instantaneous configuration
of a system. The simplest example is that of a tetrahedron resting on a table.
There are four states corresponding to the four sides it may rest on. It is possible
to label each side with a symbol and use it to remind yourself about one of four
things. In other words it constitutes a four-state memory. Memories are labelled
physical states.
One kind of symbol you could use, to label the tetrahedron, would be the
numeric digits {0, 1, 2, 3}. It is then possible to use it as a 1-digit memory! We
are now able to store a multi-digit, base 4, number by employing one tetrahedron
for each digit. The group used to store a single value is called a register. The
statespace is the range of values possible and is determined by the number of
ways in which the states of the tetrahedra may be combined. N of them allows 4N
values, 0→ (4N−1). If we wish to store many values we simply use many words.
State, or memory, is not necessarily used to store numbers. Symbols, or
symbol combinations, may represent characters or graphic objects. Alternatively
they may represent objects in the real world. The combined state may then be
used to represent relations between objects.
1 Electric analogue computers were once common and still find application today.
4 CHAPTER 1. COMPUTATION
Process
A process describes the behaviour pattern of an object. It consists of a sequence
of events which is conditionally dependent on both its initial state and
communication with its environment, which may be modelled as a number of
other processes. Processes are said to start, run and then terminate.
Each possesses private state, which is inaccessible to any other, and a number
of channels by which to communicate externally. The complete set of symbols
which may be stored must include those which may be input or output.
The named description of a process is called a procedure. The fundamental
indivisible atomic, or primitive, events of which any process is capable may be
categorized…
• Assignment
• Input
• Output
Protocol
A stream is a sequence of symbols. It may or may not be infinitely long. If finite
then it is terminated by a special value, End Of Transmission (EOT). The length
of the stream is not known a priori to either the receiver or transmitter. Upon any
given clock tick only the current symbol in the stream is accessible to either.
Returning to the manufacturing analogy, it may be necessary to represent a
1.1. SYSTEMS 5
1.1.2
Classes of system
Causal systems
Physics tells us that output can never be simultaneous with, or precede, the input
which causes it. This is called the law of causality. It is one of the most
6 CHAPTER 1. COMPUTATION
fundamental laws of physics. All real, natural and artificial systems obey it.
There is no explanation of it. The universe is simply made that way.
For a computer, causality means that every output value takes some interval of
time to derive from the input stream on which it depends. The interval must
always be shorter than that between clock ticks at the level of the primitive
action.
Linear systems
There are several specific classes of system which are of interest. Knowledge of
them helps in analysing problems. One we shall mention now is that of systems
which exhibit linearity. In a linear system one can add together several inputs
and get an output which is simply the sum of those which would have resulted
from each separate input. Very few natural systems are linear. However there is a
growing range of applications which are benefiting from the simplicity, and thus
low cost, of linear processors2.
Deterministic systems
A deterministic system is one whose output is predictable with certainty given
prior knowledge of some or all preceding input. All conventional computers are
deterministic systems.
An example of a deterministic system is a processor running a process which
adds integers on two input channels placing the result on a single output channel.
Here one need only know the input symbols on the current clock tick to predict
(determine) the result output on the next one.
Another system, whose process is the computation of the accumulated sum of
all integers input on a single stream, requires prior knowledge of all symbols
input since process start.
Stochastic systems
A stochastic system is one where it is only possible to determine the probability
distribution of the set of possible symbols at the output. In other words the
output at each clock tick is a random variable.
Although it is the more general system, the stochastic computer is rare and
very much just the subject of research within the field of artificial intelligence at
the time of writing this book3.
1.1. SYSTEMS 7
1.2
Automata
1.2.1
Switches
Normally-open switches
A switch is the simplest system which exhibits state. Of the many types of
switch, the one we shall consider is the normally-open switch (Figure 1.2).
Imagine a light switch which is sprung so that the light is immediately
extinguished on removal of your finger.
It only closes, allowing electrons to flow, in response to pressure from your
finger, thus turning on the light.
Now for just a little physics. Some energy must be expended in closing a
normallyopen switch. Our simple model has one output and two inputs. Are both
inputs the same?
Anything that flows has kinetic energy. Something which may potentially flow
is said to have potential energy. In other words, some energy is available which
may be converted into kinetic energy.
We have already met one fundamental law of physics in the law of causality.
Here we have a second one. The first law of thermodynamics requires the
conservation of energy. It therefore tells us that flow can only occur if some
potential energy is available. A stream of water flows down a hill because the water
has gravitational potential energy available.
To understand our normally-open switch we observe that potential energy
must be available at both inputs. In this sense both inputs are the same. They
differ in that flow may occur through one and not the other. Different kinds of
potential energy may characterize each input. For example, a light switch uses
electrical potential energy (voltage) on its flow input and mechanical potential
energy (finger pressure) on its other input.
Another example of a normally-open switch would in fact allow us to
construct an entire (albeit rather slow) computer. A pressure-operated valve is
designed to switch air pressure. It may be referred to as a pneumatic switch. The
interesting property of this switch is that the output of one may operate others.
The number of others which may be operated defines the fan-out of the switch.
By now it should come as little surprise that the fundamental building block of
all electronic systems, the transistor, is in fact no more than a switch. It is very
important to understand that a computer can be built using any technology
capable of implementing a normally-open switch4. There is nothing special to
computation about electronics. Nor is it true that there exists only artificial
computers. Biological evolution may be regarded as a form of computation.
Computers are often rather noisy. Most of the noise comes from a cooling
system which is usually just a fan. There is a fundamental reason for this…
switches consume5 energy. Yet another fundamental law of physics, the second
law of thermodynamics, tells us that no machine may do work without producing
heat. This heat is being produced continuously. If nothing is done to remove it
the system will overheat. Computers get hotter the faster they run!
There are a lot of switches in a computer. Hence it is very important for the
technology chosen to offer a switch which requires as little energy to operate as
possible. Designs usually involve a trade-off between power consumption6 and
speed.
Switch operation is the most fundamental event in computation. Therefore the
operation speed of the switch will limit the speed of the computer. Biological
switches (e.g. the neurons in our brains) switch rather slowly. They take~10−3s.
It appears that the best an electronic switch can do is~10−9s. Optical switches,
recently developed in the UK, promise switching in~10−12s. The reader should
have noticed the contrast between the switching speed of the neuron and that of
the transistor. The capabilities of the human brain compare somewhat favourably
with those of current computers. It is obvious that we are doing something
wrong!
Memory
“State” is really just another term for memory. The number of states of a system
is equal to the number of things it can remember. States are memories. We may
label them how we like, e.g. by writing symbols on the sides of our tetrahedra.
1.1. SYSTEMS 9
latch output state. Removing the applied potential does not affect the latch state
which now remains “1”. In other words, it remembers it!
The reader should verify that the latch will similarly remember the application
(input) of “0”.
Logic gates
Logic gates are devices which implement systems with binary input and output
values. The presence or absence of a potential, at either input or output, is used to
infer the truth or otherwise of a proposition. A full treatment of the functions
they may implement and how they are used in the construction of computers is
left for Part II of this book. It is however appropriate now to illustrate how our
simple normally-open switches may be used to construct logic gates. Figure 1.4
shows how AND, OR and NOT gates may be made.
Output is “1” from AND only if A and B are “1”. Output is “1” from OR only
if A or B is “1”. NOT merely inverts a single input.
In fact logic gates merely define standard ways in which switches may be
connected together. Their usefulness is that they allow formal (mathematically
1.1. SYSTEMS 11
1.2.2
Finite automata
An automaton is any entity which possesses state and is able to change that state
in response to input from its environment. It is a discrete system whose next state
depends both on input and its current state. Figure 1.5 illustrates the idea from
both functional and procedural points of view.
Imagine you are monitoring an instrument which is equipped with three lights,
each of which may be only green or red. It is your job to interpret the meaning of
the pattern of lights according to a small instruction set. Let us say that the
instrument is designed to detect military aircraft and identify them as either
friend or foe. A certain pattern is interpreted as” aircraft detected”.
Subsequently, some patterns mean “friend”, some mean “foe” and the rest mean
“failure to identify”. These are the states, you are the (4-state) automaton and the
light patterns form a symbol alphabet.
Automata are fundamental building blocks of computers. They may be found
implemented in both software and hardware. The automaton process may be
described as a series of series of IF…THEN…statements inside a REPEAT…
UNTIL…FALSE loop (infinite loop). These form the instruction set of the
automaton. Each one is thus composed of…
12 CHAPTER 1. COMPUTATION
…where the condition is in two parts, state and input symbol. The action is
simply the updating of state. To summarize, a procedural description of the
behaviour of an automaton is…
The automaton may have output only in the sense that the state may be visible
externally. Typically, output forms the input to another automaton. In our
analogy the second might be a 2-state automaton enabling a missile to be fired at
a “foe”. A formal definition of an automaton must consist of the following…
• Set of symbols
• Set of states
• Set of instructions
The set of states allowed forms the statespace, the set of symbols the alphabet. It
is programmed simply by specification of the instruction set. No ordering of
instructions is needed.
Finite automata7 are simply automata with a finite number of states and may
alternatively be described by a state transition graph which defines how the
system proceeds from one state to the next, given both state and input symbol.
Part II describes their design.
1.2.3
Turing Machines
Structure
The Turing Machine is composed of three subsystems. One of these is a
processor, which is much like an automaton except that it can also output a
symbol distinct from its state. Symbols are input from, and output to, a linear
memory composed of a sequence of memory “cells”. The processor is also able
to move one position in either direction along memory.
The linear memory forms a second subsystem which contains symbols drawn
from an alphabet. One of the symbols is special and usually termed blank. Each
memory cell is capable of storing a single symbol. The operation of the machine
is cyclic. A single cell is read, then one of just three actions is carried out.
The third subsystem is a channel which allows the processor to read or write a
memory cell. Note that moving along the linear memory may equally well be
performed by…
7 In the field of hardware design automata are usually referred to as state machines. The
It may help to imagine the processor as a “rubber stamp” stamping new symbols
onto a tape (linear memory) which depend on the current one it sees and an
internal memory (Figure 1.6).
Programming
As with automata, instructions are solely of the if…then kind. The condition part
of the instruction is simply made up of the state and the symbol which has been
read. The action part has three possibilities only…
The input to a Turing Machine is the contents of the memory at the start. The
output is the memory contents when it halts. If it fails to halt then the function is
not computable. The program is the (unordered) set of instructions.
A particular Turing Machine is defined by specifying the following…
Computability
One model of computation is that of a process which evaluates a function. Not
all functions are computable. There is no known way of distinguishing the
incomputable from the computable. Each problem must be investigated in its
1.1. SYSTEMS 15
own right. The Turing Machine is a useful model for such investigation because
of the Church thesis [Church 36] which may be paraphrased thus…
The term “effectively” implies that it must be possible to write a program for a
computer to achieve the evaluation of the function. In other words it must be
possible to describe the process of evaluation. For a good introduction to the
subject of computability see [Rayward-Smith 86].
1.2.4
Cellular automata
Origin
Here the reader may be introduced to part of the legacy of another great person
who helped create a science of computation…J.von Neumann, von Neumann
wished to compare living entities with artificial systems. The element of life is
the living cell. Cellular automata were his invention to promote understanding
of self-replicating systems [von Neumann 66].
Interest
Just as the study of finite automata promotes understanding of sequential
computation, the study of cellular automata promotes that of parallel
computation. It is an area of research which is progressing rapidly, at the time of
writing, and promises machines which might earn the adjective “intelligent”.
Structure
An automata network is any graph of finite automata which evolves by means of
discrete interactions which are both mutual and local. A graph G is defined as a
set of sites together with a neighbourhood system . Hence G={S, F}.
Programming
There is no global program for a cellular automata network. Cells share a
common local program which describes how to interact with their neighbours to
update their (purely local) state. It may define either a deterministic or a
stochastic process.
Perhaps the most frequently investigated deterministic process is that of the
game of Life (Figure 1.7) [Conway 82]. Here the graph is a set of sites (cells)
16 CHAPTER 1. COMPUTATION
If the state of the entire graph is displayed as an image (two dimensional array
of brightness, or colour, values) behaviour cycles emerge as patterns which move
or repeat themselves. Those cycles which infinitely repeat are known as limit
cycles.
1.1. SYSTEMS 17
1.3
Processes
1.3.1
Nature
Events
The atomic (indivisible) element which characterizes a process is the event. The
observed behaviour of any process is considered to be a discrete sequence of
events. Each event is idealized to be instantaneous. It is the accomplishment of
an action in response to a single command of the process specification.
The alphabet of a process is the set of all events of which it is capable. For
instance, the alphabet of a Turing Machine is simply {All defined modify
events, move forward, move backward, halt}.
In order to be useful, a process must possess one special event in its alphabet…
succeed Passage of this event means successful termination. The equivalent for
the Turing Machine is halt.
Example process
Recall that a process is the behaviour pattern of an object. This consists of a
sequence of events which is conditionally dependent on both its initial state and
communication with its environment. A process starts, runs and then terminates.
For an example of a process, consider an economy (Figure 1.8). We here adopt
an extremely naïve model. A number of supplier/manufacturer/consumer chains
run concurrently without communicating with each other. Each supplier inputs
raw materials and then outputs goods (perhaps refined raw materials or
components) to a manufacturer from whom it also inputs money. A manufacturer
simply inputs from its supplier and outputs goods to its customer, from whom it
also inputs money. Part of the definition of this process, not rendered explicit in
the diagram, is that it cannot output money until it is first input. The customer
inputs goods, for which it pays, and then outputs waste. Another omission from
the diagram is any hint of how the process or any of its subordinates terminate.
The alphabet of the process includes communication events between pairs of
subordinates which cause update of its internal state. Input and output events for
subordinates must be synchronized to form a communication transaction internal
to the parent process. The effect is then to update the state of the parent.
Traces
The actual sequence of events observed is called a trace and will end with a
special succeed event if the process terminates.
18 CHAPTER 1. COMPUTATION
REPEAT
CASE input OF input 5p
5p: output small bar output small bar
10p: output big bar input 10p
UNTIL FALSE output big bar
input 10p
output big bar
…
An example trace start for the process is shown on the right above. It is just one
possibility of many.
Recursion
Processes may sometimes possess a recursive definition. This means that they
may be defined in terms of themselves. A very simple example of such a
definition is that of a clock whose alphabet is simply composed of a single tick
event, and never terminates.
This would be formally described by…
The initial event must always be specified and is called a guard. The idea is that
the process is a solution of the equation. Equations may be manipulated rather
like algebra. Mutual recursion is the name given to a definition specified as the
solution of a set of simultaneous equations.
Primitive processes
Primitive processes are the simplest things we need consider. In the process
model there are just three primitive processes (in addition to STOP and SKIP)…
• Assignment
• Input
• Output
Assignment refers to the assignment of purely local state. In the process model of
software there can be no global variables and no references to non-local
variables whatsoever. All resources (e.g. a database) must either be distributed or
belong solely to a single process.
Just as assignment is of a variable to an expression, output is of an expression
and the corresponding input is of its value into a variable. The correspondence of
assignment to input and output is no coincidence and shows that all computation
may be regarded as communication, as we shall see.
Construct processes
Constructs may be employed to specify a process in terms of subordinates. The
possibilities are as follows…
• Sequence
• Parallel
• Selection
• Iteration
• Guard event
20 CHAPTER 1. COMPUTATION
• Condition
• Expression value
1.3.2
Concurrency
Synchronization
Any pair of processes running in the same window of time are said to be
concurrent (Figure 1.11). Of interest are those processes which interact by
communication.
Consider the economy example above. The manufacturer process is unable to
send money to its supplier until it has first received money from its customer. As
a result the supplier must wait idle until the manufacturer is ready and the
transaction may proceed. This is what is meant by synchronization.
Communication is delayed until both parties are ready.
Of course, the constraint of payment before supply may result in the supplier
delaying the sending of goods until payment has been received. In other words,
the supplier will not be ready for that transaction until after the other has
occurred.
Deadlock
One of the most insidious problems which can occur with the specification of
concurrent systems is that of deadlock. The classic example of this is the dining
philosophers problem (Figure 1.9).
A number of philosophers (say three) share a common dining room. Each has
his own seat and fork and is right handed. They eat nothing but spaghetti which
requires two forks with which to serve a helping. Being either stupid or utterly
selfish, they are incapable of assisting each other and so must each make use of
the fork of another or starve to death if all dine together. The problem is that they
cannot all serve themselves at the same time8. If they do not talk to each other
and reach agreement how can they avoid starvation?
Each philosopher dining may be described as a process defined by the
following procedure…
The philosophers come and go as they please and the dining system will work
except when they start together! If they all start simultaneously, no-one will be
able to commandeer their own fork. They will all STOP and never succeed. This
situation is an example of deadlock.
One solution is to create a new process whose task it is simply never to allow a
full table at any one time. This is an example of a monitor process which permits
secure access to a shared resource. Although it prevents the philosophers from
deadlocking (starving), it enforces a degree of sequentiality which is obviously
not maximally efficient.
8 A similar situation has been used to illustrate the difference between heaven and hell.
The same physical scenario is present in both. In hell they starve, in heaven they eat!
22 CHAPTER 1. COMPUTATION
1.3.3
Communication
Successive processes
Communication is fundamental to computation. Computation may be regarded as
purely assignment and communication at all levels, from hardware upwards.
In the purely procedural model of computation, and hence pure procedural
programming languages, communication is rarely formalized. No concurrency is
allowed since usually only a single processor is assumed available.
Communication is reduced to that between processes which run in sequence, i.e.
successive processes. The only way in which they may communicate is by means
of a shared variable called a buffer. Figure 1.10 shows the relationship between
successive processes and the variable they share which belongs to their mutual
parent process.
The situation is like that in a market when a vendor is waiting for an expected
customer who is to purchase his last item of stock. If the vendor knows the
customer cannot arrive until he has already sold the rest of his stock, it is
obviously secure and efficient for him to leave the purchase in an agreed location
for the customer to collect. This is only secure if the agreed location is not known
to anyone else. Leaving the purchase is equivalent to assigning a value to a
buffer at the previously declared (agreed) location. The location and the data
type form the protocol for the transaction (communication event).
This form of communication is said to be asynchronous because sending and
receiving take place at different times. Rather than characterize processes as
successive or concurrent it is sufficient, and arguably more meaningful, to
simply relate them by whether their communication is asynchronous or
synchronous. Communication between successive processes is asynchronous and
uses a shared variable called a buffer.
Concurrent processes
Communication between concurrent processes is a different problem.
Figure 1.11 illustrates communication between concurrent processes within their
time window of overlap. The input and output events of receiver and sender
processes respectively are synchronized to form a single event of their mutual
parent. Communication between concurrent processes is synchronous and uses a
channel.
Communication, whether synchronous or asynchronous, may be categorized
by the number of senders and receivers as follows…
• One-to-one
• One-to-many (broadcast)
• Many-to-one (multiplex)
1.1. SYSTEMS 23
Further reading
Computers may be modelled as a network of processors running processes which
communicate with each other and an environment which may be modelled in the
same way. The process model of computation may be applied to the lowest level
of computation even unto the level of the normally-open switch.
The model very briefly introduced here originates with [Hoare 78]. A full and
fascinating account is given in [Hoare 85], which the reader is strongly
recommended to read.
Exercises
Question one
i Protocols may be layered one above the other. For example natural language
employs rules collectively called a syntax to determine the valid structure of
symbols called words in each sentence. Part of this protocol is the rule that a
sentence is a stream with EOT “.”. Below that protocol lies another, part of
which takes the form of a dictionary defining the valid structure of symbols
called characters in each word.
If both writer and reader of this book are regarded as processes, what is the
protocol used in their communication?
1.1. SYSTEMS 25
ii Detail any other kinds of protocol you can think of, which are used for
channels of communication in the everyday world.
Question two
i Summarize the instruction set of the (human) automaton discussed in Section 1.
2.2.
ii Suggest an implementation of this automaton in Modula-2 or other
structured procedural programming language.
Question three
Show how normally-open switches may be used to implement a NAND logic
gate, which has output “1” unless both inputs are “1”. Use only two switches.
Question four
i Describe an example, of your own, of a process where some of the subordinate
processes run concurrently. Having done that, now describe an example where
all the subordinate processes run successively.
ii A process is composed of four subordinate processes, A, B, C, D. The
following communication paths must exist…
• AΣ B (asynchronous)
• BΣ D (synchronous)
• CΣ D (asynchronous)
• CΣ A (synchronous)
Two processors are available, which do not share memory but which possess
physical (hardware) communication channels (one input and one output channel
each). How must the processes be assigned to processors?
iii Is there any possibility of deadlock?
Chapter 2
Software engineering
2.1
Projects
2.1.1
Engineering design process
The following universal design process is employed in all fields of engineering…
1. Analyse
2. Design
3. Implement
4. Verify
• Procedure
• Process
• Object
• Relation
2.1. PROJECTS 27
• Function
In any one design, all modules must be of the same kind to make interfaces
between them possible.
Some means of dividing the implementation into corresponding software
partitions must be available which permits separate development and separate
verification.
Verification is usually the greater problem. One can rarely be certain that an
implementation is totally correct. The best that is often possible is to do some
tests to verify that the requirements specification is met with the largest possible
set of inputs. Exhaustive verification means testing the system output for every
legal input and is usually prohibitively expensive. It is often possible to classify
input. If so, a test set of inputs may be selected with members from each class.
28 CHAPTER 2. SOFTWARE ENGINEERING
2.1.2
Organization
A project itself may be thought of as a process whose output is the
implementation of systems which meet requirements input. This time however
the process is carried out by people. The implementation for a particular
requirement is known as an application.
Projects make use of the design process described above. Each sub-process is
usually carried out by a separate team. The reader should think of a line leading
from requirement to verified solution and a sequence of sub-processes along it. At
the design stage the requirement meets prior knowledge and a very poorly
understood process known as human thought produces a solution.
The processes of analysis, design, implementation and verification, when
applied to computer systems, are collectively known as software engineering. A
readable text on software engineering is [Sommerville 85].
Few systems are simple to engineer, most are complex. Projects are usually
con-strained to a tight time scale because of the need for a product to reach the
market before its competition. As a result it is essential to distribute effort across
a team.
Three problems arise here…
• Distribution: How may the project be broken down between team members?
• Interface: How may team members work together effectively?
• Co-ordination: How may the project now be managed?
The problem may be summarized as how to get a ten man-year job done in a
single year!
There is no single clear answer to any of these questions. They are still the
subject of much research. The ideas in modular software engineering go some
way towards a solution.
2.1.3
Languages
Requirements
The concept of language is central to both computer science and software
engineering. There is need for a language to communicate each of the following
things…
2.1. PROJECTS 29
Figure 2.2: Top-down task decomposition diagram for the composition of this book
• Requirements specification
• Design specification
• Implementation (programmer readable)
• Implementation (machine readable)
The language used to specify requirements is usually the local natural language
(e.g. English). Design and implementation languages should be compatible since
the programmer must translate the design specification into an implementation.
A semantic gap is said to separate the two implementation languages of the
programmer and machine and is of a width which depends on the design
philosophy of the .machine and is crossed via translation. The machine language
is in terms of capabilities of the machine itself which will reflect both its design
philosophy and the model of computation chosen.
2.2
Modular systems design
2.2.1
Tasks
Balanced loading
The principle of balanced loading may be roughly stated as follows…
It does not help at all if we decompose a task without making its offspring
significantly easier to perform. Ideally all tasks at all levels should be of equal
difficulty to minimize the overall effort required and maximize its distribution.
These ideas can be usefully illustrated by the analysis of management
structure. Figure 2.3 (lower) represents the better structure. It may not
necessarily be very good in practice. It is easy to imagine that one middle
manager may have less to do than colleagues at the same level. The result would
be an imbalance of load between them. There is no easy way to assess relative
task loading. An initial design may have to be revised after it has been tested.
Although it may appear easy, obtaining a decomposition of a task, which is
balanced in both dimensions, is far from it. It requires skill on the part of the
designer which takes experience to gain.
Population growth
Following balanced loading, a second principle of top-down task decomposition
may be stated…
Each partition of a task should be into between two and five offspring.
The size of the “box” population should not grow too fast or too slow. Between
two and five children per parent is recommended.
No single analysis should consider more than a few levels. A sensible number
is three. Each terminal (lowest level) task should subsequently be the subject of a
further analysis if necessary.
To continue with the management analogy, compare the two alternative
structures in Figure 2.3. It is hard to imagine how balanced loading could ever be
achieved with the upper structure. Still worse is that extra, unnecessary,
interaction must occur. The path of command is twice as long as in the lower
structure.
Interaction
Once a satisfactory top-down diagram is obtained, the interface between each
task and its parent must be rendered clear within the specification of both. A
third principle is necessary…
2.1. PROJECTS 31
The precise description of the interaction between parent and offspring tasks is
called the interface.
The ideas detailed here are based on those of Edward Yourdon. For a full
account of top-down design see [Yourdon & Constantine 78].
Modules
The top-down diagram is progressively developed until each task may be
delegated to a single engineer (or team of engineers) who may be expected to
produce a verified system prior to a declared deadline. The software to perform
each task delegated in this way is called a module. Each module will eventually
exist in the following guises…
32 CHAPTER 2. SOFTWARE ENGINEERING
• Definition module
• Implementation source module
• Implementation object module
The definition module defines the module interface, whilst the implementation
source module contains the software itself in humanly readable form using a
programming language. The implementation object module contains the same but
in machine readable form using a machine language.
To summarize, in order to render a large requirement manageable we break it
down into tasks until a point is reached where individuals (or small teams) are able
to implement software to perform each one. The system design is then a number
of definition modules, corresponding to each task, which each specify the
interface to both parent and offspring. It must be emphasized that all tasks at all
levels are represented by modules. The first definition module to be written is the
topmost.
Top-down decomposition is applicable with all programming models.
However, tasks are most easily related to procedures (which perform them). In
the procedural model the communication of…
• Procedures
• Variables
• Constants
• Data types
…is undertaken. A parent is said to import any of these from its offspring. An
offspring is said to export them to its parent.
The reader must be wary of confusion here. Modules are purely descriptive.
They offer a hierarchical method of describing the required system for human
benefit only. For the sake of performance (i.e. fast execution) the machine should
be as unhindered as possible by inter-module boundaries. It is not possible to
have modular software without some performance diminution so modularity may
need to be traded-off against performance.
In order to reduce development cost it is vital that software be reused
whenever possible. Library modules of previously written software should be
built up and used later. To summarize…
The benefits of modules are twofold. Firstly, modules make the development of
large systems manageable. Secondly, reusable software drastically reduces
development cost. Modules should be thought of as hardware “chips”. They
prevent “reinvention of the wheel” and unnecessary duplication of effort. They
2.1. PROJECTS 33
form a software resource which need only be developed once and then used in
conjunction with any (correctly interfaced) higher level modules. The interface,
as detailed in the definition module, should be all that it is necessary to know in
order to use a library module. It should not be necessary to even have available
the corresponding implementation source module.
2.2.2
Processes
Partitions
There is one other benefit gained from data flow analysis. A means of process
oriented design is afforded which conforms to the notion of stepwise refinement.
Each node on the data flow graph may be reduced to a new subgraph. This
continues until processes are reached which may be implemented without further
reduction. In other words the system is partitioned into a network of processes.
Each process may then be separately developed, maintained, verified and reused
as a software module.
2.2.3
Objects
Nature
The reader is expected to be familiar with the procedural programming model
where software specifies system behaviour in terms of procedures which act
upon data structures which may be either static or dynamic. The object oriented
programming model unifies data and procedure with the concept of an object.
The idea is rather like an extension of that of a record by allowing procedure
fields which alone may act upon the associated data fields. These are termed
methods and are invoked only by sending the object an appropriate message.
State is said to be encapsulated with methods and can only be updated by the
arrival of a message. For instance, a graphics system may require the ability to
draw a circle. A circle object is defined which responds to the message “draw”
by invoking a suitable method to update the screen (state). Polymorphism is the
name given to the ability of different objects to respond differently to the same
message. A line object may also respond to “draw” with a completely different
result.
Objects in the real world may be represented by rendering an abstract data
type which represents not just its state but also the operations which may be
performed on it. A system is represented as a network of communicating objects
very similar to one composed of processes. The only truly significant difference
is that objects possess a property known as inheritance. It is possible to declare
the…
• State
• Methods
• Messages
2.1. PROJECTS 35
• Encapsulation
• Message passing
• Polymorphism
• Inheritance
See [Thomas 89] for a more thorough and very readable introduction. The
archetype object-oriented language is Smalltalk [Goldberg &; Robson 83]. The
use of objects in system design is thoroughly treated in [Meyer 88].
The implementation of objects using a procedural language is limited since
message passing and inheritance are not explicitly supported. However [Stubbs &
Webre 87] is an excellent introduction to the implementation of abstract data
types given such limitations.
Lists
A list is a dynamic abstract data structure, i.e. it may change in size while the
process to which it belongs is running. List elements may be physically either
sequence associated or pointer associated depending on implementation.
Arguably the list is the simplest object of all. The minimum set of messages it
should recognize is…
• Insert
• Remove
Both are followed by a key which identifies the element affected. Typically other
messages would allow an element to be inspected without removal and check to
see if a given key is present.
A list is referred to as a generic type of data structure. In other words it defines
a class of object. Lists are linear and thus are ordered. Each element has precisely
one successor and one predecessor except those at each end. Extra messages
may be defined which exploit the structure best for a given application.
36 CHAPTER 2. SOFTWARE ENGINEERING
Stack
At the purely data level a stack is identical to a list. It is also a linear, ordered
structure. It is its message protocol that differs (see Figure 2.5).
The minimum set of stack messages is…
• Push
• Pop
Queue
The last object class to be introduced has the following minimum message set…
• Enqueue
• Serve
2.3
Structured programming
2.3.1
Primitives
Assignment
Variables are distinct, named items within the memory subsystem of a computer.
The name of a variable is a symbol which denotes its location. A variable is said
to have a value which must be chosen from a set known as its range. Two
38 CHAPTER 2. SOFTWARE ENGINEERING
variables with ranges which differ in size or content are said to be of different
data type.
Variables of a process constitute its state. Values of variables may change
while the process to which they belong runs.
Constants are symbols which denote a value which remains unchanged while
the process to which they belong runs. Their type must be chosen from those of
the variables belonging to the same process. Each is then said to be compatible with
variables of that type.
Binary operators are primitive functions which take two arguments and evaluate
a result. A simple example is the add operator, which when applied to the
arguments {3, 4} evaluates to 7. The set of operators supplied in a programming
language will depend on the applications for which it is designed. For instance, a
language used to develop mathematical applications must have, at very least, a
full set of algebraic operators. One used to develop graphics or image processing
applications will need (at least) a full set of logical operators.
Expressions are combinations of operators, variables and constants which are
evaluated by first evaluating each operator in turn. The arguments of each binary
operator are considered to be expressions. A variable or constant is just a
primitive expression.
The order in which expressions must be evaluated will be dictated by the
syntax (rules) of the language. It is common to use parenthesis to allow the
programmer to indicate any particular meaning. In order to reduce ambiguity the
language usually defines an operator precedence which defines the order in
which operators are evaluated. Responsibility for removing all ambiguity usually
rests with the programmer.
Assignment means the assignment of a value to a variable. In procedural
programming an assignment statement commands the assignment to occur at a
specific point in the procedure. The variable to which the value is assigned
appears on the left hand side of a symbol denoting assignment. On the right hand
side is an expression, e.g.…
answer:=42
Input
See Chapter 1 for a discussion of the process model of computation.
An input command causes the assignment of a value input (over a channel) to
a variable. In the Occam programming language (see 2.4.2) the variable name
appears on the right hand side, the channel name on the left and a symbol
denoting input lies in between, e.g.…
2.1. PROJECTS 39
c.keyboard? key.pressed
Output
Point-to-point communication is the most fundamental within any network. It is
that between just two nodes and in one direction only. It constitutes a single event
for the network but one event each for sender and receiver (output and input).
Out of this, many-to-one (multiplexing) and one-to-many (demultiplexing or
broadcast) communication can be achieved.
An output command causes the output of the value of an expression onto a
channel. Consequently the channel name and expression must be stated. In
Occam the expression appears on the right hand side, the channel name on the
left and a symbol denoting output lies in between, e.g. …
c.screen!6*7
Hence this means “output the value of ‘6*7’ onto channel ‘c.screen’”
2.3.2
Constructs
Sequence
The SEQ construct is used to directly define an event sequence. Each statement
is a command causing one or more events to occur (see Figure 2.7).
Each statement may be a primitive or a procedure name and may be regarded
as a process which must terminate before the next starts.
Parallel
The PAR construct is the means by which the programmer specifies which
processes are to run concurrently. It must itself be considered a sequential
process whose event order will depend upon communication between the
component processes.
The events of a process defined by a parallel construct are the assignments
belonging to each component process and each communication (i.e. input
+output) between them, (see Figure 2.8).
Once again, each statement within the construct is either a primitive or a
procedure name.
40 CHAPTER 2. SOFTWARE ENGINEERING
Selection
Selection means choosing one process from a number of stated possibilities. The
choice is made immediately before the one chosen is to start. There are three
ways in which the choice may be made…
2.3.3
Partitions
Modules
Modules are simply a means of partitioning software for human benefit. They do
not reflect any partition relevant to execution. The whole point of a module is
that an individual (or team) may take responsibility for its separate development
and separate test. They have the added advantages of reusability between
applications (like hardware chips) and maintainability. To facilitate reusability,
there should ideally be no difficulty in interfacing modules when they are
designed, implemented or run.
2.1. PROJECTS 43
Each of these should have their own, clearly delineated, section in the source
module, see Figure 2.10.
Procedures
Procedures offer a means of further partitioning software. Once again they exist
to ease development, maintenance and modification. They do not enhance
performance but if anything reduce it. Like modules they are to some extent
reusable. They should be thought of as implementing a single task. Neither the task
nor the procedure should be bigger than a single human can understand easily at
one time. A guideline recommended by the author is that the procedure source
should all fit on the screen of a terminal at once. All statements of a program
should be thought of as belonging to one or other procedure.
Procedure parameters are variables on whose value the precise action of the
procedure depends. Value parameters are variables whose current value is passed
to the procedure. These variables thus remain unaltered after procedure
execution. Reference parameters are variables whose location is passed,
allowing the procedure to alter their value.
The scope of a variable is the software partition in which it is visible.
Variables declared within a procedure are called local variables hence their
scope is that procedure only. The same goes for constants. Parameters are said to
have local scope only. Local scope usually includes any procedures declared
locally, although this is not true of Occam.
44 CHAPTER 2. SOFTWARE ENGINEERING
2.4
Standard programming languages
2.4.1
Modula 2
Primitives
There is only assignment in Modula 2. There is no support for concurrent or
parallel processing at the primitive level. Expression evaluation is well supported
with operators to support mathematical, graphics and text processing
applications.
Constructs
All the purely sequential constructs discussed above are available. A sequence is
defined as any set of statements separated by “;”. There is no support at the
construct level either for concurrency or parallel processing. (No PAR or ALT.)
In addition to WHILE, REPEAT and FOR constructs Modula 2 has another
iteration construct…LOOP. Multiple exits are possible using a command EXIT
which may be conditionally selected. Use of the LOOP construct is not
recommended for the inexperienced programmer!
Partitions
Software partitions are the strength of Modula 2! Procedures and modules are
fully supported. Functions are not quite the same as in other languages. They
occur in Modula 2 in the guise of function procedures which are able to take
reference as well as value parameters (caution!) and unfortunately are unable to
2.1. PROJECTS 45
return structured data objects. The effect of this can be simulated by using a
reference parameter (although it will lead to less readable source code) or by
returning a pointer to the structured data.
The procedure data type is provided allowing procedures to be passed as
parameters. This is a powerful feature which should be used with great care.
Modula 2 was the first language to provide modules as an explicit, readable
part of the language. It is possible to clearly delineate between all four sections
listed above.
Concurrency
It is at this level that support is found for concurrent processing though not for
parallel processing. A special module, guaranteed always available, called
“System” exports two procedures and one data type which facilitate extra
processes called coroutines to run quasi-concurrently with a procedure defined
in the usual way. These are…
• NewProcess: A procedure which creates, but does not start, a new process. It
must be passed a parameter of procedure type which defines the new process
and returns a variable of process type
• Transfer: A procedure which transfers control between two processes
specified by their descriptors which are passed as value parameters
• Process: A data type used for process descriptors
• Secure communication
• Mutual exclusion (of processes from shared resources)
• Synchronization
It is certainly for experts only. Unfortunately, even experts make mistakes which
in this area can be disastrous.
Support for concurrent processing in a programming language should be
provided at the primitive level. The same facilities should also (transparently)
support parallel processing and be simple to use. This is only possible if the idea
of shared resources is abandoned and each process has its own private memory.
Computer architects must provide support for both concurrent processing
(scheduling, soft channels) and parallel processing (hard channels).
46 CHAPTER 2. SOFTWARE ENGINEERING
Applicability
Modula 2 is especially recommended for large projects. It renders large projects
manageable because of the availability of modules for design, specification and
implementation. The high degree of the source readability eases its maintenance.
Development costs may be cut by exploitation of module reusability.
The range of operators available make it appropriate to mathematical, textual
and graphics applications. Not discussed above are facilities for systems level
programming which are reasonable.
As a vehicle for applications requiring concurrent processing it is usable (with
great care) by experts only. It is not equipped for parallel processing at all.
Further reading
The purpose of this brief summary is to set the background for subsequent
discussion of hardware architectural support of modular, procedural
programming. It is not the intent to be thorough.
The discussion should be enough for a student who has successfully traversed
a course in procedural programming using any fully block-structured language
(e.g. Pascal).
[Knepley & Platt 85] is recommended as a readable tutorial, accessible to any
student, with a fair amount of example code. [Wirth 85] is the official summary
given by the author of the language, Niklaus Wirth, and is recommended as a
tutorial and reference text for those experienced in at least one other
programming language. Modula 2 is now well established. Many good texts are
thus available.
2.4.2
Occam
Primitives
Probably the most important and novel feature of Occam is that it supports
concurrency at the primitive level. There are three main primitives…
• Assignment
• Input
• Output
All five are fully fledged processes in their own right but, from the point of view
of the programmer, are indivisible.
Data types available unfortunately differ between variable and channel. For
instance, variable length strings and records are supported as channel protocols
but not as the data types of variables.
Constructs
Occam succeeds beautifully in being at once both simple and extremely
expressive in its facilities to construct sequential processes. Construction of a
sequence of subordinate processes, i.e. where the event ordering is explicitly
stated, is made possible using a SEQ construct. Iteration is provided via a
WHILE construct, conditionally terminated in the usual way.
Construction of a single sequential process from a number of concurrent ones
is enabled with the PAR construct. All specified processes start together. The
PAR itself terminates only when all the subordinate processes have terminated.
If just one of them STOPs so does the PAR. It is essential to understand that,
although its subordinate processes run concurrently, the PAR itself forms a
single process which is sequential like any other.
All three forms of selection (mentioned earlier) are available. CASE selects a
process from a list by expression value. IF selects from a list by the truth of an
associated condition. Care is required here! Should no condition evaluate true
the process STOPs. A wise programmer thus adds the condition TRUE and an
associated SKIP process to the list. There exists a second problem with IF. Given
more than one condition succeeding in the list specified, which process is run? In
Occam it is the one following the first successful condition. Nothing is gained by
nesting IF constructs in Occam. All conditions may simply be entered in the list
of a single construct. This has the effect of improving readability.
Selection by channel is supported by the ALT (ALTernative) construct. A list
of input guards is specified each with an associated process. The process
associated with the first ready guard is the one which runs. If no guard is able to
terminate the ALT is suspended until one can.
Each guard may be supplemented with a condition to selectively exclude
inputs from consideration. A guard may be an input from a timer process, which
outputs time. The AFTER qualifier may be used to prevent the guard from
succeeding until after a stated time thus effectively allowing a “timeout” process
to be triggered if no other input arrives.
48 CHAPTER 2. SOFTWARE ENGINEERING
Replication
Replication of SEQ, IF, PAR and ALT construct level processes is supported. If
the design calls for a number of procedurally identical processes, which differ
only in elements chosen from arrays of either channels (PAR, ALT) or variables
(SEQ, IF), extremely concise elegant programs may be written. Replication
renders Occam very expressive.
Partitions
We have already met one of the weaknesses of Occam in its limited set of data
types for variables. A second weakness is arguably in its support of partitions to
benefit software design. There are no modules which can encapsulate packages
of procedures for reuse between applications. The effect can, of course, be
achieved by shuffling text between files using an appropriate editor. Occam does
however go part way towards the concept of a module. Individual procedures
may be separately compiled and then linked into any number of applications.
Procedures in Occam are well supported as named processes which take
reference parameters by default. Value parameters are also supported. If invoked
as a subordinate of a SEQ, a procedure must be passed variables with which to
communicate with its neighbours. If that of a PAR it needs the appropriate
channels.
Functions also form part of Occam. Only value parameters are allowed. All
parameters must be variables and not channels. Hence functions which
communicate together must communicate via variable sharing and hence must
run successively. Neither may they contain any nondeterminacy in the form of
ALT or PAR constructs.
Neither procedures nor functions may be recursively defined. This constitutes
a third weakness in that it places a limit on its expressivity, especially for
mathematical applications.
Applicability
Concurrent functions or procedures cannot share variables and thus cannot inflict
side effects upon each other. A newcomer to Occam might well start off with the
notion of it as “Pascal with no shared variables”. The idea of doing away with
the sharing of any state between processes is central and responsible for Occam
being able to offer simple yet unrestricted access to the power of parallel
processing.
Event processing is the essence of creating systems which effect real time
control over automation (e.g. production lines, robots, washing machines,
watches, kettles etc). Such systems are known as embedded systems and now
account for the greatest and fastest growing number of computers. Languages
2.1. PROJECTS 49
such as Ada and Occam were originally designed for real time control and
embedded systems.
Problems in the real world almost always possess a high degree of
concurrency. It is the exception which is purely composed of successive
processes. Occam is the simplest programming language with which to
adequately describe such problems. It is much simpler to describe a problem
composed of concurrent, communicating sequential processes with a language
which is equipped for it than to mentally transform it into a purely sequential
form. Unfortunately the majority of practising software engineers are firmly
rooted in thinking sequentially and thus often find Occam difficult and natural
concurrency obscure.
The simplicity of Occam, together with its expressive power over the behaviour
of both natural and machine systems, give it an accessible range of applications
limited only perhaps by their scale due to its limited modularity.
Further reading
We can only afford a brief summary of Occam here, enough to serve subsequent
discussion of hardware architectures which support the process model of
computation. [Burns 88] is an excellent tutorial which includes a comparison
with Ada. [Fountain & May 87] offers an alternative introduction. [Inmos 88#1]
is the official language definition and manual which also contains many useful
examples. It is absolutely indispensable.
Exercises
Question one
i Define in your own words the meaning of the following terms and how they
may be used in the engineering of software…
• Module
• Top-down task diagram
• Data flow diagram
Question two
i Explain the evolution of a software project in terms of the various phases of its
life.
ii Identify the phase at which the following must be specified…
• Definition modules
• Pseudocode
50 CHAPTER 2. SOFTWARE ENGINEERING
Question three
i What are library modules and why are they commercially important to software
engineering?
ii In what form would you expect to use library modules?
Question four
If neither procedures nor modules should affect the machine execution of a
program, why should a machine language support them?
Chapter 3
Machine language
3.1
Nature
3.1.1
Translation from programming language
Machine language is used for the description of the system process which may
be presented to the raw machine. On the other hand, a programming language is
used for a description of the same thing but presented in a manner
understandable to a human. The programming language should also render the
software modularity clear to a human. In other words the boundaries and
interfaces of procedures and modules should be opaque in the programming
language but transparent in the machine language.
The language understood by different machines varies greatly. Portability of
programs between computers is achieved via use of common programming
languages. The machine language program is not portable, except between
machines of identical design.
The problem of informing a machine of the procedure you wish it to follow is
much like that of informing a person, who speaks another language, how to
perform a task. The language of the speaker (programmer) must be translated
into that of the listener (machine). If the translation is carried out word by word,
as it is spoken, it is referred to as interpretation. If it takes place after everything
has been said it is referred to as compilation. Figure 3.1 illustrates the problem.
The difference between the programming and machine language is called the
semantic gap. It is possible to argue that hardware architecture should evolve so
as to narrow this gap. This begs the question of the choice of programming
language. Since it is now comparatively cheap to develop a new processor, a
number of new designs are appearing which are optimized to close the semantic
gap for a variety of programming languages.
An alternative view is that the semantic gap is an inevitable consequence of
the conflict between the machine requirements (for performance) and those of
52 CHAPTERS 3. MACHINE LANGUAGE
humans (for maintainability and ease of development). In this case the machine
language design may take more account of the requirements of the compiler code
generator, which must generate machine language “code” automatically. It must
then be made easy to make choices logically where alternatives exist.
3.1.2
Structure
A machine language program takes the form of a stream of independent
instructions. They are conventionally encoded as binary numbers and are
executed sequentially in the order in which they are found. Each is made up of
two parts…
• Operation
• Operands (0, 1, 2 or 3)
the previous operation (e.g. a subtraction) was zero or not. Hence, for example,
two numbers may be compared and subsequent action rendered dependent upon
whether or not they share a common value. All processor state is stored
collectively in the processor state register (PSR).
Machine code for the running process is stored in a large linear memory
(Figure 3.2) which may be referenced randomly as an array. The array index is
called an address and each memory cell a location. Array bounds are simply zero
and its size (minus one). The address of the next instruction to execute is stored
in another register, wide enough to accomodate any address, called the program
counter (PC) (Figure 3.3). Sequencing is obtained automatically by incrementing
the program counter after each instruction is executed.
Conditional instructions may be used to modify the control flow by
conditionally updating the program counter. All selection and iteration constructs
may be implemented using a single instruction, the conditional branch, which
adds its single operand to the program counter if the condition succeeds but
otherwise does nothing.
Almost all modern computers make use of the idea that the machine code
should reside in the same memory device as data. Obviously care must be taken
that the two occupy distinct areas of memory. However shared memory
simplifies the architecture and, hence lowers the cost, of the whole computer.
Computers using this principle are often referred to as von Neumann machines,
after the person credited with the innovation. Those which employ separate
memories for code and data are referred to as Harvard machines.
3.1.3
Interpretation
It is the function of the processor control unit to interpret machine language. In
other words it translates each instruction, one at a time, into a sequence of
physical microoperations. There may be two parallel components to a micro-
operation…
• Register transfer
• Control of functional unit
(1) alu.in.0 → r0
(2) alu.in.1 → r1, alu(add)
(3) r1 → alu.out
54 CHAPTERS 3. MACHINE LANGUAGE
3.1.4
Instructions
function may be computed using just a few binary logical operators. In fact the
set {and or not} is sufficient.
In reality any operator requires a sequence of operations (and hence some
time) to be evaluated. For example, the function plus (a, b) will usually require
transfering the values of a and b to the input of a physical adding device and then
transfering the result to plus. In addition the adder must be activated at the right
moment. It must run some process to generate a sum. We postpone these
problems until Part II. It is vital to understand that the machine language
represents the operation of the hardware at its highest physical level, not its
lowest. Operator instructions may give the effect of “instant” evaluation when
executed, but in fact many distinct register transfer operations may be required1.
Control flow
The use of a conditional branch instruction to modify the contents of the program
counter depending on processor state is discussed above. We now turn to how it
may be used to implement selection and iteration constructs.
Shown below are code segments for the machine language implementation of
both WHILE and IF…THEN…ELSE constructs. In order to avoid binary notation,
mnemonics are used for all instructions.
; start cb <else>
cb <exit> …
… br <exit>
br <start> ; else
; exit …
; exit
<exit> denotes the offset to the next instruction in memory. <else> denotes the
offset to the code to be executed should the condition fail. Note how much more
convenient it is to have an instruction which branches only if the condition fails.
Of course the actual condition needed may be the negated version of the one
available.
Linkage
In order to ease the engineering of software, it is necessary to provide support for
procedure invocation. Procedure code, at the level of the machine, is referred to
as a subroutine. Invocation requires the following…
• Branching to subroutine
• Returning from subroutine
• Passing parameters
The first is directly implemented with another branch instruction which we shall
give the mnemonic bsr. It is beyond the scope of this chapter to discuss support
for nested procedure invocation. The method adopted here is simply to save the
incremented program counter value in a register as the return address. As well as
doing this, bsr adds its operand to the program counter register in order to enter
the subroutine.
Returning from subroutine is achieved by the ret instruction which simply
copies the return address back into the program counter and must be the last in
the subroutine. The thread of control is shown in Figure 3.4.
Parameters may be passed by placing them in general purpose registers prior
to bsr.
Application support
The fourth group of instructions is that which support a given set of applications.
For example, graphical applications require block move operations.
Although not strictly essential, such a group is necessary if the design is to be
competitive as a product. Many manufacturers now offer a range of co-
processors which extend the instruction set or enhance the performance of a
given subset for a specified applications group.
58 CHAPTERS 3. MACHINE LANGUAGE
3.1.5
Operands
Number of operands
The number of operands depends on a design decision and on the instruction
itself. It is possible for a design to require zero operands. This assumes that the
computer organization is such that the operand (s) have a predetermined source
(e.g. on top of a stack) and the result a predictable destination. At the other
extreme three operands may be required for each operator (two arguments and
one result).
There is disagreement among computer architects as to which is the better
number. Fewer operands generally means shorter code but can mean more micro-
operations per instruction.
Storage class
The arguments to the instruction describe where to find the operands. Memory
devices are usually grouped into storage classes. The concept of storage class
represents the programmer’s view of where the operand resides. A hardware
engineer usually only perceives devices.
For example, the only complete computer we have so far met, the Turing
Machine, has just two storage classes, processor state and the linear memory. We
have met only two storage classes for a real modern computer, the register file
and linear “main” memory.
3.1. NATURE 59
• Program memory
• Register
• Workspace
The actual set found depends strongly upon its architecture design but the above
is typical.
Due to constraints imposed by the ability of current technology to meet all the
requirements of typical applications, real computers have two, three or more
distinct memory devices. At least some of these will be available as distinct
operand storage classes. Others will require special software to communicate
with external processors which can gain direct access.
Access class
The access class of an operand is the manner in which it is referenced. It
describes what happens to it and whether it is updated. There follows a
summary…
• Read (R)
• Write (W)
• Read-Modify-Write (RMW)
Read access implies that the operand value remains unaltered by the reference
and simply has its value used, for example as an operand of an operator. A two-
operand addition instruction may be defined to overwrite the second operand
with the result. The first operand is of access class read and the second read-modify-
write. An example of write access is the destination of a store.
The access class of each operand must be specified in the definition of each
instruction since it depends almost totally on the nature of the operation
performed.
Addressing modes
Each instruction encodes an operation. Operations act on operands. Also
encoded within the instruction is how to find the operands. For instance, if just
one operand is required and it resides in memory then a further instruction field
60 CHAPTERS 3. MACHINE LANGUAGE
must communicate this fact and an instruction extension must indicate its
address. An instruction is therefore a record with the following fields…
• Opcode
• Addressing mode(s)
• Address(es)
The addressing mode defines the storage class of the operand. When the operand
is in memory it is called absolute or direct addressing. Absolute addressing has
become progressively less common since the whole machine code program will
need editing if the data area is moved. Relative addressing, where data is
referenced via offsets from a workspace pointer, removes this difficulty. Only
the pointer need be changed. Similarly code (or data) in program memory may
be addressed relative to the program counter. Everything may be accessed via an
offset from one or other register, be it data, a subroutine or construct code
segment. Position independent code is said to result from such an architecture.
Immediate mode indicates to the processor that constant data follows in an
instruction extension instead of an offset or an address. It may be thought of as a
special case of program counter relative addressing. However, the data should be
regarded as contained within the instruction.
Some addressing modes do not require qualification with an address. For
example register addressing may be encoded as a distinct mode for each
register. Hence no further qualification is required.
There follows a summary of common addressing modes…
• Immediate
• Register
• Workspace relative
• Program counter relative
• Absolute (direct)
• Indexing
• Indirection
array shown has only one dimension. Multi-dimensional arrays require one index
per dimension specified, which are then used to calculate the element address.
(Memory has just one dimension so some mapping is necessary.)
Indirect addressing means, instead of specifying the operand address,
specifying the address of the address. Indirection is like crossing a pond via
stepping stones. Each stone represents one level of indirection.
62 CHAPTERS 3. MACHINE LANGUAGE
3.2
Simple architectures
3.2.1
Sequential processing
Programmer’s architecture refers to just that set of registers which a machine
level programmer needs to know about. Architecture more generally refers to
everything such a programmer needs to know, including the available instruction
set, addressing modes etc..
Figure 3.7 shows the programmer’s architecture for purely sequential
processing. As may be seen, very little is needed. The size of the general purpose
register file shown is arbitrary although it is generally agreed that about eight
registers are sufficient for expression evaluation. They may still be used for
parameter passing if subroutines are not called while an expression is being
evaluated.
The following addressing modes should suffice and permit position
independent code…
• Immediate
• Register
• Workspace relative
(Optional index modifier)
3.2.2
Parallel processing
• It terminates
• Waits for communication
• It has been executing for more than a timeslice
kept with two pointers. Queue front points to the head of the queue (the next
member to leave) and queue back points to its tail (Figure 3.9).
Two new instructions are required to support a process ready queue, endp
terminates a process and causes the next in the queue to be despatched to the
processor, startp places a newly ready process at the back of the queue. These
can both be made very simple if the queue entry is just a pointer to the value
required in the program counter for that process to run or continue.
The change of process on a processor requires a context switch. The context is
simply the processor state and process state. When a process terminates there is
obviously no need to save context. If it is just suspending, to resume later,
context must be saved and later restored. Context switching is reduced almost to
zero by two simple expedients. First, maintain all process state in workspace
except when evaluating expressions. Second, forbid suspension while expression
evaluation is in progress. As a result no process state is in registers when
suspension occurs except program counter and workspace pointer. The program
counter may be saved at a reserved offset within workspace and the workspace
pointer used as process identifier. A context switch is thus very fast because it
only involves saving and loading just two registers, WS and PC.
The implementation of PAR and ALT constructs is beyond the scope of this
chapter and must wait for Part III which also considers the architecture support
for concurrency in considerably greater depth and detail.
66 CHAPTERS 3. MACHINE LANGUAGE
Communication
Where only one processor is involved there must be a mechanism establishing
synchronization. The problem is exactly like meeting up with a friend during the
day when only the location of the rendezvous has been established. Suppose that
you arrive to find your friend is not yet there, i.e. you have arrived first.
Obviously you must suspend the process which involves the meeting. In order
not to suspend all your processes you sensibly decide to leave a note giving your
location for the rest of the day. Having arrived, your friend is able to find you
and the suspended process is free to resume.
There is no need for the system to maintain a list of suspended processes since
each may be rescheduled by the second process to arrive at the rendezvous, who
knows their identity.
To summarize, a process may be…
• Running
• Ready
• Suspended
3.2.3
Modular software
Refer to Chapter 2 for a full account of the use of modules in software
engineering.
There is a performance overhead with invocation of a procedure in another
module due to the extra level of indirection required. Remember that modules
serve the purposes of development, maintenance and cost, not the application
performance. As such they may be thought of as the software equivalent of
hardware chips.
A module descriptor is required to support linkage with other modules and is
made up from the following pointers…
• Program base
• Static base
• Link table
68 CHAPTERS 3. MACHINE LANGUAGE
3.3
Instruction set complexity
3.3.1
Reduced instruction set computer (RISC)
Decsription
RISC philosophy calls for little more than the minimum necessary set of
instructions which enables all possible programs to be compiled. Correct
selection of instructions is therefore critical and has been helped by inspection of
usage by existing compiler generators. Instructions themselves are kept simple
and are of common, small size. To summarize the principal features of a RISC
design…
– Programming language
– Compiler code generation
All of these features interact in such a way as to coexist harmoniously. The small
instruction set promotes rapid decoding and execution. Decoding requires less
hardware which instead may be devoted to a large register file. Hence the
register file may be large enough to be used for local variables as well as
expression evaluation.
Just because the instruction set is small does not mean that it cannot
effectively reduce the semantic gap. Much progress may be made by
simultaneous consideration of language, compiler and architecture. An effective
match may be maintained between the architecture and compiler code generation
strategies. The arrival of RISC in fact champions at least two causes…
A more complete description of the features and principles of RISC design must
wait for Part III.
Reducing processor complexity has two motivations. It improves reliability
and reduces the development time and so gets a new processor to market more
70 CHAPTERS 3. MACHINE LANGUAGE
quickly. The author has first hand experience of more traditional devices arriving
years late and still with serious flaws. Indeed this had become the rule rather than
the exception. How much trust can one put in a device when notice of a serious
defect is announced long after it was brought to market. Lives now frequently
depend on computer reliability. RISC design has already been demonstrated as
an answer to this problem. A relatively small company in the UK has recently
developed its own RISC. The Acorn ARM arrived on time and for four years (at
time of writing) has proved free from “bugs”. It also delivers a cost/performance
ratio which is an embarrassment to its non-RISC competitors.
We now turn to the underlying reason why a smaller instruction set is able to
deliver an enhanced performance…
Exploitation of locality
It is not the case that a RISC delivers improved performance because more,
faster instructions simply reduce the total execution time from that required by
fewer, slower ones. It is the choice of instructions to implement that counts.
Because RISC designers set out to have fewer, they had to think more carefully
about which are really needed and which are not essential.
Temporal locality is an extremely important concept for processing. It is the
property of software to reference the same set of stored items in memory within a
given time window (Figure 3.13).
The process model of execution encourages this by only allowing references to
local variables and procedure parameters. Structured programming also strongly
enhances the effect since a loop body will reference the same variables on each
iteration.
The RISC philosophy is to reap the maximum possible benefit from temporal
locality. The idea is that, on process start (or procedure entry), all local variables
are loaded into registers where they may be accessed much more rapidly.
Afterwards they are stored in memory. This is called a load/store memory access
scheme. If, in addition, parameters are passed in registers then spatial locality is
introduced. References will all be to the same spatial window. The idea may be
extended by keeping all the local variables of the currently executing group of
processes in an enlarged register file.
The load/store scheme plus the drive for simplicity require that only one or
two addressing modes are provided. However these are defined in a flexible
manner so that cunning use of registers may be used to create synthesized
addressing modes.
The most important consequence of load/store access, and its associated single
addressing mode, is that memory references are minimized. This is responsible
for a large proportion of the performance improvement seen. A secondary
consequence is that the compiler code generation is simplified. Another
mechanism, called register windowing, almost eliminates the (conventionally
3.1. NATURE 71
History
The RISC story began with an IBM project around 1975 called the 801, [Radin
83]. This was never sold commercially. However the results were used in the
IBM RT which is now available. The term RISC, together with many of the
fundamental ideas, surfaced in the Berkeley designs called RISC I and RISC II,
[Patterson & Ditzel 80].
An account of the Berkeley projects, and a fascinating comparison of several
recent academic and commercial RISC systems, may be found in [Tabak 87].
Also very highly recommended is [Colwell et al. 85] which seeks to define a
RISC, separates and assesses the various associated ideas and gives a thorough
contrast with CISCs.
72 CHAPTERS 3. MACHINE LANGUAGE
3.3.2
Complex instruction set computer (CISC)
Description
In CISC design no attempt is made to minimize the size of the instruction set. As
well as there being a larger set, the instructions themselves are more complex.
Many different formats are possible and a plethora of addressing modes are
provided. For example the DEC VAX architecture supplies addressing modes
which auto-increment or auto-decrement an array index after, or before, carrying
out an operation.
The result is a complex processor executing instructions rather more slowly
than its RISC counterpart. However, some of the instructions and addressing
modes achieve much more, when they are required. Given both application and
compiler which make use of the powerful features, a superior performance may
be demonstrated by a CISC.
The chief characteristics of a CISC are…
Here, the focus of attention for performance increase is the application. The high
cost and low speed of reference of memory which prevailed for many years
persuaded designers to turn their attention to reducing code size and the number
of memory references by “migrating” subroutines to hardware implementation.
Programming language and the compiler/architecture interface were rarely
considered. New instructions could not replace old ones because of the need for
upwards compatibility. (Old code had to still run on new machines.) Some new
instructions actually operated more slowly than the series of old ones they were
supposed to replace, though this was not generally the case. Many of the new
instructions were highly specialized, thus rarely used. As complexity increased,
3.1. NATURE 73
so design and development time rapidly increased, as did the incidence of design
error.
Applications support
Applications specific machine language extensions or (performance)
enhancements are usually readily available. For example, graphics applications
frequently require moving, or applying a common logical operator to, a block of
memory at a time. A special addressing mode may be supplied to cope with this.
Similar provision is often found nowadays for string processing and for
mathematical (floating-point) operators.
3.3.3
Comparison of RISC and CISC
Closure of the semantic gap and applications support look very good in the
catalogue and have proved popular. However, once the CISC machine language
is implemented, an application may not run faster than it would on a RISC. The
reasons are twofold. Firstly, the powerful instructions take time to translate into a
sequence of primitive operations. Secondly, all that CISC complexity could have
been exchanged for a larger register file, to gain greater benefit from locality, or
even for another separate processor. The latter would reap benefit if parallelism
exists at the problem level and hence at the algorithmic level.
To summarize, the advantages of RISC are improvements in…
• Code length
• Application specific performance
• Upwards compatibility with older machines
Exercises
Question one
i When a book is translated from one natural language to another, is it
interpretation or compilation?
ii Explain what is meant by semantic gap. In your own words, summarize the
arguments for and against designing a computer architecture to reduce it.
2 No pun intended!
3.1. NATURE 75
Question two
The instruction…
means…
Add the contents of register zero, in the register file, to the memory
location whose address is 0400 and put the result back in register zero.
State the storage class, access class and addressing mode of each of the two
operands.
Question three
A rather basic processor has only the following instructions…
• nand Bitwise logic operator which operates on its one and only register and
an operand in memory. Its result is placed automatically in the register.
• shift Shift operator which shifts the contents of the register left by one bit.
• load…direct addressed operand from memory to register.
• store…operand in register to direct addressed memory location.
• branch…on the condition that the process state is set, the number of words
forward or back specified by the immediate addressed operand.
It also has only a one bit processor state which is cleared if the result of a nand
is zero and set otherwise. Both memory locations and the register are four bits
wide.
Write a program which computes the AND of two variables stored in memory
whose addresses may be referenced symbolically as x and y. Ensure that a zero
result clears the processor state and sets it otherwise.
Part II
4.1
Notation
4.1.1
Pure binary
Binary words
A number is written as a sequence of digits which are collectively referred to as a
word. Symbols, called numerals, must be decided upon to represent each distinct
digit value from zero to a maximum which is one less than the base. The arable
system, using base ten, is the one with which we are all familiar. Digits increase
in significance from right to left, representing increasing powers of the base.
Pure binary notation uses base two. The arabic symbols “0” and “1” denote
binary digit (bit) values. The quantity of bits required to represent a number is
given by log2 of its value rounded upwards to the nearest integer. For example…
Each time a single extra bit becomes available, the range of values which may be
represented doubles since the quantity of available states doubles. A value is just
the label of a state.
Physical representation
Pure binary notation is special to current computer technology because it directly
corresponds to the physical data representation inside any contemporary
computer. The reason for this is that it is comparatively easy to store and
78 CHAPTER 4. DATA REPRESENTATION AND NOTATION
Word partitions
It has become customary to divide up word width into standard “chunks” called
bytes and nibbles where…
1 It should be pointed out that it is quite possible (though less easy) to render machines
which physically represent values using bases other than two. A memory cell would have
to be devised with a number of stable states equal to the new base. Communication
elements (wires) would face similar requirements. An electrical solution might use a
number of voltage ranges to distinguish states.
4.1. NOTATION 79
Each location in a memory map (see Figure 4.1) shares the common word
width of the machine, which is nowadays usually an integral number of bytes.
The problem for humans using binary notation is that numbers take a lot of
writing. For example, the number which is written 65, 535 in decimal notation
becomes 1111111111111111 in binary! It is somewhat relieved by using a point
(not to be confused with the binary point) to break up the word into fields (e.g.
1111.1111.1111.1111) but life can still get extremely tedious.
4.1.2
Hexadecimal
Notation/representation correspondence
If writing numbers in pure binary notation is too tedious, why then do we not use
decimal? Given, say, a 4-bit binary word, every state has a corresponding 2-digit
decimal value. Sadly, not every 2-digit decimal value has a corresponding 4-bit
binary state. The redundant values mean that we would have to take great care
when using decimal notation. It is simply too inconvenient.
Consider a 4-bit word carefully. It has 16 (24) states. Wouldn’t it be nice if we
had a notation where each digit had 16 states. We could then use a single digit to
denote the value represented by the word. Every possible digit value would
correspond to just one state and vice versa. What we require then is a base
sixteen, or hexadecimal, notation, often abbreviated to just “hex”.
Digit symbols
The symbols used for hexadecimal digits are partly arabic numerals and partly
alphabetic characters…
Just in case you are confused, remember that the concept of number is simply a
consequence of that of counting. A new digit is required to the left of the old
when the count reaches a chosen value which is called the base. The separate
historical origins of writing and counting explain why we write text from left to
right but numbers from right to left.
8-bit → 0016…FF16
16-bit → 000016…FFFF16
32-bit → 0000.000016…FFFF.FFFF16
Leading zeros are included to infer the word width. Note how a point may be
used to aid clarity by breaking up a value into fields of four hex digits, which
each correspond to 4×4 bits.
Figure 4.1 depicts a (currently) not uncommon memory map where the data
word width is one byte and the address word width is four bytes. (One byte
requires two hex digits.)
It should be clear from this how awkward life would be if decimal notation were
used for either address or data. One would have to think carefully whether a
particular value actually corresponded to a real address (or real data).
4.1.3
Octal
Some machines use a word width which is divisible by three instead of four.
Each three bits can physically take 8 (23) states. Hence it is sensible to employ a
notation of base 8 and divide the word up into 3-bit fields, the value of each one
denoted by a single octal digit. Once again there will be a one-to-one
correspondence between notation and state. For example…
9-bit → 0008…7778
18-bit → 000.0008…777.7778
27-bit → 000.000.0008…777.777.7778
4.1. NOTATION 81
As before leading zeros are included to infer word length. This time points are
used to separate fields of 3×3 bits.
Obviously bases other than eight or sixteen are possible. However, small bases
are not very useful and those given by powers of two larger than four require too
many symbols per digit.
Figure 4.2 illustrates the alternative notations and their correspondence to
groups within the word. It also serves as an example of translation between them
via pure binary.
Lastly, it is true to say that hex has now become more useful than octal. This
is because of the success of 8-bit machines which have subsequently evolved
into 16-bit, 32-bit and even 64-bit versions.
4.2
Primitive data types
4.2.1
Integer
Sign-magnitude
We now turn to the means of representing integers including negative as well as
positive values.
82 CHAPTER 4. DATA REPRESENTATION AND NOTATION
There are only two states needed to represent sign so let’s use one bit
within the word specifically to represent it. Make it the leftmost bit since
the sign is usually written leftmost. All states with this bit set are labelled
negative. The state of all bits to the right of the sign bit are simply labelled
with their pure binary value.
Figure 4.3 shows a pure binary and a sign/magnitude labelling of states. There
are two serious drawbacks and one advantage to this simple scheme.
As pointed out earlier, the very concept of “number” is drawn from that of
counting. Similarly that of addition implies counting forwards or incrementing a
number. On the “clock” representation of states depicted, counting corresponds
to a hand moving clockwise. The modulus of the register is simply its pure binary
range. For the register depicted this is simply 16 (24). A pure binary addition
whose result is greater than the modulus is said to be correct modulo 16.
Unfortunately an addition which crosses the boundary between sign values gives
the wrong answer using the sign/magnitude labelling.
The second drawback is that there are two representations of zero. It is clearly
inefficient and confusing to use two states for a single value.
Sign/magnitude representation of signed integers has one redeeming feature. It
is extremely easy to negate a number. All that is necessary is to invert the
leftmost bit. To summarize…
Twos-complement
The “clock” diagram for the twos-complement state labelling scheme is depicted
in Figure 4.4.
So how is this labelling arrived at? All that we do is…
4.1. NOTATION 83
Figure 4.3: Pure binary and sign/magnitude state labelling represented as a clock face
1. Subtract the value we wish to represent from the modulus (16)
84 CHAPTER 4. DATA REPRESENTATION AND NOTATION
2. Increment the register (move hand clockwise) from zero that number of
times
There is a single major advantage and one disadvantage with this scheme. First
the disadvantage. Negation is hard! One way is to make use of the definition of
twos-complement…
The procedure for labelling the clock diagram is consistent with this.
The negation operation derived directly from the definition unfortunately turns
out to be very hard to achieve automatically. It is also dependent upon the word
width (or modulus). A much better way (for human and machine) is to employ
the following procedure…
1. Ones-complement
2. Increment
…which yields our new algorithm for twos-complement negation and shows that
it is independent of word width. Unlike the earlier method, we may apply it
without concerning ourselves with the word width of the operand.
Now for the overwhelming advantage of twos-complement representation of
integers which has resulted in its universal acceptence and use. Addition works
across sign boundaries! Every state has a single unique labelling. Note that the
twos-complement range of a register of modulus ). That of the register depicted is
−8…+7. Addition of twos-complement signed values will give the correct result
as along as they, and the result, are within range.
You are strongly urged to try some simple sums yourself using the clock
diagram for twos-complement. (Recall that addition simply means counting
forward by rotating the hand clockwise and subtraction means counting
backwards by rotating the hand anticlockwise.)
To summarize the pros and cons of twos-complement signed integer
representation…
Neither overflow nor underflow can occur if the arguments are of diffrent sign
and are in range themselves. This observation lead to a simple rule for detecting
over/underflow…
4.2.2
Character
Printing characters
All that is necessary to represent printing characters is to decide a standard
labelling of states. The equivalent pure binary state label is referred to as the
code for the corresponding character.
It is desirable that some subsets are arranged in order to ease text processing
algorithm design. For example, a program which swaps character case will benefit
from the codes for ‘a…z’ and ‘A…Z’ being in order. Alphabetical ordering also
facilitates the sorting of character strings.
Codes will also be needed for other printing characters such as space,
punctuation and numeric (decimal) characters.
Appendix A contains a table of the American Standard Code for Information
Interchange, more commonly known as ASCII, which is now the almost
universal code used for representation of both printing and control characters.
A useful summary is…
‘a’…‘z’ → 6116…7A16
‘A’…‘Z’ → 4116…5A16
‘0’…‘9’ → 3016…3916
Space → 2016
Control characters
Control codes are included for controlling text-oriented devices, such as…
• Display screens
• Keyboards
• Printers
4.1. NOTATION 87
• Line
• File
Unfortunately the standard does not stipulate precisely what codes are used to
delimit text lines and files. As far as lines are concerned, either or both LF and
CR may be used. The situation is worse for file delimitation. The operating
system or compiler designer has a free hand here. As a result they all do it
differently!
The UNIX operating system uses LF alone as a line delimiter and nothing at
all as a file delimiter2.
4.2.3
Real
Fixed-point representation
There are three things we should care (and thus think) about when deciding the
numeric data type for a variable…
Dynamic range is the difference between the largest and smallest values
the variable might possibly assume as the process runs and may be
denoted by → V.
Resolution determines the difference between a desired, true value and
the nearest one that can be represented and may be measured as the
difference in value represented by neighb ouring states, denoted by Σ V.
Precision is the ratio of the difference between the value represented by
a state and that by its neighbour to the value itself and may be denoted by .
2 The process which is reading it is sent a signal instead. However a process (for instance
a user at a keyboard) writing out a file signals its end using ASCII EOT.
88 CHAPTER 4. DATA REPRESENTATION AND NOTATION
The range has merely been translated by an amount equal to the implicit scale
factor. Because of the implicit scale factor the programmer must scale abstract
level quantities (e.g. distances, angles etc.) before representing them at the
virtual level as fixed-point reals. This is an inconvenience to the programmer but
not to the user, who cares more about the performance and the cost of the system
than any tough time the poor programmer had tailoring it!
Floating-point representation
In floating-point representation a word is partitioned into two fields, as shown in
Figure 4.5. A mantissa is represented (labelled) as a fixed-point value with the
point assumed to the right of the leftmost bit. This means that the mantissa
represents numbers between zero and two. Secondly, an exponent is represented
as an integer value (point to the right of the rightmost bit).
The exponent may be thought of as an explicit scale factor in contrast to the
implicit one associated with any fixed-point representation.
Floating-point representation sacrifices resolution for dynamic range! In
other words if you want resolution it is a much better to use a fixed-point
representation. If you want dynamic range, beyond that available in the local
4.1. NOTATION 89
Floating-point degeneracy
A little thought will reveal a problem with the floating-point type. There are
many representations for most values!
Consider a 4-bit mantissa with a 4-bit exponent. Shifting the mantissa right
one bit is exactly equivalent to dividing by two. Incrementing the exponent by
one is exactly equivalent to multiplying by two. If we do both the value
represented will remain utterly unchanged. For example…
The existence of more than one state per value is called degeneracy. The
ambiguity so caused is unacceptable. It is removed by enforcing a rule which
states that the leftmost bit in the mantissa must always be set. Thus the mantissa
always represents values between one and two. Representations obeying this rule
are said to be of normalized form. All operators on floating-point variables must
normalize their result.
Because it is always set, there is no point in including the MSB. Hence it is
simply omitted. It is referred to as the hidden bit. The normalized floating-point
representation of 12 is thus 1000:0011=1.5×23=12.
When set, the MSB of the mantissa now contributes 0.5 and we have an extra
bit of precision available. Neat trick eh!
90 CHAPTER 4. DATA REPRESENTATION AND NOTATION
Signing
Unfortunately twos-complement signing of the mantissa results in slow
normalization. Hence we resort to the sign-magnitude approach and include a
sign bit.
Signing of the exponent uses neither twos-complement nor sign-magnitude
representation. The reason for this is that it is highly desirable to compare two
floating-point values as if they were sign-magnitude integers. Sign-magnitude
encoding of the exponent is therefore not a possibility since it introduces a
second sign-bit. Twos-complement is not on since the pure binary interpretations
are not ordered by value (e.g. the representation of 1 appears less than that of-1).
The exponent may be signed using excess M/2 representation, where M is the
modulus of the exponent field. A value is arrived at simply by subtracting M/2
from the pure binary interpretation. Two floating-point values using excess M/2
exponent representation may be compared by simply comparing the pure binary
interpretations of each entire word. The clock diagram for the state labelling of a
4-bit register is given in Figure 4.6.
NaN may be used used to signal invalid operations such as divide by zero. Another
special state, not included in the table above, is used to indicate a result too small
to be encoded in the usual way. Instead it is encoded via a denormalized
mantissa, whose validity is indicated by a zero exponent.
There is also a 64-bit double precision standard (see Figure 4.7).
It should be noted that not all computers adhere to the standard. For instance
there are some very common machines to be found which use excess M/2
exponent representation.
Floating-point operations
There are three phases in the algorithm for addition or subtraction…
Similarly there are three phases in the algorithm for multiplication or division…
In each middle phase the mantissas are processed exactly as if they represented
pure binary values. The position of the point remains fixed and in no way affects
the operation.
Summary
The following observations are noted about floating-point representation of the
real data type…
It must be emphasized that the direct representation of very large or very small
quantities without scaling is advantageous to the programmer but not to the user,
who cares more about cost and performance which will be lost! Choosing
floating-point real typing is justified if the abstract quantity calls for high
dynamic range and roughly constant precision.
4.1. NOTATION 93
4.3
Structured data types
4.3.1
Sequence association
Arrays
Given a linear memory organization, the sequence-associated data structure
known as an array needs very little work to implement. In fact a memory map is
itself an array, whose index is the current address.
All that is needed to represent an array is to specify its start and its end
(Figure 4.8). The start and end are known as the array bounds. The only other
thing required is a variable elsewhere in memory to act as an index. To
summarize, an array representation consists of…
The bounds form a simple example of data structure descriptor. The linear
organization of memory takes care of finding neighbour elements. (All structures
are graphs of one kind or another.)
What we mean by an array is a sequence of elements of common type. No
hardware can guarantee that. It is up to the programmer to write well behaved
software.
Records
Records are represented in much the same manner as arrays. Actual
representations vary with compiler implementation but bounds remain at least
part of the descriptor and an index is still required. A record differs from an array
in that differing types of variable are allowed to appear within it.
Each entry in a record is referred to as a field (e.g. a floating-point
representation may be thought of as a record consisting of three fields…sign,
mantissa and exponent).
Security
Two forms of security should be provided…
• Type checking
• Bounds checking
94 CHAPTER 4. DATA REPRESENTATION AND NOTATION
4.3.2
Pointer association
Pointer type
A pointer is just the address of an item of data (either elementary or structured).
It is the counterpart of an index but within a pointer-associated data structure. It
references an individual element (Figure 4.9).
Abstract data structures may be constructed using records, one or more fields
of which contain a pointer. They present an alternative to the use of array indices
and permit truly dynamic structures to be built and referenced.
Security
Exactly the same criteria for security of access apply as to sequential structures.
However both bounds checking and type checking are much more difficult when
using a linear spatial memory map.
Programmer declared type labels tagged to pointers are used in most
conventional languages to facilitate type checking. But since these are static they
can only delineate the element type of which the structure is composed. They do
not delineate the type of the whole dynamic structure.
Dynamic type tagging of pointers fulfils part of our security requirement but
provides no means of checking whether a variable (perhaps structured itself) is to
be found within another.
The action of assigning (moving) a pointer structure is very difficult to render
secure and impossible to render efficient given a linear spatial memory map.
Exercises
Question one
i What is the smallest register word length such that a given number each of
octal and hex digits may denote a valid state for every value?
ii Write down the octal and hex notations for each of the following numbers…
• 011001010100001100100001
• 000111110101100011010001
iii Write down the pure binary notations for the following numbers…
• 8000.002C16
• 076.543.2108
96 CHAPTER 4. DATA REPRESENTATION AND NOTATION
Question two
i Prove that the arithmetic negation of a twos-complement value is equivalent to
the ones-complement incremented by one.
ii Give the interpretation in decimal notation of the following pure binary
states assuming first sign-magnitude, then excess-M/2 and finally twos-
complement interpretation…
• FF16
• C916
Question three
i ASCII was designed at a time when terminals were just teletypes, which simply
printed everything they were sent on paper and transmitted everything typed
immediately. In what ways is it inadequate for present day terminals and how is
it extended to remedy the situation?
ii ASCII is very convenient for transmitting textual data. What is the problem
with transmitting raw binary data (e.g. a machine executable file)? Suggest a
solution which makes use of ASCII.
Question four
i What is meant by degeneracy of a number coding system? Give two examples
of degenerate number coding systems.
ii Give, in hex notation, normalized single precision IEEE floating-point
representations for the following…
• −1.375
• −0.375
• −0.34375
Question five
i Summarize the circumstances when it is sensible to make use of floating-point,
rather than fixed-point, number representation, and those when it is not.
ii Perform the floating-point addition of 1.375 and 2.75 showing details of
each phase of the operation. (Assume IEEE standard single precision
representation.)
Chapter 5
Element level
5.1
Combinational systems
5.1.1
Specification
Prepositional logic
Propositions are assertions which are either true or false. Assertion is typically
denoted in natural language by verbs, most often the verb “to be”. In a formal
language (i.e. a mathematical one) it is denoted by “=”. No action is implied.
Examples are…
Propositions do not allow one to express assertions about whole classes of object
or about exceptions within classes. Such assertions are called predicates.
Connectives are the means of joining propositions together. Because they are
associative it is sufficient that they are binary. Because there are four possible
ways of combining two binary objects there are sixteen (24) connectives (see
Table 5.1). In addition to the binary connectives a single unary one is defined
called logical negation (Table 5.2).
Formulæ are formed by joining propositions together with connectives. For
instance…
Table 5.1: Truth table for all sixteen binary logical connectives
A B → → | →
F F F T F F F T T T F F F T T T F T
F T F F T F F T F F T T F T T F T T
T F F F F T F F T F T F T T F T T T
T T F F F F T F F T F T T F T T T T
Formulæ themselves are propositions and thus are either true or false. Consider
especially that F3 may be either true or false.
Classification of formulæ is possible into one or other of the following
categories…
• Tautologies are always true (e.g. The cat is fat OR the cat is thin)
• Contradictions are always false (e.g. The cat is fat AND the cat is thin)
• Consistencies may be either true or false (e.g. Fred is a cat AND the cat is
fat)
Boolean algebra
Boolean variables are variables with range {1,0}. It is hoped that the reader has
already met and used them as a data type of a programming language. Boolean
algebra is designed to appear much like ordinary algebra in order to make it easy
to write and manipulate truth functions. Boolean operators are primitive
functions, out of which expressions may be constructed, and comprise the single
sufficiency set…
• Boolean expression
• Truth table
The truth table is a unique specification, i.e. there is only one per function. There
are, however, often many expressions for each function. For this reason it is best
to initially specify a requirement as a truth table and then proceed to a Boolean
expression.
Boolean functions are important to computer architecture because they offer a
means of specifying the behaviour of systems in a manner which allows a
modular approach to design. Algebraic laws may be employed to transform
expressions. Table 5.4 shows a summary of the laws of Boolean algebra.
5.1.2
Physical implementation
Logic gates
If we are going to manufacture a real, physical system which implements a
Boolean function, we should first query the minimum set of physical primitive
systems needed.
We know from a theorem of propositional logic that any truth function may be
formulated using the set of operators {AND, OR, NOT}. This operator set must
therefore be implemented physically. Each device is known as a logic gate (or
just gate). Figure 5.1 depicts the standard symbols for {AND, OR, NOT}.
As an alternative to implementing the {AND, OR, NOT} set, either {NAND}
or {NOR} alone may be implemented. Figure 5.2 depicts the standard symbols
used for these logic gate devices. Note that Boolean algebra does not extend to
either of these operators1.
Implementing a truth function by use of combining logical operators is known
as combinational logic. It was pointed out earlier that a prepositional logic formula,
to express a given function, which uses either NAND ( | ) or NOR ( → ) alone
will usually contain more terms than if it were to use AND, OR, NOT alone. It was
also pointed out that, in a sense, it would thus be less efficient. However,
physical truth is the province of the engineer and not the mathematician.
Manufacturing three things the same is much cheaper than three things different.
The reader is recommended to open a catalogue of electronic logic devices and
compare the cost of NAND or NOR gates to those of AND, OR and NOT. Using
102 CHAPTER 5. ELEMENT LEVEL
Figure 5.1: Logic gates implementing the {AND, OR, NOT} sufficiency set of operators
NAND or NOR to implement a combinational logic system usually turns out to
be more efficient in the sense of minimizing production cost, which is of course
the most important sense of all.
Figure 5.2: Logic gates implementing the {NAND} or {NOR} sufficiency set of operators
Complementation
Zero
Identity
Idempotence
Absorption
DeMorgan’s laws
The last combinational logic gate to mention does not constitute a sufficiency
set.
The exclusive-or operator is so useful that it is to be found implemented as a
logic gate (Figure 5.3).
It is useful because…
Once again Boolean algebra does not include this operator2. However, with care
it is possible to extend it to do so. The → symbol is used to denote the XOR
operator. It is commutative…
These two are called minterms. Any Boolean function may be written as either…
Minterms are easily deducible from a truth table, simply by writing down the
pattern (as if spoken in English) which produces each 1 of function value. The
value of the function is 1 if the first minterm is 1 OR the second OR the third…
and so on.
Thus the combinational logic implementation of XOR, using just AND, OR and
NOT, is just…
Figure 5.4: Exclusive-or function implemented using AND,OR and NOT gates
AND/OR structures
Because every truth function may be realized as either the standard product of
sums or the standard sum of products it is not uncommon to find structures of the
form depicted in Figure 5.5. They are called two-level structures. The first level
produces the minterms or maxterms, the second yields the function itself.
Recall DeMorgan’s laws…
Recall also that economy demands use of a single type of gate, either NAND or
NOR. Application of DeMorgan’s laws results in a very useful design transform.
We may, it turns out, simply replace every gate in a standard product of sums
design by NOR, or every gate in a standard sum of products by a NAND,
without affecting its function. The equivalent networks for those in Figure 5.5 are
shown in Figure 5.6.
106 CHAPTER 5. ELEMENT LEVEL
As a result it is necessary to work out how to access individual bits and groups of
bits within a word. In order to achieve access within word boundaries we
introduce bitwise logical operators which operate on each bit of a word
independently but in parallel. The technique used is known as bit masking. We will
consider three kinds of access…Set, Clear and Invert.
First let’s deal with Set. The initial step requires the creation of a mask. This is
a constant binary word with a 1 in each position which requires setting and a 0 in
every other. For example 1111.00002 would be used as a mask to set the most
significant nibble (MSN) of a 1-byte word. This is shown below (left) along with
how to clear and invert it.
Note that the mask needed with XOR for Invert is the same as is used with OR
for Set but its inverse is required with AND for Clear.
5.1.3
Physical properties
Logic polarity
Boolean states ({0,1}) must be represented in some way by physical states in
order for the computer to possess memory. In addition to those physical states,
some kind of potential energy must be employed to communicate state, which
implies that we must decide on two standard potentials, one for logic 1 and one
for logic 0. One must be sufficiently greater than the other to ensure rapid flow.
• Positive logic simply means the use of the high potential for logic 1
108 CHAPTER 5. ELEMENT LEVEL
• Negative logic simply means the use of the low potential for logic 1
It is very easy to become confused about the meaning of design schematics using
negative logic4. Remember that design and implementation are separate issues.
Design may take place without the need to know the logic polarity of the
implementation.
Sometimes part of a design needs to be shown as negative logic. The logic
symbols used to indicate this are shown in Figure 5.7. The “bubble” simply
indicates inversion and has been seen before on the “nose” of NAND and NOR
gates.
Negative logic gates are shown with their equivalents in positive logic. Each
equivalence is simply a symbolic statement of each of DeMorgan’s laws. This is
the easy way to remember them!
Propagation delay
It is hoped that the reader is familiar with the concept of function as a mapping
from a domain set to a range set. If we are to construct physical systems which
realize functions (e.g. logical operators) we must prove that we understand what
that concept means.
There are two interpretations of a function. The first is that of association. Any
input symbol, belonging to an input alphabet, is associated with an output
symbol, belonging to an output alphabet. For example, imagine a system where
the input alphabet is the hearts suit of cards and the output alphabet is clubs. A
function may be devised which simply selects the output card which directly
corresponds to (associates with) that input.
The second interpretation is that of a process whose output is the value of the
function. For example, both input and output alphabets might be the set of
integers and the function value might be the sum of all inputs up to the current
instant. An algorithm must be derived to define a process which may then be
physically implemented.
The association interpretation is rather restrictive, but for fast implementation
of simple primitive operators may offer a useful approach.
The purpose of the above preamble is to make it clear that, since a process
must always be employed to physically evaluate a function, time must elapse
between specification of the arguments and its evaluation! This elapsed time is
called the propagation delay.
4 Positive logic is the norm and should be assumed unless stated otherwise.
5.1. COMBINATIONAL SYSTEMS 109
Race condition
Logic gates are typically combined into gate networks in order to implement
Boolean functions. Because of the gate propagation delay a danger exists that can
bring about unwanted non-determinacy in the system behaviour.
Figure 5.8 illustrates the two different outputs which can result from the same
input. It is history that decides the output! Although we might make use of this
property later for implementing memory, it is totally unwanted when
implementing a truth function (independent of time or history).
A simple precaution at the design phase will eliminate race conditions from
deciding output. The following rule must be obeyed…
Races may result in unstable states as well as stable ones, as we shall see later.
which it flows is simply a wire. “Charging” thus refers to the build up of charge
in a capacitance at the switching input of a transistor.
The charging time is totally dependent on the operating potential of the switch
and is responsible for the majority of the propagation delay of any gate.
The first law of thermodynamics may be (very coarsely) stated…
…and is popularly known as the law of conservation of energy. The second law
of thermodynamics basically says that all machines warm up as they operate and
may be (very coarsely) stated…
The first law implies that we cannot build a logic gate, a computer or any other
device without a power supply. Gates consume energy each time an internal
switch operates. The second law of thermodynamics states that we cannot build a
logic gate, a computer or any other machine which does not warm up when it
operates.
Race oscillation
Race oscillation is related to the race condition discussed earlier. It was
mentioned that, in addition to obtaining more than one possible output state, it
was possible to obtain unstable output states. Figure 5.9 shows how this may
arise.
Let’s suppose we begin with an input state at which brings about an output
state of 1 after one gate propagation delay, tpd. The input is then immediately
disconnected. The output is fed back to the input, forming a cycle in the network,
via any gate which will generate an extra delay. This feedback signal causes the
output state to be reset to 0. The cycle of events repeats itself indefinitely. It is an
example of a recursive process called oscillation.
The frequency of oscillation is the number of complete cycles per second. A
little thought shows that, for the network shown, this is given by…
5 For those who failed their physics…power is energy “consumed” per unit time.
5.1. COMBINATIONAL SYSTEMS 111
Fan-out
As discussed above, gates are fabricated from normally-open switches which
require a certain potential to close. The input must be charged for a period of
time until this potential is reached and the switch closes. The charging time is
thus dependent on two factors…
• Flow rate
• Operation potential
The rate of flow out of a gate is dependent on the available energy from the
power supply to the gate. Some constriction or resistance to flow is essential
since gates must share the available energy.
The flow rate into the gate depends on that out of the previous one. The fan-out
is the number of gates which may be connected to the output of a single gate and
still work reliably. The flow available at the output of a single gate is fixed and
determined by its effective output resistance. Connecting more than one gate
input to it implies dividing up this flow. As a result, a longer charging time is
required for each successor gate and their propagation delay is thus increased.
Fan-out and propagation delay must therefore be traded-off against one another.
112 CHAPTER 5. ELEMENT LEVEL
5.2
Sequential systems
5.2.1
Physical nature
Synchronization
Processes are entities which possess state and which communicate with other
objects. They are characterized by a set of events of which they are capable.
All this may seem a little formal and mysterious. Not so! You are surrounded
by such processes. For instance, a clock or watch (digital or analogue) is such a
beast. It has state which is simply some means of storing the time. This is likely
to be the orientation of cog wheels if it is analogue and a digital memory if it is
digital. It also communicates with other processes, such as yourself, by means of
some sort of display. Note that this is a one way communication, from the clock
to you. There is likely also to be some means for you to communicate to it, e.g.
in order to adjust it. Its reception of the new time and your transmission of it are
events which belong to the clock and you respectively. The two combined form a
single event of an abstract object of which both the clock and you are part (e.g.
your world).
All events fall into three classes…
• Assignment of state
• Input
• Output
Right at the heart of a computer there is a device which causes a single pre-
defined process to occur as a result of each command (or instruction) from an
alphabet (or instruction set). It is called a control unit. It is discussed in detail in
Chapter 7.
This section forms the first step towards an understanding of it, without which
it is impossible to truly understand a computer. The reader is warned to
remember the above definition of a process and to always return to it when
confused. Any process, except those composed of single events, may be broken
down into smaller processes which may run concurrently or successively. In this
part of the book we shall discover how to assemble more and larger hardware
processes until we are able to construct a complete computer. From that point
on, larger processes are constructed in software.
Two (or more) processes are said to run synchronously if some of their events
occur simultaneously (i.e. at the same instant of time). It is impossible for
processes to synchronize without communication. The necessary communication
may not be direct. Indeed it is inefficient to synchronize a large number of
processes directly. A much better way is to have all of them share a common
clock.
The world may be thought of as a (very) large collection of processes which
are synchronized to a single common clock (or time base). It is the job of a
standards institute to define this clock to a high degree of precision. To most of
us the clock on the wall suffices.
Synchronous communication is…
Temporal logic
A theory is needed on which to base usable methods of specifying systems which
change with time. Temporal logic is at present an active field of research which
promises many applications within computer science, such as…
The successor operator operates on two instants and evaluates true if the first
precedes the second. The timed truth function of an instant and a proposition
evaluates true if the proposition is true at that instant. This function effectively
maps the truth of the proposition for all time.
Temporal operators are unary operators which establish the truth of a
proposition with respect to time. Intuitively these operators must fall into the
following categories…
Assertions that something will always be true in the future and/or the past may
be constructed via negation. For example, might be used to state “From now on,
it will rain forever.”. It may be interpreted6 literally as “It is not true that, at some
future time, it will not rain.”. The timed truth function decides the truth or
otherwise of the assertion. In reality, in this case, it would take forever to
evaluate. This clearly places some sort of constraint on the range of applications
of temporal logic. It would seem well suited to the formal specification of system
behaviour.
One problem which exists for system specification is that of how to specify
precisely when an event, within the system process, is to occur. It is possible
either to specify that an event coincides with an instant or that it occurs within
the interval between the instant and its predecessor. What is needed is a set of
temporal operators which suits the needs of the process model computation.
This means that it should be tailored to specify and reason about sequences of
events. Such a set is composed of three unary and two binary operators as
follows…
• Temporal operators ( )
• Boolean connectives ( , , ¬)
• Set of atomic propositions
6 Note how, in English, we often have to resort to phrases, such as “from now on”, to
Properties of temporal logic allow the abstraction of requirements and thus the
proper separation of specification from design. It is possible to merely consider
what is wanted, without consideration of the constraints of what is available for
implementation.
One great advantage of such an approach is that theorems of temporal and
prepositional logic may be applied to the specification to verify that it is
consistent, i.e. that no pair of requirements conflict. This is a form of temporal
reasoning.
A second formidable advantage is that the same reasoning may be applied to
transform the specification into a design given…
The ideas discussed here are inadequate for such ventures. For instance predicate
logic is used, to achieve the expressivity required, rather than simple
propositional logic. The (brave) reader is referred to [Halpern et al. 83] and
[Moszkowski 83].
Although the Mealy machine is the more general formulation, the Moore
machine is just as general in application but may require more state (memory).
Figure 5.13 depicts a schematic of both.
Table 5.6: State transition tables for Moore and Mealy state machines
Moore machine Mealy machine
State Output Next state State Output Next state
x=0 x=1 x=0 x=1 x=0 x=1
A 0 A B A 0 0 A B
5.1. COMBINATIONAL SYSTEMS 119
5.2.3
Physical implementation
Requirements
In order to implement simple processes as real physical (i.e. hardware) systems
we may use the model of one or other type of state machine.
Synchronous communication will be used for the sake of simplicity (using
nothing more complicated than a wire, given electrical technology).
This reduces our problems to those of implementing…
• 1-bit memory
• Clock
A 1-bit memory is a bistable whose state may be written and read synchronously
with a clock, which is an unstable system, oscillating between two states,
outputting ticks to all components of the system.
From these two primitives it is possible to build systems implementing simple
communicating sequential processes. Systems constructed in this way are
referred to as sequential logic.
Clock
Figure 5.9 offered a possible solution for a clock. (A thing to be avoided in
combinational logic is the key in sequential logic.) This system has two states,
neither of which is stable. The rate at which it oscillates between them is totally
7 This is rather like picking up the mail on coming home in the evening. Who cares when
the postman came! One only cares about what the mail actually is.
120 CHAPTER 5. ELEMENT LEVEL
dependent on tpd, the propagation delay of the gates employed. Here lies a
problem.
5.1. COMBINATIONAL SYSTEMS 121
8 Like water flowing downhill into the lake behind a dam until a sufficient weight of
1-bit memory
It is very important to realize that any bistable system may be used to implement
a 1-bit memory. Such a system may be made from normally-open switches, as
shown in Figure 1.3. Figure 5.15 shows the same bistable in gate form.
The state adopted when energy is supplied will depend on which gate (switch)
wins the race. However, the system may be disturbed into one or other state by
some external process as we shall see.
In order to consider the function of 1-bit memory cells we shall consider
various types. Because we know all about gates now, we shall use them to
implement memory. Figure 5.16 (upper) shows a bistable gate configuration
which should be compared with Figure 5.8. The two inputs are Reset and Set.
Their function should be plain from the state transition table below…
Figure 5.16: RS latch fabricated from both NOR and NAND gates
It is equally common to use NAND gates to implement a latch, in which case the
inputs are negative logic and thus labelled and . Figure 5.16 (lower) depicts
such a latch and shows how it is usually drawn, with inputs and outputs on
opposite sides.
The system behaviour with RS=00 is exactly the same as that depicted in
Figure 5.8. When power is first applied the output is determined solely by
whichever is the faster gate (regardless of by how little). Subsequently it is
determined by the previous state, i.e. its memory. The behaviour with RS=11 is
rather boring and not usable. Depending on gate type used the output is 0 or 1 on
both outputs. As a result it is usual to treat this input configuration as illegal and
take care to ensure that it can never occur. The remaining two possible input
configurations are used to write data into the latch. As long as RS=00, previously
written data will be “remembered”.
The disadvantages of an RS latch are twofold…
The primitive systems discussed in the remainder of this chapter are all built
around the RS latch since it is the simplest possible bistable system which may
be written as well as read. We will show how the idea may be extended to enable
synchronous communication with other primitive systems, and simple
programming of function.
A modification to the US latch which frequently proves very useful (e.g. in
system initialization) is the provision of extra inputs for…
Figure 5.17: RS latch with added set, clear and enable inputs
The input gating allows a clock to switch it on and off so that the system will
ignore data except when it is enabled. When enabled, output will follow input
with only the propagation delay, which we assume to be small compared to the
clock period. This property is referred to as transparency and has an effect on
communication which is sometimes useful and sometimes highly undesirable. A
second device connected to the output of a transparent one will receive its input
effectively simultaneous with the first. A latch is a transparent memory cell
which implies that it cannot be both read and written simultaneously. To
summarize…
The two halves of a clock-cycle are known as phases. We can use two D latches
to construct a D flip-flop as shown in Figure 5.20. Note that the state transition
table is exactly the same as for the D latch. The differences are that input and
output are now synchronized to a clock and that the new device may be
simultaneously read and written.
The second problem is that some way of disabling the memory is required so
that the memory may be commanded to input or do nothing. The solution to this
one is simply to duplicate the enable connections of both latches and join one
from each together.
Values are said to be clocked in and out on each clock tick. Synchronous
communication with other components of a system is therefore possible. The
10 See Chapter 3.
5.1. COMBINATIONAL SYSTEMS 127
Feedback is once again employed to determine the state. Its effect is perhaps best
understood as preventing the RS latch from being both set and reset
simultaneously. RS latches may be used with added set and clear inputs which
may be used asynchronously, to implement JK flip-flops. Table 5.7 shows the
functions which may be programmed.
130 CHAPTER 5. ELEMENT LEVEL
, are called direct inputs and act asynchronously, i.e. their effect is
immediate regardless of the clock. They allow the state of the flip-flop to be
ensured at a time prior to the start of a process.
One major problem arises with the JK flip-flop. It is known as the ones-catching
problem (Figure 5.23). An input transient state is recorded by an enabled JK flip-
flop. D flip-flops do not suffer from this problem. A transient D=1 causes a
transient Set input to the RS latch (inside the D latch) which is quickly followed
by a Reset input.
We would very much like input states to be inspected only at discrete points in
time so that systems may be noise immune and so that synchronous
communication may be rendered secure. An edge-triggered design for the D flip-
flop is shown in Figure 5.24. The explanation of its operation is left as an
exercise.
Figure 5.25 shows an edge-triggered JK flip-flop design. No input which fails
to persist over a state transition is recorded by the flip-flop. An explanation of its
operation is also left as an exercise. We shall see later how this device may form
a useful programmable processor within a computer.
Lastly, an edge-triggered T flip-flop may be implemented simply by
substituting an edge-triggered D flip-flop in Figure 5.21.
5.1. COMBINATIONAL SYSTEMS 131
5.2.4
Physical properties
Time intervals
Nothing in reality can ever happen truly instantaneously. No gate can change
state simultaneously. Enough potential must be accumulated at each switch
within it to cause it to operate. For this reason the input to any flip-flop must be
132 CHAPTER 5. ELEMENT LEVEL
• Setup time
• Hold time
• Propagation delay
5.1. COMBINATIONAL SYSTEMS 133
Figure 5.27: Setup, hold and propagation delay time intervals of a flip-flop
Power considerations
The same comments apply to flip-flops as did to logic gates, under this heading.
Thermodynamics tells us that it is impossible to change the state of something
without converting potential energy into kinetic or some other form of energy,
i.e. to do work. It also tells us that it is also impossible to do work without heat
dissipation. Flip-flops thus both consume energy and get hot when they operate.
The same comments about fan-out apply to flip-flops as were made with
regard to logic gates. The output of each is only able to drive a limited number of
inputs to other devices, depending on both its own output flow resistance and the
134 CHAPTER 5. ELEMENT LEVEL
input potential required to operate the others in the time available. Fan-out and
speed of operation must be traded-off by the designer.
Depending on the physical bistable employed a flip-flop may require energy to
maintain one (or both) of its operating states. If the actual state itself of the
bistable is not to be used to drive the output, then it is possible for the device to
retain its state when the power supply is removed. Such a memory is referred to
as non-volatile. A non-volatile bistable memory has a potential function like that
shown in Figure 5.11 and must exhibit hysteresis. A volatile memory has no
hysteresis and must be maintained in its upper state by some other source of
energy (such as a stretched spring). Energy is always required to change the
state of any bistable, but not necessarily to maintain it.
Power considerations of memory devices may be summarized…
• Consumption
• Dissipation
• Fan-out
• Volatility
Exercises
Question one
Use Boolean algebra to prove that…
1. A two-level NAND gate structure may be used to implement a standard sum
of products
2. A two-level NOR gate structure may be used to implement a standard
product of sums
(Use the examples shown in Figure 5.5 and 5.6 reduced to just four inputs in
each case).
Question two
i Show how a bitwise logical operator might be used to negate an integer stored
in memory using sign-magnitude representation.
ii Describe how the same operator is used in the negation of a value
represented in twos-complement form.
Question three
Show how both NAND and NOR gates may be used to implement the Boolean
functions…
• NOT
5.1. COMBINATIONAL SYSTEMS 135
• OR
• AND
Question four
i Using Boolean algebra, derive implementations of the following function using
only…
• NAND
• NOR
ii Devices which implement this truth function are available commercially and
are referred to as and-or-invert gates. Prove that, alone, they are sufficient to
implement any truth function.
Question five
i Expand the NAND implementation of the RS latch, shown in Figure 5.16, into
a network of normally-open switches.
ii Discuss the requirements specification that an engineer would need in order
to implement a practical device.
Question six
Explain, in detail, the operation of the system shown in Figure 5.28.
Question seven
i Explain the difference between a latch and a flip-flop. What are the limitations
of latches when used as 1-bit memories?
ii Explain, in detail, the operation of the edge-triggered D flip-flop of
Figure 5.24.
136 CHAPTER 5. ELEMENT LEVEL
6.1
Combinational system design
6.1.1
Boolean algebra
Specification
The behaviour of a purely combinational system must be determined by the
specification of a function. If the system input and output channels are of binary
digital form then we may regard each as a set of Boolean variables. Each output
may thus be specified as a Boolean function of the input variables. It is simplest
to generalize by specifying each output as a function of all input variables.
Table 6.1: Truth table for three input, and three output, variables
A B C Q1 Q2 Q3 Q4
0 0 0 1 0 0 0
0 0 1 1 1 1 0
0 1 0 1 1 1 1
0 1 1 1 0 1 0
1 0 0 0 0 1 0
1 0 1 0 1 1 1
1 1 0 0 1 1 1
1 1 1 0 0 0 1
The easiest way to derive the function is to specify the truth table. On the left-
hand side are written all permutations of the values of input variables. Given n
input variables there will thus be 2n rows. Columns correspond to the variables
themselves. On the right-hand side are written the desired values for each output
138 CHAPTER 6. COMPONENT LEVEL
variable. Table 6.1 shows an example. Note that Q1 is in fact only a function of A
and is quite independent of B and C1. Q2 is only a function of B and C and is
independent of A2. Only Q3 and Q4 need be expressed as functions of all three
input variables.
You should now pause for thought about the precise meaning of a truth table
such as the one shown. Writing a “1” in a right-hand column indicates that the
function value will be 1 if the input variables take the values found in the same
row on the left-hand side. For example, {A,B,C}={0,0,1} is the first condition in
the table upon which Q3=1. The output variable is asserted given the truth of any
one of the conditions. Hence the function may be defined by the Boolean
expression formed from the OR of all of them.
The expression so formed is called the standard sum of products. Each
product is called a minterm. Thus the specification of the Boolean function called
Q3 may be deduced from the truth table to be…
Note that this is not the simplest expression of the function. Like arithmetic
algebra, it is possible to reduce the number of terms and hence the number of
gates needed to implement the system.
It is quite possible to draw up a table which represents the function as a
standard product of sums. For some reason this does not appeal to intuition. Our
minds seem to prefer to AND together variables and OR together terms.
Both the standard sum of products and the standard product of sums form
equally valid expressions of the function. DeMorgan’s laws may be used to
interchange between the two.
Reduction
We now come to the first of three methods which may be employed to arrive at a
useful design of a combinational logic system to meet a specification given by
either…
• Truth table
• Boolean function(s)
1
6.1. COMBINATIONAL SYSTEM DESIGN 139
Examples
The reduction of Q3 proceeds as follows. First we note down all pairs of terms
which differ in just one variable. Numbering terms from left to right, these are…
• {1,5}
• {1,3}
• {2,6}
• {4,6}
Those terms which appear more than once in this list are duplicated, by using the
law of idempotence ( ), until one copy exists for each pairing. Pairs are then
combined to reduce the expression.
…yielding…
The disadvantages of this method of simplification are that it is not very easy to
spot the term pairs (especially given many input variables) and that it is prone to
error.
6.1.2
Karnaugh maps
Derivation
An alternative to presenting and manipulating information in textual form is to
use a graphical one. Where information is broken into a number of classes, such
2
140 CHAPTER 6. COMPONENT LEVEL
as the values {0,1} of an output variable, a graphical form is usually much more
effective.
The graphical alternative to a truth table or Boolean function is the Karnaugh
map. A box is drawn for each state of the input variables and has a “1”, “0” or
“X” inscribed within it, depending on what is required of the output variable.
“X” denotes that we don’t care! In other words, each box represents a possible
minterm. Each “1” inscribed represents an actual minterm, present in the
expression.
The K-map graph in fact wraps around from top to botom and from right to left.
In reality it has a toroidal topology, like that of a doughnut. It must, however, be
drawn in the fashion shown in Figure 6.1. A separate K-map must be drawn for
each output variable. Figure 6.1 shows an example of a Karnaugh map for three
input variables and thus has eight (23) boxes.
It is simple to derive a K-map from a truth table specification by transferring
each 1 (minterm) into its appropriate box. The K-map makes the process of
reduction much easier3 because it renders as neighbours minterms differing in
one variable only.
Reduction
Each minterm is said to be an implicant of a function Q because it implies Q=1.
A prime implicant of Q is one such that the result of deleting a single literal4 is
not an implicant.
The minimal implementation of Q is the smallest sum of prime implicants
which guarantees the specified function correct for all possible input
Examples
Figure 6.2 depicts the K-maps for Q3 and Q4 as specified in Table 6.1. In the case
of Q3, no group with more than two minterms is possible. The solutions depicted
may be presented as Boolean expressions…
Which of the two possible solutions is optimal (if any) will depend on practical
circumstances5.
Just one solution is possible for Q4…
Hazards
One potentially serious problem emerges from the physical nature of switches
and hence logic gates. The finite gate propagation delay, together with the
unsynchronized nature of communication between gates, gives rise to the
possibility of momentarily invalid output.
Such hazards occur when a single variable appearing in a pair of prime
implicants changes value. One becomes 0 as the other becomes 1. In any real
physical system it is possible that, for a brief period, both will be 0.
On a K-map potential hazards may be identified as vertical or horizontal
transitions between prime implicant groups. The solution is to add a redundant
Limitations
K-maps for more than three variables are possible. The form for four variables is
shown in Figure 6.4. They become cumbersome for more than four or five
6.1. COMBINATIONAL SYSTEM DESIGN 143
6.1.3
Quine-McCluskey
Reduction
This method is generally applicable to problems with large as well as small
numbers of input variables. It is also suited to automation due to its tabular,
rather than graphical, approach.
There are two phases, one of reduction to a complete set of prime implicants
and a second to eliminate those which are unnecessary. The reduction phase
begins by tabulation of minterms according to the number of 1s. Terms differing
by just one variable are thus located nearby rendering them quick to find.
You should verify for yourself that further reduction is impossible in the
examples shown.
Minimization
Once the process of reduction has proceeded as far as possible, a minimal set of
prime implicants must be selected. Not all prime implicants are necessary, in the
sum, in order to assert all minterms required by the specification. A minimal set
may be selected with comparative ease using a table of prime implicants (row)
against minterms (column). A selection is made which is no larger than is
necessary to provide a complete set of minterms. One method is to follow up a
random initial selection avoiding those which duplicate a tick in a column
already dealt with.
The table and selection for Q4 is shown first (Table 6.4) because it is simpler. A
unique solution is possible as can be deduced through the use of a K-map.
The table and selection for Q3 (Figure 6.5) demonstrates the derivation of both
solutions determined previously by K-map. One follows an initial selection of
the topmost prime implicant, the other the bottom one. Each solution is arrived at
quickly using the method mentioned above.
146 CHAPTER 6. COMPONENT LEVEL
6.2
Sequential system design
6.2.1
Moore machines
State assignment
We begin this section with a brief recapitulation of the nature of a Moore state
machine. Figure 5.13 shows that the internal state of the machine is updated by a
functional mapping from a value input upon each tick of a clock. Values output
are similarly derived and synchronously communicated.
The property which distinguishes the Moore machine from the Mealy machine
is that output is strictly a function of state alone. It is decoupled from the input by
the internal state. The consequences of this are primarily…
State machines are used within computers to cause some predetermined process
to run upon receipt of a single simple instruction. The next and final chapter in this
part of the book will demonstrate how they are used to execute a stored
program, i.e. a sequence of instructions.
The example used below, to illustrate a logical approach to designing both
Moore and Mealy state machines, turns this idea around. Instead of causing a
sequence of events following a signal, we shall investigate how to cause a signal
following receipt of a given sequence. In other words, the machine has to
recognize a pattern. It illustrates some very important ideas which have wide
technological application.
Any state machine may be designed using the logical approach employed
here. The first thing to do is to assign state. This entails deciding just what
events you wish to occur and in what order. Once that is decided, each state is
given a label to provide a brief notation. It is sensible to include in the state
assignment table a brief description of the event corresponding to each state. The
number of states will dictate the number of flip-flops required.
Table 6.6 shows the state assignment table for the Moore state machine which
is to detect (recognize) the bit string 1001. Since there are five states three flip-
flops will be needed6. The binary representations of each state are completely
arbitrary. It is easiest to simply fill that part of the table with a binary count, thus
assuring that each state is unique. Here, each state is just a memory of events that
6.1. COMBINATIONAL SYSTEM DESIGN 147
have already occurred. In this instance they provide memory of various partial
detections of the bit string sought.
6 Recall that the number of flip-flops i must be such that 2i → n where n is the number of
states.
148 CHAPTER 6. COMPONENT LEVEL
Q2 Q1 Q0 Y
0 1 1 0
1 0 0 1
Table 6.9: Truth table for flip-flop excitations for Moore sequence detector
X Q2 Q1 Q0 D2 D1 D0
0 0 0 0 0 0 0
0 0 0 1 0 1 0
0 0 1 0 0 1 1
0 0 1 1 0 0 0
0 1 0 0 0 1 0
1 0 0 0 0 0 1
1 0 0 1 0 1 1
1 0 1 0 0 0 1
1 0 1 1 1 0 0
1 1 0 0 0 0 1
Finally the reduced Boolean algebra specification for the combinational logic for
a Moore state machine which detects the sequence 1001 is as follows…
6.2.2
Mealy machines
Figure 6.6: K-maps for Moore machine to detect the sequence 1001
The state transition table (Table 6.11) appears a little more complicated, with the
addition of an extra column specifying the precise dependency of output on
input.
Table 6.12: Truth table for output and flip-flop excitations of Mealy sequence detector
X Q1 Q0 D1 D0 Y
0 0 0 0 0 0
0 0 1 1 0 0
0 1 0 1 1 0
0 1 1 0 0 0
1 0 0 0 1 0
1 0 1 0 1 0
1 1 0 0 1 0
1 1 1 0 1 1
6.2.3
Summary
There follows a brief summary of the process of designing a state machine. It
applies equally to either Moore or Mealy types.
152 CHAPTER 6. COMPONENT LEVEL
Figure 6.8: K-maps for Mealy machine to detect the sequence 1001
1. State assignment table: Decide the events which are to occur and arbitrarily
assign them binary states. Their number will dictate how many flip-flops are
required.
2. State transition diagram: Decide, and specify pictorially, how you wish the
system to behave.
6.1. COMBINATIONAL SYSTEM DESIGN 153
3. State transition table: Tabulate the relationship of present state to next state
and output.
4. Truth tables for output & flip-flop excitation: These are derived directly
from state assignment table and state transition table.
6.3
Components
6.3.1
Logic units
Introduction
Before we are in a position to discuss typical components of a contemporary
computer architecture we must first discuss commonly used sub-components.
Sometimes these form part of a component7, sometimes they are found in the
“glue” which joins components together. Only then will we approach the
components themselves which may be divided into two categories…
Several different types of register are discussed, some of which are capable of
carrying out operations on data words and thus are of great assistance in
implementing machine instructions.
Binary decoder
The first sub-component of interest is a system which decodes a binary value. By
“decoding” is meant the assertion of a unique signal for each possible value
input. Hence a 2-bit decoder has an input consisting of two signals, one for each
bit of a binary word, and four separate outputs, one for each input value.
Figure 6.9 shows a schematic diagram and Table 6.13 the truth table of a 2-bit
decoder.
Binary encoder
It is sometimes necessary to binary encode the selection of one signal from a
collection of them. The device shown in Figure 6.10 effectively does the reverse
of a decoder. The diagram has been kept simple by showing only two bits. It
should be fairly obvious how to extend the system. One OR gate is required for
each output bit. Its truth table is shown in Table 6.14.
Each input bit which requires an output bit set is connected to the appropriate
OR gate. For example, the gate for the LSB must have inputs connected to every
alternate input bit so that the output LSB is set for all odd-valued inputs.
There exists two major problems with the simple system shown…
• Invalid output: Should more than one input signal be active, output will be
invalid
• Ambiguity: Identical output is achieved for both the least significant input
signal active and no input signal active
A simple solution would be to XOR all pairs of input lines to derive a valid
output signal8. An alternative is to prioritize inputs and design the system so that
the output is the consequence of the highest priority input. Such a system is
referred to as a priority encoder and proves to be much more useful9 than the
simple system shown in Figure 6.10. The design of such a system is left as an
exercise in combinational logic design.
156 CHAPTER 6. COMPONENT LEVEL
Figure 6.11: Read-Only-Memory (ROM) with 3-bit address and 4-bit data
Read-Only-Memory (ROM)
Read-Only-Memory is a contradiction in terms, or at least a description of
something which would be totally useless. In truth the system described here is
Write-Once-Only-Memory but the author does not seriously intend to challenge
convention. Unfortunately, computer engineering is full of misleading (and
sometimes downright wrong) terms.
A ROM may be thought of as a linear memory system which receives an
address as its input and produces as output the data word “contained” in the
“register” at that address. The width of address and data words form parameters
of the system.
It may also be thought of as a decoder/encoder system. Binary code is used at
the input. The code at the output is decided by the complete “memory contents”.
ROM is in fact implemented in this fashion, as shown in Figure 6.11.
As stated above, in order to be used as a memory, the system must be written
at least once. ROM is written (once only) by fusing links between the inputs of
the output gates and the output of the address decoder. EPROM10 is a widely
used erasable and reprogrammable ROM which allows multiple attempts to get
the contents right!
Perhaps the most important use of ROM is as non-volatile memory for
bootstrapping computers, i.e. giving them a program to run when they are first
8 This may be achieved in parallel by 2n XOR and one OR gates (fast but expensive) or in
ripplethrough fashion by n−1 XOR gates (slow but cheap).
9 e.g. to encode an interrupt vector in a vectored interrupt system (see Chapter 9).
6.1. COMBINATIONAL SYSTEM DESIGN 157
switched on. We will see in the next chapter that ROM is also sometimes used in
the implementation of processor control units.
Multiplexer (MUX)
The purpose of a multiplexer is to switch between channels. It is simply a multi-
way switch. An example of a MUX is the channel selector on a television. The
source of information for the channel to the system which builds images on the
screen is switched between the many available TV broadcast channels. Its truth
table is shown in Table 6.15 and schematic diagram in Figure 6.12.
Table 6.15: Truth table for multiplexer (MUX) with four inputs
S1 S0 Q
0 0 D0
0 1 D1
1 0 D2
1 1 D3
Demultiplexer (DEMUX)
A DEMUX performs the inverse function of the MUX. A single input value is
routed to an output channel selected by the two select control inputs. Figure 6.13
shows the combinational logic required. Its truth table is shown in Table 6.16.
It was described above how a decoder might enable (switch on) a system
selected from a number of others by an address. If all the systems are connected
to a common data bus, only the one enabled will read or write a data word. This
assumes parallel transmission of all data word bits.
Table 6.16: Truth table for demultiplexer (DEMUX) with four outputs
D S1 S0 Q3 Q2 Q1 Q0
d 0 0 d
d 0 1 d
d 1 0 d
d 1 1 d
the address bits to its select control inputs. Refer back to the telephone example
of time-multiplexing given above. Assuming the signal is digitally encoded,
make an estimate of how rapidly the MUX and DEMUX must switch to form a
useful speech communication system12.
Half-adder
The truth table for 1-bit binary addition is shown in Table 6.17.
It is easily seen that 1-bit binary addition is equivalent to a XOR operation and
that the carry may be generated by an AND. The system which performs a 1-bit
binary addition with carry output is called a half-adder and is depicted in
Figure 6.14.
A half-adder cannot be used in the addition of multiple-bit words because it
cannot add in a carry from the next less significant bit.
12
For “hi-fi” sound quality approximately 50,000 12-bit samples of a microphone output
must be transmitted per second.
160 CHAPTER 6. COMPONENT LEVEL
Full-adder
The full-adder adds one bit from each of two data words and an incoming carry.
It is easily made from two half-adders. As is readily seen from the truth table
below, the outgoing carry is the OR of the carry outputs from both half-adders.
Figure 6.15 shows the complete combinational logic system of a full-adder. Its
truth table is shown in Table 6.18.
Select Operation
10 A−B
11 Not allowed!
There is but one serious disadvantage with this system. The result output will not
be valid until the carry has rippled through from LSB to MSB. The time taken
for this to finish would limit the speed of arithmetic computation. For this
reason, although it is quite instructive, the ripple-through arithmetic unit is rarely
used nowadays. There is a better way.
Figure 6.17 shows the logic for the system which replaces the full-adder, in the
arithmetic unit, for generating the sum. A separate system to generate carry must
also be included for each bit of the result word. Figure 6.18 shows the schematic
for the logic required. The products may be generated merely by adding an extra
input to the AND gate for each term. A complete 4-bit LAC adder is shown in
Figure 6.19.
LAC adders, as shown, may be connected together so that the carry output of
each is connected to the subsequent carry input. The final carry output relies on
its predecessors carry values rippling through. It is however possible to design a
multiple-bit LAC adder which allows for larger scale carry generation. In
addition to the sum outputs group generate and group propagate outputs are
generated, formed from the Gi and Pi. These are then used as inputs to a carry
generator as before to generate, bit-parallel, the carry inputs to each adder and
the final carry output. The deduction of these combinational functions is left as
an exercise.
6.3.3
Registers
Data register
A register is simply a collection of flip-flops wide enough to contain one complete
word of data. Registers are read onto, and written from, a data bus, which is of
width equal to that of the registers themselves. Some means must be provided to
enable the connection of any one particular register to the bus and to ensure that
it either reads or writes the bus but not both simultaneously.
The gates connected between the flip-flop outputs and the bus may be
regarded merely as switches. Their connection to the bus is said to form a tristate
channel because three different events may occur upon each clock tick…
• Logic 1 passed
• Logic 0 passed
• Nothing passed (Flip-flop completely disconnected from bus)
The need for tristate buffering is clear. Without it the outputs of a register may
be connected via the bus to the outputs of another, leaving the state of the bus
(and hence the inputs of any third register enabled for reading) undecided.
Figure 6.20 shows the two least significant flip-flops of a data register and their
direct input, and tristate output, connection to the data bus.
164 CHAPTER 6. COMPONENT LEVEL
• R/W (Read/Write)
6.1. COMBINATIONAL SYSTEM DESIGN 165
• En (Enable)
R/W means “read not write” when set and therefore “write not read” when clear.
It is standard practice to refer to control signals by their function when asserted.
En serves to decide whether a read or write operation is to be allowed or not.
Remember that, unlike the data inputs, control inputs are asynchronous. They
must be set up prior to the clock tick when the operation is required.
…according to the value of the MUX select inputs, which form the control
inputs to the register. As well as shifting data right or left, the bidirectional shift
register can read and write the data bus. A schematic diagram is shown in
Figure 6.21.
Flip-flops at each end of the register are able to read from an external source,
which may then be shifted further through the word. The external source may be
a data source external to the computer in the form of a serial channel. If the two
ends are connected together the shift operation is then known as register rotation.
The control inputs to the bidirectional shift register are two MUX select bits.
• S0
• S1
Up/down counter
Now consider the design of a register which is capable of counting up or down in
binary. As we shall see in the next chapter, such a register is very useful indeed
in the design of a processor.
A counter is a state machine since it has state which varies in a controlled
manner. We shall consider a Moore machine design. Real counters have finite
state and hence can only count from 0 up to M–l, where M is the modulus of the
register (2N for a register of width N). In order not to obscure an example with
unnecessary detail, we shall follow the design of a modulo-8 counter13.
There is no need for a state assignment table since we know there are eight states
and will choose to label them by decimal enumeration. Figure 6.22 shows the
state transition diagram for a Moore state machine with a single input, U, which
decides the direction of count. Notice that the state returns to 0 after reaching 7.
From this we deduce the state transition table, Table 6.21. Having chosen to use
T flip-flops, a truth table may be deduced for the flip-flop inputs (Table 6.22).
It is quite sufficient to employ K-maps in the design of the count logic since
only four input variables need be considered. Figure 6.23 gives them for the
excitation logic for the two most significant bit flip-flops. It is obvious from the
truth table that T0=114. From this observation and the K-maps we conclude that
6.1. COMBINATIONAL SYSTEM DESIGN 167
Table 6.21: State transition table for Moore modulo-8 up/down counter
State Next state
U=0 U=1
Q2 Q1 Q0 Q2 Q1 Q0 Q2 Q1 Q0
0 0 0 1 1 1 0 0 1
0 0 1 0 0 0 0 1 0
0 1 0 0 0 1 0 1 1
0 1 1 0 1 0 1 0 0
1 0 0 0 1 1 1 0 1
1 0 1 1 0 0 1 1 0
1 1 0 1 0 1 1 1 1
1 1 1 1 11 0 0 0 0
If you are wide awake you just might have noticed something very useful. It is
possible to rewrite the expression for T2 as…
It is simple to combine these to yield the combinational logic specified above for
the up/down counter shown in Figure 6.24. The count logic may be specified via
recursion as…
One further useful observation may be made. Should T0=0 counting would cease
since all subsequent T inputs are effectively combined via conjunction (AND)
with T0. Hence we may utilize T0 as an enable input. When it is asserted the
counter counts up or down according to the value of U, otherwise it simply
maintains its state.
Lastly, it must be said that JK flip-flops could have been used with J and K
inputs connected together (at least when counting). Toggle flip-flops simply make
the design of counters easier and more natural15.
Counters are extremely important to the construction of computers and related
systems. They represent a means of generating states which is relatively easy to
understand intuitively. They have very many uses outside the world of
computers as well.
To summarize, the control inputs to the up/down counter are…
• U
• En
• Clr
170 CHAPTER 6. COMPONENT LEVEL
Active register
The final register considered here (Figure 6.25) forms a very powerful system at
the cost of a fairly high degree of complexity. It combines an up/down counter with
a data register to yield a register which could relieve much of the processing
burden from an arithmetic unit. Use of several of them would allow exploitation
of word parallelism by allowing several words to be processed simultaneously.
It is possible to extend the register to include shift and rotate operations as
well through the addition of a MUX for each bit. This might be of assistance in
implementing a multiply instruction.
The control inputs for the active register are summarized below (only one
control input is allowed asserted upon any single clock tick!)…
• Increment • Write
• Decrement • Zero
• Read • Complement
Question one
Derive a combinational logic design for the two truth functions specified in
Table 6.23 using each of…
Question two
i Prove using truth tables, and then Boolean algebra, that two half-adders and an
OR gate may be used to implement a full-adder.
ii Using Boolean algebra, derive the combinational logic for a 4-bit adder with
look-ahead carry generation.
iii Show how two 4-bit adders with look-ahead carry may be combined to
yield an eight bit adder with full look-ahead carry (i.e. no ripple through between
adders).
6.1. COMBINATIONAL SYSTEM DESIGN 173
Question three
A linear memory map is a one dimensional array of data registers whose index
is known as an address. The address appears as a binary word on a channel
known as an address bus. The data, read or written, appears on a channel known
as a data bus.
Using a memory map of only eight words as an example, show how a binary
decoder may be used to derive the enable input of each data register from the
address.
Question four
i Binary coded decimal (BCD) representation often proves useful because of the
human preference for decimal representation. Show, using examples, how you
think such a code is implemented.
ii Design a one (decimal) digit BCD up-counter.
Question five
Design a 4-input priority encoder. Output consists of two binary digits plus a
valid output signal.
Chapter 7
Control units
7.1
Function of the control unit
7.1.1
Processor organization
Internal communication
In Chapter 3 the organization of a processor was presented from the programmer
or compiler perspective. This consisted of a number of special registers, which
were reserved for a specific purpose, and a file of general purpose registers, used
for the manipulation of data. Now we are to be concerned with processor
organization from the machine perspective.
There are a number of ways in which a processor may be organized for
efficient computation. The subject will be treated in detail later in the book. In
order to discuss the function of the processor control unit it is necessary to
consider some fundamental ideas of the subject.
A processor may be pictured as a collection of registers, all of which
communicate together via an internal data bus1 and share a common word
width. Each register is connected to the bus for…
• Read
• Write
…operations.
Connecting registers onto the bus is very much like plumbing. The read and
write register control inputs act to turn on or off valves as shown in Figure 7.1.
The control inputs of all processor registers collectively form part of the
internal control bus of the processor. It is this which allows the implementation
of a protocol for the communication between registers. To transmit a value from
one register to another the read input of the sender and the write input of the
7.1. FUNCTION OF THE CONTROL UNIT 175
• One-to-one (Point-to-point)
• One-to-many (Broadcast)
The control bus is usually omitted from diagrams since it adds clutter and
conveys little information. One must simply remember that all registers have
control inputs.
The problem for the graphic artist is unfortunately shared by the design
engineer. Current technology is essentially restricted to two dimensions. A third
dimension is necessary in order to connect registers to both data and control
bus2. Use of a third dimension is currently expensive and hence must somehow
be minimized. One way to picture a processor is as three planes (Figure 7.3). The
middle one contains the registers, the others the control and data buses.
Connections are made, from buses to registers, from the appropriate outer plane
to the inner one. It is the function of the control unit to cause the correct internal
control bus signals to be asserted through each clock cycle.
The fact that executing any given machine instruction requires a sequence of
events is almost wholly the result of all communication internal to the processor
sharing a single communication channel in the form of the internal data bus. Full
connectivity, where every register is connected to every other, would allow
execution of any instruction within a single clock cycle.
Logic units
In addition to registers the processor also contains combinational logic
components such as the arithmetic logic unit (ALU). These systems possess
asynchronous inputs and outputs whose values must be communicable to any
1 If the technology is electrical, each data bus bit-channel may simply be a conductor.
2 If the system is electrical short circuits would otherwise result.
176 CHAPTER 7. CONTROL UNITS
External communication
In the strict sense, any system which may be characterized by a process may be
referred to as a processor. Within a computer the term is usually taken to mean
7.1. FUNCTION OF THE CONTROL UNIT 177
• Program
• Data
Unfortunately, there are serious practical problems with this approach. Firstly,
the width of the control bus grows in direct proportion to the size of memory and
quickly becomes unmanageable. Secondly, it is by no means obvious how data
may be bound to their location or how instructions might be conveniently
sequenced. Contemporary computers separate memory and (programmable)
processor. A single memory is used for both program and data.
It is advantageous to consider the activities of both processor and memory
each as a process when considering processor to memory communication
(Figure 7.5). The bandwidth of a channel is a measure of the frequency of
communication transactions it can support. The bandwidth of the processor to
memory communication channel largely limits performance in any computer.
The processor to memory channel is typically made up of data, address and
control buses. The protocol includes synchronization (with respect to a clock)
and a means of establishing the direction of communication3. The combinational
178 CHAPTER 7. CONTROL UNITS
7.1.2
Machine language interpretation
Microcode
We have established above that the control unit is responsible for generating the
correct sequencing of the control signals which make up the control bus.
In a programmable processor, the correct sequencing is determined by the
program. The program is itself a sequence of instructions. The control unit must
cause the correct sequence of control words to be output onto the control bus to
execute each instruction. Because they are sequenced just like instructions, they
may be thought of as micro-instructions and any sequence of them as microcode.
The sequence of events they bring about may be termed a microprocess. Each
event is termed a micro-operation.
A micro-instruction, at its simplest, is just a control word placed on the internal
control bus which causes a micro-operation to be performed. The micro-
instruction format is then just the arrangement of control signals within the
control word. For example, an architecture which has just…
• Two data registers with read/write control inputs r1, w1, r0, w0
• Arithmetic unit with control inputs af1, af0…
• …and register control inputs aw1, aw0, ar
Instruction format
An instruction format comprises a number of distinct fields, each containing a
binary code as follows…
• Opcode
• Addressing modes
The addressing mode field may be further broken down. The most significant
two bits being zero indicates register mode, and the remaining bits binary encode
which register.
State generation
We have now established that the control unit must decode instructions and
generate a sequence of micro-instructions, or control words, for each one in
order to bring about the appropriate microprocess. We now turn to how the
necessary state generation may be achieved.
Each micro-instruction corresponds to a state of the internal control bus. Some
of these states may be repeated as the microprocess runs, so fewer states than
events may suffice. Nor is it necessary to generate a control word directly. In
each case it will be found that the majority of the signals will not be asserted (i.e.
are 0) which suggests that to do so would be inefficient. The designer need only
enumerate the distinct states required and design a state machine to cause these
to occur in the sequence required. Control words are then decoded from each
state. Other approaches are possible as we shall see in the next section.
State decoding
Each control bus state must be decoded from the flip-flop states of the state
machine. This is rendered fairly easy usually, despite the fact that the control
word is often very wide and few states are required per instruction, because most
7.1. FUNCTION OF THE CONTROL UNIT 181
signals in the control word are typically not asserted. The general form of a
control unit is thus that shown in Figure 7.9.
7.1.3
Fetch/execute process
Microprocesses
The operation of any computer may be described as the execution of the
following procedure…
The loop forms the fetch/execute process, iteratively (or recursively) defined in
terms of the loop body which is traditionally known as the fetch/execute cycle,
which may be expanded into four microprocesses…
Fetch instruction
Any microprocess may be reduced to a number of events, expressed using a
register transfer language (RTL). We shall continue to use the RTL introduced in
Chapter 3. Instruction fetch reduces to…
(1) EAR → PC
(2) DIR → Mem, BCU(read)
182 CHAPTER 7. CONTROL UNITS
This assumes a processor organization with the following functional unit and
registers, of which only the PC would be of concern to the machine language
programmer,…
The first micro-operation copies the contents of the PC into the EAR.
Subsequently, upon a read command, the BCU copies the contents of the
memory location, whose address is in the EAR, into the DIR. It is assumed here
that both BCU and memory operate fast enough that the instruction at the
address in the EAR arrives in the DIR on the the next clock tick so that the final
operation can then go ahead and copy it into the IR, ready for execution. At the
same time, the PC is incremented to point to the next instruction to be fetched.
Note the parallelism by division of function made possible by the fact that the
PC can increment itself without the need to be transferred to an arithmetic logic
unit (ALU). Only three clock-cycles are therefore needed.
Fetch operand
Operands may be classified according to their location, i.e. they belong to one or
other storage class. Typical storage classes are…
• Program memory
• Register
• Workspace
(1) EAR → PC
(2) DIR → Mem, BCU(read), PC(increment)
7.1. FUNCTION OF THE CONTROL UNIT 183
(1) EAR → PC
(2) DIR → Mem, BCU(read), PC(increment)
(3) EAR → DIR
(4) DIR → Mem, BCU(read)
The first two steps load the address, which is assumed to be an instruction
extension, following the basic instruction in program memory. Indirect
addressing requires repeating steps 3 and 4 once for each level of indirection.
After indirection, DIR is in exactly the same state as for direct addressing.
Register addressing is very efficient because operand fetch cycles are not
required. The full specification of the operand location may be encoded within
the addressing mode field of the basic instruction itself. No (slow) transaction
with external memory need take place. The operand may be copied directly to
the destination required by the opcode.
Execute instruction
Their will be one unique execute microprocess for each opcode. The extent to
which this is separate from operand fetch microprocesses depends very much on
implementation4.
A very simple example of an execute microprocess is that implementing the
register move instruction of Figure 7.8. Only a single micro-operation (register
transfer) is required.
(1) r1 → r0
• Sequence of micro-operations
Just the control signals asserted upon inspection of the conditione may differ or
the subsequent sequence itself.
Interrupt
It is usually necessary for the processor to be capable of switching between a
number of distinct processes upon receipt of appropriate signals. Each process
generates its own signal demanding attention. These signals are known as
interrupt requests since they each ask the processor to interrupt the process
currently running.
The processor itself may be considered as a process receiving interrupt
requests and whose state changes according to the current running process.
Switching between processes is termed alternation. A useful analogy is that of a
person playing several chess games concurrently. When ready, an opponent
signals the “multi-player” who must then switch to that game until some other
opponent transmits a similar signal. The “multiplayer” must however always
finish a move before responding to a new signal.
Processors often manage using a single interrupt request signal allowing only
two processes and hence just two programmed procedures. The procedure which
is switched in, following receipt of an interrupt request, is called an interrupt
routine and must have a start address known to the processor. It is usually either
a fixed value or contained in an internal special register.
The arrival of an interrupt request is recorded as a bit in the processor status
register which is interrogated after the execute microprocess, at the start of the
interrupt microprocess. If set then the following (fixed) microprocess is
executed…
(1) EAR → SP
(2) DOR → PC, SP(increment)
(3) Mem → DOR, BCU(write)
PC → INT
4 In some architectures, any addressing mode may be used with (almost) any operation.
The contents of the PC are first saved at a known location, usually the top of a
stack located by a special register SP. Subsequently the contents of another
special register (INT), containing the address of the interrupt routine, are copied
into the PC. Program execution proceeds from there.
7.2
Implementation
7.2.1
Introduction
There are three functional components to a control unit…
• Opcode decoding
• State generation
• State decoding
• Minimum state
• Shift register
• Counter
Each method tends to dictate how opcode decoding (to trigger state generation)
and state decoding (to derive control word) are carried out.
Timing requires careful consideration. The control word must be ready before
each operation is to take place. Each processor component must have the correct
values at its control inputs before a clock tick causes it to execute an operation.
This is easiest to implement if we distinguish two clocks (“tick” and “tock”)
which interleave. We need use just a single oscillating bistable system as before
to generate instead a two-phase clock where alternate state changes form each
phase (Figure 7.10). The two phases might be termed…
• Command phase
• Operation phase
7.2.2
Minimum state method
Sequencing
The first means of state generation to be considered here is the minimum state
method. This is the standard method, discussed in the previous chapter for the
design of state machines. It may be summarized as follows…
Where a fixed sequence of states is required, usually only a single input signal is
needed which triggers execution. Once the first state is attained, the
microprocess continues until termination. The exception to this occurs when the
design calls for an instruction to be interruptible. A second (interrupt request)
input will then be required which must cause execution to be aborted and an
interrupt microprocess to start.
Here is an example of the design of a control unit to implement the single
instruction…
add r0, r1
Table 7.1: State assignment table for add instruction control unit
Label State Micro-operation
a 00 Inactive
b 01 au.i0 → r0
c 10 au.i1 → r1, alu(add)
d 11 r0 → au.0
Next the state transition table must be deduced and the output function
specified. Table 7.2 assumes the use of a Moore machine and the micro-
instruction format given in the previous section. The 1-bit input signals the start
of instruction execution. Subsequent input values are ignored until completion. If
the unit is not executing, its output is forced inactive.
It now remains to deduce how to generate the required output values and
excite each flip-flop. For this we require a truth table (Table 7.3).
Design of the necessary combinational systems may proceed as described in
Chapter 6. Note that C8, C3, C1 are never asserted and hence require no design
effort.
Table 7.2: State transitiont table for add instruction control unit
Input Present state Next state Micro-instruction
0 a a 0008
1 a b 0008
X b c 0218
X c d 2448
X d a 1028
Conditional sequencing
Conditional sequencing of states will be required to implement the conditional
branch instructions required to implement selection and iteration constructs of a
procedural programming language. It is achieved through the use of an
additional input to the state machine which possesses the value of the condition.
188 CHAPTER 7. CONTROL UNITS
Condition states are stored within the processor state register (PSR), each as a 1-
bit field known as a flag.
Selection of micro-instruction sequencing is also required according to opcode.
Hence inputs to the state machine may be of two kinds…
• Processor state
• Opcode
If the opcode is used directly, any slight change in its definition will require
considerable redesign effort. The use of a decoder to generate state machine inputs
from the opcode is thus very advantageous. Remember also that adding a single
extra input bit doubles the size of the state transition and truth tables!
7.2.3
Shift register method
Sequencing
An alternative to the above is to adopt the following approach…
For a reason made clear below, this is commonly called the shift register, delay
element or one flip-flop per state approach. The timing signals are easily
achieved by feeding a 1 through a shift register, leading to the structure shown in
Figure 7.11.
7.1. FUNCTION OF THE CONTROL UNIT 189
Figure 7.11: The basic structure of a shift register control unit for a single instruction
Conditional sequencing
As mentioned previously, there are two kinds of selection…
The second form of selection upon value of an input, where the choice is
between two distinct sequences, requires switching between two chains of flip-
flops. A simple instance of this is shown in Figure 7.13. Here the assertion of
control signal n occurs at one of two times depending on the value of condition m.
Obviously completely independent microprocesses might be implemented for
each value of m. Each solution would be known as a timing chain. This is the
kind of selection used to implement conditional branching. The input selecting a
chain would be a flag from the processor state register. Rather than implement a
separate chain for each flag, a MUX would typically be employed to provide a
single selection input to the control unit. The selection inputs of the MUX itself
might be provided by an appropriate field of the IR.
Integration
The design of a shift register control unit to implement a given processor
operation is easy compared to that of a minimum state one. The reason is that a
control flow analysis may be performed to produce a flowchart which has a one-
to-one correspondence with the hardware. Figure 7.14 shows a set of symbols
which may be employed in the flowchart together with their corresponding
hardware implementation. The design is thus intuitive, a fact which greatly
7.1. FUNCTION OF THE CONTROL UNIT 191
The timing signals are developed from timing chains of flip-flops. Some of them
( ) will be conditional asserted according to the state (s) of a PSR flag. Operation
signals are the result of decoding the current opcode. Only one will be asserted at
a time, corresponding to the opcode. The combinational logic for driving each
control word signal has the general structure shown in Figure 7.15.
• Easy to design: Use of standard forms to establish all necessary timing chains
• Easy to develop: Intuitive interpretation speeds development and debugging
192 CHAPTER 7. CONTROL UNITS
Figure 7.14: Flowchart symbols for shift register control unit design
• Inefficient to implement: One flip-flop per event per chain is expensive
7.2.4
Counter method
Sequencing
We have so far seen how either a specially designed state machine or a shift
register may be used for generating state within a control unit. It should not be
surprising that another standard component, the counter, may also be employed.
Each time the counter is incremented a new distinct state is arrived at. Two
advantages are immediately apparent. Firstly, counters are standard components,
so there is no need to design a special one for each instruction or each processor.
Secondly very much fewer flip-flops are needed for a given number of states
(log2 n for counter, n for shift register).
There is a further advantage. Control words may be simply stored as values in
memory since the state generated by the counter may be treated as an address.
They are thus rendered easily understood as micro-instructions, sequenced
exactly as are instructions.
ROM may be used to conduct the decoding from state to control word.
Alternatively, the development of such a control unit may be rendered very
convenient through the use of a writable microcode memory instead.
7.1. FUNCTION OF THE CONTROL UNIT 193
Figure 7.15: Combinational logic structure driving each control word structure
Integration
Microprocedures for the interpretation of all instructions may be held within the
microcode ROM. Execution of the correct one is ensured by loading the counter
with the correct start address as the first event in the execute microprocess.
• Control word: Only field output from control unit and composed of sub-fields
of control signals (when decoded if necessary) controlling the operation of
each register and combinational logic unit
• Jump/Enter: Single bit field which provides the control input for switching the
CAR input (AMUX) between a new start address (from microcode mapping
ROM) or jump destination (see below)
5 Note that such encoding is not necessary with the minimum state or shift register
approaches since neither calls for the storage of microcode.
7.1. FUNCTION OF THE CONTROL UNIT 195
Figure 7.17: Relationship between microcode ROM and microcode mapping ROM
• Jump destination: Field containing a new value for the CAR and used to
implement either constructs or microprocedure invocation/return
• Flag select: Field used as control input to the CMUX to select a condition
signal input or force the CAR to unconditionally increment or load new start
address
Selection and iteration may be achieved via the jump micro-instruction. The
Jump/ Enter bit must be set in the micro-instruction executed immediately prior
to the first of the microprocedure. A valid jump destination field must also be
supplied. The last requirement is that the flag select field must be be set
appropriately to select 1 in order to force the CAR to load a new value from the
AMUX.
Termination of each execute microprocedure is effected by invoking the fetch
micro-procedure, which will usually be stored at the base of the MROM, i.e. at
address zero. In turn, zero must then be stored at address zero in the MMROM.
Ensuring zero in the opcode field of the IR when the machine is first switched on
will cause the first instruction to be fetched6.
Correct termination therefore requires all execute microprocedures to end with
the micro-instruction…
jump #0000
Microprocedures may be invoked and returned from using the jump mechanism.
They may be shared between execute microprocedures and thus reduce the total
microcode required in implementing an entire instruction set.
196 CHAPTER 7. CONTROL UNITS
• Easy to design: Standard components (counter, ROM, MUX) are used, all of
which have a simple, regular structure
7.1. FUNCTION OF THE CONTROL UNIT 197
Exercises
Question one
Suggest a control word format for a processor with the following components…
The active registers are as described in Chapter 6 except that they have two extra
control inputs, one for shift left and one for shift right. The carry flag in the PSR
is arranged to appear at the end of either register so that, after a shift, it will
contain the value of the previous LSB, if a right shift, or MSB, if a left shift.
Question two
Design a minimum state control unit (from a Moore state machine) for the
processor described in question one, which negates the contents of a single active
register (assuming twos complement representation).
Question three
i Pseudocode for a shift and add algorithm for multiplication of unsigned
integers, which requires no dedicated hardware, is as follows…
6 Typically, all registers are automatically cleared to zero when a processor is switched
on. This offers a simple way of obtaining predictable behaviour on power-up. Hence the
system designer must ensure the power-up boot procedure begins at address zero in main
memory since the program counter will be cleared also.
198 CHAPTER 7. CONTROL UNITS
…where denotes the shift left operator and the shift right operator7. Carry
being set after a shift right is equivalent to the multiplier ending in 2−1 which is
taken into account by adding half the new multiplicand (simply the old unshifted
version) to the result register. Satisfy yourself, using examples, that you
understand how the algorithm works.
Suggest an additional step which precurses the algorithm to improve its
performance over a large random set of pairs of operands. How might it be
further extended to deal with signed operands?
ii Assuming the processor architecture described in question one, give the
microprocedure, expressed in RTL, which implements a multiplication
instruction employing the algorithm shown above (without extensions) given
that the result, multiplier and multiplicand already occupy the active registers.
(Remember to terminate it correctly.)
Question four
i Establish a horizontal micro-instruction format for the microprogrammable
control unit shown in Figure 7.18 which includes the control word format given
as your answer to question one. (Make a reasonable assumption for the width of
the MROM address and that of the PSR.)
ii Give the complete microcode, in binary, octal or hex, to implement the shift
and add multiplication microprocedure described in answer to question three.
Take care to exploit all possible word-level parallelism.
Question five
Design a shift register control unit which implements the algorithm for unsigned
integer multiplication given in question three. Make the maximum possible use of
parallelism between micro-operations.
7The algorithm relies on the fact that a left shift is equivalent to multiplication by two,
and a right shift division by two.
Part III
Computer organization
Chapter 8
Processor organization
8.1
Requirements
8.1.1
General requirements
There are no hard and fast rules applying to processor organization. In this
chapter, four designs are considered as distinct. Commercially available designs
are either hybrids, employing a mixture of the ideas presented here, or a
combination of more than one philosophy allowing the system designer freedom
of choice over which to exploit. This is illustrated in Chapter 10 with a
discussion of three commercially available designs.
Two requirements dominate long term design aims…
Efficient execution requires the minimization of the time taken to translate
each element of a software source module into micro-operations. Virtually all
software is now written in a high level language. It is important to understand
that the architecture must be optimally designed for the compiler code generator.
The fact that an alternative architecture may permit more rapid execution of a
process, given machine level (assembly language) programming, is completely
immaterial.
Efficient development requires architectural support for software partitions
such as procedures and modules. Software partitions should correspond to
problem partitions in the most direct fashion possible. Separate development of
each partition requires separate compilation which in turn requires support for
code and data referencing across partition boundaries. With the procedural
programming model this implies the ability for a procedure in one module to
reference variables and procedures within another, ideally without the need to
recompile or relink more than just the one (referencing) module.
8.1. REQUIREMENTS 201
Instructions
The purpose of a processor is to translate machine language instructions into
microoperations. The speed of execution depends on the average time taken to
translate each instruction. The instruction throughput1 is simply the mean number
of instructions executed per unit time.
A number of methods are known which improve instruction throughput and
they are often the first thing of which a designer or manufacturer will boast.
Prominent among them currently is the instruction cache and pipelining which
are both discussed below. Caution is however required since some instruction
sets achieve more per instruction than others at the problem level!
Data
Like the Turing Machine, a contemporary computer employs separate processor
and memory. The two are connected by some kind of communication channel.
Unlike the Turing Machine processor, which can only reference a single location
and move to its successor or predecessor, a contemporary processor has random
access to its memory2. It may reference any element in memory, whenever it
chooses.
The limit on the instruction throughput of a processor is largely determined by
its data throughput which is the number of memory references per unit time that
the processor to memory communication channel is capable of supporting. This
is referred to as its bandwidth3. The channel is typically a bus, thus an important
factor will be the bus width, equal to the number of bits transferred per bus cycle.
Most contemporary computers employ a shared memory for both instructions
and data. This philosophy is credited to J.von Neumann4 [von Neumann 46].
Thus the bandwidth of the processor to memory channel is important because,
8.1.3
Real-time systems
In addition to these universal requirements will be the applicability of the design.
It is now becoming cost effective to design processors which are optimized to a
specific application. Arguably the most important growth is in real-time systems
which are capable of synchronous communication with their environment. Such
systems require event-driven software. An event occurring external to the system
immediately causes appropriate processing to generate the required response. The
delay between the event and execution of the first relevant instruction is known
as the latency and must be minimized. Such a processor must also possess the
capability of alternation between distinct environment processes. As a result it
must itself be capable of scheduling multiple processes5.
Applications of real-time systems range from aircraft flight control to robotics.
The principles are also highly applicable to the human/computer interface
process of a computer work-station.
3 In its strict sense the term refers to the difference between the lowest and highest
frequency of sinusoidal analogue signal which may be transmitted over the channel.
4 …who is also credited with the very idea of the stored program. Previously, computers
had to be physically altered in order to change their program. Those machines which
employ separate memories for instructions and data are said to possess a Harvard
architecture.
5 See discussion in Chapter 1.
8.1. REQUIREMENTS 203
8.2
Accumulator machine
8.2.1
Programming constructs
The most basic construct, the sequence, is implemented using a counter called
the program counter (PC). It contains the address of the next instruction to be
fetched from memory and then executed. It is incremented after each instruction
fetch microoperation.
Useful computations require selection and iteration constructs both of which
may be implemented by means of conditional branch instructions. Branching6
occurs if a specified condition flag is set7 in the processor state register (PSR).
Processor state depends on the results of previously executed instructions. For
example, if the result of a preceding addition instruction was zero, a flag (usually
denoted by Z) will be set. The arithmetic logic unit (ALU) will automatically
have caused it to be so.
The organization of the accumulator machine is shown in Figure 8.3 and its
programmer’s architecture in Figure 8.4. Note that there is no obvious means to
reuse software partitions of any kind.
8.2.2
Data referencing
The accumulator is so called because it is said to accumulate the value of the
computed function. Its contents repeatedly form one argument of a function
computed by the ALU. The result of each ALU operation is communicated back
into the accumulator. For example, summation of an array of integers may be
performed as a sequence of binary addition operations each of which takes the
result of the last as one operand.
The X register gets its name from “eXtension” or “indeX” and is used to contain
either. For example multiplication of two integers will require a result register
twice as large as required for each operand and will use X as an extension of the
accumulator. X also renders multiple precision arithmetic more efficient by
reducing the number of memory references required by acting as a temporary
store. It would also be used as the array and loop index in the summation
example mentioned above. An instruction to decrement X would be employed in
the loop body. The loop would terminate with a conditional branch instruction
which interrogates the Z flag in the PSR.
Accumulator machine instructions need take only a single operand. Typical
addressing modes are…
Immediate mode is used for referencing constants and absolute mode for scalar
variables. The others support the access of elements within sequence-associated
and pointer-associated structured variables.
8.2.3
Booting
By booting or bootstrapping we mean the correct initiation of the process we
wish to run. This first of all requires that all registers possess a suitable state. In
particular the PC must point to the first instruction of the boot code. Traditionally
this is first achieved by clearing all registers, which implies that the first
instruction to be executed must exist at the memory location whose address is
zero8. Similarly, judicious design can ensure that a satisfactory boot state is
achieved with all flags in the PSR cleared to zero.
8.2.4
Summary
The accumulator machine described here has many similarities with the von
Neumann machine [von Neumann 46]. It supports programming a complete set of
8.3
Stack machine
8.3.1
Software partitions
The stack machine directly supports a stack data, object (see Chapter 2) through
the inclusion of a stack pointer (SP) register to contain the address of the top-of-
stack (TOS). It may be automatically incremented after each push and
decremented after each pop instruction. In addition it should also be possible to
reference TOS without actually removing the top item, i.e. without affecting SP.
A subroutine is simply a section of code which may be executed repeatedly
when a program runs, avoiding the need for code repetition. When combined
with a mechanism for parameter passing it becomes a procedure. With a further
mechanism to return a value it implements a function9. Two special instructions
facilitate subroutine invocation and return…
(1) PC → (PC+length(bsr))
(2) EAR → SP
(3) DOR → PC, SP(increment)
(4) PC → (PC+<offset>), BCU(write)
Note that the semantic level of the RTL in use has now been raised by extending
it to allow the right-hand side to be specified as an arithmetic sum, whose
8.1. REQUIREMENTS 207
(1) EAR → SP
(2) DIR → Mem, BCU(read), SP(decrement)
(3) PC → DIR
8.3.2
Data referencing
It should be clear from the above that a stack provides direct support for
subroutines. In order to implement procedures and functions the correct
reference scoping for the following classes of data must be guaranteed…
• Global data
• Local data
• Parameters
• Function value
Note that all data are dynamic except global data. The stack frame is the
mechanism for referencing all dynamic data. The frame pointer (FP) register
contains the address of the first free location at the bottom of the frame, which
contains all local variables and whose size must be decided by the compiler by
inspecting their number and type. Directly beneath the stack frame is found the
old frame pointer value and return address followed by parameters and possibly
space reserved for a function value return. This structure is created by the
compiler generated code in implementing a procedure or function call and is
shown in Figure 8.5.
To summarize the compiler generated code for a procedure or function call…
The function value returned (if any) will now be easily accessible at TOS. There
follows a summary of addressing modes available with a stack machine…
The registers available for register relative addressing are detailed in Table 8.1.
The static base register (SB) contains a pointer to an area of memory reserved
for global access by all procedures of a software module.
One-address instructions are quite adequate. The stack machine uses TOS like
the accumulator machine uses its accumulator. TOS always serves as one
operand and the result of a binary operator so it remains only to specify the other
operand.
All data referencing is therefore made relative to addresses stored in processor
registers. Variables are bound to offsets, not absolute addresses! This has the
very important consequence that code may run independent of position in
memory. No absolute addresses need appear in the code at all. It is thus known
as position independent code.
210 CHAPTER 8. PROCESSOR ORGANIZATION
8.3.3
Expression evaluation
Expressions may be evaluated with very few instructions purely using the
stack10. In a given assignment, the left-hand side is a variable which is referenced
via a special register and an offset to which it is bound. Which register (FP or SB)
depends on the scope of the variable as discussed above. The right hand side is
the expression which is computed at TOS. Each variable appearing in the
expression is pushed onto the stack as it is required. When the computation is
complete the value of the expression rests at TOS. The assignment is then
completed by a pop to the location of the variable.
Instructions take the form of operators which operate on the stack. Typically
they pop the top two items as operands and push a result back. This may then be
used as an operand to the next instruction and so on until the value of the whole
expression is to be found at TOS, e.g. to compute (9+3)×(7−4)…
push #7 7
push #4 7, 4
sub 3
push #9 9, 3
push #3 3, 9, 3
add 12, 3
mul 4
“#” denotes immediate addressing mode. The status of the stack is shown on the
right hand side.
A great deal of code is (or may be) written with only short expressions to be
evaluated. On some machines stacks have been implemented specifically and
solely for the purpose of expression evaluation, taking the form of a small
number of processor registers. By eliminating the need for memory references a
significant performance increase is obtained.
It is quite possible to survive without an index register to reference array
elements by using a variable to act as an indirection vector. The vector into the
array (base+ index) is maintained and use made of indirect mode to recover or
set the element value. This does have the unfortunate consequence however of
incurring two memory accesses per element reference, one to acquire the vector,
a second to acquire the operand.
8.3.4
Alternation
Alternation between environment processes is provided by the mechanism
known as the interrupt. A hardware channel, which supports a signal protocol, is
connected to the con-trol unit. When a signal is received, the control unit completes
8.1. REQUIREMENTS 211
the current instruction, increments the PC, then saves it on the stack11. The
address of the interrupt routine is copied from the INT register into the PC. The
activity is very much like that of calling a subroutine. Return from interrupt
routine is achieved by a pop of the return address back to the PC.
In this way the processor is able to switch rapidly between two routines,
although the means of switching in each direction are not symmetric. The
interrupt allows rapid switching from a “main” routine into the interrupt routine
but subsequent interrupts are disabled upon entry and execution continues until
programmed return.
To communicate with multiple environment processes, polling must be used.
On entry into the interrupt routine, all such processes are interrogated (polled) until
the one which sent the interrupt signal is found.
8.3.5
Summary
Figure 8.7 shows the organization of the stack machine, Figure 8.6 the
programmer’s architecture.
Stack machines permit software modularity by supporting procedures and
functions at machine level. However, each invocation requires considerable time
overhead, reducing performance for the sake of both reducing code size and
easing the development and maintenance of source. They are very efficient, with
respect to code length, at expression evaluation and require only one-address
instructions.
Stack machines require only a small instruction set composed solely of the
operators required together with push and pop. An additional “generate address”
instruction is required to support access within sequence-associated or pointer-
associated structured data. They are compact and simple in architecture and
machine code. The instructions themselves may be compactly encoded and
implemented simply because there are few of them. Their single severe
disadvantage is that every data reference is to main memory which is time and
power consuming.
A form of alternation is possible, via the mechanism of the interrupt, which
requires further support (e.g. software polling).
8.4.1
Exploitation of locality
Both stack and accumulator machines suffer from the same handicap of the need
to access memory for each variable reference. This is particularly severe when
evaluating an expression. The inclusion of a file of processor registers, directly
controlled by the control unit, may greatly reduce the problem by housing all
variables local to a procedure while it runs. Registers may be used to contain any
type of data for any purpose and hence are collectively referred to as a general
purpose register file.
The efficiency derived from a register file is a result of structured
programming. Research has shown that the use of structured programming
results in temporal locality, i.e. the processor tends to access the same group of
variables within a certain size time window. Spatial locality is the tendency to
reference variables closely situated within memory. The exploitation of locality
is discussed in Chapter 3.
When the procedure is entered all local variables are initialized within the
register file. If the register file is not large enough, only those most frequently
referenced are housed there. In order that the calling procedure does not have its
local variables corrupted, the new procedure must, as its very first act, save the
register file at a known location. Parameters may be passed and values returned
in the register file but great care must be taken. A procedure implementation
must pay regard to register usage. As a result conventions exist with some
compilers and architectures, for instance the use of r0 in the “C” language for a
function return value.
8.1. REQUIREMENTS 213
Register files are usually combined with a stack architecture. The stack now
provides two facilities…
the memory references required on every procedure call that are due to the
architecture and not the problem. We now move on to discuss the removal of this
inefficiency.
8.4.2
Software partitions
The advantages of a stack machine over an accumulator one are attractive but the
penalty can be reduced performance. Each procedure invocation requires
parameters, return address and frame pointer to be written to the stack. The two
latter items must later also be read on exit and return. All this implies memory
access over a bus which can only transmit one item at a time. There is a way to
avoid as many references and greatly speed up those remaining.
A large register file is provided within the processor. A register window is
notionally laid over the register file so that only those visible through the window
are accessible. The window now takes the place of the stack in implementing
local variables, parameters and return value as depicted in Figure 8.8.
Accessing data in processor registers is very much faster than accessing memory
locations because each one is accessed directly without the need for address
decoding and bus protocol. The problem with registers is one of interconnection.
Bussed memory requires only one extra signal connection each time the number
of locations doubles! Registers require another signal pair (at least) for each and
every one added. For this reason it is costly, particularly with a 2-d technology,
where overlapping connections are not allowed. A large area of silicon “real
estate” is given over to interconnect that could just as easily be used for a large
bussed memory or even another processor!
A compromise is where the register file is broken up into equal sized windows
which overlap to allow parameter passing. Only one window is active at a time
whose address is the current window pointer (CWP) for which a register is
provided. For example, we might have eight windows of twenty-two registers.
Each window may overlap its successor or predecessor by six registers. Hence the
window advances across sixteen registers on each procedure call. In this example
the register file thus consists of…
decoded will provide the select signal to choose which signals each register must
obey.
8.4.3
Data referencing
Global data, although undesirable since it allows side effects to occur, is still
perceived by many as essential for an architecture to support. A register window
machine deals with this in an elegant fashion. Figure 8.10 shows how it is
achieved. A partition at one extreme of the logical register file remains
permanently bound to the same set of physical registers and holds global
variables visible to all procedures. Given ten global registers, the total size of the
logical register in our example will be thirty-two.
There need only be a single principal addressing mode for a register window
machine! Absolute, register relative and indexed register relative access to main
• Absolute: reg is any register containing the value zero, arg2 is an immediate
operand specifying the required absolute address
• Register relative: reg contains the address of the operand, arg2 is either any
register containing the value zero or a zero immediate value
8.1. REQUIREMENTS 217
Figure 8.10: Logical register file supporting global, local and parameter variables
• PC relative: reg is the logical register number of the PC, arg2 is an immediate
operand whose value is the required offset
(Used solely for conditional branching)
Arguments may be used to derive either one or two operands depending on the
effect of the operation encoded within the opcode. The single operand effect will
be restricted to load/store instructions which transfer any data to be processed
within a procedure into local register variables. In this way memory references
are kept down to the absolute minimum necessary. Logical and arithmetic
operators will require two arguments.
8.4.4
Parallel instruction execution
In Chapter 7 the process performed by the processor control unit was described
in pseudocode by…
There is an assumption here that the two components of the loop body must be
executed sequentially when in fact such is not the case. The sub-processors
which perform each sub-process are in fact independent except for a single
asynchronous channel of comunication…the instruction register. In machines
where the above is a true description of control unit behaviour, each sub-
processor is idle while the other does its job. Borrowing some notation from the
Occam 2 programming language [Inmos 88#1] we can describe the
transformation we require thus…
SEQ PAR
fetch instruction fetch instruction
→
execute instruction execute instruction
Things are not quite this simple however. The processor responsible for the fetch
process is the bus control unit (BCU) which is required to perform other tasks
which are subsidiary to the execute process. This leads to the possibility
remaining that execution might be delayed until a fetch terminates. Also a fetch
typically takes less time than an execute which leads to valuable machinery lying
idle once again.
The answer is to have a queue of instructions called an instruction cache
which is topped up whenever the bus is not in use for instruction execution. A
fresh instruction is always present, waiting at the front of the queue, whenever a
new execute process is ready to start provided the queue is large enough. The
use of a queue data object in this fashion is called buffering and is vital for
efficiency in asynchronous communication. The idea of pipelining is shown in
Figure 8.11 and is precisely that employed on a production line such as is found
in a car factory. Parallel execution of fetch and execute is but a simple example.
One other problem remains. The result of the execution of a conditional
branch instruction may change the identity of the following instruction
8.1. REQUIREMENTS 219
depending on processor state. The contents of the entire instruction cache must
then be discarded and replaced. However, a good standard of structured
programming should make this comparatively rare. Of the order of only 1% of
instructions are a conditional branch. Also the condition will only fail an average
of 50% of occasions, leaving the contents of the instruction cache still valid.
An interrupt will also invalidate the contents of the instruction cache. Again
interrupt events are comparatively rare.
Further pipeline parallelism may be obtained by breaking down execute
microprocesses. However, if this is pursued too far the overhead incurred in
buffer transactions and growth in control unit complexity can defeat performance
gained.
Parallel operation of different functional units is obviously desirable to avoid
machinery lying idle unnecessarily. If a number of instructions present in the
instruction cache each require to operate different functional units (e.g.
multiplier, ALU) on independent operands there is no reason why they should not.
Ensuring that the operands concerned are indeed independent is not always easy
however.
Multiple identical functional units become useful once an instruction cache is
included in the design. Typically this means multiple ALU architectures. A
significant increase in instruction throughput may be gained in many
applications at the cost of a reasonable increase in control unit complexity.
8.4.5
Summary
The organization of the instruction cache+register window machine is shown in
Figure 8.12. Note that only two special registers are required, PC and CWP. The
most significant performance advantage is that very much less external memory
access is required upon…
• Subroutine invocation
• Expression evaluation
This fact gives rise to an observed performance increase with very little added
complexity compared to an accumulator or stack machine.
The programmer’s architecture is shown in Figure 8.13 and is clearly very
simple to understand and for which to generate code14.
There is no technical reason for associating the presence of an instruction
cache and register windowing in processor design. They are quite independent
220 CHAPTER 8. PROCESSOR ORGANIZATION
14Remember that the compiler code generator author need not be concerned with how the
windowing operates and need only consider the logical register file.
8.1. REQUIREMENTS 221
each instruction has its own processor. Much more promising is the data-led
approach where each variable has its own processor! Further comments would
be outside the scope of this text.
Another idea which suggests itself given this design is that of establishing
memory as purely for code. The idea of separate memory for data and code
predates that of a single memory shared by both. Machines with separate
memories for each are referred to as Harvard architectures. The compiler would
obviously have to attempt to ensure that sufficient procedure activation depth
exists. Recursion would be risky, but then again it is anyway given a stack
machine. The implementation technology would have to permit a sufficiently
large register file. This requires reducing the cost of interconnect implicit with a
2-d technology.
The origin of many of these ideas is (historically at least) associated with the
drive for the RISC15 architecture. A good survey of RISCs is to be found in
[Tabak 87].
8.5
Queue + channel machine
8.5.1
Process scheduling
Process networks
The design of this machine is to fully support the software model of a process
network which communicates, internally and externally, by passing messages
over channels16. Processes may be hierarchically reduced until the structure of
each process is purely sequential and composed of the following primitives…
• Assignment
• Input
• Output
Scheduling algorithms
The support required for multiprocessing is some kind of scheduling algorithm.
The two simplest are…
The FIFO algorithm simply stores ready processes in a ready queue. A process
is served (or despatched) to the processor and then runs till it terminates. Round
Robin also employs a ready queue but each process is timed-out after one
timeslice if it has not already terminated or suspended. The magnitude of the
timeslice depends upon the overhead in context switching between processes and
may be either fixed or variable. If variable it might be made proportional to the
priority of the process, though that would not be easy to implement.
Scheduling support
The minimum support necessary for process scheduling is simply a front pointer
and a back pointer for a ready queue. The process is represented in the queue
simply by its workspace pointer (WS). When suspended, the PC value is located
8.1. REQUIREMENTS 223
8.5.2
Message passing
Channels
Channels which venture to processes running on external processors are referred
to as hard channels. They clearly require special hardware mechanisms. The
mechanism implementing the transmission and reception of data over a hard
channel may be built around a bidirectional shift register (see Chapter 6). The
transmission must be inhibited until the receiving processor is ready. A
subsidiary channel of signal protocol must be provided, alongside the data
channel, to effect this17. Input and output instructions are unaffected by whether
the partner process is running on the same or another processor.
Soft channels are those which serve to connect processes which share a
processor. They are implemented in a fashion which appears identical with hard
channels to the processor at the instruction level. Both employ rendezvous and
differ only in the rendezvous location and the mechanism by which data is
transferred.
Rendezvous
Rendezvous requires a rendezvous location agreed a priori by both processes
partaking in the transaction. This may simply be a memory location which is
initialized with a value denoting empty (Figure 8.14). The first process to arrive
at the rendezvous, on finding it empty, leaves behind its identifier (WS value)
and suspends itself by simply quitting the processor, causing the despatch of the
next ready process from the queue. When the second process arrives it completes
the transaction by direct access to the workspace of the first, which it
subsequently reschedules (enqueues). Note that it is quite irrelevant whether the
sender or receiver arrives first. The second process to arrive will either read or
write the workspace of the first, depending on which one it is.
8.5.3
Alternation
Alternation is natural and simple to implement on the queue/channel machine.
Concurrent processes may be programmed, each to look after communication
with a single environment process. Any which are not currently communicating
will be suspended. Any which are will be scheduled to run or be running.
For very low latency in real-time systems, a high priority process may
preempt the one running in response to a signal on a special hard channel (which
supports signals only).
8.5.4
Summary
The design of a processor to support both local multiprocessing and connection
within a processor network need not be complicated. All that is needed to
support static nonprioritized processes is a pair of queue pointer registers and the
Exercises
Question one
i As described in the text, the implementation of a function invocation by a
compiler on a pure stack machine implies the creation of a suitable stack
structure containing…
• Local variables
• Parameters
• Return value
• Value parameters
• Reference parameters
Question two
Many architectures build their stacks downward in memory. Use your solution to
question one to show in each case how each parameter and each variable would
be referenced using only the frame pointer.
Note: Use the notation “<offset>(fp)”. Also assume that any address is three
bytes long and that the least significant byte of anything is always found at the
lowest address.
Question three
Consider three processors…
For all three machines assume a basic instruction length of one byte and an
address length of four bytes. An immediate operand may be encoded within one,
two or four bytes as required using twos-complement integer representation.
Stack machine push, and pop alone take a single address locating the operand
to be moved. The arithmetic instructions all operate on the topmost two items on
the stack, leaving the result on top. Where ordering of operands is important, the
top item is equivalent to that on the left of a written expression. Assume all
variables are located as byte offsets from either frame pointer or static base
register. Hence both move instructions will be just two bytes long (one byte for
an instruction plus one byte for an offset).
Accumulator machine One operand is always the accumulator, corresponding
to that on the left of a written expression. The result is always placed in the
accumulator.
Register file machine The second address identifies both the righthand operand
and the destination of the result of an arithmetic operation. In the case of a move
instruction, the first and second address identify source and destination
respectively. Assume sixteen registers allowing up to two to be specified within
a single byte, additional to the basic instruction and an address or immediate
operand (if required). Note that there is no register windowing, so all variables
are bound to memory and not to registers.
i The execution time of an instruction is almost completely determined by the
number of bus cycles required. Why does it take much longer to perform a (read
or write) bus cycle than any other processor operation? Need it always be true?
ii Assuming a unit of time equal to that taken to perform a bus cycle,
enumerate the time taken to execute each instruction of each processor described
above.
iii Summarize the advantages and disadvantages of the stack machine in the
light of your answers to parts i and ii of this question. How might the worst
disadvantage be alleviated.
Question four
i For each of the processors described in question three compose the code which
a compiler might produce to implement the following assignment…
…given that the order of operations will obey the following rules…
Quantify the improvement in the execution time on the stack machine yielded by
arranging that the topmost three stack items are always to be found in special
dedicated processor registers.
Note: You may assume that a > 0 and that single precision arithmetic is
sufficient (i.e. forget about carry).
Question five
i Enumerate all bus operations required for a stack machine to invoke, enter, exit
and return from a function. Assume that three parameters are passed and that all
parameters and return value each require just a single bus operation to read or
write.
ii Compare your result with the operations required to achieve the same thing
on a register windowing machine. What are the limitations of register
windowing?
Question six
What is the difference between a procedure and a process and what (if any) are
the similarities? Include a brief comparison between procedure oriented and
process oriented software design.
Chapter 9
System organization
9.1
Internal communication
9.1.1
System bus
Bus devices
A top-level functional decomposition of a computer may be made yielding a
requirement for the following components…
• Processor
• Memory
• External communication (Input/Output or I/O)
Bus structure
The term system bus is used to collectively describe a number of separate
channels. In fact it may be divided into the following subsidiary bus channels…
• Address
• Data
• Control
Address bus and data bus are each made up of one physical channel for each bit
of their respective word lengths. The control bus is a collection of channels, usually
of signal protocol (i.e. single bit), which collectively provide system control. The
structure of the system bus is depicted in Figure 9.2.
Control signals may be broken down into groups implementing protocols for
the following communication…
• Arbitration
1 This approach is reductionist and may not be the only way to approach constructing a
computer. Artificial neural systems [Vemuri 88] [Arbib 87] inspired by models of brain
function, offer an example of a holistic approach .
2 We have met the bus before as a means of communication inside a processor (see
Chapter 7).
3The purchaser of many a processor “upgrade” has been sadly disappointed to find only a
marginal increase in performance because of this.
232 CHAPTER 9. SYSTEM ORGANIZATION
Figure 9.2: Subdivision of system bus into address, data and control buses
• Synchronous transaction
• Asynchronous transaction (Events)
9.1.2
Bus arbitration
Arbitration protocol
While a bus transaction occurs, an arbiter decides which device, requesting use
of the bus, will become master of the next transaction. The master controls the
bus during the whole of a bus cycle deciding the direction of data transfer and the
address of the word which is to be read or written.
Arbitration must take into account the special demands of the processor
fetching code from memory. As a result most commercial processors combine
the tasks of arbitration with that of executing code by implementing both in a
single device called a central processing unit (CPU).
The arbitration protocol operates cyclically, concurrent with the bus cycle,
and is composed of two signals…
• Bus request
• Bus grant
One physical channel for each signal is connected to each potential master
device. A device which requires to become master asserts bus request and waits
for a signal on bus grant, upon receipt of which it disasserts bus request and
proceeds with a transaction at the start of the next bus cycle. Note that bus
request is a signal which is transmitted continuously whereas bus grant is
instantaneous. A useful analogy is the distinction between a red stop sign, which
9.1. INTERNAL COMMUNICATION 233
Device prioritization
How does the arbiter decide which device is to be the next master if there are
multiple requests pending? As is often the case, the answer to the question lies
waiting in the problem as a whole. Here we find another question outstanding…
How do we ensure that the more urgent tasks are dealt with sooner? Both
questions are answered if we assign a priority to each device and provide a
mechanism whereby bus requests are granted accordingly. The simplest such
mechanism is that of the daisy chain shown in Figure 9.3.
9.1.3
Synchronous bus transactions
Bus cycle
A transaction is a single communication event between two bus devices. Each
transaction is synchronous, i.e. each participating device must complete a
transaction before proceeding. Asynchronous communication may also be
234 CHAPTER 9. SYSTEM ORGANIZATION
achieved as we shall see later in this section. On each transaction a single device
becomes master and communicates a message to or from just one slave. One
transaction occurs on each bus cycle. A single transaction may be subdivided into
two phases…
• Address
• Data transfer
• Address
• Read/Write
The protocol of the address channel is simply made up of timing and word
length. That of the read/write channel is, once again, its timing and just a single
bit which indicates the direction of data transfer.
Since bus transactions occur iteratively, the operation of the two phases are
together referred to as a, bus cycle.
Both address and data must remain valid long enough to physically traverse their
channel and be successfully latched4.
9.1. INTERNAL COMMUNICATION 235
Figure 9.4: Bus cycle showing address and data transfer phases
Synchronous transaction protocol for slow slaves
The time taken by a memory device to render valid data onto the data bus varies
according to the device concerned. Because of this, a transaction protocol
specifying a fixed interval between valid address and valid data would require
that interval to be appropriate to the slowest slave device ever likely to be
encountered. This would imply an unnecessarily slow system since, as pointed
out earlier, bus bandwidth limits overall system performance.
Wait states are states introduced in the bus cycle, between address valid and
data valid, by slow devices to gain the extra time they need (Figure 9.5). Any
number of wait states are permitted by most contemporary bus transaction
protocols. Note that an extra control signal (Rdy) is necessary and that such a
protocol implies a slight increase in processor complexity.
Address/data multiplexing
The fact that the address bus and data bus are active at distinct times may be used
to reduce the cost of the physical system at the expense of a slight reduction in
system bus bandwidth.
Multiplexing is the technique of unifying two (or more) virtual channels in a
single physical one. It was introduced and discussed in Chapter 6 where it was
shown how to construct a multiplexer and demultiplexer. Time-multiplexing
4 The physical means of sending and receiving messages is discussed in Chapter 1 and
Chapter 5. The timing restrictions of accessing any physical memory device are discussed
in Chapter 5. In summary these are the setup time, hold time and propagation delay.
236 CHAPTER 9. SYSTEM ORGANIZATION
Figure 9.6: Event signal (interrupt request) and daisy chained event acknowledge
(interrupt grant)
9.1.4
Asynchronous bus transactions
System events
There is another form of communication which requires a physical channel. The
behaviour of the running processes will typically be conditionally dependent
upon events occurring within the system. Devices must communicate the
occurrence of an event to the processor as shown in Figure 9.6.
Note that events can occur within the processor as well6, For example, the
control unit should be capable of detecting an attempt to divide by zero. Such a
processor event is typically dealt with by exactly the same mechanism as for
system events. The set of system events and that of processor events are
collectively known as exceptions.
System events are associated with communication, both internal and external.
Completion or failure of asynchronous communication transactions must be
signalled to the processor. A signal that an event has occurred is called an
interrupt since it causes the processor to cease executing the “main” program7
and transfer to an interrupt service routine.
Event protocol
Before a system event can be processed it must first be identified. There are two
principal methods…
• Polling
• Vectoring
Event polling means testing each and every event source in some predetermined
order (see discussion of event prioritization below). Clearly this will occupy the
processor with a task which is not directly getting the system task done. Care
5 See Chapter 7.
238 CHAPTER 9. SYSTEM ORGANIZATION
must be taken to test the most active sources first to minimize the average time
taken to identify an event.
Given a pure signal, there exists no choice but polling to identify an event.
Commonplace in contemporary system architectures is a more sophisticated
protocol which includes transmission of the event identity by the source.
Whether or not a running process on the processor will be interrupted or not
depends on the event which caused the attempted interrupt. The interrupt signal
is thus more properly referred to as an interrupt request. In order to decide
whether interruption will indeed occur, the event protocol of the system must
include some form of arbitration. If no event is currently being serviced, the
request will be successful and an interrupt grant signal be returned.
Thus a complete picture of the required event protocol may now be presented.
There are three phases…
The symbol used to identify the event may be chosen so as to also serve as a
pointer into a table of pointers to the appropriate interrupt service routines. This
table is called the interrupt despatch table. Its location must be known to the
processor and hence a base pointer is to be found at either of the following…
The event protocol and its efficient means of vectoring a processor to the
required interrupt service routine are depicted in Figure 9.7.
Event arbitration
Event protocols must include some means of deciding which event to service
given more than one pending. There are three fundamental schemes…
• FIFO
• Round robin
• Prioritization
Figure 9.8: Event prioritization and control using an interrupt control unit (ICU)
Prioritized arbitration is the most simple to implement in hardware and is the one
depicted in Figure 9.6. Event acknowledge channels are arranged in a daisy
chain. Each device passes on any signal received that it does not require itself.
Devices, regarded as event sources, must simply be connected in the daisy chain
in such a way that higher priority processes are closer to the processor. Software
prioritization is also extremely simple. The order of polling is simply arranged
such that sources are inspected according to priority. Note that this may well
conflict with the efficiency requirement that sources be inspected in order of
event frequency. Daisy chain and prioritized polling require only signal event
protocol.
Prioritization of a vectored event protocol, as depicted in Figure 9.7, requires a
little more hardware but still uses standard components. A priority encoder is
used to encode the interrupt vector/event identity and thus ensures that the one
transmitted is the highest priority provided that event sources are connected
appropriately.
An interrupt control unit, Figure 9.8, is an integrated device which will
usually provide prioritized vectored event protocol as well as FIFO and round
240 CHAPTER 9. SYSTEM ORGANIZATION
Figure 9.10: Direct memory access controller (DMAC) connected to CPU and system
bus
is ready to transfer data, and the DMAC may assert a busy signal continuously
until it is free to begin. In addition a R/W channel will be required to indicate the
direction of transfer.
A simplified picture of the programmable registers is shown in Figure 9.11.
One control bit is shown which would determine whether an event is generated
upon completion of a transfer. Other control parameters which may be expected
are channel priority and transfer mode. The transfer mode defines the way in
which the DMAC shares the system bus with the CPU. The three fundamental
modes are as follows…
• Block transfer mode…completes the whole transfer in one operation and thus
deprives the CPU of any access to the bus while it does so
• Cycle stealing mode…transfers a number of bytes at a time, releasing the bus
periodically to allow the CPU access
• Transparent mode…makes use of bus cycles that would otherwise go unused
and so does not delay the CPU but does seriously slow up the speed of data
transfer from the device concerned
There are some devices which require block transfer mode because they generate
data at a very high rate, once started, and are inefficient to stop and restart.
Magnetic disc and tape drives usually require this mode.
Although a degree of efficiency is possible by careful design, the fact remains
that the system bus is a shared resource and currently sets the limit to overall
system performance.
242 CHAPTER 9. SYSTEM ORGANIZATION
9.2
Memory organization
9.2.1
Physical memory organization
Requirements
The memory sub-system of a computer must fulfill some or all of the following
requirements, depending on application…
The first three items apply to all systems, regardless of application. The last two
really only apply to work-stations (which represent a just a small proportion of
working systems).
Minimum mean access time (per access)…of memory partially determines the
bandwidth of the system bus and thus the performance of the entire system (see
preceding section). It is not necessary to have all memory possessing the
minimum attainable access time. That would certainly conflict with other
requirements, particularly that of minimum mean cost. The mean access time
should be considered over all accesses, not over locations. It is possible to
minimize mean access time by ensuring that the fastest memory is that most
frequently accessed. Memory management must operate effectively to ensure
that the data most frequently referenced is placed in the memory most rapidly
accessed. We shall see how to achieve this later on in this section.
Minimum mean cost (per bit)…over all memory devices employed largely
determines the cost of contemporary machines. This is because the majority
(~90%) of the fundamental elements (e.g. switches) contained therein are
9.1. INTERNAL COMMUNICATION 243
Technological constraints
It is impossible to fulfill all memory requirements with a single memory device.
It probably always will be because different physical limits are imposed on
optimization against each requirement.
Access time is limited first by the switching necessary to connect the required
memory register to the data bus and secondly by the time taken for that memory
circuit, they are hard to avoid! Roughly four times the memory may be rendered
on the same area of silicon for the same cost.
Nothing comes for free. The problem is leakage. Any real physical capacitor
is in fact equivalent to an ideal one in parallel with a resistance, which is large
but not infinite. A dynamic memory may be thought of as a “leaky bucket”. The
charge will slowly leak away. The memory state is said to require periodic
refreshing. Contemporary electronic dynamic memory elements require a refresh
operation approximately every two milliseconds. This is called the refresh
interval Note that as bus bandwidth increases, refresh intervals remain constant
and thus become less of a constraint on system performance.
Refreshing is achieved using a flip-flop as shown in Figure 9.12. First the flip-
flop is discharged by moving the switches to connect to zero potential. Secondly
the switches are moved so as to connect one end of the flip-flop to the data
storage capacitor and the other to a reference capacitor, whose potential is
arranged to be exactly half way between that corresponding to each logic state.
The flip-flop will adopt a state which depends solely on the charge in the data
capacitor and thus recharge, or discharge, it to the appropriate potential. Flip-
flop, reference capacitor and switches are collectively referred to as a sense
amplifier. Note that memory refresh and read operations are identical.
A cost advantage over static memory is only apparent if few flip-flops are
needed for sense amplification. By organizing memory in two dimensions, as
discussed below, the number of sense amplifiers may be reduced to the square
root of the number of memory elements. Thus as the size of the memory device
grows so does the cost advantage of dynamic memory over static memory. For
small memory devices, static memory may still remain cost effective. The
246 CHAPTER 9. SYSTEM ORGANIZATION
since data may be erased. It thus offers competition for purely magnetic memory
with two distinct advantages…
…in addition to the non-volatility and extremely low cost per bit which both
technologies offer. Access time and transfer rate are similar, although magnetic
devices are currently quicker, largely because of technological maturity. The
portability advantage of the optical device arises out of the absence of any
physical contact between medium and read/write heads. Further reading on
optical memory is currently scarce, except for rather demanding journal
publications. An article in BYTE magazine may prove useful [Laub 86].
There is an enormous gulf separating an idea such as that outlined above and
making it work. For example, a durable material with the necessary physical
properties must be found. Also, some means must be found to physically
transport the heads over the disc while maintaining the geometry. A myriad of
such problems must be solved. Development is both extremely risky and
extremely costly.
the form of a tape, impose a severe overhead for random access but are
extremely fast for sequential access. A magnetic tape memory can present the
storage location whose address is just one greater than the last one accessed
immediately. An address randomly chosen will require winding or rewinding the
tape…a very slow operation. Tape winding is an example of physical transport
(see below).
An alternative arrangement is a two-dimensional array of words. Figure 9.16
shows this for a single bit layer. Other bits, making up the word, should be
visualized lying on a vertical axis, perpendicular to the page. No further
decoding is required for these, only a buffer connecting each bit to the data bus.
Each decode signal shown is connected to a vertical slice, along a row or
column, through the memory “cube”.
Externally the 2-d memory appears one-dimensional since words are accessed
by a single address. Internally this address is broken into two, the row address
and column address which are decoded separately. The advantage of row/column
addressing is that the use of dynamic memory can yield a cost advantage, using
current technology, as a result of needing just one sense amplifier for each entire
row (or column). This implies that only flip-flops are required for n memory
elements.
Extra external hardware is required to ensure that every element is refreshed
within the specified refresh interval. A refresh counter is used to generate the
refresh address on each refresh cycle. Assuming that a sense amplifier is
provided for each column, it is possible to refresh an entire row simultaneously.
Care must be taken to guarantee a refresh operation for each row within the
given refresh interval.
Note that row address and column address are multiplexed onto a local address
bus which may thus be half the width of that of the system bus. Reduction of
interconnect reduces cost at the expense of speed and complexity. The following
signals must be provided in such a manner as not to interfere unduly in system
bus cycles…
These are used to implement a protocol for the communication of row and column
addresses, i.e. to latch them.
Most contemporary memory devices which offer…
…require some kind of physical transport. For example, devices using magnetic
technology require transport of both read and write heads which impose and
9.1. INTERNAL COMMUNICATION 249
Access time is the time taken to physically transport the read/write heads over the
area of medium where the desired location is to be found. This operation is known
as a seek. The data transfer rate is the rate at which data, arranged sequentially,
may be transferred to or from the medium. This also requires physical transport,
but in one direction only, and without searching.
Figure 9.17 shows the arrangement of the winchester disc which possesses a
number of solid plattens, each of which is read and written by an independent
head. Each sector is referenced as though it belonged to a 1-d memory via a
single address which is decoded into three subsidiary values…
• Head
250 CHAPTER 9. SYSTEM ORGANIZATION
Note that the term sector is used with two distinct meanings…a sector of the
circular disc and its intersection with a track!
Whatever the arrangement, as far as the system bus is concerned, each
individual memory element may be visualized simply as a flip-flop, internally
constructed from a pair of normally-open switches or, if one prefers, a pair of
invertor gates. Only one connection is actually required for both data in and data
out (level) signals. However these must be connected to the data bus in such a
way as to…
Figure 9.18 shows how this may be achieved. Note that it is the output buffer
which supplies the power to drive the data bus and not the poor old flip-flop!
This avoids any possiblity of the flip-flop state being affected by that of the bus
when first connected for a read operation.
Associative cache
The associative cache (Figure 9.21) is a means of reducing the average access
time of memory references. It should be thought of as being interposed between
processor and “main” memory. It operates by intercepting and inspecting each
address to see if it possesses a local copy. If so a hit is declared internally and
• Even parity
• Odd parity
256 CHAPTER 9. SYSTEM ORGANIZATION
…according to whether the parity bit denotes that an even or odd number of 1s
are present in the word. Even parity may be computed in a single exclusive-or
(XOR) operation according to…
Only single errors may be detected using a single parity bit. No correction is
possible, only an event reported to the processor whose response may be
programmed as an interrupt service routine if an interrupt is enabled for such an
event. Any memory error is as likely to be systematic as random nowadays.
Hence, on small systems, it is now often considered satisfactory simply to report
a memory device fault to the poor user or systems manager.
Hamming code offers single error correction/double error detection
(SECDED). Although only a single error may be corrected, double errors may be
detected and reported as an event. First, consider what is required of an additional
syndrome word which is capable of indicating whether a word is correct and, if
not, which of all possible errors has occurred. Given a data word length of n bits
and a syndrome word length of m bits there are n+m possible error states.
Including the correct state, there are thus n+m+1 possible states of a memory
read result. Thus the syndrome word length is defined by…
This implies the relationship shown in Table 9.1 between data word length,
syndrome word length and percentage increase of physical memory size.
It is obvious from this that such a code is only economic on systems with large
data bus width. Further, it would be useful if the syndrome word possessed the
following characteristics…
Table 9.1: Relationship between parameters for Hamming error detection code
Data bits Syndrome bits Percentage increase in memory size
8 4 50
16 5 31
32 6 19
64 7 11
Hamming code achieves all these desirable features by forming subsets of the
data word and recording the parity of each. Each bit of the syndrome word in
fact just represents the parity of a chosen subset of the data bits. It is possible to
choose these sets in such a way that any change in parity, detected by an XOR
between recorded and calculated syndrome words, indicates, not just an error,
but precisely which bit is in error. For example, the 5-bit odd parity syndrome of
a 16-bit data word is calculated via…
9.1. INTERNAL COMMUNICATION 257
…where pi are stored as the syndrome and pΣi are computed after a read. E=0
indicates no error, EΣ 0 indicates one or two errors, each (or each pair) of which
will cause E to take a unique value, allowing the erroneous bit(s) to be identified
and corrected.
9.2.2
Virtual memory organization
Requirements
Virtual memory simply means the memory as it appears to the compiler and
programmer. The requirements may be summarized…
Single linear memory map…hides the complexity and detail of physical memory
which should be of no interest whatsoever to the compiler or machine level
programmer. Physical memory organization design aims should be considered
distinct from those of virtual memory organization. In other words, as far as the
compiler is concerned all of memory is unified into a single memory map. This
memory map will be typically of a volume approximately equal to that of the low
cost (mass storage) device and of access time approximately equal to that of
“main” memory. Memory management encapsulates the task of ensuring that this
is so. All virtual memory should be considered non-volatile.
These requirements pose severe problems for conventional programming
techniques where the poor user, as well as the programmer, is required to
distinguish between memory devices by portability (e.g. “floppy” vs. “hard” disc)
and by volatility (“buffer” vs. “disc”). It is hardly surprising that computers are
only used by a tiny proportion of those who would benefit (e.g. ~7% for business
applications). The most promising new paradigm, which might unify user and
programmer classes, is that of the object.
Objects14 are internal representations of “real world” entities. They are
composed of…
258 CHAPTER 9. SYSTEM ORGANIZATION
• Methods
• State
State simply means variables which are private to the object. Methods are
operations which affect state. Objects communicate by message passing.
The important point is that neither user nor compiler need give consideration
to physical memory organization. Objects are persistent, a fact consonant with the
non-volatility of virtual memory. By rendering communication explicit the need
for a visible filing system is obviated. The intention here is to point out the
relationship between objects and virtual memory15. It is not appropriate to give a
full introduction to object oriented systems here. The reader is referred to an
excellent introduction in BYTE magazine [Thomas 89] and to the “classic” text
[Goldberg & Robson 83].
Security…of access becomes essential when a processor is a shared resource
among multiple processes. The simplest approach, which guarantees
effectiveness given a suitable compiler, is for each process to be allocated
private memory, accessible by no other. The problem is that no such guarantee is
necessarily possible at the architecture level. A careless or malicious
programmer can easily create code (e.g. using an assembler) which accesses the
private memory of another process unless the architecture design renders this
physically impossible.
To achieve a secure architecture, memory access must be controlled by the
process scheduler which is responsible for memory allocation to processes when
they are created.
14See Chapter 2.
15 A full treatment of the subject of virtual memory implementation more properly
belongs in a text on operating system, e.g. [Deitel 84].
9.1. INTERNAL COMMUNICATION 259
• Block number
• Offset into block
All device memory maps are divided into block frames, each of which holds some
or other block. Blocks are swapped between frames as required, usually across
260 CHAPTER 9. SYSTEM ORGANIZATION
Paged memory
For the moment we shall simplify the discussion by assuming a paged memory,
though what follows also applies to a segmented memory. The size of a page will
affect system performance. Locality has been shown to justify a typical choice of
512 bytes.
In fact the block map table must also record the physical memory device
where the block is currently located. If it is not directly addressable in main
memory a page fault event is reported to the processor. Whether software or
hardware, the memory manager must respond by swapping in the required page
to main memory. This strategy of swapping in a page when it is first referenced
is called demand paging and is the most common and successful. Anticipatory
paging is an alternative strategy which attempts to predict the need for a page
before it has been referenced.
Security may be afforded by each scheduled process possessing its own
distinct page map table. This may be used in such a way that no two processes
are physically able to reference the same page frame even if their code is
identical. It is only effective if no process other than the operating system is able
to initialize or modify page map tables. That part of the operating system which
does this is the memory allocation component of the process scheduler. It alone
must have the ability to execute privileged instructions, e.g. to access a page map
table base address register. Note that data or code may be shared by simply
mapping a page frame to pages in more than one page map table.
A page replacement strategy is necessary since a decision must be taken as to
which page is to be swapped out when another is swapped in. Exactly the same
arguments apply here as for the replacement strategy used for updating the
contents of an associative cache (see above). As before, the principle of locality
suggests that the least recently used (LRU) strategy will optimize performance
given structured code. Unfortunately it is very difficult to approximate
efficiently. See [Deitel 84] for a full treatment of this topic.
16 Note that this activity used to require software implementation, which would form part
of the operating system. It is now typically subject to hardware implementation.
9.1. INTERNAL COMMUNICATION 261
Lastly, because address translation must occur for every single memory
reference, speed is of the highest importance. As pointed out above, a table look
up and an addition is required for each translation. Addition of page frame
address to offset merely requires concatenation17. Hence it is the table look up that
limits performance. Because of this it is common to employ a dedicated
associative cache for page map table entries deemed most likely to be referenced
next. This typically forms part of a processor extension called a memory
managment unit (MMU) which is also usually capable of maintaining the entire
page map table without software intervention by independently responding to all
page fault events.
Segmented memory
Everything said above about paged memory also applies to segmented memory,
which offers an advantage in the ease of rendering security of access at the cost
of significantly more difficult memory management due to possible
fragmentation of the memory maps of every physical memory device.
Security is easier to achieve since the logical entities which require protection
(e.g. the state of a process or object) will naturally tend to vary in size. It is easier
to protect one segment than a number of pages.
Fragmentation is the term for the break up of a memory map such that free
memory is divided into many small areas. An example schematic diagram is
shown on Figure 9.25. It arises due to repeated swapping in and out of segments
which, by definition, vary in size. The damaging consequence is that, after a
period of operation, a time will arrive where no contiguous area of memory may
be found to frame a segment being swapped in.
At the expense of considerable complexity, it is possible to enjoy the best of
both worlds by employing a paged segmented memory. Here the memory may be
17 Only the page frame number is needed, rather than its complete base address, because it
is sufficient to completely specify page frame location on account of the fixed size of a
page. The physical base address of a page frame is just its number followed by the
appropriate number of zeros, e.g. nine zeros for a page size of 512 bytes.
262 CHAPTER 9. SYSTEM ORGANIZATION
• Segment
• Page offset within selected segment
• Word offset within selected page
The segment number selects the page map table to be used. The page number
selects the page, offset from the segment base. Finally the word number selects
the word, offset from the page base. Note that, although only a single addition is
required, two table look ups must be performed. That is the potential
disadvantage. Very fast table look ups must be possible. Benefit from caching
both tables is impossible if frequent segment switching occurs, e.g. between code
and data.
Figure 9.27 allows comparison of the appearance of the three different virtual
memory organization schemes discussed.
9.3
External communication (I/O)
9.3.1
Event driven memory mapped I/O
Ports
A port, in the real world, is a place where goods arrive and depart. In a computer
the same purpose is fulfilled except that it is data which is received or
transmitted. The most efficient way for a processor to access a port is to render it
addressable on the system bus. Each port has its own distinct address. A read
9.1. INTERNAL COMMUNICATION 263
operation then receives data while a write transmits it. Because ports thus appear
within the main memory map this technique is known as memory mapped I/O.
Communication is inherently asynchronous since the port acts as a buffer. Once
data is deposited there, either a processor or external device, depending on
direction of data transfer, may read it whenever it is ready. Synchronous
communication is possible if…
264 CHAPTER 9. SYSTEM ORGANIZATION
• Port arrival
• Port departure
Device drivers
The process to which arrival and departure events are reported is called a device
driver. Any system with more than one port for input or output must be
multiprocessing, at least at the virtual level. In the now rare case where no
interrupt generation is possible, polling of all ports must be iteratively
undertaken to determine when events have occurred and select18 the appropriate
device driver to generate a response.
The software architecture for a collection of synchronous communication port
drivers is shown below expressed in Occam…
PAR i=0 FOR devices WHILE running c.event[i]? signal port[i]? data
process (data)
This code assumes the availability of a separate channel for each event. If only a
single such channel is available it will be necessary to wait for a signal upon it
and subsequently poll event sources.
If a compiler is unavailable for a language which supports such real-time
systems programming, interrupt service routines must be individually coded and
placed in memory such that the interrupt control mechanism is able to vector
correctly. Note that at least two separate routines are required for each device.
Portability is lost. It is not sufficient that a language supports the encoding of
interrupt routines. It must also properly support programming of multiple
concurrent processes.
In many real-time applications the process which consumes the data also acts
as the device driver. This is not usually the case with general purpose computer
workstations. Most programs for such machines could not run efficiently
conducting their I/O via synchronous communication. In the case of input from a
keyboard the running program would be idle most of the time awaiting user key
presses. The solution is for the keyboard device driver to act as an intermediary
and communicate synchronously with the keyboard and asynchronously with the
program, via a keyboard buffer19.
The function of the event driven device drivers, which form the lowest layer
of the operating system in a work-station, is to mediate between running programs
and external devices. They usually communicate synchronously with the devices
and asynchronously with running processes.
9.1. INTERNAL COMMUNICATION 265
Protocol
External communication channel protocols may be divided into two classes…
• Bit serial
• Bit parallel
Parallel protocols support the transfer of all data bits simultaneously. Bit serial
protocols support the transfer of data bits sequentially, one after the other.
Serial protocols must each include a synchronization protocol. The receiver
must obviously be able to unambiguously determine exactly when the first data bit
is to appear, as well as whether it is to be the most or least significant bit. One
method of achieving synchronization is to transmit a continuous stop code until
data is to be sent, preceded by a start code of opposite polarity. The receiver
need only detect the transition between codes. However it must still know fairly
accurately the duration of a data bit.
Serial interfaces are easily and cheaply implemented, requiring only a
bidirectional shift register20 at each end of a 1-bit data channel.
Both serial and parallel protocols require transaction protocol. Perhaps the
simplest such is called the busy/ready protocol. Each party emits a level signal
indicating whether it is busy or ready to proceed. Each transaction commences
with the sender asserting ready. When the receiver ceases to assert busy, data
transfer commences and the receiver re-asserts busy, so indicating
acknowledgement to the sender. The entire cycle is termed a handshake. Finally
when the receiver port has been cleared it must resume a ready signal, allowing
the next word to be transmitted.
Note that all protocols are layered. Layers of interest are…
• Bit
• Word
• Packet (Frame)
• Message
Only the first two have been discussed here. The rest are more properly treated in
a text on digital communications.
Figure 9.29: Parallel port registers of the 6522 VIA mapped into memory
20 See Chapter 5.
21 Their practical exploitation would require data sheets which are easily obtainable from
electronic component suppliers.
22 …which may, with a little difficulty, be used as a serial port.
9.1. INTERNAL COMMUNICATION 267
bit in the interrupt enable register. The device driver (interrupt service routine)
must respond by polling the status of the VIA by reading the interrupt flag
register which records the event which has occurred (Figure 9.30).
The auxiliary control register decides whether or not data is latched as a result
of a handshake, which would usually be the case. It also controls the shift
register and the timers.
Timer control allows for free running, where it repeatedly counts down from a
value stored in it by the processor, or one shot mode, whereby it counts down to
zero just once. Timers are just counters which are decremented usually by the
system clock. An extremely useful option is to cause an event on each time-out,
allowing the processor to conduct operations upon timed intervals. Timers may
even be used to generate a waveform on a port output pin, by loading new values
after each time-out, or count edge signals arriving on a port input pin.
Shift register control allows for the shift timing to be controlled by…
…events. It also determines whether the shift direction is in or out and allows the
shift register to be disabled. Note that no provision is made for a transaction
protocol. This would have to be implemented in software using parallel port bits.
Serial communication is much better supported by the 6551 Asynchronous
Communications Interface Adaptor (ACIA), whose memory mapped registers are
shown in Figure 9.31.
Bit level synchronization is established by use of an accurate special clock and
predetermining the baud rate (the number of bits transferred per second). The
268 CHAPTER 9. SYSTEM ORGANIZATION
Figure 9.31: Serial port registers of the 6551 ACIA mapped into memory
control register allows program control of this and other parameters, such as the
number of bits in the stop code (between one and two) and the length of the data
word (between five and eight).
Like the VIA/IFR the status register encodes which event has occurred and
brought about an interrupt request (Figure 9.32). The device driver must poll it to
determine its response. Note that parity error detection is supported.
The command register provides control over the general function of the
interface device. Parity generation and checking may be switched on or off.
Automatic echo of incoming data, back to its source, is also an option. Other
options are the enabling/disabling of interrupt request generation on port arrival/
departure events and the altogether enabling/disabling of transactions.
Serial communication has been traditionally used for the communication
between a terminal and a modem, which connects through to a remote computer.
For this reason the handshake signals provided on commercial serial interface
devices are called…
Terminal communication
Thus far the mechanisms whereby the familiar terminal communicates data both
in, from the keyboard, and out, to the “screen”, remain unexplained in this volume.
Here is an overview of how a keyboard and a raster video display is interfaced to
the system bus in a memory mapped fashion.
Every keyboard is basically an array of switches, each of which activates one
row signal and one column signal, allowing its identity to be uniquely
characterized by the bit pattern so produced. An encoder is then employed to
produce a unique binary number for each key upon a “key press” event. It is not
a difficult matter to arrange the codes produced to match those of the ASCII
standard.
9.1. INTERNAL COMMUNICATION 269
Figure 9.33 shows the encoder connected to the system by means of a VIA
port. The handshaking is not shown. Here the keyboard ready signal equates with
key press event and should cause the VIA to generate an interrupt request and a
handshake response (acknowledge).
The key matrix need not be large since certain bits in the character code
output by the encoder are determined by the three special keys…
• Control (bits 5, 6)
270 CHAPTER 9. SYSTEM ORGANIZATION
A minimum of 64 character keys, plus the four special keys, are usually
required. Thus a 8×8 matrix would be sufficient.
The raster video display is much more difficult. Current technology relies on
the cathode ray tube (CRT) for physical display. It may be briefly summarized as
an electron beam scanned across a very large array of phosphor dots, deposited
on the inside of a glass screen, causing them to glow. The beam is scanned raster
fashion (Figure 9.34), typically with approximately one thousand lines. By
varying the intensity of the beam with time, in accordance with its position, the
screen is made to exhibit a desired brightness pattern, e.g. to display characters.
The rapidity with which the intensity may be varied depends upon the quality of
the CRT and determines the maximum number of picture elements or pixels
which may be independently rendered of different brightness.
The screen is divided up into a two-dimensional array of pixels. It is arranged
that this array be memory mapped so that a program may modify the brightness
pattern displayed simply by modifying the values stored in the array
(Figure 9.35). Typically, given a word width of one byte, a zero value results in
the corresponding pixel being black (unilluminated) and FF16 results in it being
white (fully illuminated). The digital values must be converted to an analogue of
the intensity (usually a voltage) by a digital-toanalogue converter (DAC). Such a
system would offer monochromatic graphics support. Colour graphics support
uses one of two possible techniques…
Figure 9.37: Use of raster display controller to interface with a video display
The raster display controller generates the addresses of locations for digital-to-
analogue conversion synchronized with display beam position. Synchronization
with the display is achieved via…
• Horizontal sync
• Vertical sync
23A very high quality system might define characters using grey levels as well.
24Font marketing is a rather telling example of a new product which is pure information.
One day a major share of the world economy might be the direct trade of pure information
products over public communication networks.
25…more commonly referred to as a cathode ray tube controller (CRTC). An example is
…signals. Upon receipt of horizontal sync the beam moves extremely rapidly
back to the left-hand side of the screen and begins a new scan line. Upon receipt
of vertical sync it flies back to the top left-hand corner to begin scanning a new
frame. Three parameters alone are enough to characterize a raster display…
• Frame rate
• Line rate
• Pixel rate
Figure 9.37 shows how a raster display controller is connected to the system bus,
the dual port memory holding the screen map and the display itself.
In a pure graphics system the font ROM would be bypassed and the whole
address required for the (much larger) screen map memory. Mutual exclusion of
the screen memory may be achieved by inserting a wait state into the bus cycle if
necessary. The necessary buffering of each memory connection (port) is not
shown in the diagram.
In the character-oriented system shown, the screen map is made up of a much
smaller array of character values, each one a code (usually ASCII) defining the
character required at that position. A typical character-oriented display would be
twenty-four lines of eighty characters. The code is used as part of the address in a
font ROM26 which stores the graphical definition of every character in every
available font. The least significant bits determine the pixel within the character
definition and are supplied by the controller. Just as a colour graphics display
requires multiple planes, an extra plane is required here to define the font of each
character location. For example, address interleaving may be employed to give
the appearance of a single two-dimensional array of 2-byte words. The least
significant byte holds the character code, the most significant holds the font
number, used as the most significant address byte in the font ROM.
Systems which are capable of overlaying, or partitioning, text and graphics on
the same display are obviously more complicated but follow the same basic
methodology.
9.3.2
External communication (I/O) processors
Figure 9.38: Connection of system bus to external communications bus via an IOP
systems, via a network interface, or communication with users, for example via a
laser printer.
One approach, which has become commonplace, is to connect all external
devices together onto an external bus interfaced to the system bus by an I/O
processor (IOP). Figure 9.38 shows such an arrangement. An example of such is
the Small Computer Systems Interface (SCSI)27 [ANSI 86], which is well defined
by a standards committee and well supported by the availability of commercially
available integrated interface devices (e.g. NCR 5380). A detailed account of a
hardware and software project using this chip may be found in [Ciarcia 86].
Every device must appear to have a single linear memory map, each location
of which is of fixed size and is referred to as a sector or block. Each device is
assigned a logical unit number whose value determines arbitration priority. Up to
eight devices are allowed including the host adaptor to the system bus. The
system then appears on the external bus as just another device, but is usually
given the highest priority. A running program may in turn program the SCSI to
undertake required operations, e.g. the read command shown in Figure 9.39.
Each SCSI bus transaction is made up of the following phases…
1. Bus free
2. Arbitration
3. Selection
4. Reselection
5. Command
9.1. INTERNAL COMMUNICATION 275
6. Data transfer
7. Status
8. Message
Arbitration is achieved without a dedicated arbiter. Any device requiring the bus
asserts a BSY signal on the control subsidiary bus and also that data bit channel
whose bit number is equal to its logical unit number. If, after a brief delay, no
higher priority data bit is set then that device wins mastership of the bus.
Selection of slave or target is achieved by asserting a SEL control signal
together with the data bit corresponding to the required target and, optionally,
that of the initiator. The target must respond by asserting BSY within a specified
interval of time. If the target is conducting a time intensive operation such as a
seek it may disconnect and allow the bus to go free for other transactions.
Afterwards it must arbitrate to reselect the initiator to complete the transaction.
SCSI seeks to make every device appear the same by requiring that each obeys
an identical set of commands. The command set is said to be device independent
and includes the following commands…
• Read
• Write
• Seek
An example of the format of a typical command is shown in Figure 9.39. Note that
by setting a link flag, commands may be chained together to form an I/O
program. Chained commands avoid the time consuming process of arbitration.
Integrated SCSI interfaces, such as the NCR 5380, are capable of reading an I/O
program in the memory of the host automatically via DMA.
Status and message phases are used to pass information about the progress and
success of operations between initiator and target.
27 …pronounced “scuzzy”!
276 CHAPTER 9. SYSTEM ORGANIZATION
See Chapter 10 for more about Transputers, which certainly represent a very
great step change in paradigm and not just in the area of external
communications.
Exercises
Question one
i Show by an example why it is that a two-dimensional memory is most
efficiently rendered square.
ii The cost per bit (ci), size (si) and access time (ti) of memory device i in a
given memory hierarchy are such that…
28 See Chapter 8.
278 CHAPTER 9. SYSTEM ORGANIZATION
What is the overall cost efficiency and access efficiency for the memory
hierarchy described below (typical for a contemporary work-station)…
What would the hit ratio of the topmost device have to be to yield an overall
access efficiency of 10%?
Question two
i Summarize the component signal channels of the system control bus. Include
channels for all signals mentioned in this chapter.
ii Draw a timing diagram for daisy chain bus arbitration. Explain how a lower
priority device, which requests the bus simultaneously with a higher priority one,
eventually acquires mastership.
iii Some system bus implementations use polled arbitration whereby, when
the bus is requested, the arbiter repeatedly decrements a poll count which
corresponds to a device number. As soon as the requesting device recognizes its
number, it asserts a busy signal and thus becomes bus master. It is therefore
ensured that, if more than one device issues a requests the bus, the one with the
highest number is granted it.
Contrast the advantages and disadvantages of the following three bus
arbitration protocols.
• Daisy chain
• Polled
• SCSI bus method
9.1. INTERNAL COMMUNICATION 279
Question three
i Draw a schematic diagram showing how an interleaved memory, three DACs, a
raster display controller and a RGB video monitor are connected together to
yield a three plane RGB colour graphics display.
ii Show how the following components…
• VIA
• Modulo 8 counter
• 3-bit decoder
• 3-bit encoder
…may be connected in order to read a very simple 64-key unencoded key matrix
which consists simply of an overlaid row and column of conductors such that the
intersection of each row and column pair may be shorted, Figure 9.42. Explain
how it is read, to produce a 6-bit key code, upon a key press event.
Question four
The LRU replacement strategy, for either an associative cache or a demand
paged virtual memory, is difficult and slow to implement since each entry must
be labelled with a time stamp and all entries consulted to determine when one is
to be replaced.
Devise an alternative implementation which approximates LRU yet is
efficient both in extra memory for entry labelling and in the rapidity with which
the entry to be replaced may be determined. The minimum amount of
combinational logic must be employed.
Question five
i Using the Hamming code described in this chapter, derive the syndrome for the
data word FAC916.
ii Show that every possible error, both in data and in syndrome, produces a
distinct error vector when the Hamming code described in this chapter is
employed.
280 CHAPTER 9. SYSTEM ORGANIZATION
10.1
Introduction
The objective of this chapter is not to provide the reader with sufficient
knowledge to author a compiler code generator or design a hardware system.
Rather it is intended to illustrate ideas conveyed throughout the text as a whole
and demonstrate that they really do appear in real commercial devices. However,
a clear overview should result of the major features of each example and
references are provided.
It is part of the philosophy of this book not to render it dependent on any
single commercial design but to concentrate attention on those concepts which
are fundamental. The reader should bear this in mind. Lastly, each of the sections
below is intended to be self-contained and self-sufficient in order to allow the
possibility of consideration of any one system alone. As a result some material is
necessarily repeated. However, the reader is strongly advised to read all three.
Much may be learned from the comparison of the three machines.
Note: The NS32000 is considered as a series of processors whereas only the
M68000 itself is presented, and not its successors…M68010, M68020 etc.. This
is justified on two counts. First, all NS32000 series processors share the same
programmer’s architecture and machine language. This is not true of the
successors to the M68000. Secondly, it seemed desirable to first consider a
simpler architecture without “enhancements” and “extensions”.
282 CHAPTER 10. SURVEY OF PROCESSOR ARCHITECTURE
10.2
Motorola 68000
10.2.1
Architecture
Design philosophy
The Motorola 68000 was a direct development of an earlier 8-bit microprocessor
(the 6800) and is fabricated in a self-contained integrated device using VLSI
electronic technology. It has succeeded in both the work-station and real-time
control markets.
From a software engineering point of view it satisfies the following
requirements…
Instruction opcodes may require one or two operands. Most instructions executed
operate on two operands. Operands may be of one, two or four bytes in length3.
Vectored interrupts are prioritized and may be masked to inhibit those below a
certain priority specified within the processor state register (PSR).
Programmer’s architecture
Figure 10.1 shows the programmer’s architecture of the M68000. Two register
files are provided, one for addresses and one for data. a7 is in fact two registers
1 It is one of the criticisms of the CISC approach that compilers cannot easily optimize
code on a CISC architecture because of the extent of choice available. See [Patterson &
Ditzel 80], [Tabak 87].
2 A trap is a software generated processor interrupt or exception.
284 CHAPTER 10. SURVEY OF PROCESSOR ARCHITECTURE
each of which is used as a stack pointer (SP). Two stacks are maintained, one for
supervisor mode, which is used for interrupt service routines, and one for user
mode, which is used for
Table 10.1: Flags in the M68000 processor state register and their meaning (when set)
Flag Write access Meaning when set
T Privileged Trace in operation causing TRC trap after every instruction
S Privileged User stack, not supervisor stack
I0…2 Privileged Interrupt priority (lower priority interrupts inhibited)
X Any Extension beyond word length following arithmetic, logical or
shift operation
N Any Negative result of twos-complement arithmetic operation
Z Any Zero result of arithmetic operation
V Any Overflow in twos-complement arithmetic operation
C Any Carry after an addition, borrow after a subtraction
Addressing modes
Table 10.2 summarizes the addressing modes available on the M68000 together
with their assembly language notation and effective address computation. Note
that “[…]”, in the effective address column, should be read as “contents of…”.
An addressing mode is specified within the basic instruction in a 6-bit field.
This is divided into two sub-fields…mode and reg. The latter may be used either
3 In M68000 terminology, “word” refers to two consecutive bytes and “long” refers to four.
10.2. MOTOROLA 68000 285
Byte or word operations affect only the lower order fields within data registers.
Hence move.b d0,d1 will copy the contents of the least significant byte in d0 into
that of d1. None of the upper three bytes in d0 will be affected in any way. It is
as though they did not exist and the register was simply one byte long. The same
applies to arithmetic and logical operations. In the case where memory is
addressed, word or long word alignment
Instruction set
Tables 10.3, 10.4 and 10.5 summarize the M68000 instruction set. Where
relevant, instruction variants are provided for operation on byte, word and long
operands.
Program control is facilitated by a suite of instructions for condition evaluation
and conditional branch. A compare multiple element (cmpm) instruction is
included to allow simple, optimized implementation of array and string
comparison4. Table 10.6 shows all possible conditional branches, with the
processor state flag on which they depend, and their meaning.
In the case of two-address instructions the first is referred to as the source and
the second the destination. The result of the operation will be placed in the
destination which usually must be a register. Where operand order is important,
for example with subtraction or division, one must take care since ordering is
right to left. Hence sub.w (a0) ,d0 means subtract the contents of the memory
288 CHAPTER 10. SURVEY OF PROCESSOR ARCHITECTURE
location whose address is in a0 from the contents of d0 into which register the
result is to be placed. Similarly divu #4, d7 means
Table 10.4: Instruction set of the M68000: Expression evaluation (continued in next
Table)
Group Mnemonic Operation
Moves move. <b|w|l> Move
movea. <w|1> Move address
movem. <w|l> Move multiple
movep. <wl> Move peripheral
moveq Move quick operand
lea Load effective address
pea Push effective address
exg Exchange content of two registers
swap Swap upper & lower words of register
4 Among architectures in general, it is not necessarily the case that a cmpm instruction
will execute faster, or be easier to code-generate, than a sequence of cmp instructions. The
compiler author should always verify these things.
10.2. MOTOROLA 68000 289
Table 10.5: Instruction set of the M68000: Expression evaluation (continued from last
Table)
Group Mnemonic Operation
Logical and Boolean and. <b|w|l> And
andi. <b|w|l> And immediate
or. <b|w|l> Or
ori. <b|w|l> Or immediate
eor. <b|w|l> Exclusive or
eori. <b|w|l> Exclusive or immediate
not. <b|w|l> Not (complement)
noti. <b|w|l> Not immediate
Shifts lsl. <b|w|l> Logical shift left
lsr. <b|w|l> Logical shift right
asl. <b|w|l> Arithmetic shift left (preserve sign)
asr. <b|w|l> Arithmetic shift right
rol. <b|w|l> Rotate left
ror. <b|w|l> Rotate right
290 CHAPTER 10. SURVEY OF PROCESSOR ARCHITECTURE
divide the contents of d7 by four. Note that instructions for short “quick”
operands exist for addition and subtraction but not for multiplication and
division.
A load/store programming approach may be taken with the M68000 since…
• Immediate to register
• Memory to register
• Register to memory
…moves are efficiently supported and encouraged. Only move instructions allow
memory to memory movement. All arithmetic and logical operations place their
result in a data register which all but forces a load/store approach.
A problem remaining with the M68000, although it represents a great
improvement over earlier machines, is that the instruction set is not wholly
symmetric with respect to addressing modes. Care must be taken to ensure that
use of a selected addressing mode is permitted with a given instruction. This can
cause complication for the compiler author.
The instruction format typically includes fields for…
• Opcode
10.2. MOTOROLA 68000 291
• Index word
• Immediate value
• Displacement
• Absolute address
Zero, one or two extensions are allowed. Figure 10.6 shows the instruction
extension formats.
An excellent concise summary of the M68000 architecture, together with
information required for hardware system integration, is to be found in [Kane
81]. A complete exposition which is suitable for a practical course on M68000
system design, integration and programming, is to be found in [Clements 87].
10.2.2
Organization
Processor organization
The organization of the M68000 is a fairly standard contemporary design
consisting of a single ALU and a fully microcoded control unit. External
communication must be fully memory-mapped since no explicit support exists
292 CHAPTER 10. SURVEY OF PROCESSOR ARCHITECTURE
10.2.3
Programming
Constructs
Some closure of the semantic gap has been obtained by the designers both by
careful selection of addressing modes for data referencing and by the inclusion
of instructions which go far in implementing directly the commands of a high
level language. Code generation is intended to produce fewer instructions.
However, the instructions themselves are more complex. In other words the
problem of efficiently implementing selection and iteration is to be solved once
and for all in microcode instead of code.
A “for” loop always requires signed addition of constant, compare and
conditional branch operations on each iteration. This is a very common
construct indeed. The constant is usually small, very frequently unity. The
designers of the M68000 included a single instruction to this end (db<cond>,
294 CHAPTER 10. SURVEY OF PROCESSOR ARCHITECTURE
with condition set to false) optimizing its implementation in microcode once and
for all. To take advantage of this instruction, a slight extra burden is thus placed
upon the compiler to isolate loops with unity index decrements. There is more to
db<cond> however. It checks a condition flag first, before decrementing the
index and comparing it to−1. This may be used to implement loops which are
terminated by either the success of a condition or an index decrementing to zero.
An example is that of repeatedly reading data elements into a buffer until either
the buffer is full or an end of stream symbol is encountered. Unfortunately it is
not always easy for a compiler to detect this kind of loop. There follows a code
skeleton for each kind of loop discussed.
Below are shown two alternative implementations of a case construct, each with
its disadvantages. The result of the computation of the case expression is first
moved into a data register where it may be efficiently manipulated. Comparisons
are then performed in order to detect which offset to use with a branch. Each
offset thus directs the “thread of control” to a code segment derived from the
high level language statement associated with a particular case label. Each
selectable code segment must end with a branch to the instruction following the
case construct end. It is usual to place case code segments above the case
instruction. The disadvantage of this implementation is that quite a lot of code
must be generated and hence executed in order to complete the branch and exit
from the code segment selected. Its advantage is that the code produced is
relocatable without effort since it is position independent.
The method shown on the left is inefficient if the number of case labels is large
(greater than about ten). However, for a small number it is more compact and
hence usually preferred. In effect it is simply a translation of case into multiple
if…then…else constructs at the machine level.
10.2. MOTOROLA 68000 295
Procedures
Invocation of a procedure is very straightforward since the instruction set offers
direct support through dedicated instructions for saving and restoring registers
and creating and destroying a stack frame for local variables, link should be used
at the start of a procedure. It creates a stack frame of the size quoted (in bytes),
unlk should appear at the procedure end. It automatically destroys the stack
frame by copying the frame pointer into the stack pointer and restoring the frame
pointer itself (from a value saved by link on the stack). Any of the address
registers may be employed as frame pointer. In order to save registers on the
stack which are to be used within the procedure, movem may be employed as
shown in the code segments which follow the next paragraph.
Figure 10.8 depicts the stack contents following execution of movem and link,
on entry to a procedure. Finally, the last instructions in a procedure should be
movem, lea, rts to restore registers and throw away items on the stack which
were passed as parameters, thus no longer required, by simply adjusting the
value of the stack pointer. A return from the procedure is then effected by
copying the return address back into the program counter (PC). In the case of a
function procedure6 one must take care to leave the return value on the top of
stack.
… …
move. w<parameter n>, −(SP) …
bsr <offset> …
move. w (SP)+, <result> unlk a<r>
movem. l (SP)+, d<n>−d<m>/a<p>−a<q>
lea. l +<parameter block size>(SP), SP
rts
The above code skeletons show how a function procedure call and return may be
effected. Prior to the bsr (branch to subroutine) space is created for the return
value by pushing an arbitrary value of the required size (long shown). Parameters
are then loaded onto the stack in a predetermined order. On procedure entry, any
registers to be used within the function procedure are saved, so that they may be
restored on exit, and the stack frame then created.
Expression evaluation
The M68000 is a register machine for the purposes of expression evaluation. For
example7, the following code segment may be used to evaluate
6 …using the Modula-2 terminology. Pascal users would normally use the term
“function”.
10.2. MOTOROLA 68000 297
The processor was not designed to perform expression evaluation on the stack.
There are two reasons why it would not be sensible to attempt it. Firstly, it would
be inefficient. Only very rarely would the compiler require more than eight data
registers. Registers are accessed without bus access cycles. Secondly, the
arithmetic instructions are designed to leave the result in a data register. Stack
evaluation simply is not supported. The instruction set is designed with the
intention that registers be used to the maximum effect.
Data referencing is usually performed using address register indirect with
displacement. The register used is the…
10.3
National Semiconductor 32000
10.3.1
Architecture
Design philosophy
The National Semiconductor 32000 was designed in the early 1980s to meet the
market for very high performance systems in both the real-time control and
work-station markets. Principal characteristics of the design are…
The three most distinctive characteristics are listed at the top. Both instruction
set and addressing modes are designed to reduce code size and execution time of
high level language statements. Single instructions replace several required in
earlier machines. Most revolutionary, however, is explicit support for modular
software. Items in external modules may be directly referenced, be they
procedures or data.
There follows a list of features present…
Among these the only truly original feature is that of symmetry. Almost any
instruction may employ any addressing mode. Any register may be used for any
10.2. MOTOROLA 68000 299
Programmer’s architecture
Figure 10.9 shows the NS32000 programmer’s architecture. Eight 32-bit general
purpose registers are provided which has been shown to be adequate for the vast
majority of expression evaluations.
Six special purpose registers (SPR) define the memory map (Figure 10.10) for
any running program. Three SPRs point to areas of (virtual) memory containing
data, static base register (SB) points to the base of static or global memory,
frame pointer (FP) points to local memory where variables, local to the currently
executing procedure, are dynamically stored in a stack frame. program pointer
(PC) points to the next instruction to be executed. Figure 10.11 shows the
300 CHAPTER 10. SURVEY OF PROCESSOR ARCHITECTURE
Table 10.7: Flags in the NS32000 processor state register and their meaning (when set)
Flag Write access Meaning when set
I Privileged Inhibit all interrupts except NMI (Traps unaffected)
P Privileged Prevent a trace trap occurring more than once per instruction
S Privileged User stack, not supervisor stack
302 CHAPTER 10. SURVEY OF PROCESSOR ARCHITECTURE
form one kind of entry in the module link table to describe procedures referenced
which belong to other modules. The other kind of entry in the link table is simply
the absolute address of a variable belonging to another module. Hence whenever
an application is loaded, the descriptors and link tables for all its component
software modules must be initialized in memory.
10.2. MOTOROLA 68000 303
• Bytes
• Words
• Double words
• Quad words
Instruction set
Tables 10.9 and 10.10 show almost all of the NS32000 instructions together with
the operation caused by their execution. The…<i> notation denotes one of the
following operand lengths…
• Byte (i=b)
• Word (i=w)
• Double word (i=d)
The instruction set is thus also symmetric with respect to data length. For
example movw means “move a word”.
Table 10.11 lists all the possible branch conditions of the processor state and
the associated branch instruction mnemonic. The possibility of branching
according to the simultaneous state of two flags helps close the semantic gap
with if…then…else selection. Note that semantic gap closure for selection and
10.2. MOTOROLA 68000 305
iteration is also assisted via the inclusion of add, compare & branch and case
instructions (see below).
The general instruction is composed of a basic instruction (Figure 10.15) of
length one, two or three bytes possibly followed by one or two instruction
extensions containing one of the following…
• Index byte
• Immediate value
• Displacement
…depending on both the instruction, which may have an implied operand, and
each of the two addressing modes, one or both of which may require
qualification. The basic instruction encodes…
• Opcode
• Operand length
• Addressing mode for each operand
308 CHAPTER 10. SURVEY OF PROCESSOR ARCHITECTURE
Figure 10.16 shows the format of a displacement extension which may be one,
two or four bytes in length.
A complete description of the NS32000 instruction set and addressing modes
may be found in [National Semiconductor 84].
10.3.2
Organization
Processor organization
Figure 10.17 shows the organization of the NS32332. This an evolved member of
the NS32000 series. The design is a hybrid stack+register machine, offering the
convenience of a stack for procedure implementation and the speed and code
compactness afforded by register file expression evaluation. An instruction
cache queues instructions fetched when the bus is otherwise idle. A dedicated
barrel shifter and adder are provided for rapid effective address calculation.
Address and data are multiplexed on a common bus. Additional working
registers are provided. These will be invisible even to the compiler and are used
by the microcode in implementing instructions.
8 NS32000 documentation reserves the term “word” to mean two bytes, “double word”
four bytes and “quad word” eight bytes. In the text here the term is used more generally.
310 CHAPTER 10. SURVEY OF PROCESSOR ARCHITECTURE
Figure 10.21 shows the bus timing modified to allow address translation. Note
that only one extra clock cycle per transaction is required provided a hit is
obtained by the associative translation cache, which is 98% efficient!
10.3.3
Programming
Constructs
Closure of the semantic gap has been obtained by the designers both by careful
selection of addressing modes for data referencing and by the inclusion of
instructions which go as far in implementing directly the commands of a high
level language. Code generation is intended to produce fewer instructions.
However, the instructions themselves are more complex. In other words the
problem of efficiently implementing selection and iteration is solved once and
for all in microcode instead of code.
A “for” loop always requires signed addition of constant, compare and
conditional branch operations on each iteration. Since this is a very common
construct indeed, and the constant is usually small, the designers of the NS32000
10.2. MOTOROLA 68000 313
Above are code “skeletons” for the implementation of both for loop and case
constructs, case effects a multi-way branch where the branch offset is selected
according to the value placed previously in r<n>. This is used as an index into a
table of offsets which may be placed anywhere but which it is sensible to locate
directly below the case instruction. The argument to case is the location of an
offset to be added to the PC which is addressed using PC memory space mode.
Each offset thus directs the “thread of control” to a code segment derived from
the high level language statement associated with a particular case label. Each
selectable code segment must end with a branch to the instruction following the
case construct end. It is usual to place such code segments above the case
instruction.
The compiler must generate code to evaluate the case expression which may
then simply be placed in the index register. It should also generate the case label
bounds, an offset table entry for every value within the bounds and code to verify
that the case expression value falls within them. If it fails to do so an offset
should be used which points to a code segment generated from the else clause in
the case construct. Any offset not corresponding to a case label value should also
point to the else segment. You should be able to see why widely scattered case
label values indicate an inappropriate use of a case construct. In such
circumstances it is better to use a number of if…then…else constructs (perhaps
nested).
Procedures
Invocation of a procedure is very straightforward since the instruction set offers
direct support via dedicated instructions for saving and restoring registers and
creating and destroying a stack frame for local variables, enter should be used as
the first instruction of a procedure. It saves a nominated list of registers and
creates a stack frame of the size quoted (in bytes), exit should be the last but one
instruction. It restores a nominated list of registers and automatically destroys the
stack frame by copying the frame pointer into the stack pointer and then
restoring the frame pointer itself (from a value saved by enter on the stack).
Figure 10.22 depicts the stack contents following execution of enter on entry
to a procedure. Finally, the last instruction in a procedure should be ret<i>which
throws away the items on the stack which were passed as parameters, and thus no
longer required, by simply adjusting the value of the stack pointer. It then effects
314 CHAPTER 10. SURVEY OF PROCESSOR ARCHITECTURE
The above code skeleton shows how a function procedure call and return may be
effected. Prior to the bsr (branch to subroutine) space is created for the return
value by pushing an arbitrary value of the required size onto the stack (double
word shown). Parameters are then loaded onto the stack in a predetermined
9 …using the Modula-2 terminology. Pascal users would normally use the term function.
10.2. MOTOROLA 68000 315
order. On procedure entry, any registers to be used within the function procedure
are saved, so that they may be restored on exit, and the stack frame created.
External procedures, i.e. those which reside in other software modules of the
application, may be invoked using either cxp (call external procedure) or cxpd
(call external procedure via descriptor) instead of bsr. The argument to cxp is
simply an offset (displacement) within the current module link table. That of
cxpd is an external procedure descriptor (see above), rxp (return from external
procedure) must be used in place of ret.
Expression evaluation
The NS32000 is a register machine for the purposes of expression evaluation.
For example10, the following code segment may be used to evaluate
The processor was not designed to perform expression evaluation on the stack.
There are two reasons why it would not be sensible to attempt it. Firstly, it
would be inefficient. Only very rarely would the compiler require more than
eight registers. Registers are accessed without bus access cycles. Secondly, the
stack is only modified if a tos operand access class is read as is the case with the
first, but not the second, operand in an arithmetic instruction. Hence add 4 (SP),
tos will leave the stack size unaltered. The first operand will remain. The
instruction set is designed with the intention that registers be used to the
maximum effect.
Data referencing is usually performed using memory space mode, in
particular…
10.4
Inmos Transputer
10.4.1
Architecture
Design philosophy
The introduction of the Transputer represents nothing less than a revolution in
computer architecture. Although a single Transputer is capable of a higher
instruction throughput than almost any other processor integrated into a single
device, its real power is extracted when used as a single node in a homogeneous
network. Parallel processing computers may be constructed extremely easily and
with a cost/performance ratio that puts supercomputing within the purchasing
power of the individual, or small organization, rather than just huge centralized
units. However, its main market is for the many new real-time embedded control
applications which it makes possible.
The Transputer is the first device to exploit electronic VLSI11 technology to
integrate an entire computer in a single device (processor, memory and
communication channels) rather than simply to expand the power and
complexity of one or other of its components. It is this approach which gives rise
to the advent of affordable parallel computation. Also included is an external
memory interface which permits easy, cheap “off-chip” memory expansion. This
10.2. MOTOROLA 68000 317
• Division of load
• Division of function
is of obvious value when the product is new, and because some of its
operations are inherently impossible to perform in a single cycle anyway.
The principal points of Transputer design philosphy may be summarized as…
although an assertion therein, that ~80% of executed instructions take just one clock
cycle, is mistaken. This is the correct figure for the proportion which are encoded in a
single byte, [Inmos 88#5, page 23], but only an unknown proportion of instructions
executed do so in a single cycle. About half of the 1-byte instructions available require
only a single cycle.
10.2. MOTOROLA 68000 319
For full documentation of the Transputer see [Inmos 88#2, Inmos 88#3] and for
Occam see [Inmos 88#1, Burns 88]. Very useful ancillary information and
documentation of example applications may be found in [Inmos 89].
Programmer’s architecture
Figure 10.23 depicts the programmer’s architecture of the T414 Transputer. The
O (operand) register acts rather like an accumulator in that it is the default source
of the operand for almost all instructions. A, B, C form an evaluation stack which
is affected both by special load/store and by arithmetic instructions. Whenever
data is loaded it is in fact pushed into A, whose content is pushed down into B,
whose content is in turn pushed down into C. The content of C is lost, so it is the
responsibility of the compiler to save it if necessary. Similarly, a store instruction
moves C into B and B into A. The new content of C is undefined. Only O, A, B, C
are directly manipulated by instructions. The rest are arranged to be taken care of
automatically14. One further register, E, is hidden and used by Transputer block
move instructions.
I contains a pointer to the next instruction to be executed and hence performs
the function of a program counter. A departure from traditional program control
is that none of the familiar processor state flags are present (e.g. carry, zero).
Instead A is used to contain the value of a Boolean expression encoded as 1 for
true and 0 for false. Multi-precision arithmetic is expected to be performed solely
using dedicated instructions which use the evaluation stack exclusively without
the need for the usual carry, overflow and negative flags.
15 Upon a halt the transputer will either idle, waiting for link communication, or reboot
of time and in fact is two registers, one for each priority. The low priority clock
ticks once every 64 μ s, high priority every 1μ s. The full cycle times are
respectively~76hours and ~4ms independent of processor clock rate!. The low
priority clock gives exactly 15625 ticks per second. TNext indicates the time of
the earliest awaited event and allows the process which awaits it to be “woken
up” and rescheduled. The process is located by means of the timer list
(Figure 10.25), a pointer to the start of which is kept in a reserved memory
location (see below).
Figure 10.26 shows the use of reserved workspace locations for process status
description. W−2 and W−4 form the links in the dynamic process ready queue
and timer list respectively, should the process be currently on either one. It
cannot be on both structures since it will be suspended if awaiting a timer event.
W−5 contains the time awaited, if any. When the process is suspended, W−1
houses the value to be loaded into I when eventually rescheduled and run. An
Occam PAR process spawns a number of subsidiary “child” processes and
cannot terminate until they do. The number of offspring which have still to
terminate plus one is recorded in W+1. Lastly, W+0 is used like an extra register
by certain instructions. If these are in use it must be kept free.
In addition to processor registers, a number of reserved memory locations are
employed to facilitate communication and process scheduling (Figure 10.27).
The bottom eight words are the rendezvous locations for eight hard channels
(links). This is the only way in which links are visible to the compiler. Other than
322 CHAPTER 10. SURVEY OF PROCESSOR ARCHITECTURE
I = 7FFFFFFE16
W = MemStart 1
A = Iold
B = Wold
C is undefined
If BootFromROM is clear then the Transputer listens to the first link to receive a
byte. Sending a reset Transputer a zero byte, followed by an address and then
data, effects a poke of that data into a memory location. Similarly a value of one,
followed by an address, effects a peek where the contents of that address are
returned on the corresponding output link. Any value greater than one is
interpreted as the length of a string of bytes forming the boot code which is
loaded into internal memory starting at MemStart. It then executes that code
starting with the following state…
I = MemStart
W = First free word 1
A = Iold
B = Wold
C = Pointer to boot link
“First free word” is the first word in internal memory whose address is
→ MemStart+ code length. The OR of W with 1 ensures that the boot code runs as
a low priority process.
Analysing the Transputer state is highly desirable when diagnosing faults on a
network. By simultaneously asserting the Analyse and Reset hardware signals the
Transputer is persuaded to reboot, testpranal (test processor analysing) may be
used at the start of ROM boot code to determine whether to reboot the most
senior system process or to perform state analysis. If booting from link, peek and
poke messages may be employed to examine state (see above).
Following Analyse/Reset a Transputer will halt program execution as soon as
a suspension point for the current process priority is reached (Table 10.14).
However the current process is not suspended. Subsequently both clocks are
stopped, I and W values are to be found in A and B respectively. State available
for analysis includes…
• Error
• W and I
• Channel status
• Ready queue
• Timer list
• Any low priority process interrupted by one of high priority
saveh and savel save the high and low priority queue registers respectively in a
pair of words pointed to by A. This facilitates inspection of the ready queue.
Addressing modes
One of the most RISC-like features of the Transputer is that each instruction
defines the operand addressing mode. In this sense it may be said to have just
one mode. However, overall there is more than one way in which the location of
an operand is determined. Table 10.15 shows these.
Instructions which specify a constant operand may be said to use immediate
mode. As far as variable data is concerned, local mode is the one used for
referencing scalar data in compiled Occam. Non-local mode is provided (by
means of load/store non-local instructions) for accessing within vector data. An
offset must be derived in O and a pointer in A before loading or storing data.
All references are relative to allow position independent code and avoid
relocation editing. Data is referenced relative to W or A and code relative to I.
Although instruction mnemonics j and cj stand for “jump” and “conditional
jump”, they in fact represent branch instructions. Their operands are added to the
content of I and do not replace it.
Structured immediate mode data may be referenced by means of ldpi (load
pointer to instruction) whose operand is an offset from the current value of I.
Instruction set
A somewhat cunning approach gives the Transputer instruction set the following
qualities.
At the simplest level each instruction is just one byte, consisting of 4-bit opcode
and operand fields (Figure 10.28). The thirteen most commonly used instructions
are encoded in a single nibble. One of the remaining three possible codes
(operate) executes its operand as an opcode, allowing in all twenty-nine
effective operations to be encoded within a single byte.
Table 10.16: Instruction set of the Transputer: Expression evaluation (continued in next
Table)
Group Mnemonic Operation nibbles Cycles
Data access ldc Load constant 1 1
ldl Load local 1 2
stl Store local 1 1
ldnl Load non-local 1 2
326 CHAPTER 10. SURVEY OF PROCESSOR ARCHITECTURE
Table 10.17: Instruction set of the Transputer: Expression evaluation (continued from last
Table)
Group Mnemonic Operation nibbles Cycles
Logical and And 4 1
or Or 4 1
xor Exclusive or 4 1
10.2. MOTOROLA 68000 327
Two prefix instructions allow the extension of the operand, right up to the word
length limit, by shifting its operand up four bits in O. (All instructions begin by
loading their operand into O and, except for prefixes, clear it before terminating.)
To generate negative operands, a negative prefix instruction complements O
prior to the left shift. Its operation may be described…
In short, the argument to nfix appears in the least significant nibble (LSN) of O
and is complemented (via BITNOT) before being shifted left by a nibble. It is not
at all obvious how one acquires a desired (twos-complement) value in O so one
doesn’t, one leaves it to a compiler or assembler to work out! For the sake of
illustration we will investigate how to acquire an operand value of−256. It
requires just nfix #F, then the required operator with argument zero. Using the
above description, the least significant sixteen bit field of O evolves as follows…
0000.0000.0000.0000
0000.0000.0000.1111
1111.1111.1111.0000
1111.1111.0000.0000
Note that the result will still be correct regardless of word width!
The principal design aim responsible for the operand register mechanism is to
minimize the number of bits required to describe a set of operations. As a result
the useful work done by the processor is made much less dependent on the
bandwidth of the processor-to-memory communication channel. For equivalent
“horsepower”, less money need be spent on acquiring fast memory.
Tables 10.16, 10.17, 10.18 and 10.19 show the instruction set of the
Transputer divided into expression evaluation, program control and scheduling/
communication categories. Single instructions are provided which implement
input, output and assignment…the three primitives of Occam. In addition, no
(run-time) operating environment is needed to conduct any of the work
associated with process scheduling. The onus is on the compiler to generate code
328 CHAPTER 10. SURVEY OF PROCESSOR ARCHITECTURE
Table 10.19: Instruction set of the Transputer: Process scheduling & communication
Group Mnemonic Operation nibbles Cycles
Communication in Input message (length 2 2.w+18 or 20
w words)
out Output message 2 2.w+20 (length w
words) or 20
outbyte Output byte 2 25
outword Output word 2 25
move Move message 4 2.w+20 or 20
(length w words)
lb Load byte 2 5
sb Store byte 4 4
enbc Enable channel 4 → 7
disc Disable channel 4 8
resetch Reset channel 4 3
Timers tin Timer input 4 ?
ldtimer Load timer 4 2
330 CHAPTER 10. SURVEY OF PROCESSOR ARCHITECTURE
Briefly, alt sets the ALT state (W−3) to Enabling.p (a predefined value), enbc and/
or enbs (enable channel, enable SKIP) instructions may then be employed to
check whether any guard is ready to proceed, in which case it modifies ALT state
to Ready.p (another predefined value), altwt will suspend the process if no guard
was ready. On rescheduling, due to a channel being ready to communicate, diss
and/or disc (disable SKIP) are used to determine which guard succeeded and to
place its process descriptor in workspace (W+0). altend then causes that process
to be immediately executed,
A problem occurs if more than one guard is ready during either enabling or
disabling. Both Occam and CSP call for nondeterministic selection in this
circumstance. The Transputer lacks a mechanism for this, so selection is
according to the first ready guard disabled.
10.2. MOTOROLA 68000 331
10.4.2
Organization
Processor organization
The Transputer is an an entire computer integrated into a single device and
consists of a processor, memory and a number of communication links (currently
four input and four output) which may be considered as i/o processors and use
DMA (Figure 10.29). These all communicate on a very fast internal bus
(currently thirty-two bits wide) which completes a transaction in just one clock
cycle (50ns for current 20MHz devices).
A comparatively low frequency clock is required by the Transputer (5MHz).
Much higher speed clocks are derived internally from it. This approach has two
valuable advantages…
• All Transputers may use the same clock regardless of their internal speed
• The nasty problem of distributing a high speed clock signal is avoided
The signals shown are adequate for static memory. Further signals are
provided by the EMI to refresh and synchronize dynamic memory. Memory
configuration is programmable, including many timing parameters, by setting a
configuration map located at the top of external memory. A full treatment of
interfacing T414 and T800 Transputers to all kinds of memory may be found in
[Inmos 89, pages 2−25].
10.4.3
Programming
Expression evaluation
The Transputer uses a stack machine model for expression evaluation. ldl and stl
push and pop data between the evaluation stack (A, B, C) and workspace. Recall
that their operand is an offset into workspace. Constants are loaded using ldc.
ldl b
ldl b
mul
334 CHAPTER 10. SURVEY OF PROCESSOR ARCHITECTURE
ldl a
ldl c
mul
ldc 4
mul
sub
ldl a
ldc 2
mul
div
stl RootSquared
ldc 0 ldc 0
ldl xlo ldl xlo
ldl ylo ldl ylo
lsum ldiff
stl zlo stl zlo
lmul and ldiv support multiple precision multiplication and division. This time C
holds the carry word of the result and is added to the product of A and B. The
least significant word of the result is held in A and the most significant (carry) in
B. Shown below is the encoding of a double precision unsigned multiplication.
ldc 0 …
ldl xlo ldl xhi
ldl ylo ldl yhi
lmul lmul
stl z0 rev
ldl xlo stl z3
ldl yhi ldc 0
lmul rev
rev ldl z2
stl z2 lsum
ldl xhi stl z2
ldl ylo ldl z3
lmul sum
stl z1 stl z3
…
Sequential constructs
There is only one processor state flag in the Transputer (Error). There are none of
the familiar arithmetic flags to assist in evaluating conditions. There are two
instructions which evaluate arithmetic conditions, eqc and gt, which use A to
record their result using the convention True=1, False=0.
Above is shown how these may be employed to encode (left to right) ¬(cond),
x=y, x>y and x→ y.
Below is shown the encoding of both for loop and case constructs. Indexed
iteration requires two contiguous words in memory. The first is the loop index
and the second the number of iterations to be performed, lend accesses the index
336 CHAPTER 10. SURVEY OF PROCESSOR ARCHITECTURE
and count via a pointer in B. It decrements the count and, if further iteration is
required, increments the index and subtracts A from I. In other words it causes a
branch, whose offset is found in A and is interpreted as negative. This avoids the
need for nfix instructions.
Procedures
Procedure invocation, like expression evaluation, on the Transputer is stack-
oriented. This time though W is used as a stack pointer. The evaluation stack is
used only for passing the first three parameters and for a return. Invocation begins
with depositing the first three parameters in A, B, C. W is then adjusted upwards
and parameters four onwards are stored in the “bottom” of workspace as shown
in Figure 10.31.
C, B, A, I are then pushed onto the “stack” by call. The return address is thus
found on top of the stack and is restored to I by ret. W must be readjusted to its
original state after the return. Code skeletons for both invocation and entry are
shown below…
stl <return>
Process scheduling
The Occam programming model currently includes only static processes.
However the Transputer fully supports dynamic process start and termination.
Concurrency is expressed in Occam using the PAR construct. Where only a
single processor is concerned it may be encoded as shown below…
Finally, when the last child process has terminated, the parent may terminate.
As with its children, [W+0] must point to the address of the next code to be
executed (the continuation code of its parent). This assumes that the whole PAR
process was itself a child of another and that it was not a component of a SEQ
process. Hence there is no code to be executed directly after its conclusion.
If that were the case, one of the “child” processes could be implemented
simply as a continuation of the parent, sharing the same value of workspace
pointer value. Only two new processes would then need to be “spawned”. Three
control threads would exist instead of the four as in the above implementation. In
fact the Occam compiler encodes PAR in this way since it reduces the
scheduling overhead and makes it easier to compile source code with PAR
within SEQ18.
Communication
Rendezvous to synchronize communication over a channel is implemented by the
Transputer instructions in out. The evaluation stack is used to describe channel
and message in exactly the same manner for both instructions. A contains the
message length in bytes, B a pointer to the channel and C a pointer to the
message area in memory. Absolutely no difference in usage may be found
whether the instructions are used with hard channels or soft channels, except
that hard channels are found at the base of memory.
Figure 10.32 depicts the first and second process arriving at the rendezvous
location. The first finds a reserved value there (Empty=NotProcess.p) and
deschedules leaving its process “id” (workspace pointer). The second process, on
finding a value which is not NotProcess.p, is able to complete the transaction,
regardless of its direction, since the message area address of the first process is
recorded in its workspace (see Figure 10.26).
18 However such source could never be configured over a Transputer network since there
is no way for
340 CHAPTER 10. SURVEY OF PROCESSOR ARCHITECTURE
mint mint
mint mint
stnl #10 stnl #00
The code above shows the initialization of two channels which just happen to be
link0in (left) and link0out (right). Code for sending a message from link0out on
one Transputer to link0in on another is shown below. The two processes will
automatically synchronize for the communication transaction.
Timers
There is a clock register for each priority (Figure 10.23). The high priority clock
ticks every 1 μ s, the low one every 64 μ s and give full cycle times of ~4ms and
~76hrs respectively. There are exactly 15625 low priority ticks per second.
Timers are abstractions of the hardware clocks. Loading a timer simply means
reading the current priority clock. In Occam timers are declared like channels or
variables. They may be input from but not output to.
Above is shown both the Occam and assembly language code to generate a delay
of one second in the running of a (low priority) process. The time is first read
(loaded) from a timer. One second’s worth of ticks are modulo added and then a
timer input performed, tin will suspend its parent process unless the value in A is
greater than or equal to the value of the current priority clock. The process will
then be inserted in the timer list (see above) to be automatically added to the
ready queue when the clock reaches the value awaited. That value is recorded in
processors executing component processes of the PAR to know when the one executing
the parent has reached the point where they themselves may start. It should usually be
possible to transform the source so that PAR has the broader scope (see [Burns 88, page
141])
10.2. MOTOROLA 68000 341
the process workspace. When it becomes the next time awaited, at the front of
the timer list19, it is copied into TNext until the wait is over. Note that the process
must wait its turn in the ready queue after the awaited time. Hence when again it
runs the clock will read some later time. Note also that tin does not affect any
variables. If the time in A is in the past it has no effect whatsoever, if not its only
effect is to suspend and add its process to the timer list.
Alternative construct
The ALT construct of Occam is a form of selection additional to IF and CASE.
Whereas IF selects according to the values on a list of Boolean expressions and
CASE according to the value of a single general one, ALT selects a process
according to the success of its guard.
A guard in its general sense is any primitive process. Here it is assumed to
mean an input. Hence the process selected is the one whose first action may be
performed, where this is an input.
In Occam any guard may be qualified by a Boolean expression which allows
that input channel to become “deaf”, and that process never to be selected, as
long as it is false. It is allowable for the guard to always succeed, by use of
SKIP, so that the entry’s selection depends only on a Boolean expression. This
also may be chosen to be TRUE allowing a default (“else”) TRUE & SKIP
process. Also a guard may be a timer input. In summary the following guard
types are possible…
• Skip guard
• Timer guard
• Channel guard
One great problem exists with implementing the ALT construct. Should more
than one process have its guard succeed, and hence become ready, the
specification calls for non deterministic selection to ensue. The Transputer, like
all other contemporary processors, lacks a mechanism for this. It cannot roll
dice. Hence the encoding will dictate an order of priority for process selection.
Encoding of ALT takes the form…
1. Enable guards
2. IF no guard is ready THEN suspend
3. Start process associated with (first) ready guard
4. Disable guards
Enabling means finding out if any are able to succeed. Prior to enabling the alt
instruction deposits a reserved value (Enabling.p) in [W−3]. enbs and enbc are
used to enable skip and channel guards. Should any be ready [W−3] is set to
Ready.p. Should any guard be a timer guard, talt must be used which also
initializes [W−4] to TimeNotSet.p. The first enbt instruction will alter this to
TimeSet.p. The first and subsequent enbt instructions will ensure that the earliest
time awaited is entered in [W−5]
altwt or taltwt will check to see if the ALT state is Ready.p. If not then it is set
to Waiting.p, [W+0] is set to NoneSelected.o to indicate that no process has been
selected, and the process is suspended. Any communication or timeout with a
waiting ALT will set its state to Ready.p and adds it to the ready queue.
Disabling means locating the guard which became ready. It is the disabling
order which dictates selection priority given more than one ready guard. The
process descriptor of the process guarded by the first ready guard encountered is
placed in [W+0] by diss, dist or disc instructions.
altend simply branches to the process whose descriptor is stored in [W+0].
Processes branched to by altend must end with a branch to the continuation
code after the ALT. Those which had a channel guard must begin with an input
from that channel. Enabling and disabling do not perform the input operation. It
must be explicitly encoded.
ALT talt …
clock? AFTER time ldl time ldltime
… ldc 1 ldc 1
bool & c? v enbt ldc <offset 1>
… ldl c dist
TRUE & SKIP ldl bool ldl c
… enbc ldl bool
ldc 1 ldc <offset 2>
enbs disc
taltwt ldc 1
… ldc <offset 3>
diss
altend
indicating its own readiness to communicate, and then suspends itself, as though
the other party in the communication were not ready.
Workspace locations with offsets 0→ 3 are affected by {alt enbs enbc altwt
diss disc altend} instructions. Those with offsets 0→ 5 are affected by the group
{talt enbs enbc enbt taltwt diss disc dist altend}.
Booting
As described above, the Transputer will boot either from ROM or the first link to
receive a byte value greater than one. If booting from ROM there should be a
jump instruction, located at the ResetCode address (7FFFFFFE16 on the T414)
branching backwards into memory.
Assuming the Transputer state is not to be analysed, the following actions may
be performed by the boot code…
The following boot code is suggested, which branches to another routine if the
Transputer is being analysed,…
; Analyse or boot? … …
testpranal ; Initialize timer ; Initialize process
cj 4 ; queue & link words ; queue front pointers
pfix <analyse hi> ldc 0 mint
pfix<analyse mi> stl 0 sthf
ldc<analyse lo> ldl 11 mint
gcall stl 1 stlf
; Initialize Error & ; loop start ; Start both timers
; HaltOnError flags mint mint
testerr mint sttimer
sethalterr ldl 0
… sum
stnl 0
ldlp 0
ldc 8
lend
…
344 CHAPTER 10. SURVEY OF PROCESSOR ARCHITECTURE
Reserved values
Process state is recorded in negative offsets from the base of the workspace of
each scheduled process. In order to identify the state of each scheduled and
suspended process, and that of each communication channel in use, certain word
values are reserved to have special meaning. These are summarized in
Table 10.20.
10.2. MOTOROLA 68000 345
Exercises
Question one
i When and why is it undesirable for a programmer to employ a case construct
with many, and widely scattered, case label values?
Comment on any implications of this issue for the relationship between the
design of machine architecture and that of its programming model.
ii Either jump or case<i> NS32000 instructions may be used, in conjunction
with program memory space addressing mode and scaled index modifier, to
implement a case construct. Explain the differences in implementation and
performance. Which is preferable and why?
Question two
i Use the information given in Table 10.2120 compare the code size and
execution time between the NS32000 and Transputer for a local procedure
invocation, entry and exit, where three parameters are passed and no return is
made. Assume that value parameters are passed as copies of local variables
located within sixteen bytes of the frame, or workspace, pointer and a two byte
branch offset. Comment on your result.
ii Use the information given in Table 10.21 to compare the code size and
execution time between the NS32000 and Transputer for the implementation of a
case construct with twelve case labels. Assume that the case expression has been
evaluated and the result stored within sixteen bytes of the frame, or workspace,
pointer and that the case label lower bound takes a value between zero and
fifteen. Comment on your result.
Question three
i Using hex notation, give the Transputer machine code (T-code) required to load
an operand into the A register whose twos-complement value is −24110.
ii Using pseudocode, give the algorithm for the T-code generation of any
general signed operand in the O register.
20 The information in Table 10.21 is for a NS32016 16-bit processor and assumes word
alignment of operands and the absence of virtual memory address translation, enter and
exit are asumed to save and restore just three registers. In fact the execution times shown
may not correspond very well with those actually observed. This is because the instruction
queue is cleared by some instructions. Hence execution time will depend on subsequent
instructions.
346 CHAPTER 10. SURVEY OF PROCESSOR ARCHITECTURE
iii Using assembly language notation, give the T-code implementation of a for
loop where the required number of iterations may be zero.
Appendix A
ASCII codes
The tables below show the American Standard Code for Information Interchange
and the function of some of the control codes it specifies…
Control…is used to generate control codes. For example, the result of pressing D
whilst holding down control, on a keyboard, is to despatch the EOT control
character. The code which results should be 4016 or 6016 less than that obtained
by pressing the lower case alphanumeric key alone.
Shift…is used to generate upper case characters. The effect of maintaining it
down whilst pressing an alphabetic key is to despatch a code with value 2016
greater than that obtained by pressing the key alone, and 1016 less with a numeric
key alone.
Appendix B
Solutions to exercises
Computation
Question one
i Stream protocol is used for the communication between author and reader.
Writer and reader processes are successive. Communication is buffered by the
book itself. It is also layered into…
ii Two examples of communication used in everyday life are the telephone and
postal system. The telephone offers synchronous comunication between
concurrent processes while the postal system offers asynchronous
communication between successive processes. Telephone cables form channels
and letters buffers respectively.
The protocols used are those for spoken and written natural language
respectively. Spoken language makes use of pauses for EOT of sentences and
places heavier reliance on rules of syntax. Otherwise the protocol for both is as
above.
Question two
i The instruction set is (e.g.)…
• …etc
Question three
A NAND gate may be implemented using normally-open switches as shown in
Figure B.1.
Question four
i It is easy to think of processes whose subordinates are concurrent. Most
systems of interest are thus. Objects of scientific study, e.g. ecological and
biological systems, are composed of concurrent communicating sequential
processes. Scientists learn a great deal from simulations which it is often too
difficult or inefficient to implement sequentially on a single processor computer.
The most obvious example, which is benefit ting us all dramatically, is weather
prediction through simulation of meteorological systems.
It actually proves difficult to think of a natural process, composed only of
successive processes, which is of any interest. Successive processes are naturally
associated with queue structures. The manufacture of a product may be naïvely
regarded as sequential…
PART I. FROM SOFTWARE TO HARDWARE 351
The two subordinates would communicate using a buffer such as a box to contain
the components. This would not be passed to the assembly worker until full. Either
the worker making components or the one assembling the product will be idle
while the other completes his task. Obviously this is very inefficient and bares no
resemblance to any modern production line.
ii Figure B.2 illustrates the problem and its solution. Communication from A
to B and C to D must be asynchronous and thus employ buffers. Communication
from B to D and C to A must be synchronous and thus employ channels. The
mapping of processes to processors results from the fact that only channel
communication is possible between distinct processors. In fact there exists a
useful transformation between two descriptions which are equal in abstract
meaning but where only one is physically possible…
SEQ PAR
PAR SEQ
A (c.1, v.1) A (c.1, v.1)
C (c.1, v.2) → B (c.2, v.1)
PAR SEQ
B (C.2, v.1) C (c.1, v.2)
D (c.2, v.2) D (c.2, v.2)
iii With the information available it must be concluded that deadlock will not
occur. However, if the data dependency between C and D were reversed then a
352 APPENDIX B. SOLUTIONS TO EXERCISES
dependency cycle would exist and deadlock could occur. A would be unable to
proceed until D had terminated, allowing C to transmit. The system would
deadlock.
Software engineering
Question one
Modules are the means of defining and implementing tasks. Each task may be
from any level of the top-down diagram. Definition modules are written by the
design engineer who passes them to a programmer who creates corresponding
implementation modules. These are then compiled, the resulting object modules
linked with the necessary library modules and then tested to verify that the
requirements specification is met.
Top-down diagrams are decompositions of tasks into more manageable sub-
units. They constitute the first phase of system design. No more than three or
four levels of decomposition should be expanded before a new analysis is
undertaken for each terminal task. The final terminals should be capable of
implementation by a single engineer.
Data flow diagrams depict the communication inherent in the requirement.
Data flows between processes (nodes) via either channels or variables, depending
on whether the processes concerned run concurrently or successively. They are
useful for process-oriented design.
Question two
i The phases in the software life-cycle are…
1. Requirements analysis
2. Requirements specification
3. Design analysis
4. Design specification
5. Implementation
6. Verification
Question three
i Library modules are packages of useful procedures which reduce the time and
effort taken to implement applications. As such they greatly reduce the cost of
software production because they reduce the total software which needs to be
designed, implemented and tested for a new application. They also greatly
reduce the time taken to deliver a new application to the market.
ii Library modules are pre-packaged software. It is wasteful and unreliable to
recompile them each time they are used. Hence the machine language should
support the invocation of procedures within them. However, the design engineer
will encounter them via their definition modules which are all he/she needs to
know! The implementation engineer will simply need to link them to his own
module for purposes of verification.
Question four
The principal motivation for machine language support for software partitions
are to reduce the overhead incurred in crossing an inter-module boundary (e.g.
when invoking an imported procedure or referencing an imported variable).
Machine language
Question one
i When a book is translated it is compilation since it is completed before
anything is read in the destination language.
ii The semantic gap is the difference between a programming language and a
machine language. The main argument for closing it is to gain more rapid
execution speed by implementing high level language statements directly in
hardware, minimizing the number of actual hardware operations performed. The
main argument against closure is that compiler code generators rarely use such
sophisticated instructions and addressing modes. It is difficult to design one
which selects the correct CISC instruction and addressing mode from many
different alternatives.
Question two
The various operand parameters are tabulated below…
Question three
Program to compute and x, y…
1. load x
2. nand y
3. store x
4. nand x
5. store x
Note that store must not affect the processor state but load might without any ill
effect. To implement load so that it did clear/set the processor state might prove
useful in computing other functions.
Question one
i A 12 bit word maps onto four octal and three hex digits exactly. (Each octal digit
maps exactly onto three bits. Each hex digit maps exactly onto four bits.)
ii 0110010100100001100100001 is equivalent to…
• 65432116
• 312414418
• 1F58D116
• 076543218
• 1000.0000.0000.0000.0000.0000.0010.1100
• 111.110.101.100.011.010.001.000
Question two
i The twos complement representation (state) of a value → , given a word width N
(modulus 2N=M), is…
PART I. FROM SOFTWARE TO HARDWARE 355
Representation Value
Sign-magnitude xcess-128 2s complement
FF16 −127 +127 −1
C916 −73 +73 −55
Note that the carry is ignored. (Remember the “clock” picture of labelling
states!)
Question three
i The most important inadequacies of ASCII as a terminal protocol are…
ASCII is a 7-bit code. If the eighth bit in a byte is not used for parity error protection
it may be used to switch to an alternative character set.
356 APPENDIX B. SOLUTIONS TO EXERCISES
For extra control, ESC is sent followed by an escape sequence of bytes. There
exists an ANSI3 standard for an extended terminal protocol which provides some
facilities, such as cursor control. The world awaits new standards!
ii The problem with transmitting raw binary data is that some values will be
misinterpreted by the receiver as ASCII control codes.
One solution is to pre-encode each nybble of the raw binary as a hex digit,
encoded in ASCII. On can then transmit an ASCII file and decode it to raw
binary at the destination.
Question four
i Two examples of number coding systems which exhibit degeneracy are…
• Sign-magnitude integer
• Un-normalized floating-point
Question five
i Floating-point representation should be chosen, in favour of fixed-point, when…
1.375 1.0110
2.75 10.1100
4.125 100.0010
fraction = 000.0100.0000.0000.0000.00002
= 04.000016
exponent = 1000.00012
= 8116
result = 4084.000016
Element level
Question one
A standard sum of products may be written…
Question two
i With sign-magnitude representation apply the XOR operator with the bit mask
1000.0000, i.e. msb=1.
ii With twos complement representation apply the XOR operator with the bit
mask 1111.1111, i.e. all 1s, to complement the value, then add 1.
Question three
Figure B.3 shows how NAND and NOR gates may be used to implement {AND,
OR, NOT}.
PART I. FROM SOFTWARE TO HARDWARE 359
Question four
i In each case, DeMorgan’s laws may be employed to expand the expression as
follows…
ii To prove that Qaoi may be employed alone to express any truth function it is
enough to show that the {AND, OR, NOT) sufficiency set may be implemented…
A single Qaoi may be used for negation in the implemention of both AND and
OR.
Question five
i Figure B.4 shows Figure 5.16 (lower) expanded into normally-open switches.
ii An engineer responsible for implementation of the RS-latch shown in
Figure B.4 would specifications of…
• Fan-out
• Setup time
• Hold time
• Propagation delay
• Power consumption
• Power dissipation
Question six
The operation of the clock system shown in Figure 5.28 is easily explained if we
consider each side of the capacitor separately. The top gate is referred to below
as A, the lower left and right as B and C respectively.
Assume that every point in the system commences at logic 0. All gates will
attempt to set their outputs to 1. After tpd+trc gate C will succeed, causing the
system output to become 14.
The logic value will now propagate through each gate in turn so that, after
3.tpd gate C output, and thus that of the system too, will attempt to become 0. The
capacitor must now discharge into the output terminal of gate C. Hence the
system output returns to 0 after 3.tpd+trc.
The system behaves cyclically in this way.
PART I. FROM SOFTWARE TO HARDWARE 361
Question seven
i A latch is enabled and contains only a single buffer whereas a flip-flop is
clocked and contains a double buffer. The flip-flop may engage in synchronous
communication whereas a latch cannot because it is transparent, i.e. when
enabled the value at the input is visible at the output.
ii The edge-triggered D-type flip-flop shown in Figure 5.24 is constructed
from three RS-latches and an AND gate. RS latches have the property of
retaining their state when = =1.
Ck=0 implies out= out=1. Hence the output of the final latch, and thus of
the whole system, will remain unchanged while the clock input is zero. On a
positive clock state transition we must consider the two possible values of D
independently.
D=0 implies out=1 while out=0 still, since the upper latch remains in its
previous state, having both inputs set. Hence Qout=0.
D=1 implies out=0 while Rout=1 since, this time, the lower latch retains its
state. The upper latch changes state so that out=0 hence Qout=1.
Use is made of the property of a latch made from NAND gates to maintain a
stable state, with Q= =1 when = =0. This allows out= out=1, regardless
of D, as long as Ck=0.
Should D change 0 → 1 while Ck=1 the lower latch will retain its previous
state since = =1, hence so will the whole system. Should D change 1 → 0
while Ck=1 the upper latch will retain its state since the lower one will adopt the
“forbidden” state where both inputs are zero and both outputs are one. (The AND
gate will effectively disconnect Ck from the input of the lower latch after
of the upper one changes to zero.) See Figure B.5.
4 trc is the time taken for the capacitor to charge through the resistor.
362 APPENDIX B. SOLUTIONS TO EXERCISES
Question one
Boolean recursive absorption…
Question two
i Truth tables for two half adders and one full adder demonstrate their
equivalence (see Table B.1).
iii A 4-bit parallel adder with look-ahead carry must generate the following
outputs.
•
•
PART I. FROM SOFTWARE TO HARDWARE 367
The carry input for the second adder and the carry out are given by.
Question three
i BCD representation uses four bits. Only the first ten binary states are valid. For
example 10012 represents 910 whereas 10102 is invalid and does not represent
anything. 4510 is represented, using eight bits, by 0010.01012, 93810, using
twelve bits, by 1001.0011.10002.
ii The state table for a bcd counter, made from T-type flip-flops, is shown in
Table B.2. Four flip-flops are required since 24>10>23.
Question four
An octal binary decoder5may be employed to transform three binary address
digits into eight binary signals which each enable a single data register to read or
write data. Figure B.10 illustrates the connections required.
368 APPENDIX B. SOLUTIONS TO EXERCISES
Question one
Control signals making up the control word for the processor described are as
follows…
Active registers a2r, a2w, a2i, a2d, a2z, a2c, a2sl, a2sr (8)
a1r, a1w, a1i, a1d, a1z, a1c, a1sl, a1sr (8)
a0r, a0w, a0i, a0d, a0z, a0c, a0sl, a0sr (8)
Program counter PCr, PCw (2)
Processor state register PSr (1)
Arithmetic unit au2r, au1w,au0w, auf1, auf0 (5)
There are 32 signals making up the whole control word. It is usually advisable to
group related signals together. Here, it is appropriate to group signals into two
fields…
• Active registers
• Arithmetic unit
…as shown in Figure B.12. Thus the control word may be represented by eight
hex digits, e.g. 0000.000016.
370 APPENDIX B. SOLUTIONS TO EXERCISES
Question two
First we compose the state assignment table (Table B.4). Only three states are
necessary, hence only two flip-flops are required. Secondly the state transition
table must be derived, (Table B.5). Lastly, Table B.6 shows the truth table for
flip-flop excitation and output generation.
The following are the irreducible expressions which require implementation…
Question three
A simple step to increase performance would be to ensure that multiplier takes
the smaller of the two values and multiplicand the larger. This will minimise the
number of loop iterations. This may be achieved by preceding the algorithm
with…
Note that the architecture described in question one could easily be extended to
allow the conditional negations to be conducted in parallel. A further active
register control input (A=Absolute value) could enable the MSB to signal a
control system (the solution to question two) to start. The A control inputs of
both registers may be simultaneously activated.
A microprogram which implements the unextended given algorithm for the
architecture of question one is…
zero a2 move a0, au0 add a0 IF zero THEN jump #0000 right a0 IF
carry THEN move a2, au0 add a1 move au2, a2 left a1 jump <start>
Question four
The above microprogram contains thirteen micro-operations. It is reasonable to
assume that few instructions would prove so complicated to implement. For
example, a register move requires only a single micro-operation. Given say
twenty instructions for our processor, a reasonable assumption for the size of the
MROM would be 256 locations dictating an address width of 8 bits.
The arithmetic unit and active registers give rise to little processor state. A
carry and zero flag will suffice. In order to force or deny jumping, both logic 1
and logic 0 must be selectable. Hence two bits are required in the micro-
instruction format.
To summarize, a possible micro-instruction format for a counter-based control
unit is…
• Jump/Enter=0 Σ Jump
• Flag select field:
− 0→ 0
– 1→ 1
– 2→ Zero
– 3→ Carry
• Start address=1016
PART I. FROM SOFTWARE TO HARDWARE 373
Note that, as a result, Jump/Enter and Flag Select fields being clear will cause a
jump on condition 0, which always fails, causing the CAR to simply increment.
Hence no jump or entry to new microprogram occurs if the most significant
three bits are clear.
Question five
Figure B.14 shows a design of the required control unit, achieved using a simple
analysis of the flow of control. The END signal would be used to trigger the fetch
microprocess. START would be asserted by the instruction decoder on finding the
multiplication opcode at its input.
The control word signals asserted on each timing signal (as enumerated on the
figure) are tabulated in Table B.8.
Table B.8: Control signals asserted vs. timing signal for shift register control unit
Timing signal Control signals asserted
1 a2z, a0r, au0
2 a0r, au1, af0
3 a0sr
4 a2r, au0
5 a1r, au1, af0
6 au2, a2w
7 a1sl
Processor Organization
Question one
i Value parameters are initialized by the compiler by copying the value of the
actual parameter into the stack location of its formal parameter counterpart.
Reference parameters are initialized by copying the address of the actual
parameter.
ii Figures B.15 depicts the stack structures on invocation of both procedure
HexStringToCard and function Real.
PART I. FROM SOFTWARE TO HARDWARE 375
Question two
The register relative addressing mode referencing for procedure
HexStringToCard is…
String +12(fp)
Value +9(fp)
Error +6(fp)
Index −3(fp)
NextSeed +8(fp)
Seed +12(fp)
Random +16(fp)
Question three
i It takes longer to perform a bus operation than an internal processor operation
because of the necessarily sequential protocol of a bus communication
transaction. This is basically composed thus…
• Address decoding
• Physical remoteness of memory
ii The number of bus cycles required for each of the two instruction classes,
move and arithmetic operator is tabulated below versus architecture…
Question four
i Typical code generated for the accumulator machine would be…
load a mul #2 store temp.l load a mul c mul #4 store temp.2 load b
mul b sub temp.2 div temp.1 store RootSquared
Typical code for the stack machine and register machine would be…
push a load a, r0
push #2 mul #2, r0
mul load c, r1
push a mul a, r1
push c mul #4, r1
mul load b, r2
push #4 mul r2, r2
mul sub r2, r1
PART I. FROM SOFTWARE TO HARDWARE 377
ii The number of instructions, code length and number of bus operations required
by the above implementations is tabulated below…
The register file machine clearly offers an advantage in execution speed, over the
stack machine for expression evaluation at the expense of code length. In order to
alleviate the problem with code length, which arises not because of the number
of instructions but because of the need for absolute addressing, and to afford
support for procedure invocation and return, it has become common for
commercial architectures to include both a register file and a stack.
Note just how greatly the execution speed of a stack machine is improved by
holding the top three stack items in processor registers.
Question five
i The following operations must be performed on a stack machine for function
invocation, entry, exit and return (number of bus operations shown in brackets)…
Note that space for the return value etc. may be accomplished by simply
incrementing the stack pointer, without the need for a bus operation. A total of
twelve bus operations are thus required for any function call where three
378 APPENDIX B. SOLUTIONS TO EXERCISES
parameters are passed and the result returned may be accomodated within a
single memory word.
ii A register windowing machine must still perform some similar operations…
1. Move parameters
2. Branch to subroutine
3. Increment current window pointer
4. Decrement current window pointer
5. Move return
…but the moving of data is all between registers and does not require a single bus
operation. The main limitations of register windowing are…
These limitations are known not to greatly affect operational efficiency. Research
has shown that…
These results have been reported by more than one research group, [Tanenbaum
78], [Katevenis 85].
Question six
A procedure is a software partition which has two main motivations…
System organization
Question one
i The cost of a memory is substantially affected by the address width required.
Therefore a two-dimensional memory should be rendered square since this
minimizes the number of address bits required. For example, given that…
6 …as oppposed to a function which abstracts a quantity. The two are sadly frequently
confused, particularly when the programming language provides one and not the other
(e.g. “C” which provides functions only). The result is identifier names which do not
correspond well with abstract (problem level) actions or quantities rendering the program
unreadable.
380 APPENDIX B. SOLUTIONS TO EXERCISES
It can be shown that A is a minimum with respect to N if and only if x=y and
provided that N is a perfect square. However we shall be content with the
example of N=16 where we observe that…
Question two
i There follows a summary of control bus signal channels…
• Bus request
• Bus grant
• Event
• Event acknowledge
• Clock
• Sync
• Read/Write
• Ready
• Address strobe (if multiplexed)
ii Figure B.16 depicts the timing for a successful daisy chain bus arbitration.
A higher priority device, completing its transaction, will cease to assert bus
request in order to release the bus for another master on the subsequent bus cycle
(whose start is marked by sync). The arbiter will keep asserted bus grant if a lower
priority device is still maintaining bus request asserted. It is up to the higher
priority device to pass on Bus grant down the chain immediately when it ceases
to drive bus request.
iii Daisy chain arbitration has the advantage of being very cheap and simple to
implement. It is unfortunately rather unfair to low priority processes which run
the risk of lock out. Some unreliability is due to a “weakest link” vulnerability.
Should any one device fail, e.g. to pass on bus grant, all those further down will
also suffer. No software priority control is possible. The ability to be extended
almost infinitely, without overhead in cost, is an advantage.
Polled arbitration offers software control of priority and greater reliability but
requires the whole system to be upgraded if extended beyond the limit set by the
382 APPENDIX B. SOLUTIONS TO EXERCISES
Question three
i Figure B.17 shows the use of an interleaved memory in implementing a 3 plane
colour graphics display.
System bus access may select a data word from just one module, asserting just
one select signal. The raster display controller requires all three RGB memory
modules simultaneously selected. For the sake of simplifying the diagram it is
shown using a separate select input. On each module the two would be internally
OR’d together.
ii Figure B.18 shows how the key matrix is connected to the system bus. It is
read by the modulo 8 counter activating each row (or column) in turn via the
decoder. Any key press event will cause the VIA to latch the (encoded) active
PART I. FROM SOFTWARE TO HARDWARE 383
column and the current row number (counter value) which together form the key
code.
The counter will need to run quicky enough to catch any key press. A key will
remain depressed for a time which is very long compared to the system clock
cycle so this will not usually be a problem. However the system must respond to
the interrupt and clear the latched port A in time for the next key press. A
monostable7 should be used on the event signal for two purposes…
• Debouncing of switch
• Pulse generation
Debouncing means preventing multiple edge signals resulting from the switch
making contact more than once when depressed (bouncing). A pulse is preferred
to a continuous level since this would prevent other event sources sending event
signals.
Question four
The purpose of this exercise was to provoke thought. To have arrived at an
efficient, practical implementation would earn praise indeed! However it has
been observed that solutions have been found for problems previously thought
7 Device with single stable state to which it returns, after a known interval, when
disturbed, thus allowing generation of a pulse from an edge.
384 APPENDIX B. SOLUTIONS TO EXERCISES
intractable when those seeking them have been unaware that they were so
deemed.
A known good approximation to LRU, which is efficient and practical to
implement, is as follows. Each entry ei is assigned a single reference bit ri which
is set whenever the entry is referenced. Whenever a miss occurs, entry ej is
loaded and all entries ei such that i Σ j are cleared. The entry replaced is the first
one for whom ri=0 in a search whose order is predetermined and fixed. If all ri=1
then that entry is replaced which possesses the smallest address.
The reader should verify, by thought and experimentation with an imaginary
list of addresses, that entries replaced do tend to be those least recently used.
Question five
i The following table shows the working which gives an error syndrome of
101012 for the data word FAC916=111.1010.1100.10012.
ii The following table shows the error vector produced for each possible bit in
error. As can clearly be seen each one is distinct and may be decoded (e.g. using
a ROM) to indicate the bit requiring inversion.
Note that a syndrome error can be distinguished from a data error by the
syndrome parity! The reader should verify that double errors produce a non-zero
error vector, but unfortunately one which is not unique to each error, and that
triple errors cannot be detected.
Question one
i Code generation to implement a case construct with more than about a dozen
case labels will usually employ a jump table approach. A table of…
• Offsets
• Absolute addresses
• Jump or branch instructions
…to the code segment corresponding to each case label is created within the
executable file. The size of the table will be decided by the upper and lower case
label bounds. An entry is present for every value between the bounds. Those not
corresponding to case labels point to a code segment which corresponds to the
“else” clause in the construct.
In practice the result of the case expression evaluation will be checked against
the table bounds, which must thus also be recorded. If outside the bounds the
“else” code must be invoked.
Evidently many, widely scattered case label values may easily result in an
unacceptably large table. For example, an integer case expression with more than
a dozen case label values scattered from zero to a million will give rise to a table
with a million entries! If there are less than a dozen entries a compiler will often
substitute a series of compare and branch instructions. This will result in code
length directly proportional to the number of case labels instead of their range.
The programmer should only ever need to be aware of the programming
model and not the underlying architecture. This rule breaks down with the use of
the case construct (unless the programmer can afford not to care about code
386 APPENDIX B. SOLUTIONS TO EXERCISES
Question two
i The following code (in order of execution) is required for procedure invocation,
entry and exit by the NS32000 and Transputer respectively…
The comparison between the two processors of code size and execution time is
shown below…
While the code storage requirement differs significantly, but not greatly, the
performance overhead in procedure invocation is clearly very much more severe
on the NS32000 than on the Transputer. The principal reason for the enormous
PART I. FROM SOFTWARE TO HARDWARE 387
The comparison between the two processors of code size and execution time is
shown below…
This shows that, despite a dedicated NS32000 instruction, the Transputer does
not require greater execution time when implementing case, although it does
require greater code length. The better performance of the Transputer is mainly
due to caseb taking a long time to calculate its effective address, during which
the transputer could have executed jump several times.
Question three
i The code required to generate an operand of−24110 and load it into the A
register is…
nfix #F #6F
ldc #F #4F
ldc <start value> stl index ldl ntimes stl index+1 ldl index+1 pfix
<exit offset hi> cj <exit offset lo> ; loop start …ldlp index pfix <start
offset hi> ldc <start offset lo> lend
Note that an additional conditional branch has to be added at the loop start. No
comparison of the loop count with zero is necessary since c j will execute a
branch if the value in A is zero. Recall that its meaning is “jump if false”, where
0 represents the logical value false.
Bibliography
[ANSI 86] American National Standards Institute: 1986, Small Computer Systems
Interface, X3T9.2 (available from…X3 Secretariat, Computer and Business
Equipment Manufacturers Association, Suite 500, 311 First Street NW, Washington,
D.C. 20001, U.S.A.)
[Arbib 87] Arbib M.A.: 1987, Brains, machines, and mathematics, Springer-Verlag
[Bell & Newell 71] Bell C.G. & A.Newell: 1971, Computer structures: Readings and
examples, McGraw-Hill
[Bentham 83] Bentham, J.Van: 1983 The logic of time, Reidel
[Burns 88] Burns A.: 1988, Programming in Occam 2, Addison-Wesley
[Church 36] Church A.: 1936, “An unsolvable problem in elementary number theory”,
American Journal of mathematics, 58, 345
[Ciarcia 86] Ciarcia S.: 1986, “Adding SCSI to the SB180 computer”, Byte, 11, 5, 85 and
11, 6, 107
[Clements 87] Clements A.: 1987, Microprocessor systems design: 68000 hardware,
software & interfacing, PWS
[Colwell et al. 85] Colwell R.P., C.Y.Hitchcock, E.D.Jensen, H.M. Brinkley Sprunt &
C.P.Roller: 1985, “Computers, complexity & controversy”, IEEE Computer, 18, 9, 8
[Conway 82] Conway J.H., E.R.Berlekamp, R.K.Guy: 1982, Winning Ways for your
Mathematical Plays, Academic Press
[Deitel 84] Deitel H.M.: 1984, An introduction to operating systems, Addison-Wesley
[Dowsing et al. 86] Dowsing R.D, V.J.Ray word-Smith, C.D.Walter: 1985, A first course
in formal logic and its applications in computer science, Blackwell Scientific
Publications
[Goldberg & Robson 83] Goldberg A., D.Robson: 1983, Smalltalk-80: The language and
its implementation, Addison-Wesley
[Halpern et al. 83] Halpern J., Z.Manna & B.Moszkowski: 1983, “A hardware semantics
based on temporal intervals”, Proc. 19th Int. Colloq. on Automata, languages and
programming, Springer Lecture Notes in Computer Science, 54, 278
[Harland 86] Harland, D.M.: 1988, Recursiv: Object-oriented computer architecture, Ellis
Horwood
[Hayes 88] Hayes J.P.: 1988, Computer architecture and organizatio, 2nd edition,
McGraw-Hill
[Hayes 84] Hayes J.P.: 1984 Digital system design and microprocessors, McGraw-Hill
390 BIBLIOGRAPHY
[Hoare 78] Hoare C.A.R.: 1978, “Communicating Sequential Processes”, CACM, 21,
8, 666
[Hoare 85] Hoare C.A.R.: 1986, Communicating Sequential Processes, Prentice-Hall
[IEEE CS 83] IEEE Computer Society: 1983, Model program in com puter science &
engineering, IEEE Computer Society Press, (available from IEEE Computer Society,
Post Office Box 80452, Worldway Postal Center, Los Angeles, CA 90080, USA)
[Inmos 88#1] Inmos Ltd: 1988, Occam 2 reference manual, Prentice Hall
[Inmos 88#2] Inmos Ltd: 1988, Transputer reference manual, Prentice Hall
[Inmos 88#3] Inmos Ltd: 1988, Transputer instruction set: A compiler writer’s guide,
Prentice Hall
[Inmos 88#4] Inmos Ltd: 1988, Transputer development system, Prentice Hall
[Inmos 88#5] Inmos Ltd: 1988, Communicating process architecture, Prentice Hall
[Inmos 89] Inmos Ltd: 1989, Transputer technical notes, Prentice Hall
[Kane 81] Kane G.: 1981, 68000 microprocessor handbook, Osborne/McGraw-Hill
[Kane, Hawkins & Leventhal 81] Kane G., D.Hawkins & L.Leventhal: 1981, 68000
assembly language programming, Osborne/McGraw-Hill
[Katevenis 85] Katevenis M.: 1985, Reduced instruction set computer architectures for
VLSI, MIT Press
[Knepley & Platt 85] Knepley E., R.Platt: 1985, Modula-2 programming, Ruston
[Laub 86] Laub L.: 1986, “The evolution of mass storage”, Byte, 11, 5, 161
[Lister 84] Lister A.M.: 1984, Fundamentals of operating systems, Macmillan
[Little 86] Little G.B.: 1986 Mac assembly language: A guide for programmers, Brady/
Prentice-Hall
[Mano 88] Mano M.M.: 1988, Computer engineering, hardware design, Prentice-Hall
[Martin 87] Martin C.: 1987, Programming the NS3200, Addison Wesley
[Meyer 88] B.: 1988, Object oriented system design, Prentice-Hall
[Moszkowski 83] Moszkowski B.: 1983, “A temporal logic for multi-level reasoning
about hardware”, Proc. IFIP 6th Int. Symp. on Computer hardware description
languages and their applications, Pittsburgh Pensylvania
[National Semiconductor 84] National Semiconductor Corp.: 1984, Series 32000
Instruction set reference manual, National Semiconductor Corp., (available, as are
NS32000 device data sheets, from…National Semiconductor Corp., 2900
Semiconductor drive, Santa Clara, California 95051, U.S.A.
[National Semiconductor 87] National Semiconductor Corp.: 1987, Series 32000 GNX
Release 2.0 assembler reference manual, National Semiconductor Corp., (available,
as are NS32000 device data sheets, from…National Semiconductor Corp., 2900
Semiconductor drive, Santa Clara, California 95051, U.S.A.
[von Neumann 66] Burks A.W., H.H.Goldstine & J.von Neumann: 1946, “Preliminary
discussion of the logical design of an electronic computing instrument”, in [Bell &
Newell 71]
[von Neumann 46] von Neumann J.: 1966, Theory of self reproducing automata,
A.W.Burks (ed.),University of Illinois Press
[Patterson & Ditzel 80] Patterson D.A., D.R.Ditzel: 1980, “The case for the RISC”,
Computer architecture news, 18, 11, 25
[Pountain & May 87] Pountain D., D.May: 1987, A tutorial introduction to Occam
programming, Blackwell Scientific Publications
[Radin 83] Radin G.: 1983, “The 801 minicomputer”, IBM J. R & D, 27, 3, 237
BIBLIOGRAPHY 391
392
INDEX 393
representation, 88
T flip-flop, 138
Underflow, 91
Unix, 93
Up/down counter, 184